Spaces:

MCP-1st-Birthday
/

network-change-simulator

Running

App Files Files Community

network-change-simulator / docs /RISK_MODEL.md

Graham Paasch

Add MAESTRO-aware lightning risk model

6161d19 about 1 month ago

preview code

raw

history blame contribute delete

3.82 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Risk Model (Draft)

1. Overview

Lightning mode translates a preset change request into a deterministic risk score (0–100) and a risk level. The model focuses on intent metadata only—no MAESTRO telemetry in Phase 1—so that judges can see how the MCP server reasons about risk in milliseconds. Each preset captures pre-change health, magnitude, and post-change signals, and the scoring engine turns those inputs into the same risk JSON exposed by the FastAPI MCP endpoint.

2. Inputs

For every (change_type, preset_id) we define the following fields:

Field	Description
`change_type`	`vlan`, `interface`, or `bgp_neighbor`. Determines base impact weight.
`preset_id`	Scenario identifier (e.g. `leaf_tor_vlan_stage`, `tor_uplink_shutdown`).
`pre_core_healthy`	`True/False` flag indicating control-plane health before the change.
`pre_interface_errors`	Whether interface errors already exist on affected devices.
`pre_existing_alarms`	Whether any alarms are active in the change scope.
`num_devices_touched`	How many devices the change modifies. Used for impact magnitude.
`post_lost_adjacencies`	Count of fabric adjacencies that disappear after the change.
`post_new_alarms`	Whether new alarms fire after the change.
`post_interface_errors`	Whether interface errors appear after the change.
`blast_radius_summary`	Human-readable description of the scope.
`context_note`	Short narrative used to build the explanation string.

These values live in server/app/mcp.py inside the PRESETS mapping.

3. Scoring algorithm

Baseline pre-change (0–30)

baseline = 0
+15 if pre_core_healthy is False
+10 if pre_interface_errors is True
+10 if pre_existing_alarms is True
clamp 0–30

Change impact (10–55)

impact_type_base = 10 (VLAN) | 25 (interface) | 35 (BGP neighbor)
impact_magnitude = min(20, 2 * num_devices_touched)
change_impact = impact_type_base + impact_magnitude

Post-change penalties (0–40)

post_penalty = 0
+20 if post_lost_adjacencies > 0
+10 if post_new_alarms is True
+10 if post_interface_errors is True
clamp 0–40

Final score + level

risk_score_raw = baseline + change_impact + post_penalty
risk_score = clamp(risk_score_raw, 0, 100)

Levels:

0–30 → low
31–70 → medium
71–100 → high

The FastAPI server uses the same logic in simulate_network_change.

4. Worked examples

VLAN – `leaf_tor_vlan_stage`

Inputs: healthy core, no alarms, 2 devices touched, no post-change penalties.
Scores: baseline 0, impact 14, post 0 → risk 14 (low).
Interpretation: localized change with clean pre/post checks → safe to stage.

Interface – `tor_uplink_shutdown`

Inputs: healthy pre-state, 1 device, but 1 adjacency lost + new alarms after shutdown.
Scores: baseline 0, impact 27, post 30 → risk 57 (medium).
Interpretation: redundancy keeps risk from going high, but alarms + lost adjacency matter.

BGP – `leaf_bgp_fabric_neighbor_add`

Inputs: healthy pre-state, 1 device, no penalties.
Scores: baseline 0, impact 37, post 0 → risk 37 (medium).
Interpretation: even clean BGP adds carry control-plane sensitivity, so Lightning keeps risk mid-range.

5. Limitations / future work

Presets emulate checks; future phases will populate them from MAESTRO telemetry.
Only three change types are modeled. WAN/core workflows will add more bases and penalties.
Full mode is still a placeholder; Lightning simply annotates that mode=full is not implemented yet.
No randomness; this phase is deterministic by design so MCP judges can validate outputs offline.