Spaces:

MCP-1st-Birthday
/

network-change-simulator

Running

App Files Files Community

network-change-simulator / docs /RISK_MODEL.md

Graham Paasch

Add MAESTRO-aware lightning risk model

6161d19 about 1 month ago

preview code

raw

history blame contribute delete

3.82 kB

	# Risk Model (Draft)

	## 1. Overview
	Lightning mode translates a preset change request into a deterministic risk score (0–100) and a risk level. The model
	focuses on intent metadata only—no MAESTRO telemetry in Phase 1—so that judges can see how the MCP server reasons about
	risk in milliseconds. Each preset captures pre-change health, magnitude, and post-change signals, and the scoring engine
	turns those inputs into the same risk JSON exposed by the FastAPI MCP endpoint.

	## 2. Inputs
	For every `(change_type, preset_id)` we define the following fields:

	\| Field \| Description \|
	\|-------\|-------------\|
	\| `change_type` \| `vlan`, `interface`, or `bgp_neighbor`. Determines base impact weight. \|
	\| `preset_id` \| Scenario identifier (e.g. `leaf_tor_vlan_stage`, `tor_uplink_shutdown`). \|
	\| `pre_core_healthy` \| `True/False` flag indicating control-plane health before the change. \|
	\| `pre_interface_errors` \| Whether interface errors already exist on affected devices. \|
	\| `pre_existing_alarms` \| Whether any alarms are active in the change scope. \|
	\| `num_devices_touched` \| How many devices the change modifies. Used for impact magnitude. \|
	\| `post_lost_adjacencies` \| Count of fabric adjacencies that disappear after the change. \|
	\| `post_new_alarms` \| Whether new alarms fire after the change. \|
	\| `post_interface_errors` \| Whether interface errors appear after the change. \|
	\| `blast_radius_summary` \| Human-readable description of the scope. \|
	\| `context_note` \| Short narrative used to build the explanation string. \|

	These values live in `server/app/mcp.py` inside the `PRESETS` mapping.

	## 3. Scoring algorithm

	1. Baseline pre-change (0–30)
	```text
	baseline = 0
	+15 if pre_core_healthy is False
	+10 if pre_interface_errors is True
	+10 if pre_existing_alarms is True
	clamp 0–30
	```

	2. Change impact (10–55)
	```text
	impact_type_base = 10 (VLAN) \| 25 (interface) \| 35 (BGP neighbor)
	impact_magnitude = min(20, 2 * num_devices_touched)
	change_impact = impact_type_base + impact_magnitude
	```

	3. Post-change penalties (0–40)
	```text
	post_penalty = 0
	+20 if post_lost_adjacencies > 0
	+10 if post_new_alarms is True
	+10 if post_interface_errors is True
	clamp 0–40
	```

	4. Final score + level
	```text
	risk_score_raw = baseline + change_impact + post_penalty
	risk_score = clamp(risk_score_raw, 0, 100)
	```
	Levels:
	* 0–30 → `low`
	* 31–70 → `medium`
	* 71–100 → `high`

	The FastAPI server uses the same logic in `simulate_network_change`.

	## 4. Worked examples

	### VLAN – `leaf_tor_vlan_stage`
	* Inputs: healthy core, no alarms, 2 devices touched, no post-change penalties.
	* Scores: baseline 0, impact 14, post 0 → risk 14 (`low`).
	* Interpretation: localized change with clean pre/post checks → safe to stage.

	### Interface – `tor_uplink_shutdown`
	* Inputs: healthy pre-state, 1 device, but 1 adjacency lost + new alarms after shutdown.
	* Scores: baseline 0, impact 27, post 30 → risk 57 (`medium`).
	* Interpretation: redundancy keeps risk from going `high`, but alarms + lost adjacency matter.

	### BGP – `leaf_bgp_fabric_neighbor_add`
	* Inputs: healthy pre-state, 1 device, no penalties.
	* Scores: baseline 0, impact 37, post 0 → risk 37 (`medium`).
	* Interpretation: even clean BGP adds carry control-plane sensitivity, so Lightning keeps risk mid-range.

	## 5. Limitations / future work
	* Presets emulate checks; future phases will populate them from MAESTRO telemetry.
	* Only three change types are modeled. WAN/core workflows will add more bases and penalties.
	* Full mode is still a placeholder; Lightning simply annotates that `mode=full` is not implemented yet.
	* No randomness; this phase is deterministic by design so MCP judges can validate outputs offline.