Graham Paasch
Add MAESTRO-aware lightning risk model
6161d19
# Risk Model (Draft)
## 1. Overview
Lightning mode translates a preset change request into a deterministic risk score (0–100) and a risk level. The model
focuses on intent metadata onlyβ€”no MAESTRO telemetry in Phase 1β€”so that judges can see how the MCP server reasons about
risk in milliseconds. Each preset captures pre-change health, magnitude, and post-change signals, and the scoring engine
turns those inputs into the same risk JSON exposed by the FastAPI MCP endpoint.
## 2. Inputs
For every `(change_type, preset_id)` we define the following fields:
| Field | Description |
|-------|-------------|
| `change_type` | `vlan`, `interface`, or `bgp_neighbor`. Determines base impact weight. |
| `preset_id` | Scenario identifier (e.g. `leaf_tor_vlan_stage`, `tor_uplink_shutdown`). |
| `pre_core_healthy` | `True/False` flag indicating control-plane health before the change. |
| `pre_interface_errors` | Whether interface errors already exist on affected devices. |
| `pre_existing_alarms` | Whether any alarms are active in the change scope. |
| `num_devices_touched` | How many devices the change modifies. Used for impact magnitude. |
| `post_lost_adjacencies` | Count of fabric adjacencies that disappear after the change. |
| `post_new_alarms` | Whether new alarms fire after the change. |
| `post_interface_errors` | Whether interface errors appear after the change. |
| `blast_radius_summary` | Human-readable description of the scope. |
| `context_note` | Short narrative used to build the explanation string. |
These values live in `server/app/mcp.py` inside the `PRESETS` mapping.
## 3. Scoring algorithm
1. **Baseline pre-change (0–30)**
```text
baseline = 0
+15 if pre_core_healthy is False
+10 if pre_interface_errors is True
+10 if pre_existing_alarms is True
clamp 0–30
```
2. **Change impact (10–55)**
```text
impact_type_base = 10 (VLAN) | 25 (interface) | 35 (BGP neighbor)
impact_magnitude = min(20, 2 * num_devices_touched)
change_impact = impact_type_base + impact_magnitude
```
3. **Post-change penalties (0–40)**
```text
post_penalty = 0
+20 if post_lost_adjacencies > 0
+10 if post_new_alarms is True
+10 if post_interface_errors is True
clamp 0–40
```
4. **Final score + level**
```text
risk_score_raw = baseline + change_impact + post_penalty
risk_score = clamp(risk_score_raw, 0, 100)
```
Levels:
* 0–30 β†’ `low`
* 31–70 β†’ `medium`
* 71–100 β†’ `high`
The FastAPI server uses the same logic in `simulate_network_change`.
## 4. Worked examples
### VLAN – `leaf_tor_vlan_stage`
* Inputs: healthy core, no alarms, 2 devices touched, no post-change penalties.
* Scores: baseline 0, impact 14, post 0 β†’ risk 14 (`low`).
* Interpretation: localized change with clean pre/post checks β†’ safe to stage.
### Interface – `tor_uplink_shutdown`
* Inputs: healthy pre-state, 1 device, but 1 adjacency lost + new alarms after shutdown.
* Scores: baseline 0, impact 27, post 30 β†’ risk 57 (`medium`).
* Interpretation: redundancy keeps risk from going `high`, but alarms + lost adjacency matter.
### BGP – `leaf_bgp_fabric_neighbor_add`
* Inputs: healthy pre-state, 1 device, no penalties.
* Scores: baseline 0, impact 37, post 0 β†’ risk 37 (`medium`).
* Interpretation: even clean BGP adds carry control-plane sensitivity, so Lightning keeps risk mid-range.
## 5. Limitations / future work
* Presets emulate checks; future phases will populate them from MAESTRO telemetry.
* Only three change types are modeled. WAN/core workflows will add more bases and penalties.
* Full mode is still a placeholder; Lightning simply annotates that `mode=full` is not implemented yet.
* No randomness; this phase is deterministic by design so MCP judges can validate outputs offline.