Graham Paasch
Add MAESTRO-aware lightning risk model
6161d19

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Risk Model (Draft)

1. Overview

Lightning mode translates a preset change request into a deterministic risk score (0–100) and a risk level. The model focuses on intent metadata onlyβ€”no MAESTRO telemetry in Phase 1β€”so that judges can see how the MCP server reasons about risk in milliseconds. Each preset captures pre-change health, magnitude, and post-change signals, and the scoring engine turns those inputs into the same risk JSON exposed by the FastAPI MCP endpoint.

2. Inputs

For every (change_type, preset_id) we define the following fields:

Field Description
change_type vlan, interface, or bgp_neighbor. Determines base impact weight.
preset_id Scenario identifier (e.g. leaf_tor_vlan_stage, tor_uplink_shutdown).
pre_core_healthy True/False flag indicating control-plane health before the change.
pre_interface_errors Whether interface errors already exist on affected devices.
pre_existing_alarms Whether any alarms are active in the change scope.
num_devices_touched How many devices the change modifies. Used for impact magnitude.
post_lost_adjacencies Count of fabric adjacencies that disappear after the change.
post_new_alarms Whether new alarms fire after the change.
post_interface_errors Whether interface errors appear after the change.
blast_radius_summary Human-readable description of the scope.
context_note Short narrative used to build the explanation string.

These values live in server/app/mcp.py inside the PRESETS mapping.

3. Scoring algorithm

  1. Baseline pre-change (0–30)

    baseline = 0
    +15 if pre_core_healthy is False
    +10 if pre_interface_errors is True
    +10 if pre_existing_alarms is True
    clamp 0–30
    
  2. Change impact (10–55)

    impact_type_base = 10 (VLAN) | 25 (interface) | 35 (BGP neighbor)
    impact_magnitude = min(20, 2 * num_devices_touched)
    change_impact = impact_type_base + impact_magnitude
    
  3. Post-change penalties (0–40)

    post_penalty = 0
    +20 if post_lost_adjacencies > 0
    +10 if post_new_alarms is True
    +10 if post_interface_errors is True
    clamp 0–40
    
  4. Final score + level

    risk_score_raw = baseline + change_impact + post_penalty
    risk_score = clamp(risk_score_raw, 0, 100)
    

    Levels:

    • 0–30 β†’ low
    • 31–70 β†’ medium
    • 71–100 β†’ high

The FastAPI server uses the same logic in simulate_network_change.

4. Worked examples

VLAN – leaf_tor_vlan_stage

  • Inputs: healthy core, no alarms, 2 devices touched, no post-change penalties.
  • Scores: baseline 0, impact 14, post 0 β†’ risk 14 (low).
  • Interpretation: localized change with clean pre/post checks β†’ safe to stage.

Interface – tor_uplink_shutdown

  • Inputs: healthy pre-state, 1 device, but 1 adjacency lost + new alarms after shutdown.
  • Scores: baseline 0, impact 27, post 30 β†’ risk 57 (medium).
  • Interpretation: redundancy keeps risk from going high, but alarms + lost adjacency matter.

BGP – leaf_bgp_fabric_neighbor_add

  • Inputs: healthy pre-state, 1 device, no penalties.
  • Scores: baseline 0, impact 37, post 0 β†’ risk 37 (medium).
  • Interpretation: even clean BGP adds carry control-plane sensitivity, so Lightning keeps risk mid-range.

5. Limitations / future work

  • Presets emulate checks; future phases will populate them from MAESTRO telemetry.
  • Only three change types are modeled. WAN/core workflows will add more bases and penalties.
  • Full mode is still a placeholder; Lightning simply annotates that mode=full is not implemented yet.
  • No randomness; this phase is deterministic by design so MCP judges can validate outputs offline.