Spaces:

MCP-1st-Birthday
/

overgrowth

Running

App Files Files Community

overgrowth / SUZIEQ_INTEGRATION.md

Graham Paasch

feat: SuzieQ drift detection and auto-remediation (Todo #4)

b9fb9ea 15 days ago

preview code

raw

history blame contribute delete

17.1 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

SuzieQ Integration - Multi-Vendor Drift Detection

Overview

SuzieQ is an open-source network observability framework that provides multi-vendor state collection, topology discovery, and historical analysis. Overgrowth integrates SuzieQ for continuous drift detection - comparing actual network state against intended state (NetBox SoT) and automatically generating remediation plans.

What is Configuration Drift?

Configuration drift occurs when the actual network state diverges from the intended state defined in your source of truth (SoT). Common causes:

Manual changes made directly on devices
Failed automation runs leaving partial configs
Hardware failures requiring emergency workarounds
Shadow IT adding unauthorized VLANs/subnets
Config erosion over time

Why SuzieQ?

Feature	SuzieQ	Traditional Monitoring
Multi-vendor	✅ Arista, Cisco, Juniper, Cumulus, etc.	❌ Vendor-specific
Agentless	✅ SSH-based collection	❌ Requires agents
Historical data	✅ Parquet files for time-travel	❌ Limited retention
Topology discovery	✅ LLDP/CDP-based	❌ Manual mapping
Open source	✅ Apache 2.0	❌ Commercial

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Overgrowth Pipeline                       │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │   NetBox    │→ │  SuzieQ      │→ │ Drift Detection  │   │
│  │   (SoT)     │  │  Collector   │  │  & Remediation   │   │
│  └─────────────┘  └──────────────┘  └──────────────────┘   │
└─────────────────────────────────────────────────────────────┘
         ↓                  ↓                      ↓
    Intended State    Actual State         Drift Analysis
    
┌────────────────────────────────────────────────────────────┐
│  Stage 7: Observability - Collect actual network state     │
│  Stage 7b: Drift Detection - Compare actual vs intended    │
│  Stage 8: Validation - Auto-remediate approved changes     │
└────────────────────────────────────────────────────────────┘

What SuzieQ Detects

1. Configuration Mismatches

Device hostname changes
Management IP changes
Unexpected device roles (leaf acting as spine)

2. VLAN Drift

Missing VLANs: Intended in SoT but not on device
Extra VLANs: Present on device but not in SoT
VLAN name mismatches

3. IP Address Conflicts

Duplicate IPs across devices
IP mismatches vs NetBox IPAM
Gateway conflicts

4. Interface State Drift

Interfaces expected UP but actually DOWN
Interfaces expected DOWN but actually UP
Description mismatches

5. Routing Issues

BGP neighbor states
OSPF adjacency problems
Route count anomalies

Usage Examples

Basic Drift Detection

from agent.pipeline_engine import OvergrowthPipeline
from agent.network_model import NetworkModel

# Create pipeline with SuzieQ enabled
pipeline = OvergrowthPipeline()

# Your network model (from NetBox or YAML)
model = NetworkModel(...)

# Stage 7: Collect actual state
obs_results = pipeline.stage7_observability(model)
print(f"Collected state from {obs_results['collection']['devices_polled']} devices")

# Stage 7b: Detect drift
drift_results = pipeline.stage7b_drift_detection(model)

if drift_results['drift_detected']:
    print(f"⚠️  Drift detected! Score: {drift_results['drift_score']:.2f}")
    print(f"Issues found:")
    print(f"  - Config mismatches: {drift_results['summary']['config_mismatches']}")
    print(f"  - Missing VLANs: {drift_results['summary']['missing_vlans']}")
    print(f"  - Interface issues: {drift_results['summary']['interfaces_down']}")
else:
    print("✓ No drift - network matches SoT")

Auto-Remediation

# Stage 8: Validate and auto-remediate
validation = pipeline.stage8_validation(model)

compliance = validation['compliance_report']
print(f"Compliance Status: {compliance['status']}")
print(f"Drift Score: {compliance['drift_score']:.2f}")

if 'remediation' in validation:
    print(f"Applied {validation['remediation']['applied']} automatic fixes")
    print(f"Skipped {validation['remediation']['skipped']} (require manual approval)")

Direct SuzieQ Client Usage

from agent.suzieq_client import SuzieQClient

# Initialize client
suzieq = SuzieQClient(use_suzieq=True)

# Collect state from devices
devices = [
    {'name': 'leaf-01', 'ip': '10.0.0.11', 'username': 'admin', 'password': 'admin'},
    {'name': 'spine-01', 'ip': '10.0.0.1', 'username': 'admin', 'password': 'admin'}
]

collection = suzieq.collect_network_state(devices)
print(f"Collected from {collection['devices_polled']} devices")

# Get topology
topology = suzieq.get_topology()
print(f"Discovered {len(topology['nodes'])} nodes")
print(f"Found {len(topology['edges'])} LLDP/CDP connections")

# Get VLAN summary
vlans = suzieq.get_vlan_summary()
for device, vlan_list in vlans.items():
    print(f"{device}: {vlan_list}")

# Detect drift
intended_state = {
    'devices': [...],
    'vlans': [...],
    'subnets': [...]
}

drift = suzieq.detect_drift(intended_state)

if drift.has_drift:
    print(f"Drift Score: {drift.drift_score:.2f}")
    print(f"Missing VLANs: {len(drift.missing_vlans)}")
    print(f"Extra VLANs: {len(drift.extra_vlans)}")
    
    # Generate remediation plan
    plan = suzieq.generate_remediation_plan(drift)
    
    for action in plan:
        status = "AUTO-FIX" if action['auto_fix'] else "MANUAL"
        print(f"[{status}] {action['action']} on {action['device']}")
        print(f"  Commands: {action['commands']}")
    
    # Apply auto-approved fixes
    results = suzieq.apply_remediation(plan, auto_approve=True)
    print(f"Applied: {results['applied']}, Skipped: {results['skipped']}")

Remediation Safety

Auto-Fix vs Manual Approval

SuzieQ classifies remediation actions by safety:

Action	Auto-Fix	Reason
Add missing VLAN	✅ Yes	Safe - doesn't disrupt traffic
Remove extra VLAN	❌ No	Dangerous - could break connectivity
Enable interface	❌ No	Dangerous - interface may be down intentionally
Fix IP mismatch	✅ Yes	Safe - corrects IPAM drift
Update descriptions	✅ Yes	Safe - cosmetic change

Approval Workflow

# Get remediation plan
plan = suzieq.generate_remediation_plan(drift)

# Filter by auto-fix status
auto_fixes = [a for a in plan if a['auto_fix']]
manual_review = [a for a in plan if not a['auto_fix']]

print(f"Auto-fix ready: {len(auto_fixes)}")
print(f"Require approval: {len(manual_review)}")

# Apply only auto-approved
suzieq.apply_remediation(plan, auto_approve=True)

# For manual items, integrate with ticketing system
for action in manual_review:
    # Create Jira ticket, ServiceNow change request, etc.
    create_change_request(
        title=f"Fix {action['action']} on {action['device']}",
        commands=action['commands'],
        reason=action['reason']
    )

Installation

Option 1: Mock Mode (Default)

No installation required! Overgrowth includes mock SuzieQ for testing:

suzieq = SuzieQClient(use_suzieq=True)
# Automatically uses mock mode if suzieq not installed

Mock mode simulates:

State collection from devices
Topology discovery
Drift detection with heuristic rules
Remediation plan generation

Option 2: Real SuzieQ

Install SuzieQ for production use:

# Install SuzieQ
pip install suzieq

# Verify installation
suzieq-cli --help

# Create SuzieQ directory
mkdir -p ~/.suzieq/parquet

Configure SuzieQ inventory (~/.suzieq/inventory.yml):

sources:
  - name: overgrowth
    hosts:
      - url: ssh://[email protected]
        devtype: eos
        
      - url: ssh://[email protected]
        devtype: eos

Start SuzieQ poller:

suzieq-poller -I ~/.suzieq/inventory.yml -d ~/.suzieq/parquet

Configuration

SuzieQ Client Options

from pathlib import Path

# Custom data directory
suzieq = SuzieQClient(
    suzieq_dir=Path("/opt/suzieq/data"),
    use_suzieq=True
)

# Collect with custom namespace
suzieq.collect_network_state(
    devices=[...],
    namespace="production"  # vs "staging", "lab", etc.
)

# Query specific namespace
topology = suzieq.get_topology(namespace="production")

Drift Tolerance

Adjust drift score threshold in stage8_validation():

# Default: 20% drift allowed
results['validation_passed'] = drift_score < 0.2

# Stricter: 10% drift
results['validation_passed'] = drift_score < 0.1

# Looser: 30% drift
results['validation_passed'] = drift_score < 0.3

Drift score calculation:

drift_score = total_drift_items / (devices_checked * expected_resources)

Examples:
- 0.0 = Perfect match
- 0.15 = Minor drift (2-3 VLANs missing)
- 0.5 = Moderate drift (half of config missing)
- 1.0 = Complete drift (nothing matches)

Integration with Pipeline

Stage 7: Observability

Collects actual network state via SuzieQ:

Device inventory
Interface states
VLAN configurations
IP addressing
Routing protocol status
Topology via LLDP/CDP

obs_result = pipeline.stage7_observability(model)
# Returns: collection stats, topology, VLAN summary

Stage 7b: Drift Detection

Compares actual vs intended (NetBox SoT):

Config mismatches
Missing/extra VLANs
IP conflicts
Interface state drift
Routing issues

drift_result = pipeline.stage7b_drift_detection(model)
# Returns: drift score, detailed findings, remediation plan

Stage 8: Validation & Remediation

Validates network compliance and auto-remediates:

Generates compliance report
Applies auto-approved fixes
Queues manual approval items
Re-checks drift after remediation

val_result = pipeline.stage8_validation(model)
# Returns: validation status, compliance report, remediation results

Drift Detection Examples

Example 1: Missing VLAN

Intended (NetBox):

vlans:
  - id: 10
    name: Users
  - id: 20
    name: Servers
  - id: 99
    name: Management

Actual (Device):

show vlan brief
VLAN Name                             Status    Ports
---- -------------------------------- --------- ------
1    default                          active    
10   Users                            active    Et1-10
99   Management                       active    Et48

Drift Detected:

{
  "missing_vlans": [{
    "device": "leaf-01",
    "vlan_id": 20,
    "vlan_name": "Servers",
    "severity": "ERROR"
  }]
}

Remediation:

! Auto-fix: Add missing VLAN
vlan 20
  name Servers
exit

Example 2: Extra VLAN (Shadow IT)

Intended: VLANs 10, 20, 99
Actual: VLANs 10, 20, 99, 666 (unauthorized)

Drift Detected:

{
  "extra_vlans": [{
    "device": "leaf-01",
    "vlan_id": 666,
    "severity": "WARNING"
  }]
}

Remediation:

! Manual approval required - could disrupt traffic
no vlan 666

Example 3: Interface Down

Intended: All uplinks should be UP
Actual: Ethernet48 is DOWN

Drift Detected:

{
  "interface_down": [{
    "device": "leaf-01",
    "interface": "Ethernet48",
    "expected_state": "up",
    "actual_state": "down",
    "severity": "WARNING"
  }]
}

Remediation:

! Manual approval - verify interface should be up
interface Ethernet48
  no shutdown
exit

Troubleshooting

Mock Mode vs Real Mode

Check if SuzieQ is installed:

suzieq = SuzieQClient(use_suzieq=True)
print(f"Mock mode: {suzieq.mock_mode}")

# Expected output:
# WARNING: suzieq not installed - using mock mode
# Mock mode: True

SuzieQ Not Collecting Data

Check SSH connectivity:

ssh [email protected]

Verify inventory:

cat ~/.suzieq/inventory.yml

Check poller logs:

tail -f ~/.suzieq/suzieq-poller.log

Test with CLI:

suzieq-cli
device show

Drift Detection Returns Empty

Cause: SuzieQ hasn't collected data yet

Solution: Run initial collection

# Start poller for 1 minute
suzieq-poller -I ~/.suzieq/inventory.yml -d ~/.suzieq/parquet --run-once

Auto-Fix Not Working

Cause: auto_approve=False (default)

Solution:

# Enable auto-approval
results = suzieq.apply_remediation(plan, auto_approve=True)

# Or apply manually via Netmiko
for action in plan:
    if action['auto_fix']:
        device = ConnectHandler(
            device_type='cisco_ios',
            host=action['device'],
            username='admin',
            password='admin'
        )
        device.send_config_set(action['commands'])

Performance

Collection Frequency

SuzieQ poller intervals:

Lab: Every 1 minute (rapid testing)
Staging: Every 5 minutes (drift detection)
Production: Every 15 minutes (capacity planning)

Data Retention

SuzieQ stores data in Parquet files:

# Check storage usage
du -sh ~/.suzieq/parquet

# Cleanup old data (>30 days)
find ~/.suzieq/parquet -mtime +30 -delete

Drift Detection Performance

Network Size	Devices	Drift Check Time
Small	1-10	< 1 second
Medium	10-100	1-5 seconds
Large	100-500	5-15 seconds
Enterprise	500+	15-60 seconds

Best Practices

1. Use Namespaces

Separate environments:

# Production namespace
suzieq.collect_network_state(devices, namespace="production")

# Staging namespace
suzieq.collect_network_state(devices, namespace="staging")

2. Schedule Regular Drift Checks

# Cron job: Check drift every hour
#!/bin/bash
cd /opt/overgrowth
source venv/bin/activate
python -c "
from agent.pipeline_engine import OvergrowthPipeline
pipeline = OvergrowthPipeline()
model = NetworkModel.from_yaml('network.yaml')
drift = pipeline.stage7b_drift_detection(model)
if drift['drift_detected']:
    print(f'ALERT: Drift score {drift[\"drift_score\"]:.2f}')
    # Send alert to Slack/PagerDuty
"

3. Auto-Fix Low-Risk Changes

# Safe changes: Add VLANs, update descriptions
auto_fix_actions = ['add_vlan', 'update_description', 'fix_ip_mismatch']

# Apply only safe actions
safe_plan = [a for a in plan if a['action'] in auto_fix_actions]
suzieq.apply_remediation(safe_plan, auto_approve=True)

# Manual review for everything else
manual_plan = [a for a in plan if a['action'] not in auto_fix_actions]
notify_team(manual_plan)

4. Track Drift Over Time

from datetime import datetime

# Log drift history
drift_log = {
    'timestamp': datetime.now().isoformat(),
    'drift_score': drift.drift_score,
    'devices_checked': drift.devices_checked,
    'issues': {
        'config_mismatches': len(drift.config_mismatches),
        'missing_vlans': len(drift.missing_vlans),
        'extra_vlans': len(drift.extra_vlans)
    }
}

# Store in database or CSV
append_to_history(drift_log)

# Alert if drift increasing
if drift_score > previous_score * 1.5:
    alert("Drift increasing rapidly!")

Future Enhancements

Planned Features

StackStorm Integration: Event-driven auto-remediation when drift detected
RAG-based Learning: Learn from past drift incidents to prevent recurrence
Change Correlation: Link drift events to recent changes (Git, tickets)
Predictive Drift: ML model to predict drift before it happens
Multi-Region Sync: Ensure consistency across global deployments

Community Contributions

See CONTRIBUTING.md for how to add:

New drift detection rules
Additional remediation actions
Custom compliance policies
Integration with other observability tools

References

SuzieQ Documentation: https://suzieq.readthedocs.io/
SuzieQ GitHub: https://github.com/netenglabs/suzieq
Overgrowth Repo: https://huggingface.co/spaces/MCP-1st-Birthday/overgrowth
NetBox Integration: See NETBOX_INTEGRATION.md
Batfish Digital Twin: See BATFISH_INTEGRATION.md

Support

Questions? Issues? Contributions?

Open an issue on HuggingFace Spaces
Join our Discord: [link]
Email: [email protected]