overgrowth / SUZIEQ_INTEGRATION.md
Graham Paasch
feat: SuzieQ drift detection and auto-remediation (Todo #4)
b9fb9ea

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

SuzieQ Integration - Multi-Vendor Drift Detection

Overview

SuzieQ is an open-source network observability framework that provides multi-vendor state collection, topology discovery, and historical analysis. Overgrowth integrates SuzieQ for continuous drift detection - comparing actual network state against intended state (NetBox SoT) and automatically generating remediation plans.

What is Configuration Drift?

Configuration drift occurs when the actual network state diverges from the intended state defined in your source of truth (SoT). Common causes:

  • Manual changes made directly on devices
  • Failed automation runs leaving partial configs
  • Hardware failures requiring emergency workarounds
  • Shadow IT adding unauthorized VLANs/subnets
  • Config erosion over time

Why SuzieQ?

Feature SuzieQ Traditional Monitoring
Multi-vendor βœ… Arista, Cisco, Juniper, Cumulus, etc. ❌ Vendor-specific
Agentless βœ… SSH-based collection ❌ Requires agents
Historical data βœ… Parquet files for time-travel ❌ Limited retention
Topology discovery βœ… LLDP/CDP-based ❌ Manual mapping
Open source βœ… Apache 2.0 ❌ Commercial

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Overgrowth Pipeline                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚   NetBox    β”‚β†’ β”‚  SuzieQ      β”‚β†’ β”‚ Drift Detection  β”‚   β”‚
β”‚  β”‚   (SoT)     β”‚  β”‚  Collector   β”‚  β”‚  & Remediation   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓                  ↓                      ↓
    Intended State    Actual State         Drift Analysis
    
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stage 7: Observability - Collect actual network state     β”‚
β”‚  Stage 7b: Drift Detection - Compare actual vs intended    β”‚
β”‚  Stage 8: Validation - Auto-remediate approved changes     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What SuzieQ Detects

1. Configuration Mismatches

  • Device hostname changes
  • Management IP changes
  • Unexpected device roles (leaf acting as spine)

2. VLAN Drift

  • Missing VLANs: Intended in SoT but not on device
  • Extra VLANs: Present on device but not in SoT
  • VLAN name mismatches

3. IP Address Conflicts

  • Duplicate IPs across devices
  • IP mismatches vs NetBox IPAM
  • Gateway conflicts

4. Interface State Drift

  • Interfaces expected UP but actually DOWN
  • Interfaces expected DOWN but actually UP
  • Description mismatches

5. Routing Issues

  • BGP neighbor states
  • OSPF adjacency problems
  • Route count anomalies

Usage Examples

Basic Drift Detection

from agent.pipeline_engine import OvergrowthPipeline
from agent.network_model import NetworkModel

# Create pipeline with SuzieQ enabled
pipeline = OvergrowthPipeline()

# Your network model (from NetBox or YAML)
model = NetworkModel(...)

# Stage 7: Collect actual state
obs_results = pipeline.stage7_observability(model)
print(f"Collected state from {obs_results['collection']['devices_polled']} devices")

# Stage 7b: Detect drift
drift_results = pipeline.stage7b_drift_detection(model)

if drift_results['drift_detected']:
    print(f"⚠️  Drift detected! Score: {drift_results['drift_score']:.2f}")
    print(f"Issues found:")
    print(f"  - Config mismatches: {drift_results['summary']['config_mismatches']}")
    print(f"  - Missing VLANs: {drift_results['summary']['missing_vlans']}")
    print(f"  - Interface issues: {drift_results['summary']['interfaces_down']}")
else:
    print("βœ“ No drift - network matches SoT")

Auto-Remediation

# Stage 8: Validate and auto-remediate
validation = pipeline.stage8_validation(model)

compliance = validation['compliance_report']
print(f"Compliance Status: {compliance['status']}")
print(f"Drift Score: {compliance['drift_score']:.2f}")

if 'remediation' in validation:
    print(f"Applied {validation['remediation']['applied']} automatic fixes")
    print(f"Skipped {validation['remediation']['skipped']} (require manual approval)")

Direct SuzieQ Client Usage

from agent.suzieq_client import SuzieQClient

# Initialize client
suzieq = SuzieQClient(use_suzieq=True)

# Collect state from devices
devices = [
    {'name': 'leaf-01', 'ip': '10.0.0.11', 'username': 'admin', 'password': 'admin'},
    {'name': 'spine-01', 'ip': '10.0.0.1', 'username': 'admin', 'password': 'admin'}
]

collection = suzieq.collect_network_state(devices)
print(f"Collected from {collection['devices_polled']} devices")

# Get topology
topology = suzieq.get_topology()
print(f"Discovered {len(topology['nodes'])} nodes")
print(f"Found {len(topology['edges'])} LLDP/CDP connections")

# Get VLAN summary
vlans = suzieq.get_vlan_summary()
for device, vlan_list in vlans.items():
    print(f"{device}: {vlan_list}")

# Detect drift
intended_state = {
    'devices': [...],
    'vlans': [...],
    'subnets': [...]
}

drift = suzieq.detect_drift(intended_state)

if drift.has_drift:
    print(f"Drift Score: {drift.drift_score:.2f}")
    print(f"Missing VLANs: {len(drift.missing_vlans)}")
    print(f"Extra VLANs: {len(drift.extra_vlans)}")
    
    # Generate remediation plan
    plan = suzieq.generate_remediation_plan(drift)
    
    for action in plan:
        status = "AUTO-FIX" if action['auto_fix'] else "MANUAL"
        print(f"[{status}] {action['action']} on {action['device']}")
        print(f"  Commands: {action['commands']}")
    
    # Apply auto-approved fixes
    results = suzieq.apply_remediation(plan, auto_approve=True)
    print(f"Applied: {results['applied']}, Skipped: {results['skipped']}")

Remediation Safety

Auto-Fix vs Manual Approval

SuzieQ classifies remediation actions by safety:

Action Auto-Fix Reason
Add missing VLAN βœ… Yes Safe - doesn't disrupt traffic
Remove extra VLAN ❌ No Dangerous - could break connectivity
Enable interface ❌ No Dangerous - interface may be down intentionally
Fix IP mismatch βœ… Yes Safe - corrects IPAM drift
Update descriptions βœ… Yes Safe - cosmetic change

Approval Workflow

# Get remediation plan
plan = suzieq.generate_remediation_plan(drift)

# Filter by auto-fix status
auto_fixes = [a for a in plan if a['auto_fix']]
manual_review = [a for a in plan if not a['auto_fix']]

print(f"Auto-fix ready: {len(auto_fixes)}")
print(f"Require approval: {len(manual_review)}")

# Apply only auto-approved
suzieq.apply_remediation(plan, auto_approve=True)

# For manual items, integrate with ticketing system
for action in manual_review:
    # Create Jira ticket, ServiceNow change request, etc.
    create_change_request(
        title=f"Fix {action['action']} on {action['device']}",
        commands=action['commands'],
        reason=action['reason']
    )

Installation

Option 1: Mock Mode (Default)

No installation required! Overgrowth includes mock SuzieQ for testing:

suzieq = SuzieQClient(use_suzieq=True)
# Automatically uses mock mode if suzieq not installed

Mock mode simulates:

  • State collection from devices
  • Topology discovery
  • Drift detection with heuristic rules
  • Remediation plan generation

Option 2: Real SuzieQ

Install SuzieQ for production use:

# Install SuzieQ
pip install suzieq

# Verify installation
suzieq-cli --help

# Create SuzieQ directory
mkdir -p ~/.suzieq/parquet

Configure SuzieQ inventory (~/.suzieq/inventory.yml):

sources:
  - name: overgrowth
    hosts:
      - url: ssh://[email protected]
        devtype: eos
        
      - url: ssh://[email protected]
        devtype: eos

Start SuzieQ poller:

suzieq-poller -I ~/.suzieq/inventory.yml -d ~/.suzieq/parquet

Configuration

SuzieQ Client Options

from pathlib import Path

# Custom data directory
suzieq = SuzieQClient(
    suzieq_dir=Path("/opt/suzieq/data"),
    use_suzieq=True
)

# Collect with custom namespace
suzieq.collect_network_state(
    devices=[...],
    namespace="production"  # vs "staging", "lab", etc.
)

# Query specific namespace
topology = suzieq.get_topology(namespace="production")

Drift Tolerance

Adjust drift score threshold in stage8_validation():

# Default: 20% drift allowed
results['validation_passed'] = drift_score < 0.2

# Stricter: 10% drift
results['validation_passed'] = drift_score < 0.1

# Looser: 30% drift
results['validation_passed'] = drift_score < 0.3

Drift score calculation:

drift_score = total_drift_items / (devices_checked * expected_resources)

Examples:
- 0.0 = Perfect match
- 0.15 = Minor drift (2-3 VLANs missing)
- 0.5 = Moderate drift (half of config missing)
- 1.0 = Complete drift (nothing matches)

Integration with Pipeline

Stage 7: Observability

Collects actual network state via SuzieQ:

  • Device inventory
  • Interface states
  • VLAN configurations
  • IP addressing
  • Routing protocol status
  • Topology via LLDP/CDP
obs_result = pipeline.stage7_observability(model)
# Returns: collection stats, topology, VLAN summary

Stage 7b: Drift Detection

Compares actual vs intended (NetBox SoT):

  • Config mismatches
  • Missing/extra VLANs
  • IP conflicts
  • Interface state drift
  • Routing issues
drift_result = pipeline.stage7b_drift_detection(model)
# Returns: drift score, detailed findings, remediation plan

Stage 8: Validation & Remediation

Validates network compliance and auto-remediates:

  • Generates compliance report
  • Applies auto-approved fixes
  • Queues manual approval items
  • Re-checks drift after remediation
val_result = pipeline.stage8_validation(model)
# Returns: validation status, compliance report, remediation results

Drift Detection Examples

Example 1: Missing VLAN

Intended (NetBox):

vlans:
  - id: 10
    name: Users
  - id: 20
    name: Servers
  - id: 99
    name: Management

Actual (Device):

show vlan brief
VLAN Name                             Status    Ports
---- -------------------------------- --------- ------
1    default                          active    
10   Users                            active    Et1-10
99   Management                       active    Et48

Drift Detected:

{
  "missing_vlans": [{
    "device": "leaf-01",
    "vlan_id": 20,
    "vlan_name": "Servers",
    "severity": "ERROR"
  }]
}

Remediation:

! Auto-fix: Add missing VLAN
vlan 20
  name Servers
exit

Example 2: Extra VLAN (Shadow IT)

Intended: VLANs 10, 20, 99
Actual: VLANs 10, 20, 99, 666 (unauthorized)

Drift Detected:

{
  "extra_vlans": [{
    "device": "leaf-01",
    "vlan_id": 666,
    "severity": "WARNING"
  }]
}

Remediation:

! Manual approval required - could disrupt traffic
no vlan 666

Example 3: Interface Down

Intended: All uplinks should be UP
Actual: Ethernet48 is DOWN

Drift Detected:

{
  "interface_down": [{
    "device": "leaf-01",
    "interface": "Ethernet48",
    "expected_state": "up",
    "actual_state": "down",
    "severity": "WARNING"
  }]
}

Remediation:

! Manual approval - verify interface should be up
interface Ethernet48
  no shutdown
exit

Troubleshooting

Mock Mode vs Real Mode

Check if SuzieQ is installed:

suzieq = SuzieQClient(use_suzieq=True)
print(f"Mock mode: {suzieq.mock_mode}")

# Expected output:
# WARNING: suzieq not installed - using mock mode
# Mock mode: True

SuzieQ Not Collecting Data

  1. Check SSH connectivity:
ssh [email protected]
  1. Verify inventory:
cat ~/.suzieq/inventory.yml
  1. Check poller logs:
tail -f ~/.suzieq/suzieq-poller.log
  1. Test with CLI:
suzieq-cli
device show

Drift Detection Returns Empty

Cause: SuzieQ hasn't collected data yet

Solution: Run initial collection

# Start poller for 1 minute
suzieq-poller -I ~/.suzieq/inventory.yml -d ~/.suzieq/parquet --run-once

Auto-Fix Not Working

Cause: auto_approve=False (default)

Solution:

# Enable auto-approval
results = suzieq.apply_remediation(plan, auto_approve=True)

# Or apply manually via Netmiko
for action in plan:
    if action['auto_fix']:
        device = ConnectHandler(
            device_type='cisco_ios',
            host=action['device'],
            username='admin',
            password='admin'
        )
        device.send_config_set(action['commands'])

Performance

Collection Frequency

SuzieQ poller intervals:

  • Lab: Every 1 minute (rapid testing)
  • Staging: Every 5 minutes (drift detection)
  • Production: Every 15 minutes (capacity planning)

Data Retention

SuzieQ stores data in Parquet files:

# Check storage usage
du -sh ~/.suzieq/parquet

# Cleanup old data (>30 days)
find ~/.suzieq/parquet -mtime +30 -delete

Drift Detection Performance

Network Size Devices Drift Check Time
Small 1-10 < 1 second
Medium 10-100 1-5 seconds
Large 100-500 5-15 seconds
Enterprise 500+ 15-60 seconds

Best Practices

1. Use Namespaces

Separate environments:

# Production namespace
suzieq.collect_network_state(devices, namespace="production")

# Staging namespace
suzieq.collect_network_state(devices, namespace="staging")

2. Schedule Regular Drift Checks

# Cron job: Check drift every hour
#!/bin/bash
cd /opt/overgrowth
source venv/bin/activate
python -c "
from agent.pipeline_engine import OvergrowthPipeline
pipeline = OvergrowthPipeline()
model = NetworkModel.from_yaml('network.yaml')
drift = pipeline.stage7b_drift_detection(model)
if drift['drift_detected']:
    print(f'ALERT: Drift score {drift[\"drift_score\"]:.2f}')
    # Send alert to Slack/PagerDuty
"

3. Auto-Fix Low-Risk Changes

# Safe changes: Add VLANs, update descriptions
auto_fix_actions = ['add_vlan', 'update_description', 'fix_ip_mismatch']

# Apply only safe actions
safe_plan = [a for a in plan if a['action'] in auto_fix_actions]
suzieq.apply_remediation(safe_plan, auto_approve=True)

# Manual review for everything else
manual_plan = [a for a in plan if a['action'] not in auto_fix_actions]
notify_team(manual_plan)

4. Track Drift Over Time

from datetime import datetime

# Log drift history
drift_log = {
    'timestamp': datetime.now().isoformat(),
    'drift_score': drift.drift_score,
    'devices_checked': drift.devices_checked,
    'issues': {
        'config_mismatches': len(drift.config_mismatches),
        'missing_vlans': len(drift.missing_vlans),
        'extra_vlans': len(drift.extra_vlans)
    }
}

# Store in database or CSV
append_to_history(drift_log)

# Alert if drift increasing
if drift_score > previous_score * 1.5:
    alert("Drift increasing rapidly!")

Future Enhancements

Planned Features

  • StackStorm Integration: Event-driven auto-remediation when drift detected
  • RAG-based Learning: Learn from past drift incidents to prevent recurrence
  • Change Correlation: Link drift events to recent changes (Git, tickets)
  • Predictive Drift: ML model to predict drift before it happens
  • Multi-Region Sync: Ensure consistency across global deployments

Community Contributions

See CONTRIBUTING.md for how to add:

  • New drift detection rules
  • Additional remediation actions
  • Custom compliance policies
  • Integration with other observability tools

References

Support

Questions? Issues? Contributions?