Spaces:

MCP-1st-Birthday
/

overgrowth

Running

App Files Files Community

overgrowth / RAG_INCIDENT_LEARNING.md

Graham Paasch

feat: RAG-based incident learning system (Todo #5)

8e74f68 15 days ago

preview code

raw

history blame contribute delete

19.2 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

RAG-based Incident Learning System

Overview

Overgrowth's incident learning system captures deployment failures and network incidents, performs root cause analysis using RAG (Retrieval-Augmented Generation), and automatically generates regression tests to prevent recurrence. This creates a continuous learning loop where every failure makes the system smarter.

The Learning Loop

┌─────────────────────────────────────────────────────────────┐
│  1. Incident Occurs                                          │
│     • Deployment failure                                     │
│     • Validation error                                       │
│     • Network outage                                         │
└───────────────┬─────────────────────────────────────────────┘
                ↓
┌─────────────────────────────────────────────────────────────┐
│  2. Capture & Store                                          │
│     • Incident details                                       │
│     • Network model                                          │
│     • Validation errors                                      │
│     • Affected devices                                       │
└───────────────┬─────────────────────────────────────────────┘
                ↓
┌─────────────────────────────────────────────────────────────┐
│  3. RAG Analysis (Vector Search)                             │
│     • Search for similar historical incidents                │
│     • Extract common patterns                                │
│     • Suggest root cause with confidence score               │
└───────────────┬─────────────────────────────────────────────┘
                ↓
┌─────────────────────────────────────────────────────────────┐
│  4. Generate Regression Test                                 │
│     • pyATS test for routing issues                          │
│     • pytest for config validation                           │
│     • Prevent same issue from recurring                      │
└───────────────┬─────────────────────────────────────────────┘
                ↓
┌─────────────────────────────────────────────────────────────┐
│  5. Update Knowledge Base                                    │
│     • Add to vector database                                 │
│     • Update LLM prompts                                     │
│     • Improve future predictions                             │
└─────────────────────────────────────────────────────────────┘

Core Components

1. Incident Database

Local JSON database with optional ChromaDB vector search:

from agent.incident_learning import IncidentDatabase, Incident

# Initialize database
db = IncidentDatabase()

# Create incident
incident = Incident(
    id="deploy-20251125-120000",
    timestamp="2025-11-25T12:00:00",
    severity="high",  # critical, high, medium, low
    category="deployment_failure",
    description="VLAN 100 duplicate configuration",
    affected_devices=["leaf-01", "leaf-02"],
    network_model={...},  # Full network model
    validation_errors=[...]  # Errors encountered
)

# Store incident
db.add_incident(incident)

# Search for similar incidents
similar = db.search_similar("VLAN duplicate error", n_results=5)

# Get all deployment failures
failures = db.get_all_incidents(category="deployment_failure")

2. Root Cause Analyzer

Uses RAG to find patterns and suggest root causes:

from agent.incident_learning import RootCauseAnalyzer

analyzer = RootCauseAnalyzer(incident_db)

# Analyze incident
analysis = analyzer.analyze(incident)

print(f"Suggested root cause: {analysis['suggested_root_cause']}")
print(f"Confidence: {analysis['confidence']:.2f}")
print(f"Similar incidents: {len(analysis['similar_incidents'])}")
print(f"Patterns found: {analysis['patterns_found']}")

Analysis Output:

{
  "suggested_root_cause": "Schema validation missed duplicate VLAN IDs. Common pattern: pre-flight checks need stricter VLAN uniqueness validation",
  "similar_incidents": ["deploy-20251120-100000", "deploy-20251118-143000"],
  "patterns_found": [
    "Common root cause: Schema validation missed duplicate VLAN IDs",
    "Commonly affected devices: leaf-01, leaf-02, leaf-03",
    "Common category: deployment_failure"
  ],
  "confidence": 0.85
}

3. Regression Test Generator

Automatically generates pyATS/pytest tests:

from agent.incident_learning import RegressionTestGenerator

generator = RegressionTestGenerator()

# Generate test for incident
test_code = generator.generate_test(incident)

# Save test
test_path = Path("tests/regression") / f"test_{incident.id}.py"
test_path.write_text(test_code)

Generated Test Example:

"""
Regression test for incident deploy-20251125-120000
VLAN 100 duplicate configuration
Generated: 2025-11-25T12:30:00
"""
from pyats import aetest

class TestDeploy20251125120000(aetest.Testcase):
    """Prevent recurrence of VLAN duplicate configuration"""
    
    @aetest.setup
    def setup(self, testbed):
        self.devices = {}
        for device_name in ['leaf-01', 'leaf-02']:
            device = testbed.devices[device_name]
            device.connect()
            self.devices[device_name] = device
    
    @aetest.test
    def verify_vlan_uniqueness(self):
        """Verify no duplicate VLAN IDs"""
        all_vlans = {}
        
        for device_name, device in self.devices.items():
            output = device.execute('show vlan brief')
            vlans = parse_vlans(output)
            
            for vlan_id in vlans:
                if vlan_id in all_vlans:
                    self.failed(f"Duplicate VLAN {vlan_id} on {device_name} and {all_vlans[vlan_id]}")
                all_vlans[vlan_id] = device_name

Usage Examples

Automatic Incident Capture

Incidents are automatically captured when validation fails:

from agent.pipeline_engine import OvergrowthPipeline

pipeline = OvergrowthPipeline()

# Run pre-flight validation
results = pipeline.stage0_preflight(network_model)

# If validation fails, incident is automatically captured
if not results['ready_to_deploy']:
    print("Validation failed - incident captured for learning")
    
    # View recent incidents
    incidents = pipeline.incident_db.get_all_incidents(limit=5)
    for inc in incidents:
        print(f"{inc.id}: {inc.description}")

Manual Incident Capture

For incidents outside the pipeline:

from agent.incident_learning import capture_deployment_failure

# Capture deployment failure
incident = capture_deployment_failure(
    description="BGP peering failed - incorrect AS number",
    network_model=model.to_dict(),
    validation_errors=[
        {"error": "BGP AS mismatch", "type": "routing"}
    ],
    affected_devices=["spine-01", "spine-02"]
)

print(f"Captured incident: {incident.id}")

Complete Learning Workflow

from agent.incident_learning import learn_from_incident

# Full learning cycle: analyze → test → update
learnings = learn_from_incident(incident)

print(f"Root cause: {learnings['root_cause']}")
print(f"Regression test: {learnings['regression_test']}")
print(f"Confidence: {learnings['confidence']:.2f}")

# Incident is updated with learnings
updated = db.get_incident(incident.id)
assert updated.root_cause is not None
assert updated.regression_test is not None

Batch Learning from Recent Incidents

# Analyze last 10 incidents
learnings = pipeline.learn_from_recent_incidents(limit=10)

print(f"Total incidents: {learnings['total_incidents']}")
print(f"Unresolved: {learnings['unresolved']}")
print(f"Analyzed: {learnings['analyzed']}")

# Each learning includes regression test
for learning in learnings['learnings']:
    print(f"  {learning['incident_id']}: {learning['root_cause']}")

Incident Categories

Deployment Failures

Pre-flight validation errors
Config generation failures
Deployment script errors
Syntax errors

Configuration Errors

Invalid parameters
Duplicate IDs (VLANs, IPs, etc.)
Reference errors
Constraint violations

Routing Issues

Routing loops
BGP/OSPF misconfigurations
Missing routes
Blackholes

Network Outages

Link failures
Device failures
Cascading failures
Service disruptions

Confidence Scoring

The RCA analyzer calculates confidence based on:

Factor	Weight	Example
Similar incidents found	40%	3+ similar = +0.4
Resolved incidents	30%	2+ resolved = +0.3
Patterns extracted	30%	2+ patterns = +0.3

Confidence Levels:

0.7 - 1.0: High confidence - safe to auto-apply learnings
0.4 - 0.7: Medium confidence - review before applying
0.0 - 0.4: Low confidence - requires manual analysis

Vector Search with ChromaDB

Installation

# Install ChromaDB for vector search
pip install chromadb

# Verify
python -c "import chromadb; print('ChromaDB installed')"

Benefits Over Keyword Search

Feature	ChromaDB	Keyword Search
Semantic similarity	✅ Finds conceptually similar incidents	❌ Exact keyword match only
Typo tolerance	✅ Handles misspellings	❌ Requires exact match
Context-aware	✅ Understands intent	❌ Literal matching
Performance	✅ Fast vector search	⚠️ Linear scan

Example: Semantic vs Keyword

Query: "BGP session won't come up"

ChromaDB finds:

"BGP neighbor not establishing"
"BGP peering failed"
"Routing protocol adjacency issue"

Keyword search finds:

Only exact matches with "BGP session"

Integration with Pipeline

Stage 0: Pre-flight Validation

Incidents automatically captured on validation failures:

# Pipeline captures incident when validation fails
results = pipeline.stage0_preflight(model)

if not results['ready_to_deploy']:
    # Incident created with:
    # - Validation errors
    # - Network model
    # - Affected devices
    # - Timestamp
    pass

Background Learning Job

In production, run learning as a background job:

# Cron job: Learn from incidents every hour
#!/bin/bash
cd /opt/overgrowth
source venv/bin/activate

python -c "
from agent.pipeline_engine import OvergrowthPipeline

pipeline = OvergrowthPipeline()
learnings = pipeline.learn_from_recent_incidents(limit=20)

if learnings['analyzed'] > 0:
    print(f'Analyzed {learnings[\"analyzed\"]} incidents')
    print(f'Generated {learnings[\"analyzed\"]} regression tests')
"

Continuous Improvement Loop

1. Deploy → Fail → Capture incident
2. Analyze → Find root cause
3. Generate test → Prevent recurrence
4. Next deploy → Test catches issue
5. Fix → Deploy succeeds
6. Knowledge updated → Future deploys smarter

Regression Test Types

pyATS Tests (Network Validation)

For routing issues, interface states, protocol validation:

# Generated for routing incidents
class TestRoutingLoop(aetest.Testcase):
    @aetest.test
    def verify_no_routing_loops(self, testbed):
        for device in affected_devices:
            routes = device.parse('show ip route')
            # Check for loops
            assert no_loops_detected(routes)

pytest Tests (Config Validation)

For schema errors, policy violations, syntax issues:

# Generated for config incidents
def test_vlan_uniqueness():
    """Prevent duplicate VLAN IDs"""
    model = load_network_model()
    vlan_ids = [v['id'] for v in model['vlans']]
    
    # Check for duplicates
    assert len(vlan_ids) == len(set(vlan_ids)), "Duplicate VLAN IDs found"

Knowledge Base Updates

LLM Prompt Updates

Based on learnings, update system prompts:

# Before learning
"Generate network configurations ensuring basic syntax validity"

# After 5 VLAN duplicate incidents
"Generate network configurations. CRITICAL: Ensure VLAN IDs are unique across all devices. Check for duplicates before generating configs. This is a common failure point."

Policy Engine Updates

Add new rules based on incidents:

# After learning from incident
class NetworkPolicy:
    def check_vlan_uniqueness(self, model):
        """Added after incident: deploy-20251125-120000"""
        vlan_ids = [v['id'] for v in model['vlans']]
        duplicates = [v for v in vlan_ids if vlan_ids.count(v) > 1]
        
        if duplicates:
            self.add_violation(
                severity='ERROR',
                message=f"Duplicate VLAN IDs: {duplicates}",
                learned_from='deploy-20251125-120000'
            )

Query Examples

Find Incidents by Pattern

# Find all BGP-related incidents
bgp_incidents = db.search_similar("BGP peering routing protocol", n_results=10)

# Find VLAN issues
vlan_incidents = db.search_similar("VLAN configuration duplicate", n_results=10)

# Find recent critical incidents
critical = db.get_all_incidents(severity="critical", limit=20)

Analyze Incident Trends

# Get incidents from last 30 days
from datetime import datetime, timedelta

all_incidents = db.get_all_incidents(limit=1000)
recent = [
    i for i in all_incidents 
    if datetime.fromisoformat(i.timestamp) > datetime.now() - timedelta(days=30)
]

# Group by category
from collections import Counter
categories = Counter(i.category for i in recent)

print("Incident trends:")
for category, count in categories.most_common():
    print(f"  {category}: {count}")

Root Cause Analysis Report

# Generate root cause analysis report
incidents = db.get_all_incidents(limit=50)
analyzer = RootCauseAnalyzer(db)

report = []
for incident in incidents:
    if not incident.root_cause:  # Unresolved
        analysis = analyzer.analyze(incident)
        
        report.append({
            'incident': incident.id,
            'description': incident.description,
            'suggested_cause': analysis['suggested_root_cause'],
            'confidence': analysis['confidence'],
            'similar_count': len(analysis['similar_incidents'])
        })

# Sort by confidence
report.sort(key=lambda x: x['confidence'], reverse=True)

for item in report[:10]:
    print(f"{item['incident']}: {item['suggested_cause']} (confidence: {item['confidence']:.2f})")

Best Practices

1. Capture Rich Context

# Good: Includes full context
incident = Incident(
    description="Deployment failed: duplicate VLAN 100",
    network_model=model.to_dict(),  # Full model
    validation_errors=errors,  # All errors
    affected_devices=["leaf-01", "leaf-02"],  # Specific devices
    config_changes=[...]  # What changed
)

# Bad: Minimal context
incident = Incident(
    description="Deployment failed"
)

2. Resolve Incidents

# Update with resolution
db.update_incident(incident.id, {
    'root_cause': 'Schema validation missed duplicate check',
    'resolution': 'Added VLAN uniqueness validator',
    'resolved_at': datetime.now().isoformat()
})

# Resolved incidents improve future analysis

3. Run Regression Tests

# Add generated tests to CI/CD
tests/
  regression/
    test_deploy_20251125_120000.py  # Auto-generated
    test_routing_20251120_100000.py
    test_vlan_20251118_143000.py

# Run before each deployment
pytest tests/regression/ --tb=short

4. Review Learnings

# Weekly review of learnings
learnings = pipeline.learn_from_recent_incidents(limit=50)

for learning in learnings['learnings']:
    if learning['confidence'] > 0.7:
        print(f"High confidence learning:")
        print(f"  Incident: {learning['incident_id']}")
        print(f"  Root cause: {learning['root_cause']}")
        print(f"  Test: {learning['regression_test']}")

Troubleshooting

ChromaDB Not Installing

# If ChromaDB fails to install
pip install chromadb --no-deps
pip install onnxruntime pydantic-settings

# Or use mock mode (automatic fallback)
db = IncidentDatabase()
# Will use keyword search instead of vector search

Incident Database Corruption

# Backup incidents
cp ~/.overgrowth/incidents/incidents.json ~/incidents_backup.json

# Reset database
rm -rf ~/.overgrowth/incidents/

# Restore from backup
mkdir -p ~/.overgrowth/incidents
cp ~/incidents_backup.json ~/.overgrowth/incidents/incidents.json

Low Confidence Scores

Causes:

Few historical incidents
No resolved incidents
No similar patterns

Solutions:

Manually resolve incidents with root causes
Add more context to incident descriptions
Wait for more incidents to build history
Use LLM for better analysis

Future Enhancements

Planned Features

LLM Integration: Claude/GPT-4 for advanced root cause analysis
Automated Fix Generation: AI-generated config fixes
Incident Clustering: Group related incidents automatically
Predictive Alerts: Warn before incidents occur
Multi-tenant: Separate incident databases per environment

Community Contributions

See CONTRIBUTING.md for:

Adding new incident categories
Improving root cause heuristics
Custom regression test templates
Integration with monitoring tools (Prometheus, Grafana)

References

ChromaDB Documentation: https://docs.trychroma.com/
pyATS Documentation: https://developer.cisco.com/docs/pyats/
Overgrowth Repository: https://huggingface.co/spaces/MCP-1st-Birthday/overgrowth
Related Docs:
- NETBOX_INTEGRATION.md - Source of truth
- BATFISH_INTEGRATION.md - Static analysis
- SUZIEQ_INTEGRATION.md - Drift detection

Support

Questions? Found a bug? Want to contribute?

Open an issue on HuggingFace Spaces
Join Discord: [link]
Email: [email protected]