overgrowth / RAG_INCIDENT_LEARNING.md
Graham Paasch
feat: RAG-based incident learning system (Todo #5)
8e74f68

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

RAG-based Incident Learning System

Overview

Overgrowth's incident learning system captures deployment failures and network incidents, performs root cause analysis using RAG (Retrieval-Augmented Generation), and automatically generates regression tests to prevent recurrence. This creates a continuous learning loop where every failure makes the system smarter.

The Learning Loop

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. Incident Occurs                                          β”‚
β”‚     β€’ Deployment failure                                     β”‚
β”‚     β€’ Validation error                                       β”‚
β”‚     β€’ Network outage                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. Capture & Store                                          β”‚
β”‚     β€’ Incident details                                       β”‚
β”‚     β€’ Network model                                          β”‚
β”‚     β€’ Validation errors                                      β”‚
β”‚     β€’ Affected devices                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. RAG Analysis (Vector Search)                             β”‚
β”‚     β€’ Search for similar historical incidents                β”‚
β”‚     β€’ Extract common patterns                                β”‚
β”‚     β€’ Suggest root cause with confidence score               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  4. Generate Regression Test                                 β”‚
β”‚     β€’ pyATS test for routing issues                          β”‚
β”‚     β€’ pytest for config validation                           β”‚
β”‚     β€’ Prevent same issue from recurring                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  5. Update Knowledge Base                                    β”‚
β”‚     β€’ Add to vector database                                 β”‚
β”‚     β€’ Update LLM prompts                                     β”‚
β”‚     β€’ Improve future predictions                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. Incident Database

Local JSON database with optional ChromaDB vector search:

from agent.incident_learning import IncidentDatabase, Incident

# Initialize database
db = IncidentDatabase()

# Create incident
incident = Incident(
    id="deploy-20251125-120000",
    timestamp="2025-11-25T12:00:00",
    severity="high",  # critical, high, medium, low
    category="deployment_failure",
    description="VLAN 100 duplicate configuration",
    affected_devices=["leaf-01", "leaf-02"],
    network_model={...},  # Full network model
    validation_errors=[...]  # Errors encountered
)

# Store incident
db.add_incident(incident)

# Search for similar incidents
similar = db.search_similar("VLAN duplicate error", n_results=5)

# Get all deployment failures
failures = db.get_all_incidents(category="deployment_failure")

2. Root Cause Analyzer

Uses RAG to find patterns and suggest root causes:

from agent.incident_learning import RootCauseAnalyzer

analyzer = RootCauseAnalyzer(incident_db)

# Analyze incident
analysis = analyzer.analyze(incident)

print(f"Suggested root cause: {analysis['suggested_root_cause']}")
print(f"Confidence: {analysis['confidence']:.2f}")
print(f"Similar incidents: {len(analysis['similar_incidents'])}")
print(f"Patterns found: {analysis['patterns_found']}")

Analysis Output:

{
  "suggested_root_cause": "Schema validation missed duplicate VLAN IDs. Common pattern: pre-flight checks need stricter VLAN uniqueness validation",
  "similar_incidents": ["deploy-20251120-100000", "deploy-20251118-143000"],
  "patterns_found": [
    "Common root cause: Schema validation missed duplicate VLAN IDs",
    "Commonly affected devices: leaf-01, leaf-02, leaf-03",
    "Common category: deployment_failure"
  ],
  "confidence": 0.85
}

3. Regression Test Generator

Automatically generates pyATS/pytest tests:

from agent.incident_learning import RegressionTestGenerator

generator = RegressionTestGenerator()

# Generate test for incident
test_code = generator.generate_test(incident)

# Save test
test_path = Path("tests/regression") / f"test_{incident.id}.py"
test_path.write_text(test_code)

Generated Test Example:

"""
Regression test for incident deploy-20251125-120000
VLAN 100 duplicate configuration
Generated: 2025-11-25T12:30:00
"""
from pyats import aetest

class TestDeploy20251125120000(aetest.Testcase):
    """Prevent recurrence of VLAN duplicate configuration"""
    
    @aetest.setup
    def setup(self, testbed):
        self.devices = {}
        for device_name in ['leaf-01', 'leaf-02']:
            device = testbed.devices[device_name]
            device.connect()
            self.devices[device_name] = device
    
    @aetest.test
    def verify_vlan_uniqueness(self):
        """Verify no duplicate VLAN IDs"""
        all_vlans = {}
        
        for device_name, device in self.devices.items():
            output = device.execute('show vlan brief')
            vlans = parse_vlans(output)
            
            for vlan_id in vlans:
                if vlan_id in all_vlans:
                    self.failed(f"Duplicate VLAN {vlan_id} on {device_name} and {all_vlans[vlan_id]}")
                all_vlans[vlan_id] = device_name

Usage Examples

Automatic Incident Capture

Incidents are automatically captured when validation fails:

from agent.pipeline_engine import OvergrowthPipeline

pipeline = OvergrowthPipeline()

# Run pre-flight validation
results = pipeline.stage0_preflight(network_model)

# If validation fails, incident is automatically captured
if not results['ready_to_deploy']:
    print("Validation failed - incident captured for learning")
    
    # View recent incidents
    incidents = pipeline.incident_db.get_all_incidents(limit=5)
    for inc in incidents:
        print(f"{inc.id}: {inc.description}")

Manual Incident Capture

For incidents outside the pipeline:

from agent.incident_learning import capture_deployment_failure

# Capture deployment failure
incident = capture_deployment_failure(
    description="BGP peering failed - incorrect AS number",
    network_model=model.to_dict(),
    validation_errors=[
        {"error": "BGP AS mismatch", "type": "routing"}
    ],
    affected_devices=["spine-01", "spine-02"]
)

print(f"Captured incident: {incident.id}")

Complete Learning Workflow

from agent.incident_learning import learn_from_incident

# Full learning cycle: analyze β†’ test β†’ update
learnings = learn_from_incident(incident)

print(f"Root cause: {learnings['root_cause']}")
print(f"Regression test: {learnings['regression_test']}")
print(f"Confidence: {learnings['confidence']:.2f}")

# Incident is updated with learnings
updated = db.get_incident(incident.id)
assert updated.root_cause is not None
assert updated.regression_test is not None

Batch Learning from Recent Incidents

# Analyze last 10 incidents
learnings = pipeline.learn_from_recent_incidents(limit=10)

print(f"Total incidents: {learnings['total_incidents']}")
print(f"Unresolved: {learnings['unresolved']}")
print(f"Analyzed: {learnings['analyzed']}")

# Each learning includes regression test
for learning in learnings['learnings']:
    print(f"  {learning['incident_id']}: {learning['root_cause']}")

Incident Categories

Deployment Failures

  • Pre-flight validation errors
  • Config generation failures
  • Deployment script errors
  • Syntax errors

Configuration Errors

  • Invalid parameters
  • Duplicate IDs (VLANs, IPs, etc.)
  • Reference errors
  • Constraint violations

Routing Issues

  • Routing loops
  • BGP/OSPF misconfigurations
  • Missing routes
  • Blackholes

Network Outages

  • Link failures
  • Device failures
  • Cascading failures
  • Service disruptions

Confidence Scoring

The RCA analyzer calculates confidence based on:

Factor Weight Example
Similar incidents found 40% 3+ similar = +0.4
Resolved incidents 30% 2+ resolved = +0.3
Patterns extracted 30% 2+ patterns = +0.3

Confidence Levels:

  • 0.7 - 1.0: High confidence - safe to auto-apply learnings
  • 0.4 - 0.7: Medium confidence - review before applying
  • 0.0 - 0.4: Low confidence - requires manual analysis

Vector Search with ChromaDB

Installation

# Install ChromaDB for vector search
pip install chromadb

# Verify
python -c "import chromadb; print('ChromaDB installed')"

Benefits Over Keyword Search

Feature ChromaDB Keyword Search
Semantic similarity βœ… Finds conceptually similar incidents ❌ Exact keyword match only
Typo tolerance βœ… Handles misspellings ❌ Requires exact match
Context-aware βœ… Understands intent ❌ Literal matching
Performance βœ… Fast vector search ⚠️ Linear scan

Example: Semantic vs Keyword

Query: "BGP session won't come up"

ChromaDB finds:

  • "BGP neighbor not establishing"
  • "BGP peering failed"
  • "Routing protocol adjacency issue"

Keyword search finds:

  • Only exact matches with "BGP session"

Integration with Pipeline

Stage 0: Pre-flight Validation

Incidents automatically captured on validation failures:

# Pipeline captures incident when validation fails
results = pipeline.stage0_preflight(model)

if not results['ready_to_deploy']:
    # Incident created with:
    # - Validation errors
    # - Network model
    # - Affected devices
    # - Timestamp
    pass

Background Learning Job

In production, run learning as a background job:

# Cron job: Learn from incidents every hour
#!/bin/bash
cd /opt/overgrowth
source venv/bin/activate

python -c "
from agent.pipeline_engine import OvergrowthPipeline

pipeline = OvergrowthPipeline()
learnings = pipeline.learn_from_recent_incidents(limit=20)

if learnings['analyzed'] > 0:
    print(f'Analyzed {learnings[\"analyzed\"]} incidents')
    print(f'Generated {learnings[\"analyzed\"]} regression tests')
"

Continuous Improvement Loop

1. Deploy β†’ Fail β†’ Capture incident
2. Analyze β†’ Find root cause
3. Generate test β†’ Prevent recurrence
4. Next deploy β†’ Test catches issue
5. Fix β†’ Deploy succeeds
6. Knowledge updated β†’ Future deploys smarter

Regression Test Types

pyATS Tests (Network Validation)

For routing issues, interface states, protocol validation:

# Generated for routing incidents
class TestRoutingLoop(aetest.Testcase):
    @aetest.test
    def verify_no_routing_loops(self, testbed):
        for device in affected_devices:
            routes = device.parse('show ip route')
            # Check for loops
            assert no_loops_detected(routes)

pytest Tests (Config Validation)

For schema errors, policy violations, syntax issues:

# Generated for config incidents
def test_vlan_uniqueness():
    """Prevent duplicate VLAN IDs"""
    model = load_network_model()
    vlan_ids = [v['id'] for v in model['vlans']]
    
    # Check for duplicates
    assert len(vlan_ids) == len(set(vlan_ids)), "Duplicate VLAN IDs found"

Knowledge Base Updates

LLM Prompt Updates

Based on learnings, update system prompts:

# Before learning
"Generate network configurations ensuring basic syntax validity"

# After 5 VLAN duplicate incidents
"Generate network configurations. CRITICAL: Ensure VLAN IDs are unique across all devices. Check for duplicates before generating configs. This is a common failure point."

Policy Engine Updates

Add new rules based on incidents:

# After learning from incident
class NetworkPolicy:
    def check_vlan_uniqueness(self, model):
        """Added after incident: deploy-20251125-120000"""
        vlan_ids = [v['id'] for v in model['vlans']]
        duplicates = [v for v in vlan_ids if vlan_ids.count(v) > 1]
        
        if duplicates:
            self.add_violation(
                severity='ERROR',
                message=f"Duplicate VLAN IDs: {duplicates}",
                learned_from='deploy-20251125-120000'
            )

Query Examples

Find Incidents by Pattern

# Find all BGP-related incidents
bgp_incidents = db.search_similar("BGP peering routing protocol", n_results=10)

# Find VLAN issues
vlan_incidents = db.search_similar("VLAN configuration duplicate", n_results=10)

# Find recent critical incidents
critical = db.get_all_incidents(severity="critical", limit=20)

Analyze Incident Trends

# Get incidents from last 30 days
from datetime import datetime, timedelta

all_incidents = db.get_all_incidents(limit=1000)
recent = [
    i for i in all_incidents 
    if datetime.fromisoformat(i.timestamp) > datetime.now() - timedelta(days=30)
]

# Group by category
from collections import Counter
categories = Counter(i.category for i in recent)

print("Incident trends:")
for category, count in categories.most_common():
    print(f"  {category}: {count}")

Root Cause Analysis Report

# Generate root cause analysis report
incidents = db.get_all_incidents(limit=50)
analyzer = RootCauseAnalyzer(db)

report = []
for incident in incidents:
    if not incident.root_cause:  # Unresolved
        analysis = analyzer.analyze(incident)
        
        report.append({
            'incident': incident.id,
            'description': incident.description,
            'suggested_cause': analysis['suggested_root_cause'],
            'confidence': analysis['confidence'],
            'similar_count': len(analysis['similar_incidents'])
        })

# Sort by confidence
report.sort(key=lambda x: x['confidence'], reverse=True)

for item in report[:10]:
    print(f"{item['incident']}: {item['suggested_cause']} (confidence: {item['confidence']:.2f})")

Best Practices

1. Capture Rich Context

# Good: Includes full context
incident = Incident(
    description="Deployment failed: duplicate VLAN 100",
    network_model=model.to_dict(),  # Full model
    validation_errors=errors,  # All errors
    affected_devices=["leaf-01", "leaf-02"],  # Specific devices
    config_changes=[...]  # What changed
)

# Bad: Minimal context
incident = Incident(
    description="Deployment failed"
)

2. Resolve Incidents

# Update with resolution
db.update_incident(incident.id, {
    'root_cause': 'Schema validation missed duplicate check',
    'resolution': 'Added VLAN uniqueness validator',
    'resolved_at': datetime.now().isoformat()
})

# Resolved incidents improve future analysis

3. Run Regression Tests

# Add generated tests to CI/CD
tests/
  regression/
    test_deploy_20251125_120000.py  # Auto-generated
    test_routing_20251120_100000.py
    test_vlan_20251118_143000.py

# Run before each deployment
pytest tests/regression/ --tb=short

4. Review Learnings

# Weekly review of learnings
learnings = pipeline.learn_from_recent_incidents(limit=50)

for learning in learnings['learnings']:
    if learning['confidence'] > 0.7:
        print(f"High confidence learning:")
        print(f"  Incident: {learning['incident_id']}")
        print(f"  Root cause: {learning['root_cause']}")
        print(f"  Test: {learning['regression_test']}")

Troubleshooting

ChromaDB Not Installing

# If ChromaDB fails to install
pip install chromadb --no-deps
pip install onnxruntime pydantic-settings

# Or use mock mode (automatic fallback)
db = IncidentDatabase()
# Will use keyword search instead of vector search

Incident Database Corruption

# Backup incidents
cp ~/.overgrowth/incidents/incidents.json ~/incidents_backup.json

# Reset database
rm -rf ~/.overgrowth/incidents/

# Restore from backup
mkdir -p ~/.overgrowth/incidents
cp ~/incidents_backup.json ~/.overgrowth/incidents/incidents.json

Low Confidence Scores

Causes:

  • Few historical incidents
  • No resolved incidents
  • No similar patterns

Solutions:

  1. Manually resolve incidents with root causes
  2. Add more context to incident descriptions
  3. Wait for more incidents to build history
  4. Use LLM for better analysis

Future Enhancements

Planned Features

  • LLM Integration: Claude/GPT-4 for advanced root cause analysis
  • Automated Fix Generation: AI-generated config fixes
  • Incident Clustering: Group related incidents automatically
  • Predictive Alerts: Warn before incidents occur
  • Multi-tenant: Separate incident databases per environment

Community Contributions

See CONTRIBUTING.md for:

  • Adding new incident categories
  • Improving root cause heuristics
  • Custom regression test templates
  • Integration with monitoring tools (Prometheus, Grafana)

References

Support

Questions? Found a bug? Want to contribute?