Spaces:
Running
A newer version of the Gradio SDK is available:
6.1.0
RAG-based Incident Learning System
Overview
Overgrowth's incident learning system captures deployment failures and network incidents, performs root cause analysis using RAG (Retrieval-Augmented Generation), and automatically generates regression tests to prevent recurrence. This creates a continuous learning loop where every failure makes the system smarter.
The Learning Loop
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Incident Occurs β
β β’ Deployment failure β
β β’ Validation error β
β β’ Network outage β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. Capture & Store β
β β’ Incident details β
β β’ Network model β
β β’ Validation errors β
β β’ Affected devices β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. RAG Analysis (Vector Search) β
β β’ Search for similar historical incidents β
β β’ Extract common patterns β
β β’ Suggest root cause with confidence score β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. Generate Regression Test β
β β’ pyATS test for routing issues β
β β’ pytest for config validation β
β β’ Prevent same issue from recurring β
βββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. Update Knowledge Base β
β β’ Add to vector database β
β β’ Update LLM prompts β
β β’ Improve future predictions β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Components
1. Incident Database
Local JSON database with optional ChromaDB vector search:
from agent.incident_learning import IncidentDatabase, Incident
# Initialize database
db = IncidentDatabase()
# Create incident
incident = Incident(
id="deploy-20251125-120000",
timestamp="2025-11-25T12:00:00",
severity="high", # critical, high, medium, low
category="deployment_failure",
description="VLAN 100 duplicate configuration",
affected_devices=["leaf-01", "leaf-02"],
network_model={...}, # Full network model
validation_errors=[...] # Errors encountered
)
# Store incident
db.add_incident(incident)
# Search for similar incidents
similar = db.search_similar("VLAN duplicate error", n_results=5)
# Get all deployment failures
failures = db.get_all_incidents(category="deployment_failure")
2. Root Cause Analyzer
Uses RAG to find patterns and suggest root causes:
from agent.incident_learning import RootCauseAnalyzer
analyzer = RootCauseAnalyzer(incident_db)
# Analyze incident
analysis = analyzer.analyze(incident)
print(f"Suggested root cause: {analysis['suggested_root_cause']}")
print(f"Confidence: {analysis['confidence']:.2f}")
print(f"Similar incidents: {len(analysis['similar_incidents'])}")
print(f"Patterns found: {analysis['patterns_found']}")
Analysis Output:
{
"suggested_root_cause": "Schema validation missed duplicate VLAN IDs. Common pattern: pre-flight checks need stricter VLAN uniqueness validation",
"similar_incidents": ["deploy-20251120-100000", "deploy-20251118-143000"],
"patterns_found": [
"Common root cause: Schema validation missed duplicate VLAN IDs",
"Commonly affected devices: leaf-01, leaf-02, leaf-03",
"Common category: deployment_failure"
],
"confidence": 0.85
}
3. Regression Test Generator
Automatically generates pyATS/pytest tests:
from agent.incident_learning import RegressionTestGenerator
generator = RegressionTestGenerator()
# Generate test for incident
test_code = generator.generate_test(incident)
# Save test
test_path = Path("tests/regression") / f"test_{incident.id}.py"
test_path.write_text(test_code)
Generated Test Example:
"""
Regression test for incident deploy-20251125-120000
VLAN 100 duplicate configuration
Generated: 2025-11-25T12:30:00
"""
from pyats import aetest
class TestDeploy20251125120000(aetest.Testcase):
"""Prevent recurrence of VLAN duplicate configuration"""
@aetest.setup
def setup(self, testbed):
self.devices = {}
for device_name in ['leaf-01', 'leaf-02']:
device = testbed.devices[device_name]
device.connect()
self.devices[device_name] = device
@aetest.test
def verify_vlan_uniqueness(self):
"""Verify no duplicate VLAN IDs"""
all_vlans = {}
for device_name, device in self.devices.items():
output = device.execute('show vlan brief')
vlans = parse_vlans(output)
for vlan_id in vlans:
if vlan_id in all_vlans:
self.failed(f"Duplicate VLAN {vlan_id} on {device_name} and {all_vlans[vlan_id]}")
all_vlans[vlan_id] = device_name
Usage Examples
Automatic Incident Capture
Incidents are automatically captured when validation fails:
from agent.pipeline_engine import OvergrowthPipeline
pipeline = OvergrowthPipeline()
# Run pre-flight validation
results = pipeline.stage0_preflight(network_model)
# If validation fails, incident is automatically captured
if not results['ready_to_deploy']:
print("Validation failed - incident captured for learning")
# View recent incidents
incidents = pipeline.incident_db.get_all_incidents(limit=5)
for inc in incidents:
print(f"{inc.id}: {inc.description}")
Manual Incident Capture
For incidents outside the pipeline:
from agent.incident_learning import capture_deployment_failure
# Capture deployment failure
incident = capture_deployment_failure(
description="BGP peering failed - incorrect AS number",
network_model=model.to_dict(),
validation_errors=[
{"error": "BGP AS mismatch", "type": "routing"}
],
affected_devices=["spine-01", "spine-02"]
)
print(f"Captured incident: {incident.id}")
Complete Learning Workflow
from agent.incident_learning import learn_from_incident
# Full learning cycle: analyze β test β update
learnings = learn_from_incident(incident)
print(f"Root cause: {learnings['root_cause']}")
print(f"Regression test: {learnings['regression_test']}")
print(f"Confidence: {learnings['confidence']:.2f}")
# Incident is updated with learnings
updated = db.get_incident(incident.id)
assert updated.root_cause is not None
assert updated.regression_test is not None
Batch Learning from Recent Incidents
# Analyze last 10 incidents
learnings = pipeline.learn_from_recent_incidents(limit=10)
print(f"Total incidents: {learnings['total_incidents']}")
print(f"Unresolved: {learnings['unresolved']}")
print(f"Analyzed: {learnings['analyzed']}")
# Each learning includes regression test
for learning in learnings['learnings']:
print(f" {learning['incident_id']}: {learning['root_cause']}")
Incident Categories
Deployment Failures
- Pre-flight validation errors
- Config generation failures
- Deployment script errors
- Syntax errors
Configuration Errors
- Invalid parameters
- Duplicate IDs (VLANs, IPs, etc.)
- Reference errors
- Constraint violations
Routing Issues
- Routing loops
- BGP/OSPF misconfigurations
- Missing routes
- Blackholes
Network Outages
- Link failures
- Device failures
- Cascading failures
- Service disruptions
Confidence Scoring
The RCA analyzer calculates confidence based on:
| Factor | Weight | Example |
|---|---|---|
| Similar incidents found | 40% | 3+ similar = +0.4 |
| Resolved incidents | 30% | 2+ resolved = +0.3 |
| Patterns extracted | 30% | 2+ patterns = +0.3 |
Confidence Levels:
- 0.7 - 1.0: High confidence - safe to auto-apply learnings
- 0.4 - 0.7: Medium confidence - review before applying
- 0.0 - 0.4: Low confidence - requires manual analysis
Vector Search with ChromaDB
Installation
# Install ChromaDB for vector search
pip install chromadb
# Verify
python -c "import chromadb; print('ChromaDB installed')"
Benefits Over Keyword Search
| Feature | ChromaDB | Keyword Search |
|---|---|---|
| Semantic similarity | β Finds conceptually similar incidents | β Exact keyword match only |
| Typo tolerance | β Handles misspellings | β Requires exact match |
| Context-aware | β Understands intent | β Literal matching |
| Performance | β Fast vector search | β οΈ Linear scan |
Example: Semantic vs Keyword
Query: "BGP session won't come up"
ChromaDB finds:
- "BGP neighbor not establishing"
- "BGP peering failed"
- "Routing protocol adjacency issue"
Keyword search finds:
- Only exact matches with "BGP session"
Integration with Pipeline
Stage 0: Pre-flight Validation
Incidents automatically captured on validation failures:
# Pipeline captures incident when validation fails
results = pipeline.stage0_preflight(model)
if not results['ready_to_deploy']:
# Incident created with:
# - Validation errors
# - Network model
# - Affected devices
# - Timestamp
pass
Background Learning Job
In production, run learning as a background job:
# Cron job: Learn from incidents every hour
#!/bin/bash
cd /opt/overgrowth
source venv/bin/activate
python -c "
from agent.pipeline_engine import OvergrowthPipeline
pipeline = OvergrowthPipeline()
learnings = pipeline.learn_from_recent_incidents(limit=20)
if learnings['analyzed'] > 0:
print(f'Analyzed {learnings[\"analyzed\"]} incidents')
print(f'Generated {learnings[\"analyzed\"]} regression tests')
"
Continuous Improvement Loop
1. Deploy β Fail β Capture incident
2. Analyze β Find root cause
3. Generate test β Prevent recurrence
4. Next deploy β Test catches issue
5. Fix β Deploy succeeds
6. Knowledge updated β Future deploys smarter
Regression Test Types
pyATS Tests (Network Validation)
For routing issues, interface states, protocol validation:
# Generated for routing incidents
class TestRoutingLoop(aetest.Testcase):
@aetest.test
def verify_no_routing_loops(self, testbed):
for device in affected_devices:
routes = device.parse('show ip route')
# Check for loops
assert no_loops_detected(routes)
pytest Tests (Config Validation)
For schema errors, policy violations, syntax issues:
# Generated for config incidents
def test_vlan_uniqueness():
"""Prevent duplicate VLAN IDs"""
model = load_network_model()
vlan_ids = [v['id'] for v in model['vlans']]
# Check for duplicates
assert len(vlan_ids) == len(set(vlan_ids)), "Duplicate VLAN IDs found"
Knowledge Base Updates
LLM Prompt Updates
Based on learnings, update system prompts:
# Before learning
"Generate network configurations ensuring basic syntax validity"
# After 5 VLAN duplicate incidents
"Generate network configurations. CRITICAL: Ensure VLAN IDs are unique across all devices. Check for duplicates before generating configs. This is a common failure point."
Policy Engine Updates
Add new rules based on incidents:
# After learning from incident
class NetworkPolicy:
def check_vlan_uniqueness(self, model):
"""Added after incident: deploy-20251125-120000"""
vlan_ids = [v['id'] for v in model['vlans']]
duplicates = [v for v in vlan_ids if vlan_ids.count(v) > 1]
if duplicates:
self.add_violation(
severity='ERROR',
message=f"Duplicate VLAN IDs: {duplicates}",
learned_from='deploy-20251125-120000'
)
Query Examples
Find Incidents by Pattern
# Find all BGP-related incidents
bgp_incidents = db.search_similar("BGP peering routing protocol", n_results=10)
# Find VLAN issues
vlan_incidents = db.search_similar("VLAN configuration duplicate", n_results=10)
# Find recent critical incidents
critical = db.get_all_incidents(severity="critical", limit=20)
Analyze Incident Trends
# Get incidents from last 30 days
from datetime import datetime, timedelta
all_incidents = db.get_all_incidents(limit=1000)
recent = [
i for i in all_incidents
if datetime.fromisoformat(i.timestamp) > datetime.now() - timedelta(days=30)
]
# Group by category
from collections import Counter
categories = Counter(i.category for i in recent)
print("Incident trends:")
for category, count in categories.most_common():
print(f" {category}: {count}")
Root Cause Analysis Report
# Generate root cause analysis report
incidents = db.get_all_incidents(limit=50)
analyzer = RootCauseAnalyzer(db)
report = []
for incident in incidents:
if not incident.root_cause: # Unresolved
analysis = analyzer.analyze(incident)
report.append({
'incident': incident.id,
'description': incident.description,
'suggested_cause': analysis['suggested_root_cause'],
'confidence': analysis['confidence'],
'similar_count': len(analysis['similar_incidents'])
})
# Sort by confidence
report.sort(key=lambda x: x['confidence'], reverse=True)
for item in report[:10]:
print(f"{item['incident']}: {item['suggested_cause']} (confidence: {item['confidence']:.2f})")
Best Practices
1. Capture Rich Context
# Good: Includes full context
incident = Incident(
description="Deployment failed: duplicate VLAN 100",
network_model=model.to_dict(), # Full model
validation_errors=errors, # All errors
affected_devices=["leaf-01", "leaf-02"], # Specific devices
config_changes=[...] # What changed
)
# Bad: Minimal context
incident = Incident(
description="Deployment failed"
)
2. Resolve Incidents
# Update with resolution
db.update_incident(incident.id, {
'root_cause': 'Schema validation missed duplicate check',
'resolution': 'Added VLAN uniqueness validator',
'resolved_at': datetime.now().isoformat()
})
# Resolved incidents improve future analysis
3. Run Regression Tests
# Add generated tests to CI/CD
tests/
regression/
test_deploy_20251125_120000.py # Auto-generated
test_routing_20251120_100000.py
test_vlan_20251118_143000.py
# Run before each deployment
pytest tests/regression/ --tb=short
4. Review Learnings
# Weekly review of learnings
learnings = pipeline.learn_from_recent_incidents(limit=50)
for learning in learnings['learnings']:
if learning['confidence'] > 0.7:
print(f"High confidence learning:")
print(f" Incident: {learning['incident_id']}")
print(f" Root cause: {learning['root_cause']}")
print(f" Test: {learning['regression_test']}")
Troubleshooting
ChromaDB Not Installing
# If ChromaDB fails to install
pip install chromadb --no-deps
pip install onnxruntime pydantic-settings
# Or use mock mode (automatic fallback)
db = IncidentDatabase()
# Will use keyword search instead of vector search
Incident Database Corruption
# Backup incidents
cp ~/.overgrowth/incidents/incidents.json ~/incidents_backup.json
# Reset database
rm -rf ~/.overgrowth/incidents/
# Restore from backup
mkdir -p ~/.overgrowth/incidents
cp ~/incidents_backup.json ~/.overgrowth/incidents/incidents.json
Low Confidence Scores
Causes:
- Few historical incidents
- No resolved incidents
- No similar patterns
Solutions:
- Manually resolve incidents with root causes
- Add more context to incident descriptions
- Wait for more incidents to build history
- Use LLM for better analysis
Future Enhancements
Planned Features
- LLM Integration: Claude/GPT-4 for advanced root cause analysis
- Automated Fix Generation: AI-generated config fixes
- Incident Clustering: Group related incidents automatically
- Predictive Alerts: Warn before incidents occur
- Multi-tenant: Separate incident databases per environment
Community Contributions
See CONTRIBUTING.md for:
- Adding new incident categories
- Improving root cause heuristics
- Custom regression test templates
- Integration with monitoring tools (Prometheus, Grafana)
References
- ChromaDB Documentation: https://docs.trychroma.com/
- pyATS Documentation: https://developer.cisco.com/docs/pyats/
- Overgrowth Repository: https://huggingface.co/spaces/MCP-1st-Birthday/overgrowth
- Related Docs:
NETBOX_INTEGRATION.md- Source of truthBATFISH_INTEGRATION.md- Static analysisSUZIEQ_INTEGRATION.md- Drift detection
Support
Questions? Found a bug? Want to contribute?
- Open an issue on HuggingFace Spaces
- Join Discord: [link]
- Email: [email protected]