# RAG-based Incident Learning System ## Overview Overgrowth's incident learning system captures deployment failures and network incidents, performs root cause analysis using RAG (Retrieval-Augmented Generation), and automatically generates regression tests to prevent recurrence. This creates a **continuous learning loop** where every failure makes the system smarter. ### The Learning Loop ``` ┌─────────────────────────────────────────────────────────────┐ │ 1. Incident Occurs │ │ • Deployment failure │ │ • Validation error │ │ • Network outage │ └───────────────┬─────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ 2. Capture & Store │ │ • Incident details │ │ • Network model │ │ • Validation errors │ │ • Affected devices │ └───────────────┬─────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ 3. RAG Analysis (Vector Search) │ │ • Search for similar historical incidents │ │ • Extract common patterns │ │ • Suggest root cause with confidence score │ └───────────────┬─────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ 4. Generate Regression Test │ │ • pyATS test for routing issues │ │ • pytest for config validation │ │ • Prevent same issue from recurring │ └───────────────┬─────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ 5. Update Knowledge Base │ │ • Add to vector database │ │ • Update LLM prompts │ │ • Improve future predictions │ └─────────────────────────────────────────────────────────────┘ ``` ## Core Components ### 1. Incident Database Local JSON database with optional ChromaDB vector search: ```python from agent.incident_learning import IncidentDatabase, Incident # Initialize database db = IncidentDatabase() # Create incident incident = Incident( id="deploy-20251125-120000", timestamp="2025-11-25T12:00:00", severity="high", # critical, high, medium, low category="deployment_failure", description="VLAN 100 duplicate configuration", affected_devices=["leaf-01", "leaf-02"], network_model={...}, # Full network model validation_errors=[...] # Errors encountered ) # Store incident db.add_incident(incident) # Search for similar incidents similar = db.search_similar("VLAN duplicate error", n_results=5) # Get all deployment failures failures = db.get_all_incidents(category="deployment_failure") ``` ### 2. Root Cause Analyzer Uses RAG to find patterns and suggest root causes: ```python from agent.incident_learning import RootCauseAnalyzer analyzer = RootCauseAnalyzer(incident_db) # Analyze incident analysis = analyzer.analyze(incident) print(f"Suggested root cause: {analysis['suggested_root_cause']}") print(f"Confidence: {analysis['confidence']:.2f}") print(f"Similar incidents: {len(analysis['similar_incidents'])}") print(f"Patterns found: {analysis['patterns_found']}") ``` **Analysis Output:** ```json { "suggested_root_cause": "Schema validation missed duplicate VLAN IDs. Common pattern: pre-flight checks need stricter VLAN uniqueness validation", "similar_incidents": ["deploy-20251120-100000", "deploy-20251118-143000"], "patterns_found": [ "Common root cause: Schema validation missed duplicate VLAN IDs", "Commonly affected devices: leaf-01, leaf-02, leaf-03", "Common category: deployment_failure" ], "confidence": 0.85 } ``` ### 3. Regression Test Generator Automatically generates pyATS/pytest tests: ```python from agent.incident_learning import RegressionTestGenerator generator = RegressionTestGenerator() # Generate test for incident test_code = generator.generate_test(incident) # Save test test_path = Path("tests/regression") / f"test_{incident.id}.py" test_path.write_text(test_code) ``` **Generated Test Example:** ```python """ Regression test for incident deploy-20251125-120000 VLAN 100 duplicate configuration Generated: 2025-11-25T12:30:00 """ from pyats import aetest class TestDeploy20251125120000(aetest.Testcase): """Prevent recurrence of VLAN duplicate configuration""" @aetest.setup def setup(self, testbed): self.devices = {} for device_name in ['leaf-01', 'leaf-02']: device = testbed.devices[device_name] device.connect() self.devices[device_name] = device @aetest.test def verify_vlan_uniqueness(self): """Verify no duplicate VLAN IDs""" all_vlans = {} for device_name, device in self.devices.items(): output = device.execute('show vlan brief') vlans = parse_vlans(output) for vlan_id in vlans: if vlan_id in all_vlans: self.failed(f"Duplicate VLAN {vlan_id} on {device_name} and {all_vlans[vlan_id]}") all_vlans[vlan_id] = device_name ``` ## Usage Examples ### Automatic Incident Capture Incidents are automatically captured when validation fails: ```python from agent.pipeline_engine import OvergrowthPipeline pipeline = OvergrowthPipeline() # Run pre-flight validation results = pipeline.stage0_preflight(network_model) # If validation fails, incident is automatically captured if not results['ready_to_deploy']: print("Validation failed - incident captured for learning") # View recent incidents incidents = pipeline.incident_db.get_all_incidents(limit=5) for inc in incidents: print(f"{inc.id}: {inc.description}") ``` ### Manual Incident Capture For incidents outside the pipeline: ```python from agent.incident_learning import capture_deployment_failure # Capture deployment failure incident = capture_deployment_failure( description="BGP peering failed - incorrect AS number", network_model=model.to_dict(), validation_errors=[ {"error": "BGP AS mismatch", "type": "routing"} ], affected_devices=["spine-01", "spine-02"] ) print(f"Captured incident: {incident.id}") ``` ### Complete Learning Workflow ```python from agent.incident_learning import learn_from_incident # Full learning cycle: analyze → test → update learnings = learn_from_incident(incident) print(f"Root cause: {learnings['root_cause']}") print(f"Regression test: {learnings['regression_test']}") print(f"Confidence: {learnings['confidence']:.2f}") # Incident is updated with learnings updated = db.get_incident(incident.id) assert updated.root_cause is not None assert updated.regression_test is not None ``` ### Batch Learning from Recent Incidents ```python # Analyze last 10 incidents learnings = pipeline.learn_from_recent_incidents(limit=10) print(f"Total incidents: {learnings['total_incidents']}") print(f"Unresolved: {learnings['unresolved']}") print(f"Analyzed: {learnings['analyzed']}") # Each learning includes regression test for learning in learnings['learnings']: print(f" {learning['incident_id']}: {learning['root_cause']}") ``` ## Incident Categories ### Deployment Failures - Pre-flight validation errors - Config generation failures - Deployment script errors - Syntax errors ### Configuration Errors - Invalid parameters - Duplicate IDs (VLANs, IPs, etc.) - Reference errors - Constraint violations ### Routing Issues - Routing loops - BGP/OSPF misconfigurations - Missing routes - Blackholes ### Network Outages - Link failures - Device failures - Cascading failures - Service disruptions ## Confidence Scoring The RCA analyzer calculates confidence based on: | Factor | Weight | Example | |--------|--------|---------| | **Similar incidents found** | 40% | 3+ similar = +0.4 | | **Resolved incidents** | 30% | 2+ resolved = +0.3 | | **Patterns extracted** | 30% | 2+ patterns = +0.3 | **Confidence Levels:** - **0.7 - 1.0**: High confidence - safe to auto-apply learnings - **0.4 - 0.7**: Medium confidence - review before applying - **0.0 - 0.4**: Low confidence - requires manual analysis ## Vector Search with ChromaDB ### Installation ```bash # Install ChromaDB for vector search pip install chromadb # Verify python -c "import chromadb; print('ChromaDB installed')" ``` ### Benefits Over Keyword Search | Feature | ChromaDB | Keyword Search | |---------|----------|----------------| | **Semantic similarity** | ✅ Finds conceptually similar incidents | ❌ Exact keyword match only | | **Typo tolerance** | ✅ Handles misspellings | ❌ Requires exact match | | **Context-aware** | ✅ Understands intent | ❌ Literal matching | | **Performance** | ✅ Fast vector search | ⚠️ Linear scan | ### Example: Semantic vs Keyword **Query:** "BGP session won't come up" **ChromaDB finds:** - "BGP neighbor not establishing" - "BGP peering failed" - "Routing protocol adjacency issue" **Keyword search finds:** - Only exact matches with "BGP session" ## Integration with Pipeline ### Stage 0: Pre-flight Validation Incidents automatically captured on validation failures: ```python # Pipeline captures incident when validation fails results = pipeline.stage0_preflight(model) if not results['ready_to_deploy']: # Incident created with: # - Validation errors # - Network model # - Affected devices # - Timestamp pass ``` ### Background Learning Job In production, run learning as a background job: ```python # Cron job: Learn from incidents every hour #!/bin/bash cd /opt/overgrowth source venv/bin/activate python -c " from agent.pipeline_engine import OvergrowthPipeline pipeline = OvergrowthPipeline() learnings = pipeline.learn_from_recent_incidents(limit=20) if learnings['analyzed'] > 0: print(f'Analyzed {learnings[\"analyzed\"]} incidents') print(f'Generated {learnings[\"analyzed\"]} regression tests') " ``` ### Continuous Improvement Loop ``` 1. Deploy → Fail → Capture incident 2. Analyze → Find root cause 3. Generate test → Prevent recurrence 4. Next deploy → Test catches issue 5. Fix → Deploy succeeds 6. Knowledge updated → Future deploys smarter ``` ## Regression Test Types ### pyATS Tests (Network Validation) For routing issues, interface states, protocol validation: ```python # Generated for routing incidents class TestRoutingLoop(aetest.Testcase): @aetest.test def verify_no_routing_loops(self, testbed): for device in affected_devices: routes = device.parse('show ip route') # Check for loops assert no_loops_detected(routes) ``` ### pytest Tests (Config Validation) For schema errors, policy violations, syntax issues: ```python # Generated for config incidents def test_vlan_uniqueness(): """Prevent duplicate VLAN IDs""" model = load_network_model() vlan_ids = [v['id'] for v in model['vlans']] # Check for duplicates assert len(vlan_ids) == len(set(vlan_ids)), "Duplicate VLAN IDs found" ``` ## Knowledge Base Updates ### LLM Prompt Updates Based on learnings, update system prompts: ```python # Before learning "Generate network configurations ensuring basic syntax validity" # After 5 VLAN duplicate incidents "Generate network configurations. CRITICAL: Ensure VLAN IDs are unique across all devices. Check for duplicates before generating configs. This is a common failure point." ``` ### Policy Engine Updates Add new rules based on incidents: ```python # After learning from incident class NetworkPolicy: def check_vlan_uniqueness(self, model): """Added after incident: deploy-20251125-120000""" vlan_ids = [v['id'] for v in model['vlans']] duplicates = [v for v in vlan_ids if vlan_ids.count(v) > 1] if duplicates: self.add_violation( severity='ERROR', message=f"Duplicate VLAN IDs: {duplicates}", learned_from='deploy-20251125-120000' ) ``` ## Query Examples ### Find Incidents by Pattern ```python # Find all BGP-related incidents bgp_incidents = db.search_similar("BGP peering routing protocol", n_results=10) # Find VLAN issues vlan_incidents = db.search_similar("VLAN configuration duplicate", n_results=10) # Find recent critical incidents critical = db.get_all_incidents(severity="critical", limit=20) ``` ### Analyze Incident Trends ```python # Get incidents from last 30 days from datetime import datetime, timedelta all_incidents = db.get_all_incidents(limit=1000) recent = [ i for i in all_incidents if datetime.fromisoformat(i.timestamp) > datetime.now() - timedelta(days=30) ] # Group by category from collections import Counter categories = Counter(i.category for i in recent) print("Incident trends:") for category, count in categories.most_common(): print(f" {category}: {count}") ``` ### Root Cause Analysis Report ```python # Generate root cause analysis report incidents = db.get_all_incidents(limit=50) analyzer = RootCauseAnalyzer(db) report = [] for incident in incidents: if not incident.root_cause: # Unresolved analysis = analyzer.analyze(incident) report.append({ 'incident': incident.id, 'description': incident.description, 'suggested_cause': analysis['suggested_root_cause'], 'confidence': analysis['confidence'], 'similar_count': len(analysis['similar_incidents']) }) # Sort by confidence report.sort(key=lambda x: x['confidence'], reverse=True) for item in report[:10]: print(f"{item['incident']}: {item['suggested_cause']} (confidence: {item['confidence']:.2f})") ``` ## Best Practices ### 1. Capture Rich Context ```python # Good: Includes full context incident = Incident( description="Deployment failed: duplicate VLAN 100", network_model=model.to_dict(), # Full model validation_errors=errors, # All errors affected_devices=["leaf-01", "leaf-02"], # Specific devices config_changes=[...] # What changed ) # Bad: Minimal context incident = Incident( description="Deployment failed" ) ``` ### 2. Resolve Incidents ```python # Update with resolution db.update_incident(incident.id, { 'root_cause': 'Schema validation missed duplicate check', 'resolution': 'Added VLAN uniqueness validator', 'resolved_at': datetime.now().isoformat() }) # Resolved incidents improve future analysis ``` ### 3. Run Regression Tests ```python # Add generated tests to CI/CD tests/ regression/ test_deploy_20251125_120000.py # Auto-generated test_routing_20251120_100000.py test_vlan_20251118_143000.py # Run before each deployment pytest tests/regression/ --tb=short ``` ### 4. Review Learnings ```python # Weekly review of learnings learnings = pipeline.learn_from_recent_incidents(limit=50) for learning in learnings['learnings']: if learning['confidence'] > 0.7: print(f"High confidence learning:") print(f" Incident: {learning['incident_id']}") print(f" Root cause: {learning['root_cause']}") print(f" Test: {learning['regression_test']}") ``` ## Troubleshooting ### ChromaDB Not Installing ```bash # If ChromaDB fails to install pip install chromadb --no-deps pip install onnxruntime pydantic-settings # Or use mock mode (automatic fallback) db = IncidentDatabase() # Will use keyword search instead of vector search ``` ### Incident Database Corruption ```bash # Backup incidents cp ~/.overgrowth/incidents/incidents.json ~/incidents_backup.json # Reset database rm -rf ~/.overgrowth/incidents/ # Restore from backup mkdir -p ~/.overgrowth/incidents cp ~/incidents_backup.json ~/.overgrowth/incidents/incidents.json ``` ### Low Confidence Scores **Causes:** - Few historical incidents - No resolved incidents - No similar patterns **Solutions:** 1. Manually resolve incidents with root causes 2. Add more context to incident descriptions 3. Wait for more incidents to build history 4. Use LLM for better analysis ## Future Enhancements ### Planned Features - **LLM Integration**: Claude/GPT-4 for advanced root cause analysis - **Automated Fix Generation**: AI-generated config fixes - **Incident Clustering**: Group related incidents automatically - **Predictive Alerts**: Warn before incidents occur - **Multi-tenant**: Separate incident databases per environment ### Community Contributions See `CONTRIBUTING.md` for: - Adding new incident categories - Improving root cause heuristics - Custom regression test templates - Integration with monitoring tools (Prometheus, Grafana) ## References - **ChromaDB Documentation**: https://docs.trychroma.com/ - **pyATS Documentation**: https://developer.cisco.com/docs/pyats/ - **Overgrowth Repository**: https://huggingface.co/spaces/MCP-1st-Birthday/overgrowth - **Related Docs**: - `NETBOX_INTEGRATION.md` - Source of truth - `BATFISH_INTEGRATION.md` - Static analysis - `SUZIEQ_INTEGRATION.md` - Drift detection ## Support Questions? Found a bug? Want to contribute? - Open an issue on HuggingFace Spaces - Join Discord: [link] - Email: overgrowth@example.com