# RAG-based Incident Learning System

## Overview

Overgrowth's incident learning system captures deployment failures and network incidents, performs root cause analysis using RAG (Retrieval-Augmented Generation), and automatically generates regression tests to prevent recurrence. This creates a **continuous learning loop** where every failure makes the system smarter.

### The Learning Loop

```
┌─────────────────────────────────────────────────────────────┐
│  1. Incident Occurs                                          │
│     • Deployment failure                                     │
│     • Validation error                                       │
│     • Network outage                                         │
└───────────────┬─────────────────────────────────────────────┘
                ↓
┌─────────────────────────────────────────────────────────────┐
│  2. Capture & Store                                          │
│     • Incident details                                       │
│     • Network model                                          │
│     • Validation errors                                      │
│     • Affected devices                                       │
└───────────────┬─────────────────────────────────────────────┘
                ↓
┌─────────────────────────────────────────────────────────────┐
│  3. RAG Analysis (Vector Search)                             │
│     • Search for similar historical incidents                │
│     • Extract common patterns                                │
│     • Suggest root cause with confidence score               │
└───────────────┬─────────────────────────────────────────────┘
                ↓
┌─────────────────────────────────────────────────────────────┐
│  4. Generate Regression Test                                 │
│     • pyATS test for routing issues                          │
│     • pytest for config validation                           │
│     • Prevent same issue from recurring                      │
└───────────────┬─────────────────────────────────────────────┘
                ↓
┌─────────────────────────────────────────────────────────────┐
│  5. Update Knowledge Base                                    │
│     • Add to vector database                                 │
│     • Update LLM prompts                                     │
│     • Improve future predictions                             │
└─────────────────────────────────────────────────────────────┘
```

## Core Components

### 1. Incident Database

Local JSON database with optional ChromaDB vector search:

```python
from agent.incident_learning import IncidentDatabase, Incident

# Initialize database
db = IncidentDatabase()

# Create incident
incident = Incident(
    id="deploy-20251125-120000",
    timestamp="2025-11-25T12:00:00",
    severity="high",  # critical, high, medium, low
    category="deployment_failure",
    description="VLAN 100 duplicate configuration",
    affected_devices=["leaf-01", "leaf-02"],
    network_model={...},  # Full network model
    validation_errors=[...]  # Errors encountered
)

# Store incident
db.add_incident(incident)

# Search for similar incidents
similar = db.search_similar("VLAN duplicate error", n_results=5)

# Get all deployment failures
failures = db.get_all_incidents(category="deployment_failure")
```

### 2. Root Cause Analyzer

Uses RAG to find patterns and suggest root causes:

```python
from agent.incident_learning import RootCauseAnalyzer

analyzer = RootCauseAnalyzer(incident_db)

# Analyze incident
analysis = analyzer.analyze(incident)

print(f"Suggested root cause: {analysis['suggested_root_cause']}")
print(f"Confidence: {analysis['confidence']:.2f}")
print(f"Similar incidents: {len(analysis['similar_incidents'])}")
print(f"Patterns found: {analysis['patterns_found']}")
```

**Analysis Output:**
```json
{
  "suggested_root_cause": "Schema validation missed duplicate VLAN IDs. Common pattern: pre-flight checks need stricter VLAN uniqueness validation",
  "similar_incidents": ["deploy-20251120-100000", "deploy-20251118-143000"],
  "patterns_found": [
    "Common root cause: Schema validation missed duplicate VLAN IDs",
    "Commonly affected devices: leaf-01, leaf-02, leaf-03",
    "Common category: deployment_failure"
  ],
  "confidence": 0.85
}
```

### 3. Regression Test Generator

Automatically generates pyATS/pytest tests:

```python
from agent.incident_learning import RegressionTestGenerator

generator = RegressionTestGenerator()

# Generate test for incident
test_code = generator.generate_test(incident)

# Save test
test_path = Path("tests/regression") / f"test_{incident.id}.py"
test_path.write_text(test_code)
```

**Generated Test Example:**
```python
"""
Regression test for incident deploy-20251125-120000
VLAN 100 duplicate configuration
Generated: 2025-11-25T12:30:00
"""
from pyats import aetest

class TestDeploy20251125120000(aetest.Testcase):
    """Prevent recurrence of VLAN duplicate configuration"""
    
    @aetest.setup
    def setup(self, testbed):
        self.devices = {}
        for device_name in ['leaf-01', 'leaf-02']:
            device = testbed.devices[device_name]
            device.connect()
            self.devices[device_name] = device
    
    @aetest.test
    def verify_vlan_uniqueness(self):
        """Verify no duplicate VLAN IDs"""
        all_vlans = {}
        
        for device_name, device in self.devices.items():
            output = device.execute('show vlan brief')
            vlans = parse_vlans(output)
            
            for vlan_id in vlans:
                if vlan_id in all_vlans:
                    self.failed(f"Duplicate VLAN {vlan_id} on {device_name} and {all_vlans[vlan_id]}")
                all_vlans[vlan_id] = device_name
```

## Usage Examples

### Automatic Incident Capture

Incidents are automatically captured when validation fails:

```python
from agent.pipeline_engine import OvergrowthPipeline

pipeline = OvergrowthPipeline()

# Run pre-flight validation
results = pipeline.stage0_preflight(network_model)

# If validation fails, incident is automatically captured
if not results['ready_to_deploy']:
    print("Validation failed - incident captured for learning")
    
    # View recent incidents
    incidents = pipeline.incident_db.get_all_incidents(limit=5)
    for inc in incidents:
        print(f"{inc.id}: {inc.description}")
```

### Manual Incident Capture

For incidents outside the pipeline:

```python
from agent.incident_learning import capture_deployment_failure

# Capture deployment failure
incident = capture_deployment_failure(
    description="BGP peering failed - incorrect AS number",
    network_model=model.to_dict(),
    validation_errors=[
        {"error": "BGP AS mismatch", "type": "routing"}
    ],
    affected_devices=["spine-01", "spine-02"]
)

print(f"Captured incident: {incident.id}")
```

### Complete Learning Workflow

```python
from agent.incident_learning import learn_from_incident

# Full learning cycle: analyze → test → update
learnings = learn_from_incident(incident)

print(f"Root cause: {learnings['root_cause']}")
print(f"Regression test: {learnings['regression_test']}")
print(f"Confidence: {learnings['confidence']:.2f}")

# Incident is updated with learnings
updated = db.get_incident(incident.id)
assert updated.root_cause is not None
assert updated.regression_test is not None
```

### Batch Learning from Recent Incidents

```python
# Analyze last 10 incidents
learnings = pipeline.learn_from_recent_incidents(limit=10)

print(f"Total incidents: {learnings['total_incidents']}")
print(f"Unresolved: {learnings['unresolved']}")
print(f"Analyzed: {learnings['analyzed']}")

# Each learning includes regression test
for learning in learnings['learnings']:
    print(f"  {learning['incident_id']}: {learning['root_cause']}")
```

## Incident Categories

### Deployment Failures
- Pre-flight validation errors
- Config generation failures
- Deployment script errors
- Syntax errors

### Configuration Errors
- Invalid parameters
- Duplicate IDs (VLANs, IPs, etc.)
- Reference errors
- Constraint violations

### Routing Issues
- Routing loops
- BGP/OSPF misconfigurations
- Missing routes
- Blackholes

### Network Outages
- Link failures
- Device failures
- Cascading failures
- Service disruptions

## Confidence Scoring

The RCA analyzer calculates confidence based on:

| Factor | Weight | Example |
|--------|--------|---------|
| **Similar incidents found** | 40% | 3+ similar = +0.4 |
| **Resolved incidents** | 30% | 2+ resolved = +0.3 |
| **Patterns extracted** | 30% | 2+ patterns = +0.3 |

**Confidence Levels:**
- **0.7 - 1.0**: High confidence - safe to auto-apply learnings
- **0.4 - 0.7**: Medium confidence - review before applying
- **0.0 - 0.4**: Low confidence - requires manual analysis

## Vector Search with ChromaDB

### Installation

```bash
# Install ChromaDB for vector search
pip install chromadb

# Verify
python -c "import chromadb; print('ChromaDB installed')"
```

### Benefits Over Keyword Search

| Feature | ChromaDB | Keyword Search |
|---------|----------|----------------|
| **Semantic similarity** | ✅ Finds conceptually similar incidents | ❌ Exact keyword match only |
| **Typo tolerance** | ✅ Handles misspellings | ❌ Requires exact match |
| **Context-aware** | ✅ Understands intent | ❌ Literal matching |
| **Performance** | ✅ Fast vector search | ⚠️ Linear scan |

### Example: Semantic vs Keyword

**Query:** "BGP session won't come up"

**ChromaDB finds:**
- "BGP neighbor not establishing"
- "BGP peering failed"
- "Routing protocol adjacency issue"

**Keyword search finds:**
- Only exact matches with "BGP session"

## Integration with Pipeline

### Stage 0: Pre-flight Validation

Incidents automatically captured on validation failures:

```python
# Pipeline captures incident when validation fails
results = pipeline.stage0_preflight(model)

if not results['ready_to_deploy']:
    # Incident created with:
    # - Validation errors
    # - Network model
    # - Affected devices
    # - Timestamp
    pass
```

### Background Learning Job

In production, run learning as a background job:

```python
# Cron job: Learn from incidents every hour
#!/bin/bash
cd /opt/overgrowth
source venv/bin/activate

python -c "
from agent.pipeline_engine import OvergrowthPipeline

pipeline = OvergrowthPipeline()
learnings = pipeline.learn_from_recent_incidents(limit=20)

if learnings['analyzed'] > 0:
    print(f'Analyzed {learnings[\"analyzed\"]} incidents')
    print(f'Generated {learnings[\"analyzed\"]} regression tests')
"
```

### Continuous Improvement Loop

```
1. Deploy → Fail → Capture incident
2. Analyze → Find root cause
3. Generate test → Prevent recurrence
4. Next deploy → Test catches issue
5. Fix → Deploy succeeds
6. Knowledge updated → Future deploys smarter
```

## Regression Test Types

### pyATS Tests (Network Validation)

For routing issues, interface states, protocol validation:

```python
# Generated for routing incidents
class TestRoutingLoop(aetest.Testcase):
    @aetest.test
    def verify_no_routing_loops(self, testbed):
        for device in affected_devices:
            routes = device.parse('show ip route')
            # Check for loops
            assert no_loops_detected(routes)
```

### pytest Tests (Config Validation)

For schema errors, policy violations, syntax issues:

```python
# Generated for config incidents
def test_vlan_uniqueness():
    """Prevent duplicate VLAN IDs"""
    model = load_network_model()
    vlan_ids = [v['id'] for v in model['vlans']]
    
    # Check for duplicates
    assert len(vlan_ids) == len(set(vlan_ids)), "Duplicate VLAN IDs found"
```

## Knowledge Base Updates

### LLM Prompt Updates

Based on learnings, update system prompts:

```python
# Before learning
"Generate network configurations ensuring basic syntax validity"

# After 5 VLAN duplicate incidents
"Generate network configurations. CRITICAL: Ensure VLAN IDs are unique across all devices. Check for duplicates before generating configs. This is a common failure point."
```

### Policy Engine Updates

Add new rules based on incidents:

```python
# After learning from incident
class NetworkPolicy:
    def check_vlan_uniqueness(self, model):
        """Added after incident: deploy-20251125-120000"""
        vlan_ids = [v['id'] for v in model['vlans']]
        duplicates = [v for v in vlan_ids if vlan_ids.count(v) > 1]
        
        if duplicates:
            self.add_violation(
                severity='ERROR',
                message=f"Duplicate VLAN IDs: {duplicates}",
                learned_from='deploy-20251125-120000'
            )
```

## Query Examples

### Find Incidents by Pattern

```python
# Find all BGP-related incidents
bgp_incidents = db.search_similar("BGP peering routing protocol", n_results=10)

# Find VLAN issues
vlan_incidents = db.search_similar("VLAN configuration duplicate", n_results=10)

# Find recent critical incidents
critical = db.get_all_incidents(severity="critical", limit=20)
```

### Analyze Incident Trends

```python
# Get incidents from last 30 days
from datetime import datetime, timedelta

all_incidents = db.get_all_incidents(limit=1000)
recent = [
    i for i in all_incidents 
    if datetime.fromisoformat(i.timestamp) > datetime.now() - timedelta(days=30)
]

# Group by category
from collections import Counter
categories = Counter(i.category for i in recent)

print("Incident trends:")
for category, count in categories.most_common():
    print(f"  {category}: {count}")
```

### Root Cause Analysis Report

```python
# Generate root cause analysis report
incidents = db.get_all_incidents(limit=50)
analyzer = RootCauseAnalyzer(db)

report = []
for incident in incidents:
    if not incident.root_cause:  # Unresolved
        analysis = analyzer.analyze(incident)
        
        report.append({
            'incident': incident.id,
            'description': incident.description,
            'suggested_cause': analysis['suggested_root_cause'],
            'confidence': analysis['confidence'],
            'similar_count': len(analysis['similar_incidents'])
        })

# Sort by confidence
report.sort(key=lambda x: x['confidence'], reverse=True)

for item in report[:10]:
    print(f"{item['incident']}: {item['suggested_cause']} (confidence: {item['confidence']:.2f})")
```

## Best Practices

### 1. Capture Rich Context

```python
# Good: Includes full context
incident = Incident(
    description="Deployment failed: duplicate VLAN 100",
    network_model=model.to_dict(),  # Full model
    validation_errors=errors,  # All errors
    affected_devices=["leaf-01", "leaf-02"],  # Specific devices
    config_changes=[...]  # What changed
)

# Bad: Minimal context
incident = Incident(
    description="Deployment failed"
)
```

### 2. Resolve Incidents

```python
# Update with resolution
db.update_incident(incident.id, {
    'root_cause': 'Schema validation missed duplicate check',
    'resolution': 'Added VLAN uniqueness validator',
    'resolved_at': datetime.now().isoformat()
})

# Resolved incidents improve future analysis
```

### 3. Run Regression Tests

```python
# Add generated tests to CI/CD
tests/
  regression/
    test_deploy_20251125_120000.py  # Auto-generated
    test_routing_20251120_100000.py
    test_vlan_20251118_143000.py

# Run before each deployment
pytest tests/regression/ --tb=short
```

### 4. Review Learnings

```python
# Weekly review of learnings
learnings = pipeline.learn_from_recent_incidents(limit=50)

for learning in learnings['learnings']:
    if learning['confidence'] > 0.7:
        print(f"High confidence learning:")
        print(f"  Incident: {learning['incident_id']}")
        print(f"  Root cause: {learning['root_cause']}")
        print(f"  Test: {learning['regression_test']}")
```

## Troubleshooting

### ChromaDB Not Installing

```bash
# If ChromaDB fails to install
pip install chromadb --no-deps
pip install onnxruntime pydantic-settings

# Or use mock mode (automatic fallback)
db = IncidentDatabase()
# Will use keyword search instead of vector search
```

### Incident Database Corruption

```bash
# Backup incidents
cp ~/.overgrowth/incidents/incidents.json ~/incidents_backup.json

# Reset database
rm -rf ~/.overgrowth/incidents/

# Restore from backup
mkdir -p ~/.overgrowth/incidents
cp ~/incidents_backup.json ~/.overgrowth/incidents/incidents.json
```

### Low Confidence Scores

**Causes:**
- Few historical incidents
- No resolved incidents
- No similar patterns

**Solutions:**
1. Manually resolve incidents with root causes
2. Add more context to incident descriptions
3. Wait for more incidents to build history
4. Use LLM for better analysis

## Future Enhancements

### Planned Features

- **LLM Integration**: Claude/GPT-4 for advanced root cause analysis
- **Automated Fix Generation**: AI-generated config fixes
- **Incident Clustering**: Group related incidents automatically
- **Predictive Alerts**: Warn before incidents occur
- **Multi-tenant**: Separate incident databases per environment

### Community Contributions

See `CONTRIBUTING.md` for:
- Adding new incident categories
- Improving root cause heuristics
- Custom regression test templates
- Integration with monitoring tools (Prometheus, Grafana)

## References

- **ChromaDB Documentation**: https://docs.trychroma.com/
- **pyATS Documentation**: https://developer.cisco.com/docs/pyats/
- **Overgrowth Repository**: https://huggingface.co/spaces/MCP-1st-Birthday/overgrowth
- **Related Docs**:
  - `NETBOX_INTEGRATION.md` - Source of truth
  - `BATFISH_INTEGRATION.md` - Static analysis
  - `SUZIEQ_INTEGRATION.md` - Drift detection

## Support

Questions? Found a bug? Want to contribute?
- Open an issue on HuggingFace Spaces
- Join Discord: [link]
- Email: overgrowth@example.com