overgrowth / DEPLOYMENT_GUIDE.md
Graham Paasch
docs: Complete deployment guide for Stage 6
c81a736

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Stage 6: Autonomous Deployment Guide

Complete guide for deploying configurations to real network devices using Overgrowth's autonomous deployment engine.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Deployment Orchestration                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  1. Config Generation (Jinja2 Templates)                   β”‚ β”‚
β”‚  β”‚  2. Pre-Deployment Validation                              β”‚ β”‚
β”‚  β”‚  3. Device Connection (Netmiko/NAPALM)                     β”‚ β”‚
β”‚  β”‚  4. Configuration Deployment                               β”‚ β”‚
β”‚  β”‚  5. Post-Deployment Verification                           β”‚ β”‚
β”‚  β”‚  6. Automatic Rollback (on failure)                        β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                    β”‚                    β”‚
           β–Ό                    β–Ό                    β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Cisco    β”‚         β”‚ Arista   β”‚        β”‚ Juniper  β”‚
    β”‚ IOS/NXOS β”‚         β”‚ EOS      β”‚        β”‚ JunOS    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

1. DeviceDriver (agent/device_driver.py)

Manages connections and config deployment to network devices.

Supported Platforms:

  • Cisco IOS
  • Cisco NXOS (Nexus)
  • Cisco IOS-XE (ASR, ISR, etc.)
  • Arista EOS
  • Juniper JunOS

Features:

  • Connection pooling and management
  • Config backup before deployment
  • Automatic rollback on failure
  • Dry-run mode (validate without deploying)
  • Mock mode (testing without devices)

2. ConfigTemplateEngine (agent/config_templates.py)

Generates device configs from Jinja2 templates.

Built-in Templates:

  • cisco_ios_l2_switch - Layer 2 access switch
  • cisco_ios_l3_router - Layer 3 router with OSPF/BGP
  • arista_eos - Arista EOS switch/router
  • juniper_junos - Juniper router/switch

Features:

  • Variable substitution from NetworkModel
  • Custom template support
  • Template validation
  • Vendor-specific config generation

3. DeploymentEngine (agent/deployment_engine.py)

Orchestrates the entire deployment workflow.

Workflow:

  1. Generate config from template
  2. Connect to device
  3. Run pre-deployment checks
  4. Backup current config
  5. Deploy new config
  6. Run post-deployment checks
  7. Rollback if checks fail
  8. Record deployment history

Usage Examples

Example 1: Deploy Single Device

from agent.deployment_engine import DeploymentEngine, DeploymentTask, DeviceType

# Initialize engine
deployer = DeploymentEngine(use_napalm=True)

# Create deployment task
task = DeploymentTask(
    device_id='core-sw-1',
    device_type=DeviceType.CISCO_IOS,
    hostname='192.168.1.10',
    username='admin',
    password='admin123',
    config="""
hostname core-sw-1
!
vlan 10
 name DATA
vlan 20
 name VOICE
!
interface GigabitEthernet0/1
 switchport mode access
 switchport access vlan 10
 no shutdown
!
""",
    dry_run=False,
    pre_checks=['command:show version'],
    post_checks=['interface:GigabitEthernet0/1']
)

# Deploy
result = deployer.deploy_single_device(task)

print(f"Status: {result.status.value}")
print(f"Duration: {result.duration_seconds:.1f}s")

if result.status.value == 'success':
    print("βœ“ Deployment successful!")
else:
    print(f"βœ— Deployment failed: {result.error}")
    if result.rolled_back:
        print("βœ“ Configuration rolled back")

Example 2: Generate and Deploy from Template

from agent.deployment_engine import DeploymentEngine
from agent.pipeline_engine import NetworkModel, Device, NetworkIntent

# Create network model
model = NetworkModel(
    name="campus-network",
    version="1.0",
    intent=NetworkIntent(
        description="Campus network deployment",
        business_requirements=["High availability", "VLAN segmentation"],
        constraints=["Budget friendly"]
    ),
    devices=[
        Device(
            name="access-sw-1",
            role="access",
            model="Catalyst 2960",
            vendor="Cisco",
            mgmt_ip="192.168.1.20",
            location="Building A",
            interfaces=[
                {
                    "name": "GigabitEthernet0/1",
                    "description": "Uplink to core",
                    "mode": "trunk",
                    "enabled": True
                },
                {
                    "name": "GigabitEthernet0/2",
                    "description": "Workstation port",
                    "mode": "access",
                    "vlan": 10,
                    "enabled": True
                }
            ]
        )
    ],
    vlans=[
        {"id": 10, "name": "DATA"},
        {"id": 20, "name": "VOICE"},
        {"id": 99, "name": "MANAGEMENT"}
    ],
    subnets=[
        {"network": "10.0.10.0/24", "vlan": 10},
        {"network": "10.0.20.0/24", "vlan": 20}
    ],
    routing={},
    services=["DHCP", "NTP"]
)

# Initialize deployer
deployer = DeploymentEngine(use_napalm=True)

# Network context for templates
network_context = {
    'vlans': model.vlans,
    'routing': model.routing,
    'domain_name': 'campus.local',
    'ntp_servers': ['0.pool.ntp.org'],
    'dns_servers': ['8.8.8.8']
}

# Credentials
credentials = {
    'username': 'admin',
    'password': 'secure123'
}

# Deploy each device
for device in model.devices:
    result = deployer.generate_and_deploy(
        device=device,
        network_context=network_context,
        credentials=credentials,
        dry_run=False,
        pre_checks=['command:show version'],
        post_checks=['command:show running-config']
    )
    
    print(f"{device.name}: {result.status.value}")

Example 3: Dry-Run Mode (Test Without Deploying)

from agent.pipeline_engine import OvergrowthPipeline

pipeline = OvergrowthPipeline()

# Generate network model
intent = pipeline.stage1_consultation("Deploy 3-tier campus network")
model = pipeline.stage2_generate_sot(intent)

# Dry-run deployment (generates configs, validates, but doesn't deploy)
results = pipeline.stage6_autonomous_deploy(
    model=model,
    credentials={'username': 'admin', 'password': 'admin'},
    dry_run=True,  # No actual changes to devices
    parallel=False
)

print(f"Dry-run complete: {results['successful']}/{results['total_devices']} would succeed")

for r in results['results']:
    print(f"  {r['device_id']}: {r['status']}")
    if r['status'] == 'failed':
        print(f"    Error: {r['error']}")

Example 4: Parallel Deployment with Ray

from agent.pipeline_engine import OvergrowthPipeline

pipeline = OvergrowthPipeline()
model = pipeline.stage2_generate_sot(intent)

# Enable parallel mode
pipeline.enable_parallel_mode()

# Deploy to all devices in parallel
results = pipeline.stage6_autonomous_deploy(
    model=model,
    credentials={'username': 'admin', 'password': 'admin'},
    dry_run=False,
    parallel=True  # Use Ray for concurrent deployment
)

print(f"Deployed to {results['successful']}/{results['total_devices']} devices")
print(f"Success rate: {results['success_rate']:.1f}%")
print(f"Rolled back: {results['rolled_back']}")

Example 5: Custom Pre/Post Validation Checks

from agent.deployment_engine import DeploymentEngine, DeploymentTask, DeviceType

deployer = DeploymentEngine()

task = DeploymentTask(
    device_id='border-rtr-1',
    device_type=DeviceType.CISCO_XE,
    hostname='10.0.0.1',
    username='admin',
    password='admin',
    config=router_config,
    dry_run=False,
    # Pre-deployment checks
    pre_checks=[
        'command:show version',
        'command:show ip interface brief',
        'ping:8.8.8.8',  # Check internet connectivity
    ],
    # Post-deployment checks
    post_checks=[
        'interface:GigabitEthernet0/0',  # Verify interface up
        'ping:10.0.1.1',  # Verify internal connectivity
        'command:show ip bgp summary',  # Verify BGP
    ]
)

result = deployer.deploy_single_device(task)

# Check which validations passed/failed
print("Pre-checks:", result.pre_check_results)
print("Post-checks:", result.post_check_results)

Configuration Templates

Cisco IOS L2 Switch Template

Located in config_templates.py as CISCO_IOS_L2_SWITCH_TEMPLATE.

Variables:

  • device.name - Hostname
  • device.mgmt_ip - Management IP
  • vlans - List of VLAN dicts (id, name)
  • device.interfaces - List of interface dicts
  • default_gateway - Default gateway IP
  • ntp_servers - List of NTP server IPs
  • dns_servers - List of DNS server IPs

Example:

from agent.config_templates import generate_cisco_ios_config

config = generate_cisco_ios_config(
    device=my_device,
    vlans=[
        {"id": 10, "name": "DATA"},
        {"id": 20, "name": "VOICE"}
    ],
    ntp_servers=['0.pool.ntp.org'],
    dns_servers=['8.8.8.8'],
    default_gateway='192.168.1.1'
)

Cisco IOS L3 Router Template

Includes routing protocols (OSPF, BGP, static routes).

Variables:

  • All L2 variables plus:
  • routing.protocol - 'ospf', 'bgp', or 'static'
  • routing.process_id - OSPF process ID
  • routing.networks - List of networks to advertise
  • routing.asn - BGP AS number
  • routing.neighbors - List of BGP neighbor dicts

Example:

config = generate_cisco_ios_config(
    device=router_device,
    vlans=[],
    routing={
        'protocol': 'ospf',
        'process_id': 1,
        'networks': ['10.0.0.0 0.0.255.255'],
        'area': 0
    }
)

Custom Templates

Create custom Jinja2 templates:

from agent.config_templates import ConfigTemplateEngine

engine = ConfigTemplateEngine()

# Add custom template
custom_template = """
hostname {{ device.name }}
!
{% for vlan in vlans %}
vlan {{ vlan.id }}
 name {{ vlan.name }}
{% endfor %}
!
"""

engine.add_custom_template('my_custom_template', custom_template)

# Use it
config = engine.render_template('my_custom_template', {
    'device': {'name': 'my-switch'},
    'vlans': [{'id': 10, 'name': 'DATA'}]
})

Validation Checks

Check Types

Command checks:

'command:show version'  # Run command, pass if no error

Ping checks:

'ping:8.8.8.8'  # Ping target, pass if successful

Interface checks:

'interface:GigabitEthernet0/1'  # Check interface status, pass if up

Check Timing

  • Pre-checks: Run before config deployment

    • Verify device accessible
    • Check current state
    • Validate prerequisites
  • Post-checks: Run after config deployment

    • Verify config applied
    • Test connectivity
    • Validate services

Error Handling & Rollback

Automatic Rollback

If post-deployment checks fail, the engine automatically rolls back:

  1. Detect check failure
  2. Log error details
  3. Deploy previous config (from backup)
  4. Mark deployment as ROLLED_BACK
result = deployer.deploy_single_device(task)

if result.rolled_back:
    print(f"Deployment failed and was rolled back")
    print(f"Reason: {result.error}")
    print(f"Config restored to: {result.config_before[:100]}...")

Manual Rollback

from agent.device_driver import DeviceDriver, DeviceCredentials, DeviceType

driver = DeviceDriver()

# Connect
creds = DeviceCredentials(
    hostname='192.168.1.10',
    username='admin',
    password='admin',
    device_type=DeviceType.CISCO_IOS
)

conn = driver.connect(creds)

# Get current config
backup = driver.get_config('192.168.1.10', 'running')

# ... something goes wrong ...

# Rollback
driver.rollback_config('192.168.1.10', backup)

Multi-Vendor Support

Cisco IOS/IOS-XE

from agent.device_driver import DeviceType

# Cisco Catalyst, ISR, ASR
device_type = DeviceType.CISCO_IOS  # or CISCO_XE

Supported Features:

  • Config merge and replace
  • Running/startup config backup
  • Auto-save on Cisco IOS

Cisco NXOS (Nexus)

device_type = DeviceType.CISCO_NXOS

Features:

  • Checkpoint/rollback support
  • Config replace via NAPALM

Arista EOS

device_type = DeviceType.ARISTA_EOS

Features:

  • Config sessions
  • Atomic commits
  • Fast boot times

Juniper JunOS

device_type = DeviceType.JUNIPER_JUNOS

Features:

  • Candidate config
  • Commit confirmed
  • Rollback points

Testing Without Devices (Mock Mode)

All components support mock mode for development/testing:

from agent.device_driver import DeviceDriver

# Initialize in mock mode (auto-detected if Netmiko/NAPALM not installed)
driver = DeviceDriver()

print(f"Mock mode: {driver.mock_mode}")  # True if no libraries

# Mock connections always succeed
conn = driver.connect(credentials)
print(f"Connected: {conn.status}")  # CONNECTED

# Mock deployments simulate success
result = driver.deploy_config('device-1', config)
print(f"Deployed: {result.success}")  # True

Deployment History & Metrics

from agent.deployment_engine import DeploymentEngine

deployer = DeploymentEngine()

# ... deploy devices ...

# Get summary
summary = deployer.get_deployment_summary()

print(f"Total deployments: {summary['total_deployments']}")
print(f"Success rate: {summary['success_rate']:.1f}%")
print(f"Average duration: {summary['avg_duration']:.1f}s")

# Recent deployments
for dep in summary['latest_deployments']:
    print(f"{dep['device_id']}: {dep['status']} ({dep['duration']:.1f}s)")

Troubleshooting

Connection Failures

Symptom: Failed to connect: timeout

Solutions:

# Increase timeout
credentials = DeviceCredentials(
    hostname='192.168.1.10',
    username='admin',
    password='admin',
    device_type=DeviceType.CISCO_IOS,
    timeout=60  # Increase from default 30s
)

# Check network connectivity
driver.verify_connectivity('device-id')

# Enable session logging for debugging
credentials.session_log = '/tmp/device-session.log'

Authentication Failures

Symptom: Failed to connect: authentication failed

Solutions:

# For devices requiring enable password
credentials = DeviceCredentials(
    hostname='192.168.1.10',
    username='admin',
    password='admin',
    secret='enable_password',  # Enable secret
    device_type=DeviceType.CISCO_IOS
)

Config Deployment Failures

Symptom: Deployment failed: command error

Solutions:

# Use dry-run to validate first
result = deployer.deploy_config(
    device_id='device-1',
    config=config,
    dry_run=True  # Test without applying
)

print(f"Would work: {result.success}")

# Check diff before deploying
if result.output:
    print(f"Changes:\n{result.output}")

Post-Check Failures

Symptom: Post-deployment check failed: ping:8.8.8.8

Solutions:

# Add delay before post-checks
import time
time.sleep(5)  # Wait for config to take effect

# Use more specific checks
post_checks=[
    'command:show ip interface brief',  # More specific than ping
    'interface:GigabitEthernet0/1'
]

# Disable rollback for troubleshooting
# (manually verify and fix)

Production Best Practices

1. Always Use Dry-Run First

# Test deployment
dry_result = pipeline.stage6_autonomous_deploy(model, dry_run=True)

# Review results
if dry_result['success_rate'] == 100.0:
    # Now deploy for real
    real_result = pipeline.stage6_autonomous_deploy(model, dry_run=False)

2. Use Pre-Flight Validation

# Run Stage 0 validation before deployment
preflight = pipeline.stage0_preflight(model)

if not preflight['ready_to_deploy']:
    print("Pre-flight failed - aborting")
    print(f"Errors: {preflight['errors']}")
    exit(1)

# Deploy only after validation passes
pipeline.stage6_autonomous_deploy(model)

3. Implement Change Windows

from datetime import datetime, time as dt_time

def in_change_window():
    """Check if current time is in approved change window"""
    now = datetime.now()
    # Only deploy between 2 AM - 4 AM
    return dt_time(2, 0) <= now.time() <= dt_time(4, 0)

if not in_change_window():
    print("Outside change window - aborting")
    exit(1)

# Deploy during approved window
pipeline.stage6_autonomous_deploy(model)

4. Use Parallel Deployment Carefully

# Start with small batch
results = pipeline.parallel_deploy_fleet(
    model=model,
    staggered=True,
    stages=[0.01, 0.05, 0.1, 1.0]  # 1%, 5%, 10%, 100%
)

# Circuit breaker stops on high failure rate

5. Maintain Deployment Audit Trail

deployer = DeploymentEngine()

# Deploy
result = deployer.deploy_single_device(task)

# Log to external system
import json
with open(f'/var/log/deployments/{result.device_id}.json', 'w') as f:
    json.dump({
        'device_id': result.device_id,
        'status': result.status.value,
        'timestamp': result.timestamp.isoformat(),
        'config_before': result.config_before,
        'config_after': result.config_after,
        'deployed_by': os.environ.get('USER'),
        'duration': result.duration_seconds
    }, f, indent=2)

Next Steps

  • βœ… Multi-vendor device support
  • βœ… Config templating
  • βœ… Pre/post validation
  • βœ… Automatic rollback
  • 🚧 Full Ray parallel deployment integration
  • 🚧 Advanced validation (pyATS test cases)
  • 🚧 Change request workflow
  • 🚧 Approval gates for production

You're now ready to deploy to real network devices! πŸš€