[Day #59 PyATS Series] Detect Split-Horizon Issues in Large Networks Using pyATS for Cisco [Python for Network Engineer]

[Day #59 PyATS Series] Detect Split-Horizon Issues in Large Networks Using pyATS for Cisco [Python for Network Engineer]


Introduction — key points

Split-horizon and related route-advertisement problems are subtle but catastrophic in large networks: routes sometimes don’t propagate where they should, or — worse — are advertised back like a boomerang creating loops. Detecting these issues at scale requires automation, consistent evidence (raw CLI + parsed JSON), and a methodical validation workflow.

In this Article you will get:

  • A practical detection strategy for split-horizon and advertisement anomalies across EIGRP/RIP/BGP scenarios.
  • A pyATS job that snapshots interfaces, neighbors, and route tables, builds a topology map, and applies heuristics to flag suspicious routes.
  • Step-by-step guidance on how to interpret results, reduce false positives, and integrate with GUI reporting (Elasticsearch/Kibana).
  • Hands-on CLI examples (what to expect) and remediation tips.

Topology Overview

Use a compact multi-site lab that exercises interaction between distributed protocols:

  • PE-A and PE-B are edge routers: they peer between sites and with route reflectors.
  • C1/C2 are customer/LAN devices running distance-vector protocols (EIGRP/RIP) connecting into the backbone.
  • The automation host (pyATS) can SSH to all devices and also run data-plane tests (ping/traceroute) where needed.

This topology exercises the three common split-horizon flavors:

  1. Classic split-horizon (distance-vector like RIP/EIGRP): routes learned on an interface should not be advertised back out that same interface. Misconfig causes re-advertisement and possible loops.
  2. iBGP route distributon / route reflection anomalies: route reflectors failing to forward to clients or accidentally re-advertising routes in incorrect ways create reachability differences.
  3. Asymmetric propagation: a route is present on one side of the network but missing on the other (propagation failure).

Topology & Communications — what we collect and why

Management plane: SSH via pyATS (Genie-backed Device objects). For scale, run concurrency (we show a single-threaded version; you can upgrade to thread pools).

Key CLI outputs to collect (per device):

  • show ip interface brief — build IP → device mapping (needed to map next-hop IP to device).
  • show ip route — full RIB to discover prefixes, route codes, and next-hops.
  • Protocol neighbor commands:
    • EIGRP: show ip eigrp neighbors
    • RIP: show ip rip database / show ip rip status or simply show ip route filter for R entries
    • BGP: show ip bgp summary, show ip bgp
  • show logging | tail 200 — search for route-withdraw / flapping messages.
  • Optional: show ip cef or forwarding table for data-plane validation.

Why we collect both interface lists and routes?
To detect split-horizon anomalies we often need to map next-hop addresses (the ‘via ‘) back to the actual device that owns the address. The interface table provides that mapping.

Validation signals we produce:

  • Mutual next-hop anomalies: A learns prefix P via B and B learns P via A — suspicious (possible advertisement back).
  • Missing neighbor propagation: Originator has P → neighbor does not see P (neighbor should receive it).
  • Asymmetric reachability: Control-plane mismatch vs data-plane (traceroute/ping failure) — puts confidence into findings.
  • Event correlation: Syslog denies/withdraws at times of change.

Workflow Script — full pyATS job (runnable)

Below is a single script detect_splithorizon.py. It is self-contained and includes robust parsing heuristics. Save next to your testbed.yml and run with python detect_splithorizon.py --testbed testbed.yml --run-id run001.

Warning: This script reads devices (read-only). Do not add write/clear commands. Test in lab.

#!/usr/bin/env python3
"""
detect_splithorizon.py
Detect split-horizon and route advertisement anomalies using pyATS.
Produces results/<run_id>/* with raw CLI + parsed JSON + anomaly report.
"""

import argparse, json, os, re, time
from pathlib import Path
from datetime import datetime
from genie.testbed import load

OUTDIR = Path("results")
OUTDIR.mkdir(exist_ok=True)

# Regex helpers
IP_RE = r'(?:\d{1,3}\.){3}\d{1,3}'
PREFIX_RE = r'\d+\.\d+\.\d+\.\d+/\d+'

# Basic route line regex (many IOS outputs follow this)
ROUTE_LINE_RE = re.compile(r'^(?P<code>[A-Z]+)\s+(?P<prefix>' + PREFIX_RE + r')\s+.*(?:via\s+(?P<next>' + IP_RE + r'))?', re.IGNORECASE)
IFACE_LINE_RE = re.compile(r'^(?P<intf>\S+)\s+(?P<ip>' + IP_RE + r')\s+\S+\s+\S+\s+(?P<status>\S+)\s+(?P<protocol>\S+)', re.IGNORECASE)
NEIGH_IP_RE = re.compile(IP_RE)

def ts():
    return datetime.utcnow().isoformat() + "Z"

def save_text(run_id, device_name, label, text):
    d = OUTDIR / run_id / device_name
    d.mkdir(parents=True, exist_ok=True)
    p = d / f"{label}.txt"
    with open(p, "w") as f:
        f.write(text or "")
    return str(p)

def save_json(run_id, device_name, label, obj):
    d = OUTDIR / run_id / device_name
    d.mkdir(parents=True, exist_ok=True)
    p = d / f"{label}.json"
    with open(p, "w") as f:
        json.dump(obj, f, indent=2)
    return str(p)

def collect_device_outputs(device, run_id):
    """Collect the minimal set of outputs needed for analysis."""
    name = device.name
    print(f"[{ts()}] Collecting from {name}")
    device.connect(log_stdout=False)
    device.execute('terminal length 0')
    outputs = {}
    cmds = {
        "interfaces": "show ip interface brief",
        "routes": "show ip route",
        "bgp_summary": "show ip bgp summary",
        "eigrp_neighbors": "show ip eigrp neighbors",
        "rip_status": "show ip rip database",
        "logs": "show logging | tail 200"
    }
    for label, cmd in cmds.items():
        try:
            out = device.execute(cmd)
        except Exception as e:
            out = f"ERROR executing {cmd}: {e}"
        outputs[label] = out
        save_text(run_id, name, label, out)
    device.disconnect()
    save_json(run_id, name, "raw_outputs", outputs)
    return outputs

def parse_interfaces(raw):
    """Return dict ip -> (intf, status)."""
    ip2intf = {}
    if not raw:
        return ip2intf
    for line in raw.splitlines():
        m = IFACE_LINE_RE.match(line.strip())
        if m:
            ip = m.group('ip')
            intf = m.group('intf')
            status = m.group('status')
            ip2intf[ip] = {"intf": intf, "status": status}
    return ip2intf

def parse_routes(raw):
    """
    Parse show ip route crude lines and return list of routes:
    [{prefix, code, next_hop (may be None), raw_line}]
    """
    routes = []
    if not raw:
        return routes
    for line in raw.splitlines():
        m = ROUTE_LINE_RE.match(line.strip())
        if m:
            routes.append({
                "prefix": m.group('prefix'),
                "code": m.group('code'),
                "next_hop": m.group('next'),
                "raw": line.strip()
            })
    return routes

def parse_neighbors(raw):
    """Extract neighbor IPs (from any neighbor command output) — heuristic."""
    neighs = set()
    if not raw:
        return neighs
    for m in NEIGH_IP_RE.finditer(raw):
        neighs.add(m.group(0))
    return list(neighs)

def build_ip_owner_map(all_ifaces):
    """
    all_ifaces: dict device -> {ip -> info}
    return ip -> device mapping (first match wins)
    """
    ip2device = {}
    for dev, ifs in all_ifaces.items():
        for ip in ifs.keys():
            ip2device[ip] = dev
    return ip2device

def detect_mutual_next_hop(routes_by_device, ip2device):
    """
    Find cases where device A has prefix P via next-hop B, and B has prefix P via next-hop A.
    Returns list of anomalies.
    """
    anomalies = []
    # build prefix-> device->nexthop mapping
    prefix_map = {}
    for dev, routes in routes_by_device.items():
        for r in routes:
            prefix_map.setdefault(r['prefix'], {})[dev] = r.get('next_hop')
    for prefix, dev_map in prefix_map.items():
        for a, next_a in dev_map.items():
            if not next_a:
                continue
            b = ip2device.get(next_a)
            if not b:
                continue
            # does b have the same prefix learned via next hop pointing to an IP owned by a?
            b_routes = prefix_map.get(prefix, {})
            next_b_ip = b_routes.get(b)
            if not next_b_ip:
                continue
            owner_of_next_b = ip2device.get(next_b_ip)
            if owner_of_next_b == a:
                anomalies.append({
                    "prefix": prefix, "device_a": a, "device_b": b,
                    "a_next_hop": next_a, "b_next_hop": next_b_ip,
                    "description": "Mutual next-hop detected (possible advertisement loop)"
                })
    return anomalies

def detect_missing_propagation(prefix_origins, routes_by_device, adjacency_map):
    """
    For each prefix origin device, check immediate neighbors (adjacency_map)
    to see if they have the prefix. If a neighbor lacks it, flag missing propagation.
    adjacency_map: device -> list of neighbor devices (by management IP mapping)
    """
    missing = []
    for prefix, origins in prefix_origins.items():
        for origin in origins:
            # neighbors of origin (list of neighbor device names)
            neighbors = adjacency_map.get(origin, [])
            for nbr in neighbors:
                # does neighbor have prefix in its routes?
                has = any(r['prefix'] == prefix for r in routes_by_device.get(nbr, []))
                if not has:
                    missing.append({
                        "prefix": prefix, "origin": origin, "neighbor": nbr,
                        "description": "Neighbor missing prefix (expected to receive advertisement)"
                    })
    return missing

def find_prefix_origins(all_ifaces, routes_by_device):
    """
    Heuristic: if a device has an interface IP that falls within prefix network,
    treat it as origin. Very coarse: we treat /24 and larger prefixes by simple containment.
    """
    origins = {}
    def ip_to_int(ip):
        parts = [int(p) for p in ip.split('.')]
        return (parts[0]<<24)|(parts[1]<<16)|(parts[2]<<8)|parts[3]
    def prefix_contains(prefix, ip):
        p, plen = prefix.split('/')
        plen = int(plen)
        ipn = ip_to_int(ip)
        pn = ip_to_int(p)
        mask = (0xffffffff << (32-plen)) & 0xffffffff
        return (ipn & mask) == (pn & mask)
    # build quick list of interface ips per device
    for dev, ifs in all_ifaces.items():
        for ip in ifs.keys():
            for dev2, routes in routes_by_device.items():
                for r in routes:
                    pre = r['prefix']
                    try:
                        if prefix_contains(pre, ip):
                            origins.setdefault(pre, set()).add(dev)
                    except Exception:
                        pass
    # convert sets to lists
    return {p: list(s) for p,s in origins.items()}

def build_adjacency_map(neigh_raw_by_device, ip2device):
    """
    neigh_raw_by_device: device -> raw neighbor output text
    Use neighbor IPs from neigbour raw and map to device names via ip2device
    """
    adj = {}
    for dev, raw in neigh_raw_by_device.items():
        adj.setdefault(dev, set())
        for ip in parse_neighbors(raw):
            owner = ip2device.get(ip)
            if owner and owner != dev:
                adj[dev].add(owner)
                adj.setdefault(owner, set()).add(dev)
    # convert sets to lists
    return {k:list(v) for k,v in adj.items()}

def main(testbed_file, run_id):
    tb = load(testbed_file)
    all_ifaces = {}
    routes_by_device = {}
    neigh_raw_by_device = {}
    # 1) collect outputs
    for name, dev in tb.devices.items():
        outputs = collect_device_outputs(dev, run_id)
        # parse interfaces
        all_ifaces[name] = parse_interfaces(outputs.get('interfaces'))
        # parse routes
        routes_by_device[name] = parse_routes(outputs.get('routes'))
        # neighbor raw (concatenate protocol neighbor outputs)
        neigh_raw = (outputs.get('eigrp_neighbors') or '') + "\n" + (outputs.get('bgp_summary') or '') + "\n" + (outputs.get('rip_status') or '')
        neigh_raw_by_device[name] = neigh_raw
    # 2) build ip->device map
    ip2device = build_ip_owner_map(all_ifaces)
    save_json(run_id, "global", "ip2device", ip2device)
    # 3) build adjacency map
    adjacency_map = build_adjacency_map(neigh_raw_by_device, ip2device)
    save_json(run_id, "global", "adjacency", adjacency_map)
    # 4) detect mutual next-hop anomalies
    mutuals = detect_mutual_next_hop(routes_by_device, ip2device)
    # 5) detect missing propagation (heuristic)
    origins = find_prefix_origins(all_ifaces, routes_by_device)
    missing = detect_missing_propagation(origins, routes_by_device, adjacency_map)
    # 6) assemble report
    report = {
        "run_id": run_id,
        "collected_at": ts(),
        "devices": list(tb.devices.keys()),
        "ip2device": ip2device,
        "adjacency": adjacency_map,
        "mutual_next_hop_anomalies": mutuals,
        "missing_propagation": missing,
        "summary": {
            "total_devices": len(tb.devices),
            "mutual_issues": len(mutuals),
            "missing_propagations": len(missing)
        }
    }
    save_json(run_id, "global", "splithorizon_report", report)
    print(f"[{ts()}] Report saved under results/{run_id}/global/splithorizon_report.json")
    print(json.dumps(report['summary'], indent=2))
    return report

if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("--testbed", required=True)
    ap.add_argument("--run-id", required=True)
    args = ap.parse_args()
    main(args.testbed, args.run_id)

How to run (example):

python detect_splithorizon.py --testbed testbed.yml --run-id run001

Explanation by Line

I’ll walk through the important parts so you and your students understand decisions and limitations.

Top-level structure

  • collect_device_outputs() — runs a set of read-only commands and saves raw output (audit trail). The script stores raw outputs so you can always re-parse or hand them to engineers for investigation.

Parsing interface addresses

  • parse_interfaces() uses show ip interface brief lines and a regex to map IP → interface for each device. This mapping is crucial to resolve next-hop IP addresses to actual devices, so we know which device “owns” a next hop.

Parsing routes

  • parse_routes() uses a conservative regex to find route lines with a code (like O, B, D, R) and a next-hop IP using via <ip>. This captures most route entries in IOS/IOS-XE/IOS-XR style RIB dumps. It does not attempt to be a full route parser (Genie parsers exist), but the heuristic is robust enough for our detection logic. You can and should replace parse_routes() with Genie device.parse('show ip route') when the platform parser is available for more accuracy.

Building ip→device map

  • build_ip_owner_map() simply walks all interface IPs discovered and maps them to device names. For multi-homed IPs or NATs this can be ambiguous; first match wins. In practice, management and interface IPs in lab environments are unique.

Adjacency map

  • build_adjacency_map() scans neighbor outputs (EIGRP/BGP/RIP) for IPs and resolves them to device names using the ip→device map. This forms a lightweight graph of immediate neighbors — used to decide “which neighbors should have seen an advertisement”.

Mutual next-hop detection (detect_mutual_next_hop)

  • This is the core heuristic for finding split-horizon-like anomalies: if Device A has prefix P with next-hop an IP owned by B, and B has P with next-hop an IP owned by A, then each appears to be depending on the other — symptomatic of misadvertisement or loops.
  • This is not absolute proof of split-horizon being disabled, but it is a strong signal worth investigating.

Missing propagation detection

  • We heuristically find prefix origins by checking if any interface IP sits within a prefix. If device X is an origin for prefix P, we expect its neighbors (from adjacency map) to see P. If a neighbor lacks P, it’s either a valid policy (intended) or a propagation problem. The script flags it so an operator can investigate.

Limitations & false positives

  • Not all missing advertisements are bugs — policy filters, route-maps, VRFs, and distribute-lists intentionally limit propagation. Always cross-check suspected anomalies with intended policies. The script provides raw evidence (saved CLI) to do that.
  • Multi-area or multi-VRF networks require extending the script to be VRF-aware (show ip route vrf <vrf>).
  • BGP route reflection complexities (communities, localpref, suppress-maps) require protocol-specific logic for accurate detection. We provide a practical starting point.

testbed.yml Example

A minimal testbed for the lab. Put this as testbed.yml and update credentials/IPs for your lab.

testbed:
  name: splithorizon_lab
  credentials:
    default:
      username: netops
      password: NetOps!23
  devices:
    PE_A:
      os: iosxe
      type: router
      connections:
        cli:
          protocol: ssh
          ip: 10.0.100.11
    PE_B:
      os: iosxr
      type: router
      connections:
        cli:
          protocol: ssh
          ip: 10.0.100.12
    C1:
      os: iosxe
      type: router
      connections:
        cli:
          protocol: ssh
          ip: 10.0.100.21
    C2:
      os: iosxe
      type: router
      connections:
        cli:
          protocol: ssh
          ip: 10.0.100.22

Notes:

  • Use separate management IPs accessible to the automation host.
  • Add custom testbed fields describing role/site to enrich the final report if desired.

Post-validation CLI (Real expected output)

Below are textual screenshots (fixed-width) that you can paste into a blog or teaching slides. These are realistic outputs and what the script expects to parse.

A — show ip interface brief example

PE_A# show ip interface brief
Interface              IP-Address      OK? Method Status                Protocol
GigabitEthernet0/0    10.0.1.1        YES manual up                    up
Loopback0              10.10.0.1       YES manual up                    up
GigabitEthernet0/1    192.168.1.1     YES manual up                    up

B — show ip route snippet (mutual next-hop suspicious case)

PE_A# show ip route
O    172.16.10.0/24 [110/2] via 10.0.1.2, 00:02:10, GigabitEthernet0/0
B    203.0.113.0/24 [20/0] via 192.0.2.2, 00:01:23
PE_B# show ip route
O    172.16.10.0/24 [110/2] via 10.0.2.1, 00:02:11, GigabitEthernet0/0
B    203.0.113.0/24 [20/0] via 192.0.2.1, 00:02:00

Interpretation: PE_A reports 172.16.10.0/24 via next-hop 10.0.1.2 (likely owned by PE_B); PE_B reports same prefix via 10.0.2.1 (owned by PE_A) — mutual next-hop.

C — show ip eigrp neighbors example

C1# show ip eigrp neighbors
K  Address        Interface       Hold Uptime   SRTT   RTO   Q  Seq
0  10.0.1.1       Gi0/0           11   00:02:30  20     500   0   12

D — Example saved report excerpt (results/run001/global/splithorizon_report.json)

{
  "run_id": "run001",
  "collected_at": "2025-08-28T12:00:00Z",
  "mutual_next_hop_anomalies": [
    {
      "prefix": "172.16.10.0/24",
      "device_a": "PE_A",
      "device_b": "PE_B",
      "a_next_hop": "10.0.1.2",
      "b_next_hop": "10.0.2.1",
      "description": "Mutual next-hop detected (possible advertisement loop)"
    }
  ],
  "missing_propagation": [
    {
      "prefix": "10.10.5.0/24",
      "origin": "C2",
      "neighbor": "PE_A",
      "description": "Neighbor missing prefix (expected to receive advertisement)"
    }
  ]
}

These artifacts are what you present to NOC teams or attach to change tickets.


FAQs

Q1 — What exactly is split-horizon and why does it matter here?

A: Split-horizon (in distance-vector protocols) prevents a router from advertising a route back out the interface it learned that route from. This avoids two-router loops. If split-horizon is disabled or misapplied, routes may be re-advertised back, causing mutual dependencies and potential loops. Our script identifies patterns that look like mutual advertisement (device A routes via B while B routes via A), which is a strong red flag.


Q2 — How does the script avoid false positives when policy filters intentionally block propagation?

A: It doesn’t — intentionally. The output flags anomalies; you must correlate with configuration (ACLs, distribute-lists, route-maps). Because the script saves raw show running-config, show ip route and show logging, you get evidence to determine intent. To reduce noise, add a configuration rule check step: ignore missing propagation when a policy explicitly prevents propagation.


Q3 — Can this detect issues in BGP route reflection setups?

A: Yes, indirectly. For iBGP with route reflectors, the symptom is missing prefixes on clients. The script flags “missing propagation” when a prefix originates on a device but a neighbor reachable via the RR doesn’t see it. For precise RR behavior you should enrich the script to parse show ip bgp neighbors and community/localpref data and verify RR client lists.


Q4 — How will this scale to hundreds of devices?

A: The single-threaded script is a starting point. For scale:

  • Run collectors concurrently (ThreadPoolExecutor or pyATS test runner with concurrency).
  • Use streaming telemetry (gNMI/telemetry) instead of CLI polling where available.
  • Centralize processing (collect raw outputs to an object store and run analysis jobs on a server cluster).
  • Limit parsing to prefixes of interest (critical services) rather than the full Internet routing table.

Q5 — What about VRFs and multi-tenant networks?

A: Extend the collector to iterate per VRF: show ip route vrf <vrf> and show ip interface brief vrf <vrf> and maintain VRF context in the ip→device map. Our heuristic assumes a global table; for VRFs you must namespace prefixes and interface lookups by VRF.


Q6 — How should operators remediate a mutual next-hop anomaly?

A: Typical steps:

  1. Confirm the anomaly in the saved raw outputs.
  2. Inspect where the route originated (check show ip route on origin) and whether policies intentionally filter.
  3. If unintended: check for no split-horizon or manual route redistribution rules on interfaces and disable misconfiguration.
  4. Use clear ip route <prefix> and clear ip bgp cautiously if needed (prefer controlled restart).
  5. After remediation, re-run the script to verify the anomaly cleared.

Q7 — Can we get a confidence score for each finding?

A: Yes — implement scoring by combining signals:

  • Mutual next-hop = high severity.
  • Missing propagation + no policy permitting = medium.
  • Missing propagation but a distribute-list found = low.
    Add scoring logic by correlating config lines and syslog context.

Q8 — How do we visualize findings for NOC and change managers?

A: Index the JSON report into Elasticsearch (index splithorizon-*) and build Kibana dashboards:

  • Table: recent runs with anomalies count.
  • Heatmap: devices with most anomalies.
  • Per-prefix detail panels linking to raw CLI snapshots.
    Alternatively, generate a simple HTML report from report.json and attach to change tickets.

YouTube Link

Watch the Complete Python for Network Engineer: Detect split-horizon issues in large networks Using pyATS for Cisco [Python for Network Engineer] Lab Demo & Explanation on our channel:

Master Python Network Automation, Ansible, REST API & Cisco DevNet
Master Python Network Automation, Ansible, REST API & Cisco DevNet
Master Python Network Automation, Ansible, REST API & Cisco DevNet
Why Robot Framework for Network Automation?

Join Our Training

If you want guided, instructor-led, hands-on training to implement, harden, and productionize automation flows like this — including pyATS, Genie parsers, telemetry, CI/CD integration and dashboards — join Trainer Sagar Dhawan’s 3-month instructor-led course: Python, Ansible, API & Cisco DevNet for Network Engineers. The course walks you through building full toolchains, from scripts to enterprise deployment, and will accelerate your path to become a confident Python for Network Engineer.

Learn more and enroll: https://course.networkjourney.com/python-ansible-api-cisco-devnet-for-network-engineers/

Join the program and start automating network reliability with confidence — from split-horizon detection to automated remediation.

Enroll Now & Future‑Proof Your Career
Emailinfo@networkjourney.com
WhatsApp / Call: +91 97395 21088