[Day #48 PyATS Series] EIGRP Neighbor Health Check (Cisco IOS-XE / IOS-XR) using pyATS for Cisco [Python for Network Engineer]

[Day #48 PyATS Series] EIGRP Neighbor Health Check (Cisco IOS-XE / IOS-XR) using pyATS for Cisco [Python for Network Engineer]


Introduction — key points (what you’ll learn)

EIGRP neighbor stability is critical for routing convergence and network stability. Small changes in SRTT, increasing retransmissions, or flapping adjacencies can lead to route withdrawal and traffic disruption. In this masterclass you will learn how to:

  • Automate collection of EIGRP neighbor and topology data from Cisco IOS-XE and IOS-XR devices using pyATS / Genie.
  • Parse structured data (or fallback to robust regex parsing) to extract neighbor IP, interface, hold time, uptime, SRTT, RTO, queue, sequence, and route counts.
  • Compute health indicators (SRTT vs threshold, hold time, route count, adjacency flaps, uptime stability).
  • Validate findings via CLI checks and push results to a GUI (Elasticsearch + Kibana or Grafana) for visualization.
  • Build remediation-friendly reports (JSON, CSV) and baseline historical comparisons for detecting flaps over time.

This is a step-by-step teaching lecture — everything from the pyATS script to GUI validation is covered so you can replicate the lab in your environment.


Topology Overview

Minimal multi-device lab for this workbook:

  • Devices run EIGRP in IPv4 under AS 65001.
  • AutomationHost accesses devices over the management network. Kibana/Grafana is used to visualize pyATS results.

Topology & Communications

  • Management network: 10.0.1.0/24 (AutomationHost, Elasticsearch, device mgmt IPs).
  • pyATS connection: SSH to devices; terminal length 0 used to avoid paged output.
  • Data gathered (CLI):
    • show ip eigrp neighbors
    • show ip eigrp topology
    • show ip route eigrp
    • show running-config | section router eigrp (verify config)
    • For IOS-XR variations, show eigrp neighbors or vendor equivalent — script will try both parse and fallback to execute+regex.
  • GUI: Push parsed JSON to Elasticsearch index eigrp-health-* and visualize via Kibana/Grafana dashboards. This lets network ops quickly see neighbors failing thresholds, flapping, or with low route counts.

Workflow Script (pyATS)

Save this as eigrp_health.py. It’s a single script with robust parsing, fallbacks, thresholds, persistence for historical comparison, and optional Elasticsearch push.

#!/usr/bin/env python3
"""
eigrp_health.py
Collect EIGRP neighbor data from pyATS testbed devices (IOS-XE / IOS-XR),
calculate health metrics and optionally push results to Elasticsearch.
"""

import json, os, re, time
from genie.testbed import load
from datetime import datetime
from pathlib import Path

# --- Configurable thresholds ---
SRRT_THRESHOLD_MS = 200          # SRTT above this is warning
ROUTE_COUNT_MIN = 1              # minimum routes expected from neighbor
HOLD_TIME_WARNING_SEC = 3        # short hold time indicates instability
HISTORY_FILE = "eigrp_history.json"
ES_PUSH = False                  # set True to push to Elasticsearch
ES_URL = "http://localhost:9200/eigrp-health/_doc/"

# Output dirs
OUT_DIR = Path("results")
OUT_DIR.mkdir(exist_ok=True)

# Regex to parse typical "show ip eigrp neighbors" lines (IOS-XE style)
NEIGHBOR_LINE_RE = re.compile(
    r"^\s*(?P<ip>\d+\.\d+\.\d+\.\d+)\s+(?P<intf>\S+)\s+(?P<hold>\d+)\s+(?P<uptime>\S+)\s+(?P<srtt>\d+)\s+(?P<rto>\d+)\s+(?P<q>\d+)\s+(?P<seq>\d+)",
    re.IGNORECASE
)

def load_history():
    if os.path.exists(HISTORY_FILE):
        with open(HISTORY_FILE) as f:
            return json.load(f)
    return {}

def save_history(h):
    with open(HISTORY_FILE, "w") as f:
        json.dump(h, f, indent=2)

def parse_neighbors_raw(output):
    """
    Fallback parser: extract neighbor entries from raw 'show ip eigrp neighbors' text.
    Returns list of dicts: ip, interface, hold, uptime, srtt, rto, q, seq
    """
    neighbors = []
    for line in output.splitlines():
        m = NEIGHBOR_LINE_RE.match(line)
        if m:
            d = m.groupdict()
            # convert numeric fields
            for k in ("hold", "srtt", "rto", "q", "seq"):
                try:
                    d[k] = int(d[k])
                except:
                    d[k] = None
            neighbors.append(d)
    return neighbors

def parse_topology_count_raw(output):
    """
    Parse 'show ip eigrp topology' to count prefixes learned by EIGRP.
    We'll count lines that look like 'P 10.1.0.0/16 ...' — simple heuristic.
    """
    count = 0
    for line in output.splitlines():
        if line.strip().startswith(("P ", "P")):
            count += 1
    return count

def collect_device_data(device):
    """
    Connect to device, run EIGRP commands, parse outputs and return structured dict.
    """
    result = {"device": device.name, "timestamp": datetime.utcnow().isoformat() + "Z"}
    try:
        device.connect(log_stdout=False)
        device.execute("terminal length 0")
        # Try to use genie.parse where available
        try:
            nbrs_parsed = device.parse("show ip eigrp neighbors")
            # Genie parsers vary; normalize into list
            neighbors = []
            # Genie structure may include neighbors keyed by interface or ip — attempt generic extraction
            # Try: nbrs_parsed.get('eigrp', {}).get('instance', ...) experimental; fallback to raw
            # For safety, also capture raw output
            raw_nbrs = device.execute("show ip eigrp neighbors")
            neighbors = parse_neighbors_raw(raw_nbrs)
        except Exception:
            raw_nbrs = device.execute("show ip eigrp neighbors")
            neighbors = parse_neighbors_raw(raw_nbrs)

        # Topology and route counts
        try:
            topo_parsed = device.parse("show ip eigrp topology")
            # fallback: raw parse
            raw_topo = device.execute("show ip eigrp topology")
            topology_count = parse_topology_count_raw(raw_topo)
        except Exception:
            raw_topo = device.execute("show ip eigrp topology")
            topology_count = parse_topology_count_raw(raw_topo)

        # Routes via EIGRP in RIB
        try:
            raw_routes = device.execute("show ip route eigrp")
            # count 'via' lines or route lines
            route_count = sum(1 for l in raw_routes.splitlines() if l.strip() and not l.startswith("Codes:"))
        except Exception:
            route_count = 0

        # Basic config check
        try:
            cfg = device.execute("show running-config | section router eigrp")
        except Exception:
            cfg = ""

        device.disconnect()

        result.update({
            "neighbors": neighbors,
            "topology_count": topology_count,
            "route_count": route_count,
            "config_snippet": cfg
        })
        return result
    except Exception as e:
        # Ensure device disconnect if connection failed halfway
        try:
            device.disconnect()
        except:
            pass
        result["error"] = str(e)
        return result

def evaluate_health(device_result, history):
    """
    Compute health metrics for each neighbor and the device overall.
    """
    now = device_result["timestamp"]
    device_name = device_result["device"]
    neighbors = device_result.get("neighbors", [])
    summary = {"device": device_name, "timestamp": now, "neighbors": [], "status": "OK"}

    for nbr in neighbors:
        ip = nbr.get("ip")
        srtt = nbr.get("srtt") or 0
        hold = nbr.get("hold") or 0
        uptime = nbr.get("uptime")
        # route-count per neighbor is not directly available in basic outputs; we use device topology and route_count as proxies
        nbr_status = "OK"
        issues = []

        if srtt and srtt > SRRT_THRESHOLD_MS:
            nbr_status = "WARN"
            issues.append(f"SRTT {srtt}ms > {SRRT_THRESHOLD_MS}ms")
        if hold and hold <= HOLD_TIME_WARNING_SEC:
            nbr_status = "WARN"
            issues.append(f"Hold time low: {hold}s")
        # detect flapping: compare uptime to last run
        prev = history.get(device_name, {}).get(ip)
        if prev and "uptime" in prev and uptime:
            # if uptime decreased significantly, neighbor flapped
            # convert uptime text to seconds roughly: '00:01:23' or '1d02h' - we'll use simple heuristic: if string changed pattern indicate flap
            if prev["uptime"] != uptime:
                # more robust: if uptime shorter than previous we flag flap - but prev uptime may be different format
                issues.append("Uptime changed (possible flap)")
                nbr_status = "WARN"

        if issues:
            summary["status"] = "WARN"
        summary["neighbors"].append({
            "ip": ip,
            "interface": nbr.get("intf"),
            "srtt_ms": srtt,
            "hold_sec": hold,
            "uptime": uptime,
            "status": nbr_status,
            "issues": issues
        })

    # Device-level checks
    if device_result.get("route_count", 0) < ROUTE_COUNT_MIN:
        summary["status"] = "WARN"
        summary.setdefault("device_issues", []).append(f"route_count {device_result.get('route_count')} < {ROUTE_COUNT_MIN}")

    summary["topology_count"] = device_result.get("topology_count", 0)
    summary["route_count"] = device_result.get("route_count", 0)
    return summary

def push_to_es(doc):
    if not ES_PUSH:
        return False
    import requests
    r = requests.post(ES_URL, json=doc)
    r.raise_for_status()
    return True

def main():
    testbed = load("testbed.yml")
    devices = list(testbed.devices.values())
    history = load_history()

    all_results = {}
    health_reports = {}
    for dev in devices:
        print(f"[{datetime.utcnow().isoformat()}] Collecting EIGRP data from {dev.name}...")
        dr = collect_device_data(dev)
        all_results[dev.name] = dr
        health = evaluate_health(dr, history)
        health_reports[dev.name] = health

        # update history per neighbor
        history.setdefault(dev.name, {})
        for n in dr.get("neighbors", []):
            history[dev.name][n.get("ip")] = {"uptime": n.get("uptime"), "last_seen": datetime.utcnow().isoformat()}

        # save per device
        fname = OUT_DIR / f"{dev.name}_eigrp_raw.json"
        with open(fname, "w") as f:
            json.dump(dr, f, indent=2)

        # save health
        hfname = OUT_DIR / f"{dev.name}_eigrp_health.json"
        with open(hfname, "w") as f:
            json.dump(health, f, indent=2)

        # optionally push to ES
        try:
            push_to_es(health)
        except Exception as e:
            print("ES push failed:", e)

    # aggregate save
    with open(OUT_DIR / "aggregate_results.json", "w") as f:
        json.dump({"collected": all_results, "health": health_reports}, f, indent=2)

    save_history(history)
    print("Done. Results saved in 'results/' directory.")

if __name__ == "__main__":
    main()

What this script does (summary):

  • Loads devices from testbed.yml.
  • For each device attempts Genie parsing; falls back to raw execute() output and regex parsing.
  • Extracts neighbor rows into a normalized structure.
  • Reads EIGRP topology and route counts to understand prefix counts.
  • Computes a per-neighbor and per-device health assessment using configurable thresholds.
  • Persists outputs and health JSON to results/ and updates a simple eigrp_history.json to detect flaps across runs.
  • Optional Elasticsearch push (toggle ES_PUSH).

Explanation by Line (annotated deep-dive)

This section unpacks the most important parts of the script so you — the engineer — know why each step exists and how to adapt it.

Thresholds & Persistence

SRRT_THRESHOLD_MS = 200
ROUTE_COUNT_MIN = 1
HOLD_TIME_WARNING_SEC = 3
HISTORY_FILE = "eigrp_history.json"
  • SRRT_THRESHOLD_MS: SRTT (Smoothed Round Trip Time) larger than this may indicate congestion or link issues. Pick value based on site SLA.
  • ROUTE_COUNT_MIN: If device has fewer than this many EIGRP RIB entries, it might not be receiving expected prefixes.
  • History file is used to detect adjacency flaps — comparing uptime values between runs.

Regex parsing fallback

NEIGHBOR_LINE_RE = re.compile(...)
  • We match typical show ip eigrp neighbors columns: IP, Interface, Hold, Uptime, SRTT, RTO, Q, Seq.
  • This regex is deliberately permissive; test it against your device outputs and modify as necessary.

collect_device_data()

  • device.connect(log_stdout=False) — connect silently.
  • device.execute("terminal length 0") — avoid --More-- pagination which would corrupt parsing.
  • Try device.parse("show ip eigrp neighbors") first because Genie returns structured dicts. If parse fails (missing parser or unsupported platform), we fallback to raw device.execute() and parse with regex.
  • Also collect topology and route data to compute coverage.

evaluate_health()

  • For each neighbor, check SRRT and hold time, and set WARN if thresholds exceeded.
  • For flaps: a naive heuristic compares uptime strings between runs. In production, you’d convert uptime strings to seconds and detect decreases.
  • Device-level check: low route_count triggers WARN.

Persistence & ES push

  • Results are saved per device and aggregated. This makes it simple to look at time series or manual inspection.
  • Optionally push health docs to an Elasticsearch index to visualize them.

testbed.yml Example

Use realistic but safe credentials in your environment; secrets should be in Vault in production.

testbed:
  name: eigrp_masterclass
  credentials:
    default:
      username: admin
      password: Cisco123!
  devices:
    CORE_RTR_XE:
      os: iosxe
      type: router
      connections:
        cli:
          protocol: ssh
          ip: 10.0.1.11
    CORE_RTR_XR:
      os: iosxr
      type: router
      connections:
        cli:
          protocol: ssh
          ip: 10.0.1.12
    DIST_SW1:
      os: iosxe
      type: switch
      connections:
        cli:
          protocol: ssh
          ip: 10.0.1.21

Notes:

  • os key allows pyATS/Genie to choose correct parsers. If you have non-Cisco or custom devices, provide appropriate os values and consider custom parsers.

Post-validation CLI (Real expected output)

Below are realistic expected outputs you’ll see on devices and sample script output. Save these as screenshots in your workbook when you run them.

A. show ip eigrp neighbors (Cisco IOS-XE)

CORE_RTR_XE# show ip eigrp neighbors
EIGRP-IPv4 Neighbors for AS(65001)
H   Address        Interface        Hold Uptime  SRTT   RTO  Q  Seq
0   10.0.1.12      Gi0/0            12   01:22:34  30    200  0  10
0   10.0.1.21      Gi0/1            11   00:12:45  25    100  0  23

B. show ip eigrp topology

CORE_RTR_XE# show ip eigrp topology
P 10.10.0.0/16 (0/120) via 10.0.1.12, 00:01:12, Gi0/0
P 192.168.1.0/24 (0/120) via 10.0.1.21, 00:02:10, Gi0/1

C. show ip route eigrp

CORE_RTR_XE# show ip route eigrp
D 10.10.0.0/16 [90/30720] via 10.0.1.12, 00:01:15, GigabitEthernet0/0
D 192.168.1.0/24 [90/30720] via 10.0.1.21, 00:02:12, GigabitEthernet0/1

D. Script (JSON) health snippet produced

{
  "device": "CORE_RTR_XE",
  "timestamp": "2025-08-28T12:00:00Z",
  "neighbors": [
    {
      "ip": "10.0.1.12",
      "interface": "Gi0/0",
      "srtt_ms": 30,
      "hold_sec": 12,
      "uptime": "01:22:34",
      "status": "OK",
      "issues": []
    },
    {
      "ip": "10.0.1.21",
      "interface": "Gi0/1",
      "srtt_ms": 25,
      "hold_sec": 11,
      "uptime": "00:12:45",
      "status": "OK",
      "issues": []
    }
  ],
  "topology_count": 2,
  "route_count": 2,
  "status": "OK"
}

When thresholds are breached (e.g., SRTT > 200ms) you’ll see "status":"WARN" and descriptive issues.


Appendix — GUI Validation: push results to Elasticsearch + Kibana (step-by-step)

1. Elasticsearch Index mapping (simple):

  • After enabling ES_PUSH = True in the script and setting ES_URL to your ES endpoint, each health dict will be POSTed to ES.
  • Define an index template mapping srtt_ms as number, timestamp as date, status as keyword.

2. Kibana Dashboard suggestions:

  • Saved Search: eigrp-health-* index, filter status: WARN.
  • Visualization 1: Metric — count of WARN vs OK.
  • Visualization 2: Time series — avg srtt_ms per neighbor (use nested fields or use neighbor docs as separate docs).
  • Visualization 3: Table — latest neighbor statuses with device, ip, interface, srtt_ms, issues.

3. Example simple Elasticsearch query to retrieve last 15 minutes for device:

GET /eigrp-health-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "device": "CORE_RTR_XE" } },
        { "range": { "timestamp": { "gte": "now-15m" } } }
      ]
    }
  }
}

Final instructor note (safety & best practices)

  • Do not run debug commands in production unless absolutely required and with maintenance windows. Prefer passively collecting show ip eigrp neighbors and syslogs.
  • Tune thresholds for your network — wide area links and satellite links will have higher expected SRTT.
  • Secure automation credentials — do not hardcode admin passwords in testbed.yml in production; use Vault, environment variables, or pyATS credential stores.
  • Test in lab first, then stage, then production.

FAQs – EIGRP Neighbor Health Check with pyATS

Q1. Why do we need to validate EIGRP neighbor health using pyATS when the CLI already provides the data?

Answer:
While show ip eigrp neighbors or show eigrp neighbors (IOS-XR) gives the information, it is manual, error-prone, and not scalable when you’re managing hundreds of routers. With pyATS, you:

  • Automate parsing of neighbor states.
  • Compare expected neighbors with actual neighbors in seconds.
  • Get structured JSON outputs instead of raw text.
  • Run validations across multi-device topologies with one command.
    For large-scale production, relying only on CLI is not reliable. Automation ensures zero human oversight errors.

Q2. What does a healthy EIGRP neighbor relationship look like in both IOS-XE and IOS-XR?

Answer:
A healthy EIGRP neighbor has the following indicators:

  • State: Up (in IOS-XR) or displayed with uptime in IOS-XE.
  • Hold Time: Continuously refreshing, not dropping to 0.
  • SRTT (Smooth Round-Trip Time): Small, stable values (e.g., 10–30ms in LAN).
  • Queue Count: Should stay at 0. If increasing, packets are being delayed.
    Example IOS-XE healthy neighbor:
0   10.1.1.2  Gi0/0/0  12  02:35:14  20  200  0  54

Here, uptime is steady, SRTT is low, and Q Count = 0.


Q3. How does pyATS parse EIGRP neighbor data differently for IOS-XE vs IOS-XR?

Answer:
Cisco platforms output EIGRP data differently:

  • IOS-XE: Uses show ip eigrp neighbors with columnar tabular format.
  • IOS-XR: Uses show eigrp neighbors with field-based structured text.

pyATS uses genie parsers to normalize both into the same Python dictionary structure (JSON-like).
For example:

{
  "eigrp_instance": {
    "100": {
      "vrf": {
        "default": {
          "address_family": {
            "ipv4": {
              "eigrp_interface": {
                "GigabitEthernet0/0/0": {
                  "eigrp_nbr": {
                    "10.1.1.2": {
                      "uptime": "02:35:14",
                      "srtt": 20,
                      "q_cnt": 0
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

This means engineers don’t have to write regex for each platform — pyATS handles vendor differences automatically.


Q4. What are common reasons for EIGRP neighbors going down, and how can pyATS help detect them?

Answer:
Neighbors may go down due to:

  1. Interface issues: Link flap or shutdown.
  2. K-values mismatch: EIGRP metric mismatch prevents adjacency.
  3. Authentication failure: Key mismatch on either side.
  4. Access-lists or firewalls: Blocking multicast/hello packets.
  5. MTU mismatches.

pyATS can detect these by:

  • Capturing interface status (via show ip interface brief).
  • Parsing EIGRP logs in debug outputs.
  • Cross-checking expected neighbor count vs actual.

With automation, you catch root causes faster instead of just seeing “neighbor down.”


Q5. How can we validate EIGRP neighbors with both CLI and GUI using pyATS?

Answer:

  • CLI Validation: Run pyats run job eigrp_health_job.py → results shown in console with PASS or FAIL.
  • GUI Validation: After the run, pyATS can generate HTML reports with neighbor tables, PASS/FAIL summary, and logs.
    For network engineers, this dual validation provides both:
  • Quick checks (CLI).
  • Executive-friendly reports (GUI).

Q6. Can pyATS help with EIGRP neighbor performance checks, not just up/down states?

Answer:
Yes. Beyond adjacency status, pyATS can validate:

  • SRTT thresholds (alert if >50ms).
  • Queue count growth (troubleshooting CPU or congestion).
  • Neighbor uptime consistency (to detect flapping).
    For example, you can write assertions like:
assert neighbor['srtt'] < 50, f"High latency on neighbor {nbr_ip}"

This turns pyATS into a proactive monitoring tool, not just a connectivity checker.


Q7. What is the advantage of using pyATS over SNMP/EEM for EIGRP neighbor monitoring?

Answer:

  • SNMP: Provides counters but may lack real-time granularity.
  • EEM (Embedded Event Manager): Reactive, device-local only.
  • pyATS:
    • Vendor-agnostic.
    • Centralized automation across hundreds of routers.
    • Converts CLI to structured data.
    • Easily integrates with Python for Network Engineer toolchains (Ansible, REST APIs, dashboards).
      Thus, pyATS is more scalable, flexible, and cloud-ready compared to legacy methods.

Q8. Can pyATS EIGRP health checks be extended for multi-vendor routing (like OSPF, BGP, IS-IS)?

Answer:
Absolutely. While EIGRP is Cisco-proprietary, the same pyATS framework works for:

  • OSPF neighbors (show ip ospf neighbor).
  • BGP peers (show bgp summary).
  • IS-IS adjacencies.

The methodology (parse → compare → validate) remains identical. This makes pyATS an investment in automation skills — not limited to one protocol.


YouTube Link

Watch the Complete Python for Network Engineer: EIGRP neighbor health check (Cisco IOS-XE/XR) Using pyATS for Cisco [Python for Network Engineer] Lab Demo & Explanation on our channel:

Master Python Network Automation, Ansible, REST API & Cisco DevNet
Master Python Network Automation, Ansible, REST API & Cisco DevNet
Master Python Network Automation, Ansible, REST API & Cisco DevNet
Why Robot Framework for Network Automation?

Join Our Training

If you want to go deeper — building visual dashboards, automating remediation, or integrating EIGRP validation into CI/CD pipelines — Trainer Sagar Dhawan runs a 3-month instructor-led program covering Python, Ansible, APIs, and Cisco DevNet for Network Engineers. The course walks you through real-world automation projects (like this EIGRP health checker), best practices, and career-facing skills.

Enroll / learn more:
https://course.networkjourney.com/python-ansible-api-cisco-devnet-for-network-engineers/

This course is the fastest way to become a confident Python for Network Engineer practitioner and lead automation initiatives in your network team.

Enroll Now & Future‑Proof Your Career
Emailinfo@networkjourney.com
WhatsApp / Call: +91 97395 21088