Why These 10?
The network stack grows more complex every year: hybrid cloud, SD‑WAN, SASE, microservices, zero trust, and high‑frequency telemetry. Instead of a long catalog, this list focuses on tools that are battle‑tested, actively maintained, and composable—they plug cleanly into automation, CI/CD, and observability pipelines. Where it helps, I’ve also suggested free/open‑source alternatives and commercial counterparts you’ll encounter in the wild.

1) Wireshark / TShark (Packet Capture & Protocol Analysis)
What it is: Wireshark is the de facto standard GUI packet analyzer; TShark is the CLI sibling for headless/automated use.
Why it matters in 2025: Encrypted transports (TLS 1.3, QUIC) and overlay networks (VXLAN, Geneve) make issues hard to see. Wireshark’s modern dissectors, TLS session key logging, and QUIC/HTTP/3 visibility turn guesswork into evidence.
Real‑world use cases
- TLS handshake failures between apps and load balancers; extract SNI/ALPN and verify cipher negotiation.
- “Slow app” complaints; correlate TCP retransmissions, out‑of‑order packets, and window scaling with server CPU spikes.
- Overlay visibility; decode VXLAN to confirm VNI, underlay MTU, and ECMP hashing.
Starter commands
# Capture to ring buffers on a Linux jump host
sudo tshark -i eth0 -b filesize:100 -b files:20 -w /tmp/cap.pcap
# Filter: only TCP handshakes
sudo tshark -i eth0 -f "tcp[tcpflags] & (tcp-syn|tcp-ack) != 0"
Pros
- Best‑in‑class dissectors and readability; reproducible evidence for RCAs.
- Works everywhere; integrates with TraceWrangler, Brim/Zeek, and CI.
Cons
- Heavy captures can overwhelm disks; must filter aggressively.
- Encrypted payloads require keys and careful privacy handling.
Best practices
- Use capture filters (BPF) at the source; avoid “capture everything, filter later.”
- Log TLS secrets (where policy allows): SSLKEYLOGFILE=/tmp/keys.log.
- For remote sites, deploy span/ERSPAN with timestamping; normalize time via NTP/PTP.
2) tcpdump (First‑Responder CLI Sniffer)
What it is: Lightweight CLI sniffer using libpcap. Ubiquitous on servers, appliances, and even containers.
Why it matters: When access is limited (appliance shells, minimal OS images), tcpdump is often the only tool you have. It pairs perfectly with TShark/Wireshark for deeper analysis.
Real‑world use cases
- Confirming bidirectional traffic across firewalls and NATs.
- Capturing SYN/SYN‑ACK timing to diagnose asymmetry or policing.
- Sampling packets from a high‑rate interface to understand burst patterns.
Starter commands
# Show TCP 3‑way handshakes to 443 with timestamps
sudo tcpdump -nni eth0 'tcp port 443 and (tcp[tcpflags] & (tcp-syn|tcp-ack) != 0)' -tttt
# Capture only first 128 bytes per packet, write to file
sudo tcpdump -s 128 -w /tmp/web.cap -G 60 -W 10 -i eth0 "host 203.0.113.10"
Pros
- Tiny, fast, scriptable; available almost everywhere.
Cons
- Raw output can be cryptic; not ideal for long traces.
Best practices
- Combine with -G (time‑based rotation) and -C (size‑based rotation).
- Always record interface, time, and host clock skew in your ticket notes.
3) Nmap (Network Scanning & Service Fingerprinting)
What it is: The standard for host discovery, port scanning, and service/version detection; includes the Nmap Scripting Engine (NSE).
Why it matters: In hybrid estates, you inherit shadow IT, ephemeral hosts, and forgotten services. Nmap quickly maps exposure and verifies firewall intent.
Real‑world use cases
- Quarterly attack surface sweeps; find forgotten management ports.
- Validate firewall rule changes (before/after compare by ASG/VPC/VNET).
- Detect weak ciphers/services with NSE scripts (where permitted).
Starter commands
# Fast TCP scan + service detection
nmap -sS -sV -T4 10.10.0.0/24
# Scripted TLS scan (example)
nmap --script ssl-enum-ciphers -p 443 edge1.sanchitgurukul.xyz
Pros
- Mature, reliable, huge community and script library.
Cons
- Can trigger IDS/IPS; always get change approval.
Best practices
- Limit concurrency (-T3) in fragile networks.
- Export XML/grepable outputs for pipelines; track drifts over time.
4) iperf3 (Throughput, Jitter, and Loss Testing)
What it is: A client/server tool to measure TCP/UDP bandwidth, jitter, and loss, including parallel streams.
Why it matters: Controlled load tests isolate link, QoS, or policing issues from application problems.
Real‑world use cases
- Pre‑cutover WAN validation (MPLS ↔ DIA/SD‑WAN).
- Confirming QoS classes and policers under contention.
- Measuring inter‑AZ/inter‑region cloud paths.
Starter commands
# Server on port 5201
iperf3 -s
# TCP test, 8 parallel streams for 60s
iperf3 -c 203.0.113.20 -P 8 -t 60
# UDP test at 100 Mbps with jitter report
iperf3 -u -b 100M -c 203.0.113.20 -t 30
Pros
- Predictable, scriptable, cross‑platform.
Cons
- Tests reflect host NIC/CPU tuning; results must be contextualized.
Best practices
- Pin CPU interrupts, disable power saving on test hosts.
- Use bi‑directional tests and reverse mode to check asymmetry.
5) traceroute / mtr (Path Mapping & Loss Localization)
What it is: traceroute maps the hop path; mtr combines ping + traceroute for continuous, per‑hop loss and latency.
Why it matters: Cloud‑age routing adds NAT, tunnels, and load‑balancing. Traceroute/MTR reveals where latency and loss originate.
Real‑world use cases
- Pinpointing where packets disappear in an SD‑WAN or ISP core.
- Verifying symmetric return paths after a new BGP policy.
Starter commands
# TCP traceroute to port 443 (gets past ICMP‑blocked networks)
sudo traceroute -T -p 443 www.example.com
# mtr for continuous view
mtr -rwz -c 200 edge1.sanchitgurukul.xyz
Pros
- Lightweight, universally available; great for triage.
Cons
- ICMP/TCP TTL behaviors vary; load‑balancers can mislead paths.
Best practices
- Try UDP/ICMP/TCP methods and compare.
- Treat per‑hop loss carefully—final‑hop loss is what users feel.
6) dig / kdig (DNS Diagnostics)
What it is: dig (BIND) and kdig (Knot) query DNS servers with precision.
Why it matters: DNS is the “phone book” of everything—misconfigs cause widespread brownouts.
Real‑world use cases
- DNSSEC validation failures; inspect RRSIG/DS chains.
- CDN troubleshooting; compare answers by resolver/edns‑client‑subnet.
- Split‑horizon and conditional forwarder issues in hybrid environments.
Starter commands
# Query authoritative with DNSSEC details
dig +dnssec @ns1.sanchitgurukul.xyz www.sanchitgurukul.xyz A
# Trace the delegation path
kdig +trace www.sanchitgurukul.xyz
# Compare two resolvers for geo‑bias
dig @1.1.1.1 www.example.com; dig @8.8.8.8 www.example.com
Pros
- Precise, scriptable, shows exactly what the resolver/authority said.
Cons
- Output can be verbose; needs practice to interpret DNSSEC.
Best practices
- Always capture authority and additional sections.
- Use +cd (checking disabled) vs. validated paths to isolate resolver issues.
7) netcat / socat (Swiss‑Army Knife for Sockets)
What it is: Simple tools to read/write from sockets, build ad‑hoc servers, or debug TLS with openssl s_client.
Why it matters: When apps fail to connect and logs are thin, netcat proves reachability and data flow.
Real‑world use cases
- Troubleshoot load‑balancer VIPs; test health‑check payloads.
- Validate firewall pinholes during a change window.
- Quick file transfer over restricted networks.
Starter commands
# Listen on 9000 and print received bytes
nc -lv 0.0.0.0 9000
# Send test payload to LB VIP
echo -e "GET /health HTTP/1.1\nHost: vip\n\n" | nc vip.example.com 80
Pros
- Minimal, ubiquitous, great for creative debugging.
Cons
- No protocol smarts; easy to shoot yourself in the foot.
Best practices
- Pair with packet capture; log timestamps; clean up listeners.
8) NetBox (Source of Truth for Networks)
What it is: Open‑source DCIM/IPAM platform that models sites, devices, interfaces, IPs, VLANs, circuits, and more. Provides a powerful REST API and webhooks.
Why it matters: Automation breaks without a reliable Source of Truth (SoT). NetBox centralizes inventory and intent so templates, playbooks, and pipelines stay deterministic.
Real‑world use cases
- Auto‑generate configs (LLDP, VLAN trunks, BGP neighbors) from SoT.
- Golden‑config compliance; drift reports for audits.
- Change previews: safe diffs before any device is touched.
Starter ideas
- Populate with device roles, platforms, and interfaces.
- Use NetBox plugins (circuits, secrets) and webhooks → GitHub Actions deploy.
Pros
- Extensible, strong API, vibrant community.
Cons
- Requires ownership and data hygiene; initial load can be heavy.
Best practices
- Define data standards (naming, tagging). Enforce via CI (schema checks).
- Make NetBox the single source—no side spreadsheets.
9) Ansible (Configuration as Code & Orchestration)
What it is: Agentless automation for push‑based configuration, templating (Jinja2), idempotent tasks, and role reuse.
Why it matters: Change safety and velocity. In 2025, network CI/CD relies on GitHub Actions + Ansible to render, diff, deploy, and rollback.
Real‑world use cases
- Bulk VLAN changes across access switches during maintenance windows.
- Staging and hitless upgrades on pairs (A/B) with guardrails.
- Drift remediation from a golden config policy.
Starter play snippet
- hosts: edge
gather_facts: no
tasks:
- name: Render interface configs
template:
src: templates/intf.j2
dest: rendered/{{ inventory_hostname }}.cfg
- name: Push config
ios_config:
src: rendered/{{ inventory_hostname }}.cfg
save_when: changed
Pros
- Huge module ecosystem; human‑readable YAML; good for CAB evidence.
Cons
- Push based; less ideal for closed devices; scaling demands discipline.
Best practices
- Use PR‑based workflows with linting and unit tests.
- Canary by site/role; auto‑rollback on post‑check failure.
Alternatives & complements: Nornir (Pythonic concurrency), Salt, pyATS/Genie (validation), Batfish (pre‑change modeling).
10) Prometheus + Grafana + Alertmanager (Metrics & Alerting Stack)
What it is: Pull‑based time‑series metrics (Prometheus), flexible dashboards (Grafana), and routing policy (Alertmanager).
Why it matters: Streaming telemetry (gNMI/Telegraf), SNMP exporters, and flow metrics can be scraped and visualized at scale with sane cardinality controls.
Real‑world use cases
- WAN health SLOs: per‑class latency/jitter, BGP peers, interface errors.
- Device health and capacity planning with dynamic thresholds.
- NOC dashboards and SRE SLO burn‑rate alerts.
Starter config (Prometheus scrape)
scrape_configs:
- job_name: edge-routers
static_configs:
- targets: [ 'edge1.sanchitgurukul.xyz:9100', 'edge2.sanchitgurukul.xyz:9100' ]
Pros
- Open ecosystem, strong exporters, integrates with anything.
Cons
- Requires thoughtful design to avoid high cardinality; HA setup needed.
Best practices
- Central Alertmanager with routing, silences, escalations.
- Use recording rules to pre‑compute SLO‑friendly series.
Putting It Together: A Modern Troubleshooting & Automation Flow
- Detect: Prometheus alert fires for high loss on branch‑east class EF.
- Correlate: Grafana shows mtr‑like per‑hop loss; NetFlow flags spike to a new CDN ASN.
- Verify: mtr + traceroute confirm path change; dig shows different CDN answer via resolver.
- Deep dive: tcpdump + Wireshark validate retransmissions and MSS/MTU issues.
- Remediate: Ansible pushes a temporary policy‑route or adjusts QoS shaping.
- Validate: iperf3 proves EF class meets throughput; pyATS/Genie post‑checks pass.
- Document: GitHub Actions stores diffs, captures, and dashboards as artifacts linked to the CAB.
Real‑World Scenarios (Playbooks You’ll Actually Use)
Scenario A: VoIP Jitter Across SD‑WAN
- Symptoms: MOS < 3.5 in two branches during peak.
- Workflow: Prometheus alert → mtr pinpoints loss at ISP handoff → iperf3 EF proves constrained bandwidth → Ansible bumps EF shaper by 10% and lowers AF traffic → Post‑checks confirm MOS recovery.
Scenario B: DNSSEC Breakage After Zone Roll
- Symptoms: Intermittent resolution failures.
- Workflow: dig +dnssec shows RRSIG expired → compare via public vs. internal resolvers → fix signer skew → add CI job to validate DS/RRSIG before publishing.
Scenario C: New App Behind LB Times Out Under Load
- Symptoms: 502/504 at peak; app team blames network.
- Workflow: tcpdump shows server advertising tiny receive windows; Wireshark confirms window scaling mismatch; fix sysctl + LB TCP profile; validate with TShark.
Tool Selection Cheat‑Sheet (Pros/Cons Summary)
| Category | Tool(s) | Biggest Strength | Watch‑outs |
| Packet analysis | Wireshark/TShark, tcpdump | Definitive ground truth | Privacy, storage, encryption keys |
| Scanning | Nmap | Rapid exposure mapping | Can trigger IDS; get approvals |
| Performance | iperf3 | Controlled load generation | Host tuning affects results |
| Path | traceroute/mtr | Hop‑by‑hop loss & latency | Load‑balancers may mislead |
| DNS | dig/kdig | Precise resolver/authority view | DNSSEC complexity |
| Sockets | netcat/socat | Quick connectivity proofs | No protocol validation |
| Source of Truth | NetBox | Automation foundation | Data hygiene required |
| Automation | Ansible | Idempotent changes, CI/CD | Push model scaling |
| Metrics/Alerting | Prometheus/Grafana | Open, flexible | Cardinality/HA |
Implementation Best Practices (2025 Edition)
- Everything as Code: Store NetBox seed data, Ansible inventories, and Prometheus rules in Git. PR reviews catch mistakes early.
- Pre‑Change Modeling: Use Batfish to fail PRs that would leak routes or break ACL intent before any device is touched.
- Validation First: pyATS/Genie jobs run before and after Ansible deploys; block merges if post‑checks degrade.
- Telemetry First: Prefer gNMI/streaming where hardware supports it; fall back to SNMP with sane intervals.
- SLOs over Thresholds: Move from arbitrary CPU>80% alerts to burn‑rate alerts tied to user impact.
- Compliance Hooks: Tag every deployment with ticket IDs; export artifacts (diffs, captures) for audits.
- Security: Handle PCAPs as sensitive data; encrypt at rest; scrub PII; rotate credentials; least‑privilege for automation.
Honorable Mentions (Know Them Too)
- Batfish – offline network modeling and policy verification.
- pyATS/Genie – vendor‑neutral operational testing and parsing.
- ElastiFlow / pmacct – flow analytics at scale.
- Zeek – network security monitoring with protocol scripting.
- ThousandEyes / Kentik – SaaS path/experience monitoring and network analytics.
- FRRouting (FRR) – open routing stack for labs and automation.
Summary
Mastering these ten tools isn’t about memorizing commands—it’s about building repeatable workflows. When combined with a clean Source of Truth, CI/CD, and observability, they let you move quickly without breaking things. Start small: standardize capture/playbook templates, codify dashboard panels, and commit your runbooks. The compounding payoff is massive.
Useful Links
https://datatracker.ietf.org/doc/html/rfc7424
https://sanchitgurukul.com/basic-networking
https://sanchitgurukul.com/network-security
