The Config Report

The Config Report

The Blame Funnel: How to Troubleshoot Without Randomly Pinging Things

It’s Probably Not the Network Series: Issue 1 of 5

JJ – Chief Packet Pusher's avatar
JJ – Chief Packet Pusher
Jun 15, 2026
∙ Paid

Dear packet witnesses,

There is a sacred ritual in IT.

An application gets slow.
A user opens a ticket.
A manager forwards it with three question marks.
Someone says, “Can the network team check?”

And just like that, you are summoned.

No logs.
No timestamps.
No source IP.
No destination IP.
No screenshot.
No error message.
No evidence that any packet has committed a felony.

Just vibes.

Welcome to the first issue of:

It’s Probably Not the Network: A Troubleshooting Survival Guide

This series is for every network engineer, sysadmin, infrastructure gremlin, and emotionally load-balanced IT professional who has ever been asked to “check the network” because one SaaS page loaded like it was being delivered by carrier pigeon.

This week, we’re starting with the most important troubleshooting skill of all:

How to avoid randomly pinging things until something looks suspicious.

Because that is not troubleshooting.

That is network astrology.


The Problem: Everyone Thinks the Network Is One Big Magic Pipe

To most users, the network is simple.

Their laptop connects to “the Wi-Fi.”
The Wi-Fi connects to “the internet.”
The internet connects to “the app.”
The app works.

That’s it.

To them, the entire path looks like this:

Laptop → Magic → Payroll

So when Payroll breaks, the obvious conclusion is:

“The magic is down.”

Unfortunately, the actual path usually looks more like this:

User device → Wi-Fi/AP → switch → VLAN → firewall → DNS → proxy/SASE/VPN → ISP → cloud provider → load balancer → application server → database → identity provider → some certificate nobody has renewed since the Obama administration

But sure.

Let’s blame the switch.


The Goal Is Not to Prove the Network Is Innocent

This is important.

The goal of troubleshooting is not to prove the network is innocent.

That sounds defensive.

And let’s be honest, sometimes the network absolutely did it.

Sometimes someone fat-fingered a route, changed an ACL, broke NAT, moved a subnet, rebooted a firewall, disabled the wrong interface, or created a VLAN that exists in spirit but not in trunk configuration.

The goal is better than that:

Prove where the problem lives.

Maybe it’s the network.

Maybe it’s DNS.

Maybe it’s the firewall.

Maybe it’s the server.

Maybe it’s the application.

Maybe it’s authentication.

Maybe it’s the user’s laptop running 47 browser extensions and a PDF toolbar from 2011.

The job is not to argue.

The job is to narrow the blast radius until the truth has nowhere left to hide.


Enter: The Blame Funnel

When something breaks, everyone starts with a giant vague statement:

“The network is slow.”

That sentence means nothing.

It is not a problem statement.
It is a distress signal.

Your job is to push the issue through the Blame Funnel.

The Blame Funnel takes a vague complaint and squeezes it down into something useful.

It starts here:

“The network is slow.”

Then you ask enough good questions to turn it into this:

“Users in Building B on VLAN 30 are seeing 12–18% packet loss to the internal inventory app at 10.40.12.25, but only over Wi-Fi, starting around 9:15 AM after the access switch maintenance window.”

Now we have something.

Now we can work.

Now the network team can stop being treated like the town wizard.


Stage 1: Define the Actual Symptom

Before you touch a command line, define what “broken” means.

Because users say things like:

  • “The internet is down”

  • “The app is slow”

  • “Wi-Fi isn’t working”

  • “The server is broken”

  • “The VPN hates me”

  • “Everything is down”

And every one of those could mean 15 different things.

So start with this:

What exactly is failing?

Can the user not connect at all?

Can they connect but the application is slow?

Can they authenticate but not load data?

Can they reach the app by IP but not by name?

Can they access other apps?

Can other users access the same app?

Is the problem constant or intermittent?

Does it happen on wired and wireless?

Does it happen on VPN and in-office?

Does it happen for one user, one subnet, one building, one region, or everyone?

You are not being annoying.

You are transforming fog into facts.

And facts are how we avoid rebooting production firewalls because “Kevin said Teams felt weird.”


Stage 2: Scope the Blast Radius

Once you know the symptom, figure out how big the problem is.

This is where the investigation gets real.

Ask:

Who is affected?

One user?

A department?

A floor?

A building?

A site?

All remote users?

All users?

Only executives, because naturally the universe enjoys comedy?

What is affected?

One application?

Multiple applications?

Internal apps only?

Internet only?

SaaS only?

Voice?

Printing?

Authentication?

File shares?

Everything except YouTube, which somehow always works because it has made dark agreements with the routing gods?

Where is it happening?

Wired?

Wireless?

VPN?

Branch office?

Data center?

Cloud?

Specific VLAN?

Specific SSID?

Specific firewall zone?

Specific ISP circuit?

The blast radius tells you where to look first.

If one user is affected, maybe it is their device.

If one VLAN is affected, maybe it is a gateway, ACL, DHCP, DNS, or routing issue.

If one site is affected, maybe it is WAN, firewall, ISP, SD-WAN, local switching, or power.

If everyone everywhere is affected, congratulations. Your day is about to become a meeting with screen sharing.


Stage 3: Identify the Path

Now we find the actual packet journey.

Not the imaginary vendor diagram.

Not the Visio from 2018 that still shows a firewall you decommissioned before the pandemic.

The real path.

You need:

  • Source IP

  • Destination IP or FQDN

  • Source VLAN/subnet

  • Destination subnet

  • Protocol

  • Port

  • DNS resolver

  • Gateway

  • Firewall path

  • NAT behavior

  • VPN/SASE/proxy involvement

  • Any recent change windows

Without source, destination, protocol, and port, you are not troubleshooting.

You are participating in infrastructure improv.

A useful problem statement looks like this:

“Client 10.20.30.55 on VLAN 30 cannot connect to app.company.local at 10.80.12.40 over TCP/443. Other users on VLAN 30 are affected. Users on VLAN 20 are not affected.”

That is beautiful.

That is actionable.

That is the kind of ticket that deserves a tiny parade.

A bad problem statement looks like this:

“App slow. Network?”

That ticket should be returned to sender with a lint roller and a juice box.


Stage 4: Test in Layers, Not Panic Spirals

The classic troubleshooting model is Layer 1 through Layer 7.

And yes, everyone jokes about it.

But it works.

The key is to move through the layers with intent instead of randomly trying whatever command your fingers remember.

Layer 1: Physical

Is the cable connected?

Is the interface up?

Any errors?

Any flaps?

Any bad optics?

Any speed/duplex weirdness?

Any power issue?

Any access point down?

Any switch stack member pretending to be alive while contributing nothing, like a printer support contract?

Layer 2: Data Link

Is the MAC address learned?

Is the VLAN correct?

Is the trunk allowing the VLAN?

Any spanning-tree weirdness?

Any port-channel mismatch?

Any excessive broadcasts?

Any duplicate MAC movement?

Any access port in the wrong VLAN because someone “temporarily” moved it six months ago?

Layer 3: Network

Does the client have the right IP?

Right mask?

Right gateway?

Can it reach the gateway?

Does routing exist both ways?

Any asymmetric routing?

Any missing route?

Any route pointing to a device that is technically powered on but spiritually gone?

Layer 4: Transport

Is the port open?

Is TCP completing the handshake?

Are packets being reset?

Are packets timing out?

Does UDP disappear into the void like a change request with no business justification?

Layer 5–7: Session, Presentation, Application

Is authentication working?

Is TLS/certificate negotiation working?

Is DNS resolving correctly?

Is the application responding?

Is the backend database alive?

Is the load balancer pool healthy?

Is the app team saying “nothing changed” in a way that suggests something absolutely changed?

You do not always need to go perfectly in order.

But you do need to know which layer you are testing.

“Ping failed” does not mean “the app is down.”

“Ping worked” does not mean “the app is fine.”

Ping is a flashlight.

Not a court verdict.


Stage 5: Separate Reachability From Usability

This is where a lot of troubleshooting goes sideways.

There is a huge difference between:

  • “Can I reach it?”

  • “Can I use it?”

  • “Does it perform well?”

  • “Does the application actually work?”

You can have perfect ping and a broken app.

You can have blocked ping and a healthy app.

You can have TCP/443 open and the application returning errors.

You can have DNS resolving correctly but to the wrong destination.

You can have a firewall policy allowing traffic while NAT quietly ruins everyone’s afternoon.

So test like this:

Basic reachability

Can the client reach the gateway?

Can the client reach DNS?

Can the client reach the destination IP?

Can the client reach other known-good systems?

Name resolution

Does the hostname resolve?

Does it resolve to the expected IP?

Does it resolve differently internally and externally?

Are different clients using different DNS servers?

Is the record cached?

Is the TTL doing something unhelpful?

Port connectivity

Can the client connect to the required TCP/UDP port?

Does the connection timeout, reset, or complete?

Is the firewall allowing it?

Is the server listening?

Application response

Does the login page load?

Does authentication complete?

Does the error appear after login?

Does the app fail only when pulling data?

Does the app fail only for certain users or roles?

This matters because “the site loads but login fails” is not the same problem as “TCP/443 never connects.”

One is likely application/authentication.

The other might be network, firewall, routing, or server.

Same user complaint.

Completely different investigation.


Stage 6: Check Recent Changes

I know.

Nobody changed anything.

A timeless classic.

Infrastructure’s favorite bedtime story.

But check anyway.

Ask:

  • Any firewall changes?

  • Any routing changes?

  • Any switch maintenance?

  • Any wireless changes?

  • Any DNS changes?

  • Any certificate changes?

  • Any identity provider changes?

  • Any application release?

  • Any server patching?

  • Any ISP maintenance?

  • Any cloud provider incidents?

  • Any “quick cleanup” someone did before lunch?

The phrase “nothing changed” usually means one of three things:

  1. Nobody knows what changed.

  2. The person who changed it is not in the meeting.

  3. The change was technically “not supposed to affect anything,” which is how you know it affected something.

Change correlation does not prove causation.

But it gives you a place to start digging.

And sometimes the fastest troubleshooting tool is not ping.

It is the change calendar.


Stage 7: Collect Evidence Before Escalating

When you escalate, do not send a shrug wearing a ticket number.

Send evidence.

Bad escalation:

“Network looks fine. Please check app.”

Better escalation:

“Client 10.20.30.55 can resolve app.company.local to 10.80.12.40. TCP/443 completes successfully. Firewall logs show allowed sessions from VLAN 30 to the app server. Packet capture confirms SYN, SYN-ACK, ACK completes. The application returns HTTP 500 after login. No packet loss observed between client and destination during testing. Please review application/backend logs around 10:15–10:30 AM.”

That second one is not “blaming the app team.”

That is handing them a flashlight, a map, and the approximate location of the goblin.

Good evidence includes:

  • Timestamp

  • Source IP

  • Destination IP/FQDN

  • Protocol/port

  • Test location

  • Test result

  • Packet loss/latency/jitter if relevant

  • Firewall/session log result

  • DNS result

  • Traceroute/MTR/path result if useful

  • Screenshot or exact error

  • Recent changes checked

  • Known-good comparison

This is how you avoid the infinite ticket ping-pong championship.

You are not just saying “not network.”

You are saying:

“Here is what was tested. Here is what passed. Here is what failed. Here is where the failure appears to begin.”

That is the difference between troubleshooting and departmental dodgeball.


Common Traps That Waste Everyone’s Time

Trap 1: Starting With the Firewall Every Time

The firewall is suspicious.

Always.

It sits there silently judging traffic and logging just enough to be useful but never enough to be emotionally satisfying.

But not every problem is the firewall.

Before you blame it, confirm:

  • Is traffic reaching the firewall?

  • Is there a matching policy?

  • Is NAT involved?

  • Is the return path correct?

  • Is the session established?

  • Is inspection interfering?

  • Is the destination actually listening?

Firewall troubleshooting without source, destination, and port is just reading logs with hope in your heart.

Trap 2: Trusting “It Affects Everyone”

Users say “everyone” when they mean:

  • Me

  • Me and Bob

  • Three people near the printer

  • Anyone I asked in the last 45 seconds

  • The entire known universe, based on vibes

Always confirm the scope.

“Everyone” is not a measurement.

It is a panic adjective.

Trap 3: Treating Wireless Bars Like Network Health

Full bars do not mean good Wi-Fi.

Full bars mean the client hears the AP loudly.

That does not tell you:

  • Channel utilization

  • Noise

  • SNR

  • Roaming behavior

  • Authentication health

  • DHCP timing

  • DNS behavior

  • Client driver nonsense

  • Whether the device is clinging to an AP in another zip code

Wireless gets its own issue later in this series because Wi-Fi is not a network.

It is a negotiation with physics.

Trap 4: Assuming DNS Is Fine Because “It Resolved”

DNS can resolve and still be wrong.

It can resolve to the wrong IP.

It can resolve differently for different users.

It can be cached.

It can be split-brain.

It can depend on VPN state.

It can depend on which DNS server the client asked.

DNS is not innocent.

DNS is just well-dressed.

Trap 5: Believing the App Team’s “Nothing Changed”

Maybe they are right.

Maybe they changed nothing.

Maybe the release pipeline changed something.

Maybe a dependency changed.

Maybe a certificate expired.

Maybe the identity provider changed behavior.

Maybe the database is slow.

Maybe the app is waiting on an API that is waiting on another API that is waiting on a cloud service that is currently held together with YAML and prayers.

“Nothing changed” should be treated as a hypothesis.

Not a blood oath.


The Troubleshooting Mindset

Good troubleshooting is not about knowing every command.

It is about asking better questions.

Bad troubleshooting says:

“Can you ping it?”

Good troubleshooting says:

“From where, to what, over which path, using which protocol, and what result would prove or disprove the current theory?”

That is less catchy.

But it works.

The best network engineers are not the ones who immediately know the answer.

They are the ones who know how to remove wrong answers quickly.

They test.

They narrow.

They document.

They compare.

They resist the urge to reboot things like an angry wizard.

That is the whole game.


Free Takeaway: The 5-Minute Blame Funnel

The next time someone says “the network is slow,” do not start with ping.

Start with this:

1. Who is affected?

One user, one group, one site, or everyone?

2. What is affected?

One app, many apps, internet, internal systems, voice, Wi-Fi, VPN?

3. Where are they?

Wired, wireless, VPN, branch, data center, cloud, specific VLAN, specific SSID?

4. What exactly fails?

Can’t connect, slow response, login failure, timeout, reset, error after login, intermittent drops?

5. When did it start?

Exact time, recent changes, maintenance windows, app releases, firewall changes, DNS changes?

6. What path does the traffic take?

Source IP, destination IP/FQDN, protocol, port, gateway, firewall, NAT, proxy, VPN/SASE, cloud?

7. What evidence do we have?

Ping, DNS result, TCP test, traceroute/MTR, firewall logs, packet capture, app error, known-good comparison?

That is the Blame Funnel.

It turns:

“The network is broken.”

Into:

“Users on wireless VLAN 40 at the warehouse can resolve the app name but TCP/443 to 10.80.12.40 times out. Wired users at the same site are fine. Issue started after AP profile changes at 8:30 AM.”

Now you are troubleshooting.

Not guessing.

Not defending.

Not sacrificing a switch to appease the outage gods.


Root Access Bonus: The “Prove It’s Not the Network” Toolkit

Alright, Root Access crew.

Now that we have the concept, let’s turn it into something you can actually use during the next outage, bridge call, or emotionally unsafe Teams thread.

Below is a practical troubleshooting kit you can copy into your NOC wiki, ticket templates, or personal “please stop blaming the network” survival binder.

Keep reading with a 7-day free trial

Subscribe to The Config Report to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 JJ from The Config Report · Publisher Privacy ∙ Publisher Terms
Substack · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture