Troubleshooting tools for remote VPN Users

What are some excellent tools out there for Remote VPN users? I’m desiring something like a smokeping that runs hidden as a background process on the workstation, that we could remotely manage and point at different targets and it would track instances of packet loss, jitter, etc, and let us review history data for days if not weeks back.

At this point, lacking a tool like this, I find remote VPN users connectivity complaints as very difficult to troubleshoot. At this point we’re kind of relying on saying things like “well, this isn’t happening to the other 3k users on the vpn so it must be you,” which quite frankly as an engineer doesn’t really satisfy me as a good answer.

I want to show charts and data that point out “see you lost connection to our network briefly, you need to call your home ISP.”

How have some of you experts in heavy WFH environments handled persistent complaints? Complaints range from “I get kicked off all the time,” to just plain “it’s so slow on the vpn!”

Of course we’re able to vet every inch of our network checking for congestion and errors along every interface through our ingress and vpn concentrator, we’re able to check CPU/Memory for every device the traffic goes through… what we are not able to check did this users ISP drop traffic, are they taking errors at some peering point to reach our ASN?

We need better tools I think!

We run a HTML5 speedtest server internally at each site where a VPN concentrator exists ( https://github.com/librespeed/speedtest ) with telemetry enabled. We have users run a test and if it seems crap we have them run one from speedtest.net and compare, 99% of the time it points at a local/ISP issue.

“well, this isn’t happening to the other 3k users on the vpn so it must be you,” which quite frankly as an engineer doesn’t really satisfy me as a good answer.

I’ve got much better things to do than troubleshoot why someone with their wifi router 4 floors up locked in a metal safe is getting shit speeds. We refer people to Geeksquad for local help.

The webex/spark media test is another good tool for general statistics: https://mediatest.ciscospark.com/

Or pay ThousandEyes an unreasonable amount of money to have it tell you it’s your employees shit wifi or bargin basement ISP. Which you can’t fix.

Fairly simple thing, but if an end user is on a wireless network and complaining about poor connectivity my go to will usually be the command:

netsh wlan show wlanreport.

It essentially should every wireless event that has happened over the past 48hrs, connects, disconnects, length of sessions etc. Useful tool & baked right into Windows.

Honestly ignore that it’s a VPN; it’s a network. Set up some monitoring of the VPN hosts from the server side. Just a ping every 60s with latency and % lost over the past 24 hours sounds like it might be sufficient?

Unfortunately I don’t know of anything purpose made for that - although I am sure something exists that would work out of the box.

If your VPN server supports it, run netflow on the ingress and egress interface. You’re still going to have a reactive approach to issues, but IMO this is acceptable for VPN. Our solution also allows a connected client to run an in-depth report that helps diagnose issues.

Few people have mentioned ThousandEyes, there is a cost to it but it’s built for this kind of stuff around running tests on the stuff you don’t own or have the ability to actively monitor.

Not sure if it was checked but we have had a few people at where I work complaining about VPN drops. Turns out they were using wireless on their work computer. I believe they were either decently far from their wifi or they lived in a congested area (apartment complex or something similar).

At this point we’re kind of relying on saying things like “well, this isn’t happening to the other 3k users on the vpn so it must be you,” which quite frankly as an engineer doesn’t really satisfy me as a good answer.

And you still have to find the issue to fix it. If not for Mary in AP, definitely for the CFO who is 67 and avoiding the office at all costs.

Not directly related, but I support a Cato SDWAN environment. Many users who have Fios Gbps home service have noticed that speed tests routinely come out to like 200-300 Mbps while on the VPN (going full tunnel). Cato support has actually clarified that they consider anything 100Mbps or higher to be correct. They do not support infinite bandwidth, and in fact if they did their product would be cost prohibitive (according to them). Mostly this is not an issue because as long as the connection is stable, 100 Mbps is good enough for most standard business applications. But this is an example of how VPN’s can be the wild west.

Most users are on a consumer grade home ISP service, because it’s the only one they can get, and it suited them perfectly fine before WFH. But you still have to prove that this is the root cause, and even then “it can’t be fixed” is just not a good enough answer.

TL;DR: I feel your pain, but I don’t have a great answer.

Use ping plotter , it can run in the background from a “troubled users “ pc and report data back to a host that is inside the vpn network . It can track packet loss per hop , create a graph as a visual representation to show a user what’s wrong , without just “telling” them.

You can track both targets on the vpn and know reliable public hosts. If they are losing connection to both at the same time you can reliably show them it’s their isp.

I heard about something called thousandeyes on a podcast recently, they were discussing this exact use case.

Look at thousandeyes. It has embedded endpoint agents that will give you explicit details of local network, wifi, gateway, internet routing, and even application steps if you wish. It’s an opex model. All cloud.

pathping for win 10 does traceroute and then does 200s of performance testing to the various gateways. Obviously ICMP based, but just have the users point it at the outside ip of your vpn gateway

Netsh wlan show interfaces
Shows wifi signal str. Anything under 80 might experience packet loss.
netsh wlan show wlanreport
If w10 this generates an html report that may show if they have been getting disconnected constantly or other issues.

If you have money get thousand eyes endpoint agent. You have to buy minimum 100 licenses but you can install it on users pcs and run all kinds for tests from their pc to other network resources and apps

Honestly, I’m not sure if there’s a troubleshooting tool for it, but I’ve seen some times where the home network is fine and it still dies: Unless the VPN itself is misconfigured, getting them to reboot the router seems to work every time: Even if “everything else is fine”.

Turns out, some home routers suck at NAT.

The only times I’ve had issues that weren’t resolved by that were users who somehow got it stuck that if they are on the VPN they can disconnect from the wifi.

The biggest issue I have seen with VPN’s are MTU issues

<internet> ---- MTU 1452 ---<consumer router> ----- MTU 1500 ----- <end user device

In this situation the VPN on the end user device assumes it can send packets of size 1500 (-IP/UDP headers), and configures the tunnel for this, and it sets the DO NOT FRAGMENT bit. Then the traffic reaches the router, and gets rejected as fragmentation is required. Or traffic flowing in the other direction gets this problem.

This in combinations with firewalls that drop any kind of ICMP means even more problems

If the VPN client does not set the DNF bit, it mans that every packet that is too big, gets fragented, which increases the chance for packet loss, as any packet not arriving means it cannot get decoded. Not every firewall handles out-of-order fragments properly, and just drops those

So, for testing tools, make sure to include a tool for MTU Path Discovery

Pingplotter with their cloud agent is what you are looking for. Deploy the cloud agent, start some pings remotely, view the data later and work with the end-user to provide to their ISP. There are more hands-on ways for troubleshooting, but reviewing a pingplotter chart takes much less engineer time than remoting-in, calling the user etc.

We did this exact same thing, also with librespeed, when users started working remotely. Made those who complained run through the speed test, and if A=B, they were instructed to move closer to their “wifi router”, or just plug in directly. If tests didn’t improve, call your ISP.

It was a good troubleshooting tool for the helpdesk as it cut out the fluff like “it feels really slow”.

For some problematic users I setup Smokeping to monitor their connection. Every time they popped on the network, smokeping would start graphing the things. That proved extremely useful for “it was slow last night”. Yup, I can see it was slow. Oh, you were sitting 4 floors down from your “wifi router”. I told you not to do that.

The best issue was one user that lived close enough to one of our sites that if they were sitting at their kitchen table, the laptop could see the SSID from the site, and attempt to connect. The wifi profiles were configured to prioritize connecting to the coroprate SSID over everything else. Of course seeing the SSID != good connection. The solution for them was to find a different room in their house to work from. Once they moved to a room on the other side of their house, the laptop connected to their home network. Problem solved.

I’ve got much better things to do than troubleshoot why someone with their wifi router 4 floors up locked in a metal safe is getting shit speeds. We refer people to Geeksquad for local help.

I have a home user that opens a NOC ticket every week for “Internet connectivity issues” while working from home. I’ve just started removing the engineer the ticketing system auto assigns and letting them rot in the queue.

I had no idea this existed. Super useful. Thank you!

Unfortunately the vpn solution we use when you ping remote hosts from the server side, the VPN concentrator itself replies to icmp on their behalf, always showing them as 1ms response time. They’ll also “continue to ping” unless the connection hard times out, meaning you’ll never see brief packet loss from our end.

This has frustratingly taken away our most basic troubleshooting tool (icmp.)