Slow IPSEC VPN on High Availiability / three different customers / long post

As a SonicWALL user since over 20 years, i now need help with a very strange behaviour and need the swarm-intellgence.

Three of my clients have a high availability installation according to spec. Nothing very special. Router(s) from ISP(s) with an IP range of at least 8 fix IPs (normally PPPoE on WAN, MTU 1492) and on the router the two HA configured boxes, normally X1, for the second ISP X2.

I have clients with 2 x TZ300 / 2 x TZ400 / 2 x TZ670, so two older and a brand new one.

Everything seemed to work fine. Done tests of failover, different power rails / UPS etc., works as intendet.

Two months ago I had a service call “slow data transfer over VPN”. Company has 1gbps symmetrical fiber, tested with speedtest.net / fast.com, came over 900mbps in both directions, so my standard answer was “your home internet is the problem, let me check it quick

Boy, was I wrong. Home office user had 600+ mbps (over WLAN, but still) and only an SMB throughput of 7.5MByte/sec / 60mbps with GVC. Checked if my arch-nemesis “RSC” was active on her notebook - nope.

As I have an 1gbps on my side too i double checked and had the exact same speed of 7.5MByte/sec!

I told her, I have to check on their side. Long story short, exchanged router and patchcables on customer site, still same.

Then the same complaint from another company with the two TZ670. Tested, the same 7.5MByte/sec SMB speed. Checked with my Site2Site VPN between my and the first company, still 7.5MByte/sec SMB speed.

As SMB can be tricky, gone methological. iPerf3 check over the same IPSEC tunnel:

  • 1 stream: 64mbps
  • 10 streams: 180mbps

iPerf3 check over Internet:

  • 1 stream: 250mbps
  • 10 streams: 980mbps

Tried to replicate with another FTTH customer with TZ400, but he has no HA Configuration and voilà

iPerf3 check over IPSEC tunnel:

  • 1 stream: 475mbps
  • 10 streams: 560mbps

iPerf3 check over Internet:

  • 1 stream: 928mbps
  • 10 streams: 980mbps

Maybe blame the ISP for his PPPoE? Tested with another customer who has an IP Range via PPPoE, but just one FTTH bridge and just one TZ300, so no HA pair. He has full speed according to specs of the TZ300, the old/slow CPU is the limitation, to blame PPPoE was off the table.

Furious, I installed an Ubuntu 24.01.1 VM with WarpSpeed (WireGuard server for dummies) at both complaining customer sites and tested it over my 1gbps connection.

iPerf3 check over WarpSpeed tunnel, factually full internet-speed:

  • 1 stream: 780mbps
  • 10 streams: 900mbps

So in conclusion, there is an extreme slowing down of the traffic in the IPSEC tunnel when the IPSEC tunnel terminates on the HA pair, and it doesn’t matter if it is Site2Site or Client2Site.

All firmware are actual, even tried with the one week old 6.5.4.15-117.

Has anyone a clue what I’m missing? Or should I try an SonicWALL support case and escalate this strange behaviour and hope for a solution?

-----------------------------------

UPDATE - WORKING, but without any work from me, so probably the ISP (who said "nothing changed on RADIUS)

Tuesday, October 29th 2024 I collected again data/test etc. As mentionned in a post below, I had an appointment at the evening with one of my customer to install my spare TZ400 in his environment. So I tested this first hand in my office if it is registered, updated etc. and wanted to preconfigure it, document my test IP ranges etc.

Then I made from my “slow” G.fast 500/100 connection the test to this client, had an fix IP spare, tested and… WTF? 18MByte per second, not 7.5MByte per second SMB?

Configured X2 of this spare TZ400 on my fast 1gbps link, and WTF^2 - 55MByte/sec? Yesterday, with my own HA TZ270 it was 7.5MByte/sec (for these quick-and-dirty tests, just a 1gbyte test-file).

So, what changed? Nothing on my side, no config changes nor Windows-Updates, nothing.

Then with my same device (Notebook), I moved the same 1gbyte test-file again over my still existing Site2Site VPN (HA 270 on my side ↔ HA 670 on their side, both on 1gbps) and… 55MByte/sec, during the day when everyone was working!

Did the same on the HA TZ400 customer who has the same ISP with 8 fix IPs, an little smaller Mikrotik RB3011 Router - 55MByte/sec, exact the same speed.

OK. Am I nuts? Maybe, but exited like a toddler in the candystore.

Then tested with the customer with the “oldest installation” HA TZ300 on G.fast 475/100mbps: 7.8MByte/sec, not 7.5MByte/sec. OK, this is slightly better, but still slow.

Logged onto their TZ300 Firewall → System → Diagnostics → Diagnostic Tool → changed to “Multi-Core Monitor” and… this box has just 2 Cores (the TZ400 have 4 Cores), Core 1 was below 10%, Core 2 was von 99%. Canceled File-Transfer, Core 2 dropped to around 2%.

So TZ300 is definitively too slow (and old) for todays speeds. My recommendation will be to exchange it with at least 2 x TZ270, if not higher when they will go on 1gbps fibre.

Then, I made the same test-procedures again from my colleagues 1gbps FTTH connection (he has just bridge with DHCP, dynamic IPv4) and my test-VM from his side. The exact same speed with SMB.

So, as I now have fast SMB, lets test iPerf3 (this right now is during the day, so it should be a little higher in the night when I normally test and don’t disturb 25+ users)

Customer 1 (HA TZ670, 1gbps fibre)

HA TZ270 → HA TZ670 over IPSEC

  • 1 stream: 430mbps
  • 10 streams: 855mbps

HA TZ270 ← HA TZ670 over IPSEC (other direction)

  • 1 stream: 663mbps
  • 10 streams: 770mbps

Customer 2 (HA TZ400, 1gbps fibre)

HA TZ270 → HA TZ400 over IPSEC

  • 1 stream: 487mbps
  • 10 streams: 535mbps

HA TZ270 ← HA TZ400 over IPSEC (other direction)

  • 1 stream: 300mbps
  • 10 streams: 600mbps

Customer 3 (HA TZ300, G.fast 475/100 not 1gbps fibre, CPU2 is the limiting factor)

HA TZ270 → HA TZ300 over IPSEC

  • 1 stream: 64mbps
  • 10 streams: 140mbps

HA TZ270 ← HA TZ300 over IPSEC (other direction)

  • 1 stream: 54mbps
  • 10 streams: 55mbps

Conclusion: “Nobody has done nothing”, it simply works as intended.

As I have no evidence, the only logical conclusion is that my ISP changed “something” in his environment, if on purpose or just some reboot/reset, namely their RADIUS servers who are responsible for the PPPoE + IP range distribution on customer with routers and IP ranges. All other customers normally don’t have PPPoE, they have DHCP and their dynamic or 1 fix IP.

This, because I’ve opened a couple of weeks ago a ticket for this exact same reason as it looked like an ISP issue (which they declined).

OR

They changed the connection from their RADIUS1 cluster to their RAIDIUS2, because one is older and they had in the past one issue. But as I didn’t track the routing with traceroute (my bad), I cannot confirm this.

So, thank all of you for your input, support and for keeping my standard and moral up for the search of the solution.

Exept for the ISP. I have to talk with them in C-level.

Have a nice day!

Kidos to you for being so methodical in your debugging and communicating it here objectively. Your organization is lucky to have you.

Wow the level of due diligence and documentation you provided is pretty amazing. Not something I normally see here. I’ve got next to nothing as far as an answer, you’ve already answered all my questions. Only thing I can think of is maybe an HA setting. My crappy recommendation, I know it’s not a good recommendation, is to break the HA pair. I mean ensure the secondary is powered down. Disable HA on the primary. Not the active/passive device. The primary as far as licensing is concerned. Turn HA off on the primary. Reboot the primary and test again. Again crappy answer but once we are this far down the rabbit hole, grasping at straws it ok.

Are you using a single cable for your HA link between the sonicwall appliances, or two cables?

So if you break the HA pair, you get better speeds?

Wow…OP wasn’t joking when they put “long post” in the title/subject :laughing:

Haven’t read it yet, but will give it a read.

any updates on this? great details and troubleshooting so far!!!

Does it perform fine for local transfers?

Thank you for your appreciation. Helps to keep my moral and my willing to solve it.

As see my answer above, I negotiated with the HA TZ670 client that I can install one of my spare TZ400 on his Router (he has two free IPs left) tomorrow.

Then it will go [ISP]<-Optic->[FTTH-Router]<-RJ-45 Cat 6a>[SonicWALL TZ400]<-RJ-45 Cat 6a>[Hypervisor]<-VM Network->[Test-VM Windows 2022].

I will keep you informed.

The older TZ300 / TZ400: one cable as the just support one cable.
The newer TZ670 both X7 and X8.

I’ve forgotten to mention that I physically plugged one box (secondary) out of power to simulate just one box, still the same problem.

I believe that is what he is saying, waiting for confirmation. I have HA pairs too and WFH 100% over GVC, i get about 100mb/s at home with 1/1g fiber, my colo has 1/1g fiber as well. So luckily my speed isn’t as bad as his but it’s interesting to see this. I’m hoping support can replicate this as sooooooooo many people complain about this exact same issue

Nope.
When I break the HA, it is still the same slow speed of 7.5MByte/sec SMB.

Hi,
I updated the OP (not sure if this is the correct way to update/close this post).

Hope this helps.

How do you mean local transfers? Those who don’t go over the firewall and via switch?
Yes, they work according to LAN speed, so 1 / 2.5 / 10gbps.

I worked for Sonicwall from 2007 until my position was eliminated in the last round of layoffs a couple of months ago. Primarily in support, so I’m speaking from experience when I say I’m highly impressed with your level of diligence. It’s guys like you that made my job fun but challenging. You have ready done all the hard work so not a lot for me to suggest which is the challenging part. I had one customer who was a large msp who always ended up on my plate. The tech was so good, we wouldn’t even troubleshoot. I would just gather all the data he already collected and open up a bug report. You remind me of him. I look forward to seeing the results of your testing.

So it’s not HA specific then?

That is my issue. I didn’t revert it to a non-HA unit, just plugged the power adapter off. For a full-test, Tomorrow I have the unique opportunity to go to the HA TZ670 client and put one of my spare TZ400 (older, but still OK model) direct to his 1gbps Router and install it on the second hypervisor with gigabit LAN cable.

As I have the full control (and permission) for those tests because he’s affected too, I will boot-up a VM and then test again vice versa over router ↔ TZ400 ↔ VM.

If the speed is the expected over 50MByte/sec (400mbps) VPN throughput, it is still an HA issue. If it’s not, then the SonicWALL devices cannot process IPSEC correctly through a routed environment because I tried four different Routers (ZyXEL G.fast XMG3927, old ZyXEL SBG3500 FTTH, Mikrotik hEX S, Mikrotik RB3011), and on every router, I have the exact speed of SMB 7.5MByte/sec (60mbps).

Have you tested your MTU is correct?

The MTU on PPPoE? Yes, see OP. Tried with different MTUs as well (dividable though 8, so 1492, 1460, 1404 etc.) , but received from the PPPoE Servers the MTU mismatch.

Then changed back to 1492.

The MTU on the normal network side was and still is 1500.

And you tested the MTU transmission with ping to check for fragmentation? I’ve seen some very strange results with PPPoE…

ping www.google.com -f -l 1492

Decrease 8 bytes at a time until you get a clean result…