As a SonicWALL user since over 20 years, i now need help with a very strange behaviour and need the swarm-intellgence.
Three of my clients have a high availability installation according to spec. Nothing very special. Router(s) from ISP(s) with an IP range of at least 8 fix IPs (normally PPPoE on WAN, MTU 1492) and on the router the two HA configured boxes, normally X1, for the second ISP X2.
I have clients with 2 x TZ300 / 2 x TZ400 / 2 x TZ670, so two older and a brand new one.
Everything seemed to work fine. Done tests of failover, different power rails / UPS etc., works as intendet.
Two months ago I had a service call “slow data transfer over VPN”. Company has 1gbps symmetrical fiber, tested with speedtest.net / fast.com, came over 900mbps in both directions, so my standard answer was “your home internet is the problem, let me check it quick”
Boy, was I wrong. Home office user had 600+ mbps (over WLAN, but still) and only an SMB throughput of 7.5MByte/sec / 60mbps with GVC. Checked if my arch-nemesis “RSC” was active on her notebook - nope.
As I have an 1gbps on my side too i double checked and had the exact same speed of 7.5MByte/sec!
I told her, I have to check on their side. Long story short, exchanged router and patchcables on customer site, still same.
Then the same complaint from another company with the two TZ670. Tested, the same 7.5MByte/sec SMB speed. Checked with my Site2Site VPN between my and the first company, still 7.5MByte/sec SMB speed.
As SMB can be tricky, gone methological. iPerf3 check over the same IPSEC tunnel:
- 1 stream: 64mbps
- 10 streams: 180mbps
iPerf3 check over Internet:
- 1 stream: 250mbps
- 10 streams: 980mbps
Tried to replicate with another FTTH customer with TZ400, but he has no HA Configuration and voilà
iPerf3 check over IPSEC tunnel:
- 1 stream: 475mbps
- 10 streams: 560mbps
iPerf3 check over Internet:
- 1 stream: 928mbps
- 10 streams: 980mbps
Maybe blame the ISP for his PPPoE? Tested with another customer who has an IP Range via PPPoE, but just one FTTH bridge and just one TZ300, so no HA pair. He has full speed according to specs of the TZ300, the old/slow CPU is the limitation, to blame PPPoE was off the table.
Furious, I installed an Ubuntu 24.01.1 VM with WarpSpeed (WireGuard server for dummies) at both complaining customer sites and tested it over my 1gbps connection.
iPerf3 check over WarpSpeed tunnel, factually full internet-speed:
- 1 stream: 780mbps
- 10 streams: 900mbps
So in conclusion, there is an extreme slowing down of the traffic in the IPSEC tunnel when the IPSEC tunnel terminates on the HA pair, and it doesn’t matter if it is Site2Site or Client2Site.
All firmware are actual, even tried with the one week old 6.5.4.15-117.
Has anyone a clue what I’m missing? Or should I try an SonicWALL support case and escalate this strange behaviour and hope for a solution?
-----------------------------------
UPDATE - WORKING, but without any work from me, so probably the ISP (who said "nothing changed on RADIUS)
Tuesday, October 29th 2024 I collected again data/test etc. As mentionned in a post below, I had an appointment at the evening with one of my customer to install my spare TZ400 in his environment. So I tested this first hand in my office if it is registered, updated etc. and wanted to preconfigure it, document my test IP ranges etc.
Then I made from my “slow” G.fast 500/100 connection the test to this client, had an fix IP spare, tested and… WTF? 18MByte per second, not 7.5MByte per second SMB?
Configured X2 of this spare TZ400 on my fast 1gbps link, and WTF^2 - 55MByte/sec? Yesterday, with my own HA TZ270 it was 7.5MByte/sec (for these quick-and-dirty tests, just a 1gbyte test-file).
So, what changed? Nothing on my side, no config changes nor Windows-Updates, nothing.
Then with my same device (Notebook), I moved the same 1gbyte test-file again over my still existing Site2Site VPN (HA 270 on my side ↔ HA 670 on their side, both on 1gbps) and… 55MByte/sec, during the day when everyone was working!
Did the same on the HA TZ400 customer who has the same ISP with 8 fix IPs, an little smaller Mikrotik RB3011 Router - 55MByte/sec, exact the same speed.
OK. Am I nuts? Maybe, but exited like a toddler in the candystore.
Then tested with the customer with the “oldest installation” HA TZ300 on G.fast 475/100mbps: 7.8MByte/sec, not 7.5MByte/sec. OK, this is slightly better, but still slow.
Logged onto their TZ300 Firewall → System → Diagnostics → Diagnostic Tool → changed to “Multi-Core Monitor” and… this box has just 2 Cores (the TZ400 have 4 Cores), Core 1 was below 10%, Core 2 was von 99%. Canceled File-Transfer, Core 2 dropped to around 2%.
So TZ300 is definitively too slow (and old) for todays speeds. My recommendation will be to exchange it with at least 2 x TZ270, if not higher when they will go on 1gbps fibre.
Then, I made the same test-procedures again from my colleagues 1gbps FTTH connection (he has just bridge with DHCP, dynamic IPv4) and my test-VM from his side. The exact same speed with SMB.
So, as I now have fast SMB, lets test iPerf3 (this right now is during the day, so it should be a little higher in the night when I normally test and don’t disturb 25+ users)
Customer 1 (HA TZ670, 1gbps fibre)
HA TZ270 → HA TZ670 over IPSEC
- 1 stream: 430mbps
- 10 streams: 855mbps
HA TZ270 ← HA TZ670 over IPSEC (other direction)
- 1 stream: 663mbps
- 10 streams: 770mbps
Customer 2 (HA TZ400, 1gbps fibre)
HA TZ270 → HA TZ400 over IPSEC
- 1 stream: 487mbps
- 10 streams: 535mbps
HA TZ270 ← HA TZ400 over IPSEC (other direction)
- 1 stream: 300mbps
- 10 streams: 600mbps
Customer 3 (HA TZ300, G.fast 475/100 not 1gbps fibre, CPU2 is the limiting factor)
HA TZ270 → HA TZ300 over IPSEC
- 1 stream: 64mbps
- 10 streams: 140mbps
HA TZ270 ← HA TZ300 over IPSEC (other direction)
- 1 stream: 54mbps
- 10 streams: 55mbps
Conclusion: “Nobody has done nothing”, it simply works as intended.
As I have no evidence, the only logical conclusion is that my ISP changed “something” in his environment, if on purpose or just some reboot/reset, namely their RADIUS servers who are responsible for the PPPoE + IP range distribution on customer with routers and IP ranges. All other customers normally don’t have PPPoE, they have DHCP and their dynamic or 1 fix IP.
This, because I’ve opened a couple of weeks ago a ticket for this exact same reason as it looked like an ISP issue (which they declined).
OR
They changed the connection from their RADIUS1 cluster to their RAIDIUS2, because one is older and they had in the past one issue. But as I didn’t track the routing with traceroute (my bad), I cannot confirm this.
So, thank all of you for your input, support and for keeping my standard and moral up for the search of the solution.
Exept for the ISP. I have to talk with them in C-level.
Have a nice day!