VNG rebooting every month

I have been tracking a case with MSFT support for a couple of months now. At a random time every month we are seeing our VNGs rebooting. The primaryconndction to the VNG goes offline for about 10 - 15 minutes, comes back online, and then the secondary connection to the VNG goes offline for about 10-15 minutes.

Does anyone else that’s using VNG see this as well?

10-15 minutes sounds weird, having a case open for months (!) even weirder.

Have you reviewed the HA article? https://learn.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-highlyavailable#about-vpn-gateway-redundancy

Every Azure VPN gateway consists of two instances in an active-standby configuration. For any planned maintenance or unplanned disruption that happens to the active instance, the standby instance would take over (failover) automatically, and resume the S2S VPN or VNet-to-VNet connections. The switch over will cause a brief interruption. For planned maintenance, the connectivity should be restored within 10 to 15 seconds. For unplanned issues, the connection recovery is longer, about 1 to 3 minutes in the worst case.

This is normal maintenance. You can’t control Azure VPN Gateway maintenance procedures. You need to build your design accordingly to accomodate the monthly maintenance.

Need to confirm, is VNGs lives in single DC or is by default regional service? Maybe try to make it redundant if this 15 minutes matters.

Hmm, were that planned maintenance on GW? What’s you VNG SKU ?

We are using the active-active VPN topology where we have active VPN connections to both primary and secondary or the VNG.
What I really can’t wrap my head around is that there is this monthly maintenance. Why does the VNG specifically need this monthly maintenance, why does it need to randomly occur. How would anyone reliably monitor connectivity at the VPN status or BGP routes if they are just going to go down without notice?

If by redundant you mean that we are forming a VPN tunnel to both endpoints of each VNG, also using BGP to learn and send routes. That is what we currently have.

When these VNGs go through a reboot, all of the primary endpoint tunnel at each location go down and come back up.

I am more looking to see if this is standard behavior for a VNG because to me it seems like their might be something wrong with the VNG itself.

I am told by Azure support that it is planned monthly maintenance that occurs randomly. Wrap your head around that :wink:

I’ll get back to you on the SKU.

The answer is that you don’t. It’s one of the main drawbacks for Azure VPN Gateway. This is really a non-issue for our organization as the maintenance windows don’t have any production impact as we have non-critical workload connectivity requirements to/from on-prem.

There isn’t a lot of documentation about this on the Azure side, however they perform the maintenance for the simple fact of instance upgrades and underlay upgrades. It’s Blackbox architecture so it’s important to realize this when building your infrastructure on public cloud as this being a factor outside of your control. It’s also important to consider this as part of your overall design when thinking about using VPN Gateway over other options like an SD-WAN NVA that can terminate IPSec connections to on-prem.

If you so incline, you can create a maintenance configuration and attach this to the VPN Gateway if you want to control a specific time (per-day), that you want the VPN Gateway maintenance to occur, but you cannot control the day in which it occurs, only the time of day.

Hope this helps,

I’ve set up probably a hundred different VNGs for different clients and I genuinely don’t think I’ve seen anything similar.

I am not pro in networking area, but… know that there are banks and other large companies which use site to site vpn as main (and only) connection between cloud and on premise systems. If this 15min break could not been avoided, in some way, they will use Direct Connect instead or as redundancy.