Cloud VPN HA topology vs regional GCP outage

keftes · March 10, 2025, 4:22pm

We’re looking to use Cloud VPN to establish connectivity from on-premise to a single VPC on GCP. The VPC is configured to support global route propagation, which means a single Cloud Router with a single Cloud VPN gateway can route traffic to all the subnets/regions in our VPC.

The described topology looks like this: https://cloud.google.com/network-connectivity/docs/vpn/images/ha-vpn-gcp-to-on-prem-2-a.svg

However in the above design, while having two VPN tunnels will ensure that we have some form of redundancy for our connections, if there is a GCP regional failure then we’ll lose our Cloud Router as well as the VPN gateway itself.

For this reason we’re considering using an additional VPN gateway & a new Cloud Router but on a second region. My question is, if we have two Cloud Routers, two VPN gateways and another pair of tunnels, all in a second region, how will the VPC be able to route traffic back to on premise?

For example:

GCP VPC (10.100.0.0/24) —> on-premise network (10.200.0.0/24)

With two Cloud Routers, how will a decision be made so that traffic can egress the GCP VPC and end up in the single on-premise network?

Does all this make sense? (protecting your VPN link from a regional GCP failure)

greenlakejohnny · March 10, 2025, 4:22pm

With two Cloud Routers, how will a decision be made so that traffic can egress the GCP VPC and end up in the single on-premise network?

The short answer is “BGP”.

The longer answer is when advertising your routes to BGP, their cloud router will translate the BGP MED (metric) numerical value to a route priority number. The lower the number, the better the route.

You can use this to influence path selection in to your network, and do optimal path active/active.

For example, let’s say you advertise metric=10 to region X and metric=25 to region Y. Traffic from region X or region Y will take the direct route.

Now where it gets really interesting is say you have region Z. The latency between region X and Z is 40ms, and Y to Z is 5ms.

GCP will automatically set the Z via X path to priority to 250 (300 + 40 + 10) and the Z via Y path to priority to 230 (200 + 25 + 5).

Even though the Y region is your backup VPN, traffic will take this path since 230 is less than 250.

Cidan · March 10, 2025, 4:22pm

Hm, if I’m reading this correctly (and please point out if I’m not), you’ll want to split the cloud router’s routes so the one in central routes only 10.0.1.0/24 and the one in west only routes 10.0.2.0/24. This way each region is compartmentalized and is responsible for its’ own routing.

Cidan · March 10, 2025, 4:22pm

/u/keftes This is also a really good solution here, but be aware that you’ll pay for GCP bandwidth across your regions, which can be a non-trivial cost.

keftes · March 10, 2025, 4:22pm

The longer answer is when advertising your routes to BGP, their cloud router will translate the BGP MED (metric) numerical value to a route priority number. The lower the number, the better the route.

Does this essentially mean that I would designate the Cloud Router in region A with higher priority than the router in region B using the broadcasted MED value? That way while both my Cloud Routers will advertise the same routes to my VPC, the route with the lowest MED will be preferred by the on-prem device?

Edit: I think I got this now, thanks.

keftes · March 10, 2025, 4:22pm

Thanks for the reply. To better explain this, lets assume that I have a single VPC with 5 subnets, each on a different region.

To set up connectivity with an on-prem network we’re going to deploy a Cloud VPN gateway (which is a regional resource) and a Cloud Router in the same region. With global route propagation on the VPC, this single router can handle routing traffic to all 5 subnets. If the router’s region goes down however that’s obviously an outage (from an on-prem to those 5 subnets connectivity point of view obviously).

You’re suggesting to split the routes across two cloud routers (located in two different regions). This would work if you had two subnets (located in two regions). But if you have more than two, then this would open us up to an outage, if one of the routers (or its whole region) went down, no?

I’m more or less looking for a reasonable topology to ensure that my “HA Cloud VPN” can support routing on-prem traffic to multiple subnets of my VPC (subnets located in different regions) in the event of a single region failure (losing two GCP regions in not a scenario I’m pursuing since then we’re talking about about DR). So more or less, having two routers / VPN gateways / tunnels - all in two different regions.

I’m a bit surprised that this isn’t a valid pattern (I’ve searched a lot through the docs), since regional outages do happen and you don’t want to lose the whole link to on-prem when that occurs.

Cidan · March 10, 2025, 4:22pm

I think the pattern you are describing is a bit unusual.

To clarify, are you saying that you will send all traffic from every region, across regions, to your VPN link that sits in a single region today? If so, this would cost you in terms of cross-region traffic, which most people try to avoid. What I’ve seen people do in these situations is, every region has it’s own VPN gateway and Cloud Router, and each gateway and router is only responsible for the subnet in the region it exists in.

Let’s look at your example: If you have 5 subnets (a subnet per region), then you would have 5 Cloud VPN gateways and 5 Cloud Routers. Each Cloud Router only advertises the subnets within the region it is created (i.e. do not select “Advertise all subnets visible to the Cloud Router” in the UI if you’re doing this manually). This way, if a region goes down, only that region’s link back to your on-prem network goes down – the other 4 continue to work as expected.

I think what you’re trying to do is treat Cloud VPN/Cloud Router’s as a “global load balancer” of sorts, but this is incorrect. You should be treating these as regional resources, and scope their work to only the region in which they exist. In this model, you are compartmentalizing a region to be fully independent of any other region.

Again, if I’m misunderstanding what you’re trying to do, I apologize.

keftes · March 10, 2025, 4:22pm

If you have 5 subnets (a subnet per region), then you would have 5 Cloud VPN gateways and 5 Cloud Routers. Each Cloud Router only advertises the subnets within the region it is created (i.e. do not select “Advertise all subnets visible to the Cloud Router” in the UI if you’re doing this manually). This way, if a region goes down, only that region’s link back to your on-prem network goes down – the other 4 continue to work as expected.

I think you nailed this. Separate gateway/router/tunnels per supported subnet(in a dedicated region). Each router responsible for its own region.

In the above scenario, would it make sense to have two routers per region? I’m just thinking of a potential failure of the Cloud Router now. Redundancy is an important factor for me I’m afraid. With the current budget, Cloud VPN is all I have.

Cidan · March 10, 2025, 4:22pm

If you’re using an HA VPN, there’s no need so long as you’re configured across two interfaces in your VPN gateway peer. We will automatically launch two Cloud Routers behind the scenes, one for each interface.

keftes · March 10, 2025, 4:22pm

We will automatically launch two Cloud Routers behind the scenes, one for each interface.

Fantastic! That covers that.

If we’re worried about the single VPN Gateway per region (there is periodic maintenance - according to the docs), we can simply provision a second one in the region, with its own set of tunnels (using the same Cloud Router as the primary) I would assume.

In that case, would the Cloud Router be able to determine which gateway (one of the two in the same region) to use when routing back on prem?

Regardless, thanks a lot, this was really helpful.

Cidan · March 10, 2025, 4:22pm

I’m not sure off the top of my head, there’s only one way to find out though

You’re very welcome!