gRPC is great and amazing. So is AWS. But putting them together means learning a few new things.
gRPC Just Loves HTTP2
Ok, so you read why prefab.cloud switched to gRPC and you're ready to go! Your first learning will be that traditional ELBs won’t work at all for gRPC. Ok, fine fine. We should be able to use the newer and recommended ALBs, right? Actually no.
gRPC is an unapologetic HTTP2 native. If gRPC is a millennial on their HTTP2 smart phone, AWS is… a bit more like a Gen X-er trying to snap the chats.
ALBs only sortof support HTTP2
While ALBs do “sort of” support HTTP2, they don’t do so in a way that is sufficient for gRPC. ALBs can happily accept HTTP2 multi-request packets, but they then de-multiplex the packets and forward them on to the listeners as HTTP 1 requests.
This is well and good if you are trying to upgrade a traditional API that sits behind an ALB so that you can get the load-balancing benefits of multiplexed HTTP2, but gRPC is much more interested in a different part of the HTTP2 spec, namely it's support for long lived connections. Built in bi-directional streaming is one of the big selling points of gRPC and that requires full fledged HTTP2 support. Prefab.cloud uses this streaming support to quickly push feature flag updates to all clients.
So what is to be done? Well there are two basic approaches that we can take. The first is to eschew AWS load balancers entirely, setup an ngnix/traefik/envoy tier and take load balancing into our own hands. That’s definitely a viable option, but is also a non insignificant piece of infrastructure that you might not have wanted to write.
AWS Network Load Balancers (NLBs) to the rescue
The second option is to use AWS’s newest load balancer, the shiny new “Network Load Balancer” or NLB. NLBs are significantly different from ALBs. They are written for truly massive scale & absolutely minimal latency. Nothing is free however, so you will lose quite a few things you may have enjoyed. In particular NLBs are "Level 4" load balancers, whereas ALBs are "Level 7". If you're a bit rusty on your network layers, what that means is that NLBs just see TCP packets and balance them pretty blindly. At level 7, ALBs can introspect the packets, since they know about HTTP and HTTPS. The good news as far as we're concerned however is that since the NLB doesn't actually know anything about HTTP1 or HTTP2 it aslo doesn't know enough to get in our way!
Downsides: AKA You don't miss AWS Certificate Management Until it's Gone
So sadly, it's no longer an option to terminate SSL at the LB layer if you use an NLB. For gRPC the good news is that the gRPC servers are very ready to do SSL for you. The more annoying news is that you’re going to need to spend some money & effort distributing your private key to your various instances. We dive deeper into that in ECS and EFS for SSL for your NLB.
gRPC Healthchecks are a Royal PITA
The second head scratcher we ran into was that there’s no way for an NLB to properly health check a gRPC service. This one I’m blaming on the gRPC side of the fence. gRPC only accepts HTTP2 and its health check needs to be a specific HTTP2 POST. This breaks approximately every health-check system that expects to be able to send a simple HTTP GET to an endpoint and expect 200 to mean the service is up.
Our current solution here is to run two processes on each node. Our deployable is a simple Dropwizard java app that basically just has a health check. That drop wizard app then spins up the gRPC server on a separate port. It makes for an imperfect health check since the health check actually bypasses the actual “gRPC” part of gRPC, but the health check can still verify that the underlying infrastructure of the service is sound.
- You can run gRPC on traditional AWS load balancers
- You will need to DIY SSL Termination
- NLB health checks of gRPC require a bit of a workaround
One final note is that the load balancing this setup achieves is one where all requests from a single host will go to the SAME instance. In my humble opinion that is… kindof the point of gRPC (long lived connection, low overhead, streaming) but depending on your load profile it’s important to understand.