The node app that won't go down

Except, surprise surprise, it's 2013, so

node.js

docker container!

Anyhow, let's minimise the likelyhood of your service/app/website being down
This presentation assumes AWS because it seems to be the most common..? it's the one I worked with

Main idea

Find out SLA of the AWS services you rely on, and add redundancy to minimise the risk of downtime where it will give you the impact.

Optimise for $cost/%downtime

This we won't cover

Deployment and build pipeline resiliency.

AWS 101

  • Region (e.g. us-east-1 , eu-west-1)
  • Availability Zone (e.g. eu-west-1a, eu-west-1b)
  • Managed region-replicated services

So in a nutshell, how to you get node on AWS ?

spin ubuntu, ssh into it, git clone, nohup npm start

ECS


why ?

How should we run ECS ?

why should I tell you...
Reference architecture!
github.com/awslabs/ecs-refarch-cloudformation

What can go wrong ?

EC2 Instances

99.95%

AWS will use commercially reasonable efforts to make Amazon EC2 and Amazon EBS each available with a Monthly Uptime Percentage of at least 99.95% [1]
0.36 hours every month
where does that number come from ?
AutoScaling group helps
Application Load Balancer ?

let's hope not

It's managed service by AWS, that shouldn't fail, however they do not have any exact reference saying it is AZ replicated and highly available
NAT Gateway ?

nope

"Highly available. NAT gateways in each Availability Zone are implemented with redundancy. Create a NAT gateway in each Availability Zone to ensure zone-independent architecture." [1]
Internet Gateway ?

nope

"Internet gateway is horizontally scaled, redundant, and highly available. It therefore imposes no availability risks or bandwidth constraints on your network traffic." [1]
A whole AZ ?

maybe...

That's why we have multiple AZs. But how often does it happen ?
A whole region ?

maybe...

That's why we have multiple regions. But how often does it happen ?
A managed service ?

maybe...

Remember the S3 outage earlier this year ?
So how likely is our application to be down ?

P(A and B) = P(A) * P(B|A)

but generalise it to the number of EC2 instances you have
All 4 EC2 instances being down ?

6.25000e-14

assuming the probability of an instance being down is 0.0005%
Want to be really paranoid ? Do you lose more money per 12 hours of downtime than the cost of running your app in 2 AZs ?

Some ideas

Go multi region

Run the same reference stack in a different region.

Go multi provider

  1. Switch to Kubernetes or Docker Swarm
  2. Profit
  3. Make sure your non managed load balancers and orchestrators are not SPOFs

Tip #1

Think about the cost
Don't make your personal blog multi-region if you don't have to. Unless you're doing it for fun.
Remember the eu-east-1 S3 failure that took down many things among them Docker Hub ?
S3 is not supposed to fail (lol), but they could very easily have been multi region. (pic of the AWS console multi-region setting)
But docker hub is a free service (the paid version didn't go down) and why would they pay twice the hosting cost to not be down for 8 hours in 3 years ?

Tip #2

break it

Many monkeys

Chaos Monkey - Latency Monkey - Conformity Monkey - Doctor Monkey - Janitor Monkey - Security Monkey - 10–18 Monkey - Chaos Gorilla
Chaos Monkey shuts down instances, Chaos Gorilla shuts down whole AZs
Pro Tip: Do it in a separate AWS account

Tip #3

use more "managed" things, offload HA concerns to AWS
where you can,

go serverless

because API Gateway and Lambdas will be managed, replicated and highly available out of the box :)

reference stack ++

github.com/joaojeronimo/paranoid-ha-ecs
helps you create, update and delete stacks in all regions
But how do you send traffic to all those ALBs ?
no idea
But I think NS1 has a product for that
(anycasted network plus one of the DNS traffic management things)