Anyhow, let's minimise the likelyhood of your service/app/website being down
This presentation assumes AWS because it
seems to be the most common..?
it's the one I worked with
Main idea
Find out SLA of the AWS services you rely on, and add redundancy to minimise
the risk of downtime where it will give you the impact.
Optimise for $cost/%downtime
This we won't cover
Deployment and build pipeline resiliency.
AWS 101
Region (e.g. us-east-1 , eu-west-1)
Availability Zone (e.g. eu-west-1a, eu-west-1b)
Managed region-replicated services
So in a nutshell, how to you get node on AWS ?
spin ubuntu, ssh into it, git clone, nohup npm start
ECS
why ?
How should we run ECS ?
why should I tell you...
Reference architecture!
github.com/awslabs/ecs-refarch-cloudformation
What can go wrong ?
EC2 Instances
99.95%
AWS will use commercially reasonable efforts to make Amazon EC2 and Amazon EBS each available with a Monthly Uptime Percentage of at least 99.95% [1]
0.36 hours every month
where does that number come from ?
AutoScaling group helps
Application Load Balancer ?
let's hope not
It's managed service by AWS, that shouldn't fail, however they do not have any exact reference saying it is AZ replicated and highly available
NAT Gateway ?
nope
"Highly available. NAT gateways in each Availability Zone are implemented with redundancy. Create a NAT gateway in each Availability Zone to ensure zone-independent architecture." [1]
Internet Gateway ?
nope
"Internet gateway is horizontally scaled, redundant, and highly available. It therefore imposes no availability risks or bandwidth constraints on your network traffic." [1]
A whole AZ ?
maybe...
That's why we have multiple AZs. But how often does it happen ?
A whole region ?
maybe...
That's why we have multiple regions. But how often does it happen ?
A managed service ?
maybe...
Remember the S3 outage earlier this year ?
So how likely is our application to be down ?
P(A and B) = P(A) * P(B|A)
but generalise it to the number of EC2 instances you have
All 4 EC2 instances being down ?
6.25000e-14
assuming the probability of an instance being down is 0.0005%
Want to be really paranoid ? Do you lose more money per 12 hours of downtime than
the cost of running your app in 2 AZs ?
Some ideas
Go multi region
Run the same reference stack in a different region.
Go multi provider
Switch to Kubernetes or Docker Swarm
Profit
Make sure your non managed load balancers and orchestrators are not SPOFs
Tip #1
Think about the cost
Don't make your personal blog multi-region if you don't have to.
Unless you're doing it for fun.
Remember the eu-east-1 S3 failure that took down many things among them Docker Hub ?
S3 is not supposed to fail (lol), but they could very easily have been multi region.
(pic of the AWS console multi-region setting)
But docker hub is a free service (the paid version didn't go down) and why would they pay
twice the hosting cost to not be down for 8 hours in 3 years ?
Tip #2
break it
Many monkeys
Chaos Monkey - Latency Monkey - Conformity Monkey - Doctor Monkey - Janitor Monkey - Security Monkey - 10–18 Monkey - Chaos Gorilla
Chaos Monkey shuts down instances, Chaos Gorilla shuts down whole AZs
Pro Tip: Do it in a separate AWS account
Tip #3
use more "managed" things, offload HA concerns to AWS
where you can,
go serverless
because API Gateway and Lambdas will be managed, replicated and highly available out of the box :)
reference stack ++
github.com/joaojeronimo/paranoid-ha-ecs
helps you create, update and delete stacks in all regions
But how do you send traffic to all those ALBs ?
no idea
But I think NS1 has a product for that
(anycasted network plus one of the DNS traffic management things)