Chaos Lambda, a lightweight serverless chaos monkey for AWS

Chaos Lambda

a lightweight serverless chaos monkey for AWS


Chaos Lambda is a serverless "chaos monkey" implementation for AWS. Use the chaos-lambda CLI to deploy and configure an AWS Lambda function that will terminate some EC2 & ECS instances at regular intervals. It will only select instances from certain ASGs that you configure and not any other instances you have.

Chaos Lambda's purpose is to help you build robust & highly-available systems that recover gracefully from failures. Use it on its own or use it in conjunction with Artillery for systems that keep working well under high load and in presence of failure.

Quickstart

  • Install the CLI with npm install -g chaos-lambda
  • Set up an IAM role for the lambda function (with the EC2FullAccess policy to be able to terminate instances)
  • Deploy the lambda with chaos-lambda deploy -r $lambda-role-arn

For more details, see the Chaos Lambda README.

Source Code

The source code for the CLI & the lambda function is over on Github: shoreditchops/chaos-lambda.

How Does It Work?

  • You use the CLI to setup and configure an AWS Lambda function in your AWS account. It has a schedule and a list of ASGs to pick instances from.
  • When the lambda is invoked on that schedule, it will pick and terminate an instance at random from one of the ASGs you whitelisted (this list is empty by default, i.e. the lambda is 100% safe to deploy).

That's it. It's that simple!

Why Chaos Lambda?

Failures happen, and they inevitably happen when least desired. If your application can't tolerate a system failure would you rather find out by being paged at 3am or after you are in the office having already had your morning coffee? Even if you are confident that your architecture can tolerate a system failure, are you sure it will still be able to next week, how about next month? Software is complex and dynamic, that "simple fix" you put in place last week could have undesired consequences. Do your traffic load balancers correctly detect and route requests around system failures? Can you reliably rebuild your systems? Perhaps an engineer "quick patched" a live system last week and forgot to commit the changes to your source repository?

Source: Chaos Monkey wiki

How is this different from Netflix's Chaos Monkey?

If you use AWS, want something very lightweight that you can set up in <15 minutes, don't want to configure & run another EC2 instance, and don't use Spinnaker then Chaos Lambda is for you.

Stay up to date

Chaos Lambda-related updates will be posted on our blog (as well as other performance & reliability engineering related posts) as well as on @ShoreditchOps on Twitter.