Why Do You Need Chaos Engineering?

Amruta Bhaskar
Jun 23, 2021
0 comment(s)
1526 Views

Chaos engineering is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions. It relies on concepts underlying chaos theory, which focus on random and unpredictable behaviour. The goal of chaos engineering is to identify weakness in a system through controlled experiments that introduce random and unpredictable behaviour.

The main benefit of chaos engineering is that organizations can use it to identify vulnerabilities before a hacker does or before a system failure. Changes made as a result of chaos engineering testing increase confidence in an organization's systems.

Some IT groups hold chaos engineering game days where teams try to break or breach systems. They use failure mode and effective analysis or other tactics to get insight into potential points of failure in their organization's systems.

With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. We all depend on these systems more than ever, yet failures have become much harder to predict.

These failures cause costly outages for companies. The outages hurt customers trying to shop, transact business, and get work done. Even brief outages can impact a company's bottom line, so the cost of downtime is becoming a KPI for many engineering teams. For example, in 2017, 98% of organizations said a single hour of downtime would cost their business over $100,000. One outage can cost a single company millions of dollars. The CEO of British Airways recently explained how one failure that stranded tens of thousands of British Airways (BA) passengers in May 2017 cost the company 80 million pounds ($102.19 million USD).

Companies need a solution to this challenge—waiting for the next costly outage is not an option. To meet the challenge head-on, more and more companies are turning to Chaos Engineering.

Designing any experiment requires four things: a postulation and hypotheses, independent variables, dependent variables, and of course, context. These principles provide a guidepost for designing chaos engineering experiments:

• Construct a hypothesis around steady-state behaviour.

• Trigger real-world behaviour, utilizing both a control and an experimental group.

• Run experiments in production by injecting failures into the experimental group.

• Automate experiments to run continuously, attempting to disprove the hypothesis that your system is resilient.

Robust experiments should trigger the loss of availability of several components within the system. Experiments need to mimic real-world events, avoiding the happy path. Tests should utilize all possible inputs while also recreating scenarios from historical system outages.

How chaos engineering works

Chaos engineering is similar to stress testing in that it aims to identify and correct system or network issues. Unlike stress testing, chaos engineering doesn't test and correct one component at a time.

Chaos engineering examines problems that have a seemingly infinite number of possible causes. It looks beyond the obvious issues and tests distributed systems against problems or sets of problems that are less likely to happen. The goal is to gain new knowledge about the system.

The process is typically divided into several steps:

Set the baseline. Start by establishing a baseline. The testers must identify how the system should operate under optimal conditions and specify what constitutes a normal working state.
Create a hypothesis. Consider one or more potential weaknesses and formulate a hypothesis about the effects of those weaknesses. For example, software testers might want to know what will happen if a large traffic spike occurs.
Test. Conduct experiments to gauge the consequences of a large spike. The experiments might reveal an error in a critical process or an unexpected cause-and-effect relationship. For example, a traffic spike simulation might reveal a storage performance issue.
Evaluate. Measure and evaluate how the hypothesis holds up and determine which problems to fix.

· Chaos Engineering Tools

· It is essential to minimize the blast radius while designing chaos experiments, ideally one small failure at a time. Measure experiments carefully, ensuring they are low-risk: involve few users, limit user flows, limit the number of live devices, etc. As one begins, it is wise to inject failures that verify functionality for a subset or small group of clients and devices. As these low-risk experiments succeed, you can then proceed to run small-scale diffuse experiments that will impact a small percentage of traffic, which is distributed evenly throughout production servers.

· A small-scale diffuse experiment's main advantage is that it does not cross thresholds that could open circuits. This allows one to verify single-request fallbacks and timeouts while demonstrating the systems resilience to transient errors. It verifies the logical correctness of fallbacks, but not the characteristics of the system during large-scale fallout.

· The following is a list of tools to get you started.

Chaos Monkey: The OG of chaos engineering. The tool is still maintained and currently integrated into Spinnaker, a continuous delivery platform developed initially by Netflix to release software changes rapidly and reliably.
Mangle: Enables one to run chaos engineering experiments against applications and infrastructure components and quickly assess resiliency and fault tolerance. Designed to introduce faults with minimal pre-configuration and supports a wide range of tooling, including K8S, Docker, vCenter, or any Remote Machine with SSH enabled.
Gremlin: Founded by the former Netflix and Amazon engineers who productized Chaos as a Service (CaaS). Gremlin is a paid service that gives one a command-line interface, agent, and intuitive web interface that allow you to set up chaos experiments in no time. Don't worry. There's a big red HALT button that makes it simple for Gremlin users to reactively rollback experiments in the case of an attack negatively impacting the customer experience.
Chaos Toolkit: An open-source project that tries to make chaos experiments easier by creating an open API and standard JSON format to expose experiments. They are many drivers to execute AWS, Azure, Kubernetes, PCF, and Google cloud experiments. It also includes integrations for monitoring systems and chat, such as Prometheus and Slack.

Chaos engineering best practices

Chaos engineering is complicated. Following these best practices can help avoid problems that stem from the fallacies listed above:

Understand the usual behaviour of the system. Having a solid understanding of the system when it is healthy will help in diagnosing problems.
Simulate realistic scenarios. Focus on injecting likely failures and bugs. For example, if latency has been a problem in the past, inject bugs that induce latency.
Test using real-world conditions. This yields the most accurate results. Chaos engineering is often performed in production environments, especially when it is too cumbersome or expensive to duplicate a large, distributed system for testing purposes.
Minimize the blast radius. Chaos engineering can be highly disruptive. Success demands coordination among IT staff, developers and business units. Experiments in production environments are rarely run at peak times, and ideally, nobody using the system will be able to tell that chaos experiments are taking place. Redundancy should be in place to ensure that services remain available if experiments do cause issues.

SkillRary