How Capital One Engineering Chaos into Production | Jobs Vox


“We like to say we’re a tech company that does banking,” Brian Pinossenior director of software engineering at Capital One told the online crowd this week DevOps Enterprise Summit,

The Fortune 100 company with 100 million customers is 28 years old — young for a US bank holding, but built well before the age of cloud native fintechs. Nevertheless, in 2020, Capital One became the first bank to run entirely on the public cloud, making it one of Amazon Web Services. biggest customer, This flies in the face of the austerity of the finance industry’s move to the cloud, which Pegs as John Cragg, CEO of MYHSM Based on uncertainty around regulations, costs and security.

So how does Capital One overcome that reservation to provide always-on banking and credit services through the public cloud? with a staunch investment in the planned and unplanned chaos engineering and testing in production — and a heavy dose of open source, Read on to find out how controlled chaos reigns at Capital One and what they think they need to take it to the next level and become one Elite DevOps Organization,

How Capital One tests in production

chaos engineering decades old devops Practice that brings a positive spin to Tech’s passion for being fast and breaking things. Like an oxymoron, Chaos Engineering is actually controlled experimentation aimed at pushing services to their limits and testing processes around software resiliency. Rather than being chaotic, it is systematic, measurable, and when done well, has rapid rollback. As TNS colleague Maria Korolov writesCompanies typically begin their chaos building what they already know is broken, before moving on to address the goals of overall business impact and flexibility.

while principles of chaos engineering Focused exclusively on coping with turbulent conditions in production, some organizations have reached the point where they are driving chaos into production. After all, no one really wants to break customer service.

,test in production Considered a bad word, you didn’t test your code,” said Pinos’ colleague and director of stability and site reliability engineering at Capital One, before the release. dude savchenko, In fact, he says, maintaining consistency between a low QA environment and production is notoriously difficult and costly, “almost impossible and not worth the trouble.”

Instead, his team generates realistic, high loads – such as Pay Friday – in production, AWS fault injection simulator, AWS System Managerand other Chaos Engineering tools, as well as those they developed internally.

They “create tools to fight complexity and admit failure,” Pinos explained, including monthly game days as well as “unplanned” — or no prior warning — chaos experiments.

You can’t have anarchy without infrastructure in the form of code

The Capital One team addresses a few buckets of massive failure:

  • application layer failure – Internal tools Cloud Doctor helps teams understand the complexity of the environment by provoking an app layer failure in a small percentage of production environments, so they can see what happens and then work to make it more resilient .
  • availability zone failure – simulating – while everyone is still awake and working – what if a zone goes down, will it automatically go down again to a new zone, and then the container automatically up again will join
  • regional failures – They need to assure that they have the ability to exit an area if one or more go down.

Although this is a very technical internal process, it begins with a hypothetical conversation – what will we do if X goes down? – Creating disaster scenarios for specific architectural failures.

Pinos says the Capital One team won’t be able to achieve any of this without standardized deployment through infrastructure as code (IaS). “To understand the complexity of all the cloud, you have to invest in tooling,” he continued, and you “get rid of manual intervention through targeted exercises.”

Chaos Engineering for low latency and high capacity

The first round of Capital One’s Chaos Experiments looked to provide much of the answer:

  • How will a critical system perform under extreme load?
  • What happens if one region or data center fails?
  • Is API Gateway capable of scaling?
  • Is the load balancer sized correctly?

These answers helped them proactively identify several potential increased latency scenarios with multiple microservices.

“By conducting several chaotic exercises, both planned and unplanned, we identified latency. We cannot be the speed of light, so the more data you have to push and the further apart your data centers are from each other, the higher your latency. will only increase,” Savchenko said in the same conversation. Data will just bounce back and forth, he says, and then, “if something changes, like a component fails or your primary database moves from one center or region to another,” things time out and Customers are negatively affected.

The further away the components are from each other, the more latency is introduced. “No client is going to wait 30 seconds for your application to load,” Savchenko continued. Therefore, Capital One focused on moving components and Right-sized components for the cloud,

He also discovered some potential findings. “The advantage of the cloud is you can scale unlimited, zero to any reasonable number. There is a cost but sometimes you don’t size your cloud native resource correctly, so when traffic shifts, You don’t have great computing power,” he continued.

Through chaos engineering, they found many cases where resources weren’t the right size for a spike in user access — “sometimes when we expect it — when people get paychecks — but also other unexpected reasons,” he said. Explained.

While “no one in IT likes processes,” Savchenko said they needed to create a process about what to do next, usually through configuration changes, capacity expansion or reengineering, within 30 days. The goal is to reduce the issues discovered. Typically, he said, size and latency defects can be resolved quickly.

Risk of testing in production

As with any experiment conducted on real users, there is an inherent risk to testing in production, but, Pinos argues, the value far outweighs the risks.

Unintended effects include:

  • increased latency
  • lack of sufficient capacity
  • actual failures
  • actual event occurring at the same time the chaos is being demonstrated in the same area of ​​the product

“The key to risk management is mitigation,” Pinos said. You need to have testers sit down with both business and technical stakeholders from the start to create an agreed-upon playbook on the boundaries of your experiment. If you reach that limit, you abort the test. He says everyone has to agree on rollback triggers and techniques that can be executed within five minutes — “anything to eliminate that lack of control.”

Of course, in order to implement chaos engineering on a large scale, you need to know that there is a problem, which comes down to real-time monitoring of all critical systems and transactions. First, to understand the steady state before the test, then during monitoring, and then again to check the steady rate.

“If you put latency into the call path and you don’t verify that the latency goes away, you have a huge problem,” commented Pinos.

Monitoring is also necessary to measure the impact of no-notice chaotic events, which go far beyond what goes really bad on standby, to help experienced site reliability engineers position. However, the rest of the affected teams are not aware of this. This is when Capital One breaks things down to see whether people and processes are responding as expected, along with automation – how quickly do engineers jump over the bridge?

Stepping up to the next level of flexibility

“You should never stop growing. The status quo is the worst thing you can achieve in the IT industry,” Savchenko said.

Capital One is currently performing Chaos Engineering in production for all web and mobile client-facing products. In 2023, he hopes to expand the scope to all-important applications including call centers and interactive voice response systems – which, he says, are typically considered separate, but certainly both parts of the customer experience for improvement. Huh.

Furthermore, they are extending their anarchy to third-party vendors. “Chaos will allow us to test those third-party vendors and identify gaps, including latency,” Savchenko said. It also covers planned failures with major communication tools like Zoom, Slack, and Splunk, ensuring engineers can easily switch to a different tool.

They are experimenting with producing and injecting fake HTTP error codes into downstream applications to see how they react to handled errors.

The Capital One engineering team is working to be able to execute all types of testing on highest volume days without advance notice. The team is working very hard so that any incident that occurs in production is self-healing.

“Our ultimate goal is to make sure that game day drills are unannounced … to use a tool with one click to start the experiment, and with one click back if anything goes wrong.” To roll,” Savchenko said, as they strive to achieve a high degree of flexibility.

Group Made with Sketch.


Source link

Implement tags. Simulate a mobile device using Chrome Dev Tools Device Mode. Scroll page to activate.