Amazon TechOn Conference, Iași 8 October 2016
“It’s Always Day One at Amazon”
Saturday October 8th I visited AmazonTechOn Conference in Iași. Why Iași? Because Amazon has opened here a development center since 2005.
Conference was hosted in the iconic building of National Theater.
Amazon TechO(n) conferece is held every 2 years and here come speakers from the US Headquarters to present technical goodies and best practices at Amazon.
For this year there were 7 main topics covered:
1. S3 Under the Covers: “Distributed Systems at Scale”
2. Customer Experience: “Learn & Be Curious”
3. Amazon Robotics Overview
4. Developer tools: “Agility at Amazon”
5. Customer Behavior: “Experimentation and Failure”
6. Security: “A Day in the Life of an Amazon Application Security Engineer”
7. Building Testing Ecosystems
By far, the most captivating one was called “S3 Under the Covers: distributed systems at scale“:
1. S3 Under the Covers: distributed systems at scale
Core Systems Primitives that power AWS
by Allan Vermeulen, Distinguished Engineer, AWS
The speaker, Allan Vermeulen works in AWS Division and is one of the 10 Amazon “Distinguished Engineers” that has worked there since 90’s. He was part of the team that set the building blocks for S3, the Amazon Storage solution.
My takes from his talk:
- Distributed systems metaphor: A man with a watch knows what the time is. A man with two watches – never knows for sure!
- How to deal with partial failures? Be Paranoid!
- When working with high load, be prepared to deal with partial failures under unthinkable scenarios: Bit Rot, flipped bits in memory, et cetera. They had an 8 hour outage at Amazon because there was a bit-flip in memory that started a chain reaction and collapsed whole racks.
- With incoming data, calculate the checksums after storing data on shards, just to be sure it remains unaltered
- Be smart when sharding the data. Latest Amazon strategies: use “cells”=groups of equal # of servers, to store clusters of servers, so that you have isolation for your clients.
- Distributed Computing Challenges:
- Discovery of group membership of objects stored in S3. “Gossip Protocol“, based on “epidemic algorithm” – custom made protocol for S3
- Failure detection: silent or dead? Make your servers talk, never assume status based on server silence.
- Amazon implements TLA+ algorithm to manage their cloud: https://en.wikipedia.org/wiki/TLA%2B
- How do you test this kind of distributed systems? As you cannot practically generate huge amounts of load to the test systems, you use a Chaos Monkey, to inject entropy into the systems.
2. Learn and Be Curious
It’s always day one at Amazon
by Sean Scott (VP of Consumer Experience Technology, Amazon)
- Ask questions. Use the five whys to get to the root cause of a problem
- Seek for different opinions.
- Make mistakes
- Test experiences, not features. Focus on the path to the result, not the result itself.
- Work backwards, starting with the customers. Use techniques like triangulation (eyeballs heat map)
- Some discoveries at Amazon :
- The shipping test: “super saver shipping” was changed into “Free shipping” (no one was reading the previous one, after this experiment they hired linguists)
- Amazon Author Pages – people were looking for all books by …, so they created the author pages → eventually transformed into a business itself (1.4 million pages automatically created for authors)
- “We plant lots of seeds”
3. Amazon Robotics Overview (amazonrobotics.com)
by Drago Kassabov (Director of Operations, Amazon Robotics)
- They use robots in “Fulfillment centres”, to help partners bring goods to be packaged.
- Next challenge (unsolved problem yet): make robots grab things from the pallets, and put them in the right basket.
- For that they went to big Universities across the globe, and made “Tech Jams” to test robots prototypes.
- Sponsored event: Amazon Picking Challenge in Germany. Next event will be in Japan.
4. Agility at Amazon
by Ken Exner, director of AWS Developer Tools
Historically, Amazon had a Monolith Development lifecycle:
Due to changes to micro-services, they needed to change the development life cycle.
They grouped in teams of up to 8 persons (“Two pizzas team”), with full ownership, including maintenance.
Such teams own one or more “primitives” = microservices at Amazon, end-to-end.
They have automated the entire process. They have hooks in place, e.g. in case of CPU escalated consumption, rollback automatically.
- Use release pipelines all the way to production
- Pessimistic deployments (Invest in validation)
- Faster isn’t better!
- Deployment strategy: deploy one box then test (sophisticated analyze the transactions), then deploy in 1 “AZ” then in one region, etc.
- Optimize ECT (Edit-Compile-Test) loop, because you want to catch problems as early as possible.
- The other loops, like Code Review, or Staging or Production testing move much more slowly.
- Monitor everything.
- When encounter a live problem, ask: do we have a graph for that problem?
- On Wednesday morning there is a meeting where all higher management looks at the graphs and spot issues
- Measure everything you can.
- Convince people that automation is better
- Canary= tests running against staging=synthetic monitoring
- Code coverage checks: if < 70% Unit Tests then gate (stop) the release
- Tests on systems: Unsolved problem: How to do automated integration tests? (they’re trying to solve it today at Amazon)
- Experimenting with policies as part of the pipeline, they will inevitably add bottlenecks
- Lessons learned:
With thousands of teams (two-pizza teams), and microservices architecture, they do in multiple environments
64 million deployments a year.
5. Experimentation and Failure
Success, one disapointment at a time
by Blair Hotchkies, director of Consumer Behavior
- Jeff Bezos quotes that drive the experimentation at Amazon:
- “Failure and invention are inseparable twins”.
- “If you know in advance that it’s going to work, it’s not an experiment”
- “Organizations that embrace the idea of invention, are not willing to suffer the string of failed experiments to get there”
- Experiment -> from failure to success Fire Phone to Amazon Echo (Alexa). Without the failure of Fire Phone they may have never launched Echo
- How they experiment:
- Pay attention to Culture Antipatterns (e.g. jumping to conclusions, tendency to tell & believe stories instead of backing up with data)
- Hire high judgement individuals, reward decisions and not outcomes
- Focus on learning
- Bake in Checkpoints
- Celebrate abandonment and negative findings
- Publish also negative learning (failed experiments)
- They developed an Amazon Experimentation Tool.
- Question: How do you decide between small gains versus long term loss: Answer: It is not a solved problem at Amazon. We are working on it.
- Question: How many experiments do you make at any given moment? Answer: A few thousands. Many are tiny ones.
Tool: “Customer Session Replay” demo-ed:
Live browsing data is recorded and reproduced on screen for the analyst so he can see the exact customer environment.
6. A Day in Life of a Security Engineer at Amazon
by Jon McClintock (Principal Security Engineer in Information Security, Amazon)
- Security is part of the development process
- In an Agile world, integrate security into the process, just like code review process
- As a Security Engineer, you make an educated risk decision
- During design phase:
- Threat Modelling: D.R.E.A.D. Model: Damage, Reproductibility, Expoloitability, Affected Users, Discoverability
Biggest challenge: how paranoid do you need to be?
There are no black and white decisions!
- Best security engineers are those that can turn a security no into a security yes
- Rather than have access to data, think about alternative ways to identify problems w/o access to data.
- Change the solution so that data passes encrypted through Tiers where we do not need it.
- This way you turn the sec. no into a Yes, and developers can deploy safely into production
- Security Resources:
7. Building Testing Ecosystems
by Phil Sigel (Senior SDET in Digital Content Platform, Amazon)
- At Amazon, the communities of practice are called “Samurai Groups”
- learning, training, coaching, etc.
- “If a test isn’t automated, it does not exist”
- Load Testing – questions:
- Load testing: Send realistic amounts of data, then do an offline replay – replay it right away
- Shadow Testing: Intercept requests and responses from the new system
- Service Mocking: Hook Mocks on the repo., decorate clients to dependent ecosystems,
- Create an ecosystem in the end, where tools work together and share data.
- We should fail in a way that our customers do not have a lousy experience.
That was it!
I am looking forward to the next edition of AmazonTechOn Conference, hopefully earlier than in 2 years 😉