The Science behind Amazon Go Stores : A Simplified explanation

High Level Functionalities

At the heart of the Go Store is the Computer Vision based Deep Learning that is used to seamlessly track and estimate the intention of everyone in the store.

Let us first start with high level functionalities  that the algorithm performs in the store. Below is a high level functionalities view that algorithms perform that I have created based on my interpretation of the process.

Capture.JPG

High level problems

Now switching from functionalities into problems, in order to perform the functionalities, there are several types of problems that need to be solved. If we aggregate these problem types, these problem types essentially solve the following three high level problems:

  • Item Identification
  • Person identification
  • Who took what (Customer association )

I have represented these three problems at a high level in the illustration below:

Capture.JPG

In an architecture format, the Amazon Go architecture looks like the illustration below. Terms that have not been introduced yet  (locator, linker, tangled state) that you see in the architecture will be explained in subsequent sections so no need to panic:

Capture

Now let us explore these three in detail to understand how the logic works.

You will see that in some places, I have mentioned the specific Neural Networks that were used but you don’t need to know anything about them to understanding this article. That is for reference purpose only if you want to do any further reading.

Person Identification

There are two key aspects of person identification in this problem. One is to identify the person as soon as they arrive in the store. The second aspect is to keep the track of the person and their activities as they move within the store.

The first key objective is to track each person the whole time they were in the store, from the moment they walk in until they leave. While it may sound relatively simple, some of the difficult problems that Amazon had to solve were:

  • Occlusion: where a person is blocked from view by something in the store
  • Tangled State:  where people are very close to each other

To address these problems, Amazon uses custom camera hardware that does both RGB video and distance calculation. From there, they segment image into pixels, group pixels into blobs, and label each blob as person/not-person. Finally, they build a location map from the frame using triangulation of each person across multiple cameras.

The next task for Amazon was to ensure the labels are preserved across frames in the video, moving from locating to tracking the customers in the store. The problems experienced in this phase were:

  • Tangled States re-identification: When 2 people who get very close together, this lowers confidence of who is who. The go store technology handles this by marking these customers as low confidence get scheduled to be re-identified over time.
  • Distinguishing Associates who perform different behavior than customers

Item identification

: The key question to answer here is:

Which specific items are off the shelf and in someone’s hand.

Some of the problems faced and solutions in this phase were:

  • Items that are very similar, like 2 different flavors of the same brand of drink, were distinguished using a specific Neural Network ( residual neural networks) that do refined product recognition (across multiple frames) after a Neural Network of other type ( Convulational Neural Network (CNN)) identifies the item class
  • Lighting and deformation changes the items, which was solved using a lot of training set data generation for these specific challenges

Customer association

Probably the most challenging problem is combining all of the information from the above steps to finally answer the “Who took what?” question.

: The Location tracking Go store cameras look from the top down, not form an isometric view, so they need to trace a path through the pixels representing the arm between the items and a customer. A simple top down model did not work well enough to solve this problem, so the team set out to build a stick-figure like model of the customer .

A novel new Deep Learning model was needed to build an articulated model of each customer from the video. (Technical stuff : It uses a CNN with a cross entropy loss function to build the joint detection point cloud, self regression for vector generation, and pairwise regression to group the vectors together).

Need to accurately account for a world where the customer can put items back on the shelf. One of the problems here can be seen in the picture below (actual feed picture from an Amazon Go store camera).

Capture.JPG

The obvious answer to the question is that an item was taken, but this is incorrect. Instead, a customer put an item back and pushed the remaining ones further back on the shelf. To solve for this, the system needs to count all the items on the shelf rather than using a simple assumption based on space.

As you can imagine, there are a massive number of poses people can be in when picking an object off the shelf, especially when you consider multiple customers in close proximity. There simply isn’t enough labeled data to train a model for each of these. Even with human labeling, it wouldn’t be possible to scale the training dataset (in terms of both money and time).

To solve for this, the team took on the ambitious project to generate synthetic activity data using simulators. Within these simulators, they needed to create virtual customers (including variations in clothing, hair, build, height etc.) cameras, lighting & shadows, and simulate the same camera hardware limitations. However, the payoff was huge:

  1. The data is pre-annotated because it is generated, which makes it 3 orders-of-magnitude cheaper to annotate simulated data.
  2. The team could scale out the compute to generate data (and they had the AWS cloud to do so).
  3. The annotations are very consistent across frames, which is not the case with human annotators.

By using simulation to build a massive training set, the team was able to leverage the power of the cloud to train on months worth of data in a day, eliminating the time bottleneck, and allowing rapid progress.

This is very similar to the techniques used by DeepMind to train AlphaStar, OpenAI to train the Open AI 5, and self-driving companies to train their driver models.

Physical Hardware & Infrastructure

Cameras

None of the Computer Vision magic can work without the video feeds. The initial challenge to solve was getting the video out of the store and to the cloud for processing. This system had the following components:

  1. Video capture with compute on board to do basic preprocessing and cut down the bandwidth requirements
  2. Video streamer appliance on site to handle video codecs, network issues, and guarantee delivery to the cloud
  3. Video servers on the cloud to capture and store video in S3 and Dynamo

Technical stuff: A key aspect of this stage in the pipeline is redundancy and anomaly detection to handle real-world failure scenarios across the system (camera, network, cloud infra, etc.) and provide resiliency.

Kiosks at entrance

The next challenge is detecting when people enter and exit the store to create what you can call a shopping session. This system has the following components:

  1. Mobile App to scan QR when you show up at the store. They spent a lot of time doing UX testing on this (scan with phone up or down, how to handle groups, etc.)
  2. Association System associates your likeness in the video to your account based on position in the store entrance when you scan the QR code
  3. Creation of the session happens based on the association

When implementing the system, the team had to solve for a few additional scenarios. First, people might scan multiple times, so they had to delete any session with no items on a second scan.

A more difficult problem is customers (especially families) want to shop as a group but only have one person pay. To enable this, the head/payer scans the same code for each person as they enter the store. This creates a session that links all of the people in the group to the same account. From there, the people in the group can leave the store whenever they choose. By moving the session up to the group level, and treating individual shoppers as a ‘group of one’, the team was able to overcome this challenge and let individuals enter or exit the group at any time.


Based on my own research.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s