How Smart warehouse picking bots navigate: A Simplified explanation of Reinforcement learning for Managers

“An artificial adaptive thinker sees the world through applied mathematics translated into machine representations”    –  Issac Assimov

As autonomous bots start penetrating warehouse operations, one of the aspects of the already complicated lives of warehouse managers will be to make sure that the bot fleet is doing its job.

They will obviously have help from a on site team of engineers to troubleshoot or investigate when things go South. You are not required to know how they exactly work but if you are technically curious, this article  will  explain in simple terms, a Machine Learning (ML) algorithm, that most bots use in some variation – Reinforcement learning.

Dude-where is my self driving Car ?

As I have mentioned in many of my posts, at a high level, an actual “Intelligent” AI solution will have the following capabilities:

  • Sense
  • Plan (interpret,analyze)
  • Execute (act)
  • Learn

If you apply this to Smart bots- For a robot to successfully operate in a given environment it :

  • must make sense of it somehow,
  • plan its actions and
  • execute those plans using some means of actuation, while
  • using feedback to make sure everything proceeds according to plan.

While all of these come easily to humans, they are often incredibly challenging for a computer. We have recent seen an explosion in the terms “self driving” and a great increase in applications of deep learning to prominence in various tasks of computer vision. But why is a fully hands off self driving vehicle (forget about a self driving Supply Chain) years away ? Despite always hearing that the technology is “almost there”?

The hard fact is that the broad spectrum of skills required to make sense of a typical pedestrian street scene is still not within the grasp of current technology. Yes- despite all the advances. That is called the “long tail” of training. Remember, in simple terms, Artifical Intelligence algorithms are not “Intelligent” in real terms. They are exposed to every possible situation that they will encounter in their role and trained to respond to all those scenarios. And you can imagine the number of scenarios a self driving car algorithm can encounter are….well…limitless.

Robotics in Supply Chain domain

Manufacturing was one of the early adopters of vanilla robotics. And the reason was simple….there is a vast difference between operating in an assembly line and operating in the street.

In the assembly line, everything in the environment can usually be precisely controlled, and the tasks needed to be performed by the robot are often very specific and narrow. The robot is not doing any thinking of its own or making any decisions. It is pre-programmed to take certain actions.

But even for these vanilla robots, designing the motion planning and control algorithms for manufacturing robots is a long and tedious process, requiring the combined efforts of many domain experts. This makes the process very costly and lengthy, which explains the vast gap between our current capabilities and those required for actually intelligent robots that need to operate in much more general environments and perform an array of tasks.

Reinforcement learning in Supply Chains

The successes of deep learning and reinforcement learning in recent years have led many researchers to develop methods to control robots using RL. The motivation is obvious- Can we automate the process of designing sensing, planning and control algorithms by letting the robot learn them autonomously?

If we could, we would solve two of our problems at once:

  • Save the time and energy we spend on designing algorithms for the problems that we know how to solve today (industrial robots) and
  • Gain solutions to those harder problems that we have no current solution for.

Starting with the basics -learn like a baby

Reinforcement learning is all about, learning from interaction; from experience-just like human babies do.

As babies and kids, we don’t know everything about how the world around us works. It is only after years of interaction with the world that we start to understand how it responds to our actions and only then can we take specific actions to achieve our goals. That process, at a high level can be depicted in the illustration below, where the “Agent” are the babies, trying to learn from the environment through action and outcome.


This is the high level logic of reinforcement learning. In reinforcement learning, we have an agent interacting with an environment. At each time step, the agent receives the environment’s current state, and the agent must choose an appropriate action in response. After the agent executes the action, the agent receives a reward and a new state.

Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation. The computer employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.

Although the designer sets the reward policy–that is, the rules of the game–he gives the model no hints or suggestions for how to solve the game. It’s up to the model to figure out how to perform the task to maximize the reward, starting from totally random trials and finishing with sophisticated tactics and superhuman skills.

Our Example Scenario- Warehouse picking bots


Let’s say you are working on developing an algorithm for “Smart” picking bots for a Distribution warehouse. The warehouse is divided into areas represented by the letters A to F, as shown in the simplified map in the following diagram. 


You are working on developing a prototype algorithm for a Bot guiding system. You have started working on a scenario to help  a bot reach from Aisle A to Aisle F. The guiding system’s state indicates the complete path to reach F.

Understanding how they work

To understand how the bot will plan the navigation, let us break things down further, Below is some terminology we will use:


  1. s is bots current position in Aisle A
  2. a is the action bot needs to decide, which is to go to the next area;
  3. There, the bot will be in another state, s’, as a result of action a

Here is how a typical bot algorithm will work here.

The bot reviews where it needs to go (in this case, the location of an item it needs to pick) and checks its mapping environment, which represents all the aisles in the diagram above, from A to F.

Now let us assume that the bot is going to make this decision for the first time. How should the bot decide how to get from A to F ? The bot uses a reward system-its decision on what action to take next is based on how much “reward” it will get if it takes that action.

Since it cannot eat a piece of cake to reward itself, the bot uses numbers. Our bot is a real number cruncher. When it is wrong, it gets a poor reward or nothing in this model. When it’s right, it gets a reward represented by the letter R.

This action-value (reward) transition, often named the Q function in Reinforcement Algorithms, is the core of many reinforcement learning algorithms.

When our bot goes from one state to another, it performs a transition and gets a reward. For example, the transition can be from Ato E, state 1 to state 2, or s1 to s2.   

Bot’s objective is to maximize the reward number. This reward can be as simple as, if the bot travels the minimum distance to get from A to B it gets assigned a reward value. In this case, what your bot is watching is the total distance from A to F to check whether things are OK. That means that the agent is calculating all the states from A to F.

Say that the bot’s algorithm is telling it that it can go from A to D to F. The bot takes the action a to reach to the next state D. Once the bot reaches D, it knows that when it reaches DF will be better as the next state because the reward will be higher to go to F than anywhere else.

But say that suddently a cluster of bots and a human picker get concentrated between D to F, which prevents its staright line travel from D to F. The agent takes the congestion into account, and reroutes, still going after the reward to minimize the distance traveled, under the new conditions. So our bot will now recalculate its route, using the new constraint, that it can not go directly from D to F. Again, using the reward system but applying the containt of the straight path from D to F not being available, bot calculates that going to state B is the best option and then it can go from B to F.

The logic is very similar to what self driving cars use. The reason it is more realistic to have a totally hands off automomous bot vs an autonomous car is that because these bots operate in a controlled environment, within the forewalls of a warehouse or manufacturing plant where you can actually train the bots on all the unique instances it will run into while navigating on a warehouse floor.

If you think about the gamut of “unique” instances a self driving car will encounter in a busy landscape (like say downtown NY), it is almost infinite. That is the reason that I believe that true, fully autonomous cars are decades aways, if they become a possibility at all.

Any views expressed are my own.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s