A Soft Introduction to Artificial Neural Networks (ANN) for Managers

The core of Deep Learning (DL)

ANNs, which are inspired by how a human brain works, form the core of DL and its true realization. Today’s revolution around DL would not have been possible without ANNs. Thus, to understand DL, we need to understand how neural networks work.

The mostly automated Supply Chains of the future that we envision today can ONLY be achieved through leveraging Artificial Neural Network methods. No amount of predictive machine learning methods can provide us the leverage that we need to develop Supply Chains of the future.

And this is why you need to have a basic understanding of what Artificial Neural Networks are and where they can be leveraged.

ANN and the human brain

ANNs represent one aspect of the human nervous system, and how the nervous system consists of a number of neurons that communicate with each other using axons (see the illustration below). The receptors receive the stimuli either internally or from the external world. Then, they pass this information to the biological neurons for further processing. There are a number of dendrites, in addition to another long extension called the axon. Toward the axon’s extremities, there are minuscule structures called synaptic terminals, which are used to connect one neuron to the dendrites of other neurons. Biological neurons receive short electrical impulses called signals from other neurons, and, in response, they trigger their own signals.Capture.JPG

We can, therefore, summarize that the neuron comprises a cell body (also known as the soma), one or more dendrites for receiving signals from other neurons, and an axon for carrying out the signals that are generated by the neurons. A neuron is in an active state when it is sending signals to other neurons. However, when it is receiving signals from other neurons, it is in an inactive state.

In an idle state, a neuron accumulates all the signals that are received before reaching a certain activation threshold. This whole process motivated researchers to test out ANNs. 

How does an ANN learn?

Based on the concept of biological neurons, the term and idea of ANNs arose. Similar to biological neurons, the artificial neuron consists of the following:

  • One or more incoming connections that aggregate signals from neurons
  • One or more output connections for carrying the signal to the other neurons
  • An activation function, which determines the numerical value of the output signal

Below we move further, here is a high level comparison inforgraphic  (from towardsdatascience.com )


Assigned weight is a key aspect

Just like biological Neural Networks, ANNs have to get “activated” to ???? But a very important aspect of ANNs is assigned weights, which influences the connection within the network.

Each weight has a numerical value indicated by Wij, which is the synaptic weight connecting neuron i to neuron j. Now, for each neuron i, an input vector can be defined by xi = (x1,x2,…xn), and a weight vector can be defined by wi = (wi1,xi2,…xin). An example is shown in the illustration below (source: AnalyticsVidya):


Before we delve deeper, let us touch upon Forward Propagation

Forward Propagation, Depth and Width

Now, referring again to the illustration above, depending on the position of a neuron, the weights and the output function determine the behavior of an individual neuron. Then, during forward propagation, each unit in the hidden layer gets the following signal:


Nevertheless, among the weights, there is also a special type of weight called a bias unit, b. Technically, bias units aren’t connected to any previous layer, so they don’t have true activity. But still, the bias b value allows the neural network to shift the activation function to the left or right. By taking the bias unit into consideration, the modified network output is formulated as follows:


The preceding equation signifies that each hidden unit gets the sum of inputs, multiplied by the corresponding weight—this is known as the Summing junction. Then, the resultant output in the Summing junction is passed through the activation function, which squashes the output, as depicted in the following diagram (from Machinelearningmastery.com):


Activation functions

To allow a neural network to learn complex decision boundaries, we apply a non-linear activation function to some of its layers. Commonly used functions include Tanh, ReLU, softmax, and variants of these.

More technically, each neuron receives a signal of the weighted sum of the synaptic weights and the activation values of the neurons that are connected as input.

One of the most widely used functions for this purpose is the so-called sigmoid logistic function, which is defined as follows:


More on Activation functions is in the appendix below.

A practical neural network architecture, however, is composed of input, hidden, and output layers that are composed of nodes that make up a network structure. It still follows the working principle of an artificial neuron model, as shown in the preceding section.


The hidden layers perform most of the computation to learn the patterns, and the network evaluates how accurate its prediction is compared to the actual output using a special mathematical function called the loss function. Typically, with neural networks, we seek to minimize the error. As such, the objective function is often referred to as a cost function or a loss function and the value calculated by the loss function is referred to as simply loss.”

Loss Function: The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.

It could be a complex one or a very simple mean squared error, which can be defined as follows:


In the preceding equation,  is the prediction made by the network, while Y represents the actual or expected output.

Finally, when the error is no longer being reduced, the neural network converges and makes a prediction through the output layer.

Training a neural network

The learning process for a neural network is configured as an iterative process of the optimization of the weights. The weights are updated in each epoch. Once the training starts, the aim is to generate predictions by minimizing the loss function.The performance of the network is then evaluated on the test set. We already know about the simple concept of an artificial neuron. However, generating only some artificial signals is not enough to learn a complex task. As such, a commonly used supervised learning algorithm is the backpropagation algorithm, which is very often used to train a complex ANN.

Backpropagation, short for “backward propagation of errors,” is an algorithm for supervised learning of artificial neural networks using Gradient descent ( we will cover Gradient descent below). Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network’s weights. Essentially, this approach forces the network to backtrack through all its layers to update the weights and biases across nodes in the opposite direction of the loss function.

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model.

More on Gradient Descent is in the Appendix section of the article below.

So at this point in the post, we can comfortably say that ultimately:

training a neural network is an optimization problem, too, in which we try to minimize the error by adjusting network weights and biases iteratively, by using backpropagation through gradient descent (GD).

Weight and bias initialization

Now, here’s a tricky question: how do we initialize the weights? Well, if we initialize all the weights to the same value (for example, 0 or 1), each hidden neuron will get the same signal. Let’s try to break it down:

  • If all weights are initialized to 1, then each unit gets a signal equal to the sum of the inputs.
  • If all weights are 0, which is even worse, then every neuron in a hidden layer will get zero signal.

For network weight initialization, Xavier initialization is used widely. It is similar to random initialization, but often turns out to work much better, since it can identify the rate of initialization depending on the total number of input and output neurons by default. You may be wondering whether you can get rid of random initialization while training a regular DNN.

Xavier Initialization: In short, it helps signals reach deep into the network.

  • If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
  • If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.

Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.

Where can you leverage them in your Supply Chain ?

Supply Chains need to leverage Deep Learning methodologies more intensively as compared to other functions like Marketing. Because Supply Chains, where they stand today have so many manual touch points. I will be posting a seperate article this week on why Deep Learning (DL) is more important than Machine Learning (ML) in the Supply Chain context.

Some of the ways you can leverage Neural Networks in Supply Chains are indicated in this article on my Blog.

A Supply Chain Executive’s summary of Deep Learning : With 15+ innovative application opportunities across Supply Chain


Article based on Individual knowledge and Expertise

Article Appendix

Activation functions

To allow a neural network to learn complex decision boundaries, we apply a non-linear activation function to some of its layers. Commonly used functions include Tanh, ReLU, softmax, and variants of these. More technically, each neuron receives a signal of the weighted sum of the synaptic weights and the activation values of the neurons that are connected as input. One of the most widely used functions for this purpose is the so-called sigmoid logistic function, which is defined as follows:


The domain of this function includes all real numbers, and the co-domain is (0, 1). This means that any value obtained as an output from a neuron (as per the calculation of its activation state) will always be between zero and one. The Sigmoid function, as
represented in the following diagram, provides an interpretation of the saturation rate of a neuron, from not being active (equal to 0) to complete saturation, which occurs at a predetermined maximum value (equal to 1):


Sigmoid versus Tanh activation function

On the other hand, a hyperbolic tangent, or Tanh, is another form of activation function. Tanh flattens a real-valued number between -1 and 1. The preceding graph shows the difference between the Tanh and Sigmoid activation functions. In particular, mathematically,  speaking the tanh activation function can be expressed as follows:


In general, in the last level of a feedforward neural network (FFNN), the softmax function is applied as the decision boundary. This is a common case, especially when solving a classification problem. The softmax function is used for the probability distribution over the possible classes in a multiclass classification problem. To conclude, choosing proper activation functions and network weight initializations are two problems that make a network perform at its best and help to obtain good training.

Gradient Descent

The process using GD described in the article above does not guarantee that the global minimum is reached. The presence of hidden units and the non-linearity of the output function means that the behavior of the error is very complex and has many local minima. This backpropagation step is typically performed thousands or millions of times, using many training batches, until the model parameters converge to values that minimize the cost function. The training process ends when the error on the validation set begins to increase, because this could mark the beginning of a phase of overfitting:


Searching for the minimum for the error function E, we move in the direction in which the gradient G of E is minimal. The downside of using GD is that it takes too long to converge, which makes it impossible to meet the demand of handling large-scale training data. Therefore, a faster GD, called stochastic gradient descent (SGD) was proposed, which is also a widely used optimizer in DNN training. In SGD, we use only one training sample per iteration from the training set to update the network parameters, which is a stochastic approximation of the true cost gradient.
There are other advanced optimizers nowadays such as Adam, RMSProp, ADAGrad, and Momentum. Each of them is either a direct or indirect optimized version of SGD.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s