## The core of Deep Learning (DL)

ANNs, which are inspired by how a human brain works, form the core of DL and its true realization. Today’s revolution around DL would not have been possible without ANNs. Thus, to understand DL, we need to understand how neural networks work.

The mostly automated Supply Chains of the future that we envision today can ONLY be achieved through leveraging Artificial Neural Network methods. No amount of predictive machine learning methods can provide us the leverage that we need to develop Supply Chains of the future.

## ANN and the human brain

**axons (see the illustration below)**. The receptors receive the stimuli either internally or from the external world. Then, they pass this information to the biological neurons for further processing. There are a number of

**dendrites**, in addition to another long extension called the axon. Toward the axon’s extremities, there are minuscule structures called

**synaptic terminals**, which are used to connect one neuron to the dendrites of other neurons. Biological neurons receive short electrical impulses called signals from other neurons, and, in response, they trigger their own signals.

We can, therefore, summarize that the neuron comprises a cell body (also known as the **soma**), one or more dendrites for receiving signals from other neurons, and an axon for carrying out the signals that are generated by the neurons. A neuron is in an active state when it is sending signals to other neurons. However, when it is receiving signals from other neurons, it is in an inactive state.

In an idle state, a neuron accumulates all the signals that are received before reaching a certain activation threshold. This whole process motivated researchers to test out ANNs.

## How does an ANN learn?

Based on the concept of biological neurons, the term and idea of ANNs arose. Similar to biological neurons, the artificial neuron consists of the following:

- One or more incoming connections that aggregate signals from neurons
- One or more output connections for carrying the signal to the other neurons
- An activation function, which determines the numerical value of the output signal

Below we move further, here is a high level comparison inforgraphic (from towardsdatascience.com )

**Assigned weight is a key aspect**

Just like biological Neural Networks, ANNs have to get “activated” to ???? But a very important aspect of ANNs is **assigned weights, **which influences the connection within the network.

Each weight has a numerical value indicated by *W _{ij}*, which is the synaptic weight connecting neuron

*i*to neuron

*j*. Now, for each neuron

*i*, an input vector can be defined by

*x*, and a weight vector can be defined by

_{i}= (x_{1},x_{2},…x_{n})*w*. An example is shown in the illustration below (source: AnalyticsVidya):

_{i}= (w_{i1},x_{i2},…x_{in})Before we delve deeper, let us touch upon Forward Propagation

## Forward Propagation, Depth and Width

The input

Xprovides the initial information that then propagates to the hidden units at each layer and finally produce the output y^. The architecture of the network entails determining its depth, width, and activation functions used on each layer.Depthis the number of hidden layers.Widthis the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output layer dimensions.

Now, referring again to the illustration above, depending on the position of a neuron, the weights and the output function determine the behavior of an individual neuron. Then, during forward propagation, each unit in the hidden layer gets the following signal:

Nevertheless, among the weights, there is also a special type of weight called a bias unit, *b*. Technically, bias units aren’t connected to any previous layer, so they don’t have true activity. But still, the bias *b* value allows the neural network to shift the activation function to the left or right. By taking the bias unit into consideration, the modified network output is formulated as follows:

The preceding equation signifies that each hidden unit gets the sum of inputs, multiplied by the corresponding weight—this is known as the **Summing junction**. Then, the resultant output in the **Summing junction** is passed through the **activation function, which squashes the output,** as depicted in the following diagram (from Machinelearningmastery.com):

### Activation functions

More technically, each neuron receives a signal of the weighted sum of the synaptic weights and the activation values of the neurons that are connected as input.

One of the most widely used functions for this purpose is the so-called sigmoid logistic function, which is defined as follows:

More on Activation functions is in the appendix below.

A practical neural network architecture, however, is composed of **input, hidden, and output layers** that are composed of nodes that make up a network structure. It still follows the working principle of an artificial neuron model, as shown in the preceding section.

The **hidden layers** perform most of the computation to learn the patterns, and the network evaluates how accurate its prediction is compared to the actual output using a special mathematical function called the **loss function. **Typically, with neural networks, we seek to minimize the error. As such, the objective function is often referred to as a cost function or a loss function and the value calculated by the loss function is referred to as simply **“ loss.”**

Loss Function:The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.

It could be a complex one or a very simple mean squared error, which can be defined as follows:

In the preceding equation, is the prediction made by the network, while *Y* represents the actual or expected output.

Finally, when the error is no longer being reduced, the neural network **converges** and **makes a prediction through the output layer**.

## Training a neural network

**generate predictions by minimizing the loss function**.

The performance of the network is then evaluated on the test set. We already know about the simple concept of an artificial neuron. However, generating only some artificial signals is not enough to learn a complex task. As such, a commonly used supervised learning algorithm is the **backpropagation algorithm**, which is very often used to train a complex ANN.

Backpropagation, short for “backward propagation of errors,” is an algorithm for supervised learning of artificial neural networks usingGradient descent ( we will cover Gradient descent below).Given an artificial neural network and anerror function, the method calculates thegradient of the error functionwith respect to the neural network’sweights. Essentially, this approach forces the network to backtrack through all its layers to update the weights and biases across nodes in the opposite direction of the loss function.

Gradient descentis an optimization algorithm used to minimize some function by iteratively moving in the direction ofsteepest descentas defined by the negative of thegradient. In machine learning, we usegradient descentto update the parameters of our model.More on Gradient Descent is in the Appendix section of the article below.

So at this point in the post, we can comfortably say that ultimately:

training a neural network is an optimization problem, too, in which we try to minimize the error by adjusting network weights and biases iteratively, by using

backpropagationthroughgradient descent(GD).

### Weight and bias initialization

- If all weights are initialized to 1, then each unit gets a signal equal to the sum of the inputs.
- If all weights are 0, which is even worse, then every neuron in a hidden layer will get zero signal.

For network weight initialization, **Xavier initialization** is used widely. It is similar to random initialization, but often turns out to work much better, since it can identify the rate of initialization depending on the total number of input and output neurons by default. You may be wondering whether you can get rid of random initialization while training a regular DNN.

In short, it helps signals reach deep into the network.Xavier Initialization:

- If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
- If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.

## Where can you leverage them in your Supply Chain ?

Supply Chains need to leverage Deep Learning methodologies more intensively as compared to other functions like Marketing. Because Supply Chains, where they stand today have so many manual touch points. I will be posting a seperate article this week on why Deep Learning (DL) is more important than Machine Learning (ML) in the Supply Chain context.

Some of the ways you can leverage Neural Networks in Supply Chains are indicated in this article on my Blog.

——————————————————————–

Article based on Individual knowledge and Expertise

## Article Appendix

### Activation functions

The domain of this function includes all real numbers, and the co-domain is (0, 1). This means that any value obtained as an output from a neuron (as per the calculation of its activation state) will always be between zero and one. The Sigmoid function, as

represented in the following diagram, provides an interpretation of the saturation rate of a neuron, from not being active (equal to 0) to complete saturation, which occurs at a predetermined maximum value (equal to 1):

*Sigmoid versus Tanh activation function*

On the other hand, a hyperbolic tangent, or **Tanh**, is another form of activation function. **Tanh** flattens a real-valued number between **-1** and **1**. The preceding graph shows the difference between the **Tanh** and **Sigmoid** activation functions. In particular, mathematically, speaking the *tanh* activation function can be expressed as follows:

In general, in the last level of a **feedforward neural network** (**FFNN**), the softmax function is applied as the decision boundary. This is a common case, especially when solving a classification problem. The softmax function is used for the probability distribution over the possible classes in a multiclass classification problem. To conclude, choosing proper activation functions and network weight initializations are two problems that make a network perform at its best and help to obtain good training.

### Gradient Descent

The process using GD described in the article above does not guarantee that the global minimum is reached. The presence of hidden units and the non-linearity of the output function means that the behavior of the error is very complex and has many local minima. **This backpropagation step is typically performed thousands or millions of times, using many training batches, until the model parameters converge to values that minimize the cost function**. The training process ends when the error on the validation set begins to increase, because this could mark the beginning of a phase of **overfitting:**

**stochastic gradient descent**(

**SGD**) was proposed, which is also a widely used optimizer in DNN training. In SGD, we use only one training sample per iteration from the training set to update the network parameters, which is a stochastic approximation of the true cost gradient.