The core of Deep Learning (DL)
ANNs, which are inspired by how a human brain works, form the core of DL and its true realization. Today’s revolution around DL would not have been possible without ANNs. Thus, to understand DL, we need to understand how neural networks work.
The mostly automated Supply Chains of the future that we envision today can ONLY be achieved through leveraging Artificial Neural Network methods. No amount of predictive machine learning methods can provide us the leverage that we need to develop Supply Chains of the future.
We can, therefore, summarize that the neuron comprises a cell body (also known as the soma), one or more dendrites for receiving signals from other neurons, and an axon for carrying out the signals that are generated by the neurons. A neuron is in an active state when it is sending signals to other neurons. However, when it is receiving signals from other neurons, it is in an inactive state.
In an idle state, a neuron accumulates all the signals that are received before reaching a certain activation threshold. This whole process motivated researchers to test out ANNs.
Based on the concept of biological neurons, the term and idea of ANNs arose. Similar to biological neurons, the artificial neuron consists of the following:
- One or more incoming connections that aggregate signals from neurons
- One or more output connections for carrying the signal to the other neurons
- An activation function, which determines the numerical value of the output signal
Below we move further, here is a high level comparison inforgraphic (from towardsdatascience.com )
Assigned weight is a key aspect
Just like biological Neural Networks, ANNs have to get “activated” to ???? But a very important aspect of ANNs is assigned weights, which influences the connection within the network.
Each weight has a numerical value indicated by Wij, which is the synaptic weight connecting neuron i to neuron j. Now, for each neuron i, an input vector can be defined by xi = (x1,x2,…xn), and a weight vector can be defined by wi = (wi1,xi2,…xin). An example is shown in the illustration below (source: AnalyticsVidya):
Before we delve deeper, let us touch upon Forward Propagation
Forward Propagation, Depth and Width
The input X provides the initial information that then propagates to the hidden units at each layer and finally produce the output y^. The architecture of the network entails determining its depth, width, and activation functions used on each layer. Depth is the number of hidden layers. Width is the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output layer dimensions.
Now, referring again to the illustration above, depending on the position of a neuron, the weights and the output function determine the behavior of an individual neuron. Then, during forward propagation, each unit in the hidden layer gets the following signal:
Nevertheless, among the weights, there is also a special type of weight called a bias unit, b. Technically, bias units aren’t connected to any previous layer, so they don’t have true activity. But still, the bias b value allows the neural network to shift the activation function to the left or right. By taking the bias unit into consideration, the modified network output is formulated as follows:
The preceding equation signifies that each hidden unit gets the sum of inputs, multiplied by the corresponding weight—this is known as the Summing junction. Then, the resultant output in the Summing junction is passed through the activation function, which squashes the output, as depicted in the following diagram (from Machinelearningmastery.com):
More technically, each neuron receives a signal of the weighted sum of the synaptic weights and the activation values of the neurons that are connected as input.
One of the most widely used functions for this purpose is the so-called sigmoid logistic function, which is defined as follows:
More on Activation functions is in the appendix below.
A practical neural network architecture, however, is composed of input, hidden, and output layers that are composed of nodes that make up a network structure. It still follows the working principle of an artificial neuron model, as shown in the preceding section.
The hidden layers perform most of the computation to learn the patterns, and the network evaluates how accurate its prediction is compared to the actual output using a special mathematical function called the loss function. Typically, with neural networks, we seek to minimize the error. As such, the objective function is often referred to as a cost function or a loss function and the value calculated by the loss function is referred to as simply “loss.”
Loss Function: The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.
It could be a complex one or a very simple mean squared error, which can be defined as follows:
In the preceding equation, is the prediction made by the network, while Y represents the actual or expected output.
Finally, when the error is no longer being reduced, the neural network converges and makes a prediction through the output layer.
The performance of the network is then evaluated on the test set. We already know about the simple concept of an artificial neuron. However, generating only some artificial signals is not enough to learn a complex task. As such, a commonly used supervised learning algorithm is the backpropagation algorithm, which is very often used to train a complex ANN.
Backpropagation, short for “backward propagation of errors,” is an algorithm for supervised learning of artificial neural networks using Gradient descent ( we will cover Gradient descent below). Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network’s weights. Essentially, this approach forces the network to backtrack through all its layers to update the weights and biases across nodes in the opposite direction of the loss function.
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model.
More on Gradient Descent is in the Appendix section of the article below.
So at this point in the post, we can comfortably say that ultimately:
training a neural network is an optimization problem, too, in which we try to minimize the error by adjusting network weights and biases iteratively, by using backpropagation through gradient descent (GD).
Weight and bias initialization
- If all weights are initialized to 1, then each unit gets a signal equal to the sum of the inputs.
- If all weights are 0, which is even worse, then every neuron in a hidden layer will get zero signal.
For network weight initialization, Xavier initialization is used widely. It is similar to random initialization, but often turns out to work much better, since it can identify the rate of initialization depending on the total number of input and output neurons by default. You may be wondering whether you can get rid of random initialization while training a regular DNN.
Xavier Initialization: In short, it helps signals reach deep into the network.
- If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
- If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.
Where can you leverage them in your Supply Chain ?
Supply Chains need to leverage Deep Learning methodologies more intensively as compared to other functions like Marketing. Because Supply Chains, where they stand today have so many manual touch points. I will be posting a seperate article this week on why Deep Learning (DL) is more important than Machine Learning (ML) in the Supply Chain context.
Some of the ways you can leverage Neural Networks in Supply Chains are indicated in this article on my Blog.
Article based on Individual knowledge and Expertise
The domain of this function includes all real numbers, and the co-domain is (0, 1). This means that any value obtained as an output from a neuron (as per the calculation of its activation state) will always be between zero and one. The Sigmoid function, as
represented in the following diagram, provides an interpretation of the saturation rate of a neuron, from not being active (equal to 0) to complete saturation, which occurs at a predetermined maximum value (equal to 1):
Sigmoid versus Tanh activation function
On the other hand, a hyperbolic tangent, or Tanh, is another form of activation function. Tanh flattens a real-valued number between -1 and 1. The preceding graph shows the difference between the Tanh and Sigmoid activation functions. In particular, mathematically, speaking the tanh activation function can be expressed as follows:
In general, in the last level of a feedforward neural network (FFNN), the softmax function is applied as the decision boundary. This is a common case, especially when solving a classification problem. The softmax function is used for the probability distribution over the possible classes in a multiclass classification problem. To conclude, choosing proper activation functions and network weight initializations are two problems that make a network perform at its best and help to obtain good training.
The process using GD described in the article above does not guarantee that the global minimum is reached. The presence of hidden units and the non-linearity of the output function means that the behavior of the error is very complex and has many local minima. This backpropagation step is typically performed thousands or millions of times, using many training batches, until the model parameters converge to values that minimize the cost function. The training process ends when the error on the validation set begins to increase, because this could mark the beginning of a phase of overfitting: