## Technical content – not a generic read. Data set used in the model can’t be shared due to confidentiality reasons. The assembly line pictures used in this post are also not the same ones as used in the real project.

## A use case for the enthusiastic Data Scientist

In this example, I will walk you through an overview of a Python program that I helped developed for an India biscuit manufacturing giant.

### The goal

The primary goal of this model consists of detecting production efficiency flaws on a food-processing conveyor belt. The use of CIFAR-10 (images) and MNIST (handwritten digit base) proves useful to understand and train some models. However, at one point, real-life datasets must be used to sell and implement deep learning and artificial intelligence in general.

The following photograph shows a section of the conveyor belt that contains an acceptable level of products, in this case, portions of chocolate cookies:

However, sometimes the production slows down, and the output goes down to an alert level, as shown in the following photograph:

The alert-level image shows a gap that will slow down the packaging section of the factory dramatically.

### Compiling the model

Compiling a **Keras** model requires a minimum of two options:

- a loss function and
- an optimizer

You evaluate how much you are losing and then optimize your parameters, just as in real life. A metric option has been added to measure the performance of the model. With a metric, you can analyze your losses and optimize your situation, as shown in the following code:

`classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])`

#### Loss function

The loss function provides information on how far the state of the model *y1* (weights, biases) is from its target state *y*.

A description of the quadratic loss function precedes that of the binary cross-entropy functions applied to the case study model in this chapter.

#### Quadratic loss function

Refreshing the concept of gradient descent, Imagine you are on a hill and want to walk down that hill. Your goal is to get to *y*, the bottom of the hill. Presently you are at *a*. Google Maps shows you that you still have to go a certain distance:

That formula is great for the moment. But now suppose you are almost at the bottom of the hill, and the person walking in front of you dropped a coin. You have to slow down now, and Google Maps is not helping much because it doesn’t display such small distances that well. You open an imaginary application called **tiny maps** that zooms into small distances with a quadratic objective (or cost) function:

To make it more comfortable to analyze, O is divided by 2, producing a standard quadratic cost function:

*y*is the goal.

*a*is the result of the operation of applying the weights, biases, and finally the activation functions. With the derivatives of the results, the weights and biases can be updated. In our hill example, if you move one meter (

*y*) per step (

*x*), that is much more than moving 0.5 meters (

*y*) per step. Depending on your position on the hill, you can see that you cannot apply a constant learning rate (conceptually the length of your step); you adapt it just like Adam, the optimizer, does.

#### Binary cross-entropy

Cross-entropy comes in handy when the learning slows down. In the hill example, it slowed down at the bottom. But, remember, a path can lead you sideways, meaning you are momentarily stuck at a given height. Cross-entropy solves that by being able to function well with very small values (steps on the hill).

Suppose you have the following structure:

- Inputs is
*{x1, x2, …, xn}* - Weights is
*{w1, w2, …, wn}* - A bias (or sometimes more) is
*b* - An activation function(ReLU, logistic sigmoid, or other)

Before the activation, *z* represents the sum of the classical operations:.

Now the activation function is applied to z to obtain the present output of the model.

With this in mind, the cross-entropy loss formula can be explained:In this function:

*n*is the total number of items of the input training, with multiclass data.*M*> 2 means situations in which a separate loss of each class label is calculated. The choice of the logarithm base (2, e, 10) will produce different effects.*y*is the output goal.*y1*is the present value, as described previously.

This loss function is always positive; the values have a minus sign in front of them, and the function starts with a minus. The output produces small numbers that tend to zero as the system progresses.

The loss function in** Keras,** which uses **TensorFlow**, uses this basic concept with more mathematical inputs to update the parameters.

A binary cross-entropy loss function is a binomial function that will produce a probability output of 0 or 1 and not a value between 0 and 1 as in standard cross-entropy. In the binomial classification model, the output will be 0 or 1.

In this case, the sum is not necessary when *M (number of classes) = 2*. The binary cross-entropy loss function is then as follows:

**Adam optimizer**

In the hill example, you first walked with big strides down the hill using momentum (larger strides because you are going in the right direction). Then you had to take smaller steps to find the object. You are adapting your estimation of your moment to your need; hence, the name **adaptive moment estimation** (**Adam**).

Adam constantly compares the mean past gradients to present gradients. In the hill example, it compares how fast you were going.

The Adam optimizer represents an alternative to the classical gradient descent method or stochastic gradient descent method . Adam goes further by applying its optimizer to random (stochastic) mini-batches of the dataset. This makes it a version of stochastic gradient descent.

Then, with even more inventiveness, Adam adds **root-mean-square deviation** (**RMSprop**) to the process by applying per-parameter learning weights. It analyzes how fast the means of the weights are changing (such as the gradients in our hill slope example) and adapts the learning weights.

#### Metrics

Metrics are there to measure the performance of your model. The metric function behaves like a loss function. However, it is not used to train the model.

In this case, the accuracy parameter was this:

`...metrics = ['accuracy'])`

Here, a value that descends towards 0 shows whether the training is on the right track and moves up to one when the training requires Adam function optimizing to set the training on track again.

### Training dataset

Can’t be shared due to confidentiality reasons.

The goal of the model is to detect the alert levels as described above

#### Data augmentation

Data augmentation increases the size of the dataset by generating distorted versions of the images provided.

The `ImageDataGenerator`

function generates batches of all images found in tensor formats. It will perform data augmentation by distorting the images (shear range, for example). Data augmentation is a fast way to use the images you have and create more virtual images through distortions:

```
train_datagen = ImageDataGenerator(re
scale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
```

The code description is as follows:

`scale`

will rescale the input image if not 0 (or none). In this case, the data is multiplied by 1/255 before applying any other operation.`shear_range`

will displace each value in the same direction determined, in this case by the 0.2. It will slightly distort the image at one point, giving some more virtual images to train.

`zoom_range`

is the value of zoom.`horizontal_flip`

is set to true. This is a Boolean that randomly flips inputs horizontally.

`ImageDataGenerator`

provides many more options for real-time data augmentation, such as rotation range, height shift, and more.

#### Loading the data

Loading the data goes through the `train_datagen`

preprocessing `image`

function (described previously) and is implemented in the following code:

```
print("Step 7b training set")
training_set = train_datagen.flow_from_directory(directory+'training_set',
target_size = (64, 64),
batch_size = batchs,
class_mode = 'binary')
```

The flow in this program uses the following options:

`flow_from_directory`

sets the directory + ‘`training_set`

‘ to the path where the two binary classes to train are stored.`target_size`

will all be resized to that dimension. In this case, it is 64 x 64.`batch_size`

is the size of batches of data. The default value is 32 and set to 10 in this case.`class_mode`

determines the label arrays returned: none or categorical will be 2D one-hot encoded labels. In this case, binary returns 1D binary labels.

### Testing dataset

The testing dataset flow follows the same structure as the training dataset flow described previously. However, for testing purposes, the task can be made easier or more difficult depending on the choice of the model. This can be done by adding images with defects or noise. This will force the system into more training and the project team into more hard work to fine-tune the model. Data augmentation provides an efficient way of producing distorted images without adding images to the dataset. Both methods, among many others, can be applied at the same time when necessary.

#### Data augmentation

In this model, the data only goes through rescaling. Many other options could be added to complicate the training task to avoid overfitting, for example, or simply because the dataset is small:

```
print("Step 8a test")
test_datagen = ImageDataGenerator(rescale = 1./255)
```

#### Loading the data

Loading the testing data remains limited to what is necessary for this model. Other options can fine-tune the task at hand:

```
print("Step 8b testing set")
test_set = test_datagen.flow_from_directory(directory+'test_set',
target_size = (64, 64),
batch_size = batchs,
class_mode = 'binary')
```

Never underestimate dataset fine-tuning. Sometimes, this phase can last weeks before finding the right dataset and arguments.

### Training with the classifier

The classifier has been built and can be run:

```
print("Step 9 training")
print("Classifier",classifier.fit_generator(training_set,
steps_per_epoch = estep,
epochs = ep,
validation_data = test_set,
validation_steps = vs))
```

The `fit_generator`

function, which fits the model generated batch by batch, contains the main hyperparameters to run the training session through the following arguments in this model. The hyperparameters settings determine the behavior of the training algorithm:

`training_set`

is the training set flow described previously.`steps_per_epoch`

is the total number of steps (batches of samples) to yield from the generator. The variable used in the following code is estep.

`epochs`

is the variable of the total number of iterations made on the data input. The variable used is`ep`

in the preceding code.`validation_data=test_set`

is the testing data flow.`validation_steps=vs`

is used with the generator and defines the number of batches of samples to test as defined by`vs`

in the following code:

```
estep=100 #10000
vs=100 #5000
ep=2 #50
```

While the training runs, measurements are displayed: loss, accuracy, epochs, information on the structure of the layers, and the steps calculated by the algorithm.

Here is an example of the loss and accuracy data displayed:

```
Epoch 1/2
- 23s - loss: 0.1437 - acc: 0.9400 - val_loss: 0.4083 - val_acc: 0.5000
Epoch 2/2
- 21s - loss: 1.9443e-06 - acc: 1.0000 - val_loss: 0.3464 - val_acc: 0.5500
```

——————————————————————–

## One comment