Predictive Maintenance Analytics with Python : A full cycle Step by step illustrative example

To give a real-life glimpse into predictive maintenance, we will use the open source Turbofan Engine Degradation Simulation dataset, which was released in 2008 by the Prognostics Center of Excellence at NASA’s Ames research centre. The dataset can be downloaded from https://ti.arc.nasa.gov/c/6/.

Describing the dataset

The dataset consists of sensor readings from a fleet of simulated aircraft gas turbine engines operating conditions as a multiple multivariate time series. The dataset consists of separate training and test sets. The testset is similar to the training set, except that each engine’s measurements are truncated some (unknown) amount of time before it fails. The data is provided as a ZIP-compressed text file with 26 columns of numbers. Each row represents a snapshot of data taken during a single operational cycle and each column represents a different variable. The columns correspond to the following attributes:

  • Unit number
  • Time, in cycles
  • Operational setting 1
  • Operational setting 2
  • Operational setting 3
  • Sensor measurement 1
  • Sensor measurement 2
  • Sensor measurement 26

In addition, the dataset has a vector of true RUL values for the data, which will be used as the ground truths for training the models.

Exploratory analysis

To give an idea of the sensor readings in areas such as the physical state of the engine (for example, with regard to the temperature of a component, the fan speed of the turbine, and so on) we decided to extract the first unit from the first dataset for all the sensors on a single engine. For this, we have written a script (see make_dataset.py) that gets all of the data files from the input directory. Then it parses a set of raw data files into a single DataFrame object and returns an aggregated representation of all files with the appropriate column names:

data_sets = []
for data_file in glob(file_pattern):
if label_data:
            # read in contents as a DataFrame
            subset_df = pd.read_csv(data_file, header=None)
            # need to create a unit_id column explicitly
            unit_id = range(1, subset_df.shape[0] + 1)
            subset_df.insert(0, 'unit_id', unit_id)
else:
            # read in contents as a DataFrame
            subset_df = pd.read_csv(data_file, sep=' ', header=None, usecols=range(26))
        # extract the id of the dataset from the name and add as a column
        dataset_id = basename(data_file).split("_")[1][:5]
        subset_df.insert(0, 'dataset_id', dataset_id)
        # add to list
        data_sets.append(subset_df)
    # combine dataframes
    df = pd.concat(data_sets)
    df.columns = columns
    # return the result

return df

To use this script, first copy all the files in the data/raw/ directory, and then execute the following command:

$python3 make_dataset.py data/raw/ /data/processed/

This command will generate three files—train.csvtest.csv, and RUL.csv—for the training set, testset, and labels, respectively. Now that our dataset is ready for exploratory analysis, we can now read each CSV file as a pandas DataFrame:

# load the processed data in CSV format
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
rul_df = pd.read_csv('RUL.csv')

# for convenience, identify the sensor and operational setting columns
sensor_columns = [col for col in train_df.columns if col.startswith("sensor")]
setting_columns = [col for col in train_df.columns if col.startswith("setting")]

Then, extract the first unit from the first dataset:

slice = train_df[(train_df.dataset_id == 'FD001') & (train_df.unit_id == 1)]

Then, we plot its sensor traces over time on a 7 * 3 = 21 plots grid to see all sensor channels. We have to plot the channel corresponding to this position:

fig, axes = plt.subplots(7, 3, figsize=(15, 10), sharex=True)

for index, ax in enumerate(axes.ravel()):
    sensor_col = sensor_columns[index]
    slice.plot(x='cycle', y=sensor_col, ax=ax, color='blue');
    # label formatting
    if index % 3 == 0:
        ax.set_ylabel("Sensor reading", size=10);
    else:
        ax.set_ylabel("");
    ax.set_xlabel("Time (cycle)");
    ax.set_title(sensor_col.title(), size=14);
    ax.legend_.remove();

# plot formatting
fig.suptitle("Sensor reading : unit 1, dataset 1", size=20, y=1.025)
fig.tight_layout();

As seen in the following diagram, data from the sensor channels is quite noisy and fluctuates over time, while other data does not seem to change at all. Each sensor’s life cycle is different in terms of the starting and ending value on the x axis:

Picture1.png

We can see that each engine has a slightly different lifetime and failure pattern. Next, we can visualize the data from all the sensor channels against time for a random sample of 10 engines from the training set:

# randomly select 10 units from dataset 1 to plot
all_units = train_df[train_df['dataset_id'] == 'FD001']['unit_id'].unique()
units_to_plot = np.random.choice(all_units, size=10, replace=False)

# get the data for these units
plot_data = train_df[(train_df['dataset_id'] == 'FD001') &
                     (train_df['unit_id'].isin(units_to_plot))].copy()

# plot their sensor traces (overlaid)
fig, axes = plt.subplots(7, 3, figsize=(15, 10), sharex=True)

for index, ax in enumerate(axes.ravel()):
    sensor_col = sensor_columns[index]

    for unit_id, group in plot_data.groupby('unit_id'):
        # plot the raw sensor trace
        (group.plot(x='cycle', y=sensor_col, alpha=0.45, ax=ax, color='gray', legend=False));
        # overlay the 10-cycle rolling mean sensor trace for visual clarity
        (group.rolling(window=10, on='cycle')
             .mean()
             .plot(x='cycle', y=sensor_col, alpha=.75, ax=ax, color='black', legend=False));

    # label formatting
    if index % 3 == 0:
        ax.set_ylabel("Sensor Value", size=10);
    else:
        ax.set_ylabel("");
    ax.set_title(sensor_col.title());
    ax.set_xlabel("Time (Cycles)");

# plot formatting
fig.suptitle("All Sensor Traces: Dataset 1 (Random Sample of 10 Units)", size=20, y=1.025);
fig.tight_layout();
The preceding code segment shows the following graph of random samples of 10 units from the sensor reading from dataset 1:
Picture1.png
From the preceding graph, we can inspect that an engine’s progress with respect to time is not quite aligned with the others. This is an impedance that does not allow us to compare the fifth cycle of one engine to the fifth cycle of another, for example.

Inspecting failure modes

Since it is already known when each engine in the training set will fail, we can compute a time before failure value at each time step, which can be defined as follows:

Time before failure (TBF) = engine elapsed life at failure time (EEL) – total operating lifetime (TOL)

This number can be considered as the countdown to failure for each engine, which allows us to align different engines’ data to a common end:

# generate the lifetimes series
lifetimes = train_df.groupby(['dataset_id', 'unit_id'])['cycle'].max()

# apply the above function to the data we're plotting
plot_data['ctf'] = plot_data.apply(lambda r: cycles_until_failure(r, lifetimes), axis=1)

# plot the sensor traces (overlaid)
fig, axes = plt.subplots(7, 3, figsize=(15, 10), sharex=True)
for index, ax in enumerate(axes.ravel()):
    sensor_col = sensor_columns[index]
    # use the same subset of data as above
    for unit_id, group in plot_data.groupby('unit_id'):
        # plot the raw sensor trace, using ctf on the time axis
        (group.plot(x='ctf', y=sensor_col, alpha=0.45, ax=ax, color='gray', legend=False));

        # overlay the 10-cycle rolling mean sensor trace for visual clarity
        (group.rolling(window=10, on='ctf')
             .mean()
             .plot(x='ctf', y=sensor_col, alpha=.75, ax=ax, color='black', legend=False));

    # label formatting
    if index % 3 == 0:
        ax.set_ylabel("Sensor Value", size=10);
    else:
        ax.set_ylabel("");
    ax.set_title(sensor_col.title());
    ax.set_xlabel("Time Before Failure (Cycles)");

    # add a vertical red line to signal common time of failure
    ax.axvline(x=0, color='r', linewidth=3);

    # extend the x-axis to compensate
    ax.set_xlim([None, 10]);
fig.suptitle("All Sensor Traces: Dataset 1 (Random Sample of 10 Units)", size=20, y=1.025);
fig.tight_layout();

The following shows the sensor channels in the same engines. The only difference is that the previous graph is plotted against the time before failure, where each engine ends at the same instant (t=0). It also gives us a common pattern across different engines, which shows that some sensor readings consistently rise or fall right before a failure, while others—for example, sensor 14—exhibit different failure behaviors:

Picture1.png

This pattern is very common in many predictive maintenance problems: failure is often a confluence of different processes, and as a result, things in the real world are likely to exhibit multiple failure modes. Due to this unpredictable pattern of data, predicting the RUL is very challenging.

Prediction challenges

As shown in the following diagram, after observing the engine’s sensor measurements and operating conditions for a certain amount of time (133 cycles in the diagram), the challenge is to predict the amount of time (in other words, the RUL) that the engine will continue to function before it fails:Picture1.png

However, making an incorrect prediction for an ML/DL model is basically underestimating the true RUL of a particular engine. This can bring the turbine engine to maintenance too early, when it could have operated for a bit longer without any issues arising. So, what would happen if our model were to overestimate the true RUL instead? In that case, we might allow a degrading aircraft to keep flying, and risk catastrophic engine failure. Clearly, the costs of these two outcomes would not be the same. Considering these challenges, in the next section, we will focus on using DL-based techniques for predicting RUL.

DL for predicting RLU

As we have discussed, we are trying to calculate the amount of time before an engine needs maintenance. What makes this dataset special is that the engines run all the way until failure, giving us the precise RLU information for every engine at every point in time.

Calculating cut-off times

Let’s consider the FD004 dataset, that contains as much as 249 engines (engine_no) monitored over time (time_in_cycles). Each engine has operational_settings and sensor_measurements recorded for each cycle:

data_path = 'train_FD004.txt'
data = utils.load_data(data_path)
To train a model that will predict RUL, we can simulate real predictions by choosing a random point in the life of the engine and only using the data from before that point. We can create features with that restriction easily by using cut-off times:
def make_cutoff_times(data):
    gb = data.groupby(['unit_id'])
    labels = []
    for engine_no_df in gb:
        instances = engine_no_df[1].shape[0]
        label = [instances - i - 1 for i in range(instances)]
        labels += label
    return new_labels(data, labels)
The preceding function generates the cut-off times by sampling for both cutoff_time and label, which can be called as follows:
cutoff_times = utils.make_cutoff_times(data)
cutoff_times.head()

The preceding lines of code show the following RUL and cut-off time for five engines only:

Capture.JPG

Deep feature synthesis

Then, we generate the features using Deep Feature Synthesis (DFS). For this, we need to establish an entity set structure for our data. We can create an engines entity by normalizing the engine_no column in the raw data:

def make_entityset(data):
    es = ft.EntitySet('Dataset')
    es.entity_from_dataframe(dataframe=data,
                             entity_id='recordings',
                             index='index',
                             time_index='time')
    es.normalize_entity(base_entity_id='recordings',
                        new_entity_id='engines',
                        index='engine_no')
    es.normalize_entity(base_entity_id='recordings',
                        new_entity_id='cycles',
                        index='time_in_cycles')
    return es
es = make_entityset(data)

The preceding code block will generate the following statistics of the entity set:

Entityset: Dataset
   Entities:
     recordings [Rows: 20631, Columns: 28]
     engines [Rows: 100, Columns: 2]
     cycles [Rows: 362, Columns: 2]
   Relationships:
     recordings.engine_no -> engines.engine_no
     recordings.time_in_cycles -> cycles.time_in_cycles

The ft.dfs function takes an entity set and stacks primitives such as maxmin, and last exhaustively across entities:

fm, features = ft.dfs(entityset=es,
                      target_entity='engines',
                      agg_primitives=['last', 'max', 'min'],
                      trans_primitives=[],
                      cutoff_time=cutoff_times,
                      max_depth=3,
                      verbose=True)
fm.to_csv('FM.csv')

ML baselines

Now that we have generated the features, we can start training a first ML model called RandomForestRegressor. Then, we will gradually move to using DL using Long Short-Term Memory (LSTM) network. Random forest (RF) is an ensemble technique that builds several decision trees and integrates them together to get a more accurate and stable prediction.The deeper the tree, the more complex the decision rules and the better fitted the model is. This is a direct consequence of Random Forest. In other words, the final prediction based on the majority vote from a panel of independent juries is always better and more reliable than the best jury. The following diagram shows random forest and its assembling technique:

Capture.JPG

So, let’s get started by preparing the separate training set and test set:

fm = pd.read_csv('FM.csv', index_col='engine_no')
X = fm.copy().fillna(0)
y = X.pop('RUL')
X_train, X_test, y_train, y_test = train_test_split(X, y)

Then, using the training set, we will check the following baselines:

  • Always predict the median value of y_train.
  • Always predict the RUL as if every engine has the median lifespan in X_train.

We will check those predictions by finding the mean of the absolute value of the errors called the Mean Absolute Error (MAE) using RandomForestRegressor from scikit-learn:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

yhat_median_predict = [np.median(y_train) for _ in y_test]
print('Baseline by median label: MAE = {:.2f}'.format(
    mean_absolute_error(yhat_median_predict, y_test)))

# Collect sensor readings from the sensor in training set
recordings_from_train = es['recordings'].df[es['recordings'].df['engine_no'].isin(y_train.index)]
median_life = np.median(recordings_from_train.groupby(['engine_no']).apply(lambda df: df.shape[0]))

# Collect sensor readings from the sensor in training set
recordings_from_test = es['recordings'].df[es['recordings'].df['engine_no'].isin(y_test.index)]
life_in_test = recordings_from_test.groupby(['engine_no']).apply(lambda df: df.shape[0])-y_test

# Compute mean absolute error as the baseline by meadian of the RUL
yhat_median_predict2 = (median_life - life_in_test).apply(lambda row: max(row, 0))
print('Baseline by median life: MAE = {:.2f}'.format(
    mean_absolute_error(yhat_median_predict2, y_test)))

The preceding code block should produce the following output showing the baseline MAE values:

Baseline by median label: MAE = 66.72
Baseline by median life: MAE = 59.96

Making predictions

Now we can use our created features to fit RandomForestRegressor to our data and see whether we can improve on the previous scores:

rf = RandomForestRegressor() # first we instantiate RandomForestRegressor from scikit-learn
rf.fit(X_train, y_train) # train the regressor model with traing set
   
preds = rf.predict(X_test) # making predictin on unseen observation 
scores = mean_absolute_error(preds, y_test) # Computing MAE

print('Mean Abs Error: {:.2f}'.format(scores))
high_imp_feats = utils.feature_importances(X, reg, feats=10) # Printing feature importance

The preceding code block should produce the following output showing the baseline MAE values and statistics about the engine recording cycles:

Mean Abs Error: 31.04
 1: LAST(recordings.cycles.LAST(recordings.sensor_measurement_4)) [0.395]
 2: LAST(recordings.sensor_measurement_4) [0.192]
 3: MAX(recordings.sensor_measurement_4) [0.064]
 4: LAST(recordings.cycles.MIN(recordings.sensor_measurement_11)) [0.037]
 5: LAST(recordings.cycles.MAX(recordings.sensor_measurement_12)) [0.029]
 6: LAST(recordings.sensor_measurement_15) [0.020]
 7: LAST(recordings.cycles.MAX(recordings.sensor_measurement_11)) [0.020]
 8: LAST(recordings.cycles.LAST(recordings.sensor_measurement_15)) [0.018]
 9: MAX(recordings.cycles.MAX(recordings.sensor_measurement_20)) [0.016]
 10: LAST(recordings.time_in_cycles) [0.014]

Then, we have to prepare both the features and label, which we can do using the following code:

data2 = utils.load_data('test_FD001.txt')
es2 = make_entityset(data2)
fm2 = ft.calculate_feature_matrix(entityset=es2, features=features, verbose=True)
fm2.head()

The loaded data should have 41,214 recordings from 249 engines in which 21 sensor measurements are used under three operational settings. Then, we have to prepare both the features and labels using the loaded data, which we can do using the following code:

X = fm2.copy().fillna(0)
y = pd.read_csv('RUL_FD004.txt', sep=' ', header=-1, names=['RUL'], index_col=False)

preds2 = rf.predict(X)
print('Mean Abs Error: {:.2f}'.format(mean_absolute_error(preds2, y)))

yhat_median_predict = [np.median(y_train) for _ in preds2]
print('Baseline by median label: MAE = {:.2f}'.format(
    mean_absolute_error(yhat_median_predict, y)))

yhat_median_predict2 = (median_life - es2['recordings'].df.groupby(['engine_no']).apply(lambda df: df.shape[0])).apply(lambda row: max(row, 0))

print('Baseline by median life: MAE = {:.2f}'.format(
    mean_absolute_error(yhat_median_predict2 y)))

The preceding code block should produce the following output showing the prediced MAE and baseline MEA values:

Mean Abs Error: 40.33
Baseline by median label: Mean Abs Error = 52.08
Baseline by median life: Mean Abs Error = 49.55

As seen, the predicted MAE value is lower than both baseline MAE values. Next, we try to improve the MAE even more using the LSTM network.

Improving MAE with LSTM

We will use the Keras-based LSTM network to predict RUL. However, for this, we first need to convert the data so that the LSTM model, which expects data in three-dimensional format, can consume it:

#Prepare data for Keras based LSTM model
def prepareData(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    X_train = X_train.as_matrix(columns=None)
    X_test = X_test.as_matrix(columns=None)
    y_train = y_train.as_matrix(columns=None)
    y_test = y_test.as_matrix(columns=None)
    y_train = y_train.reshape((y_train.shape[0], 1))
    y_test = y_test.reshape((y_test.shape[0], 1))
    X_train = np.reshape(X_train,(X_train.shape[0], 1, X_train.shape[1]))
    X_test = np.reshape(X_test,(X_test.shape[0], 1, X_test.shape[1]))    
    return X_train, X_test, y_train, y_test

Now that we have the data appropriate for the LSTM model, we can construct the LSTM network. For this, we have a fancy LSTM network that has only an LSTM layer followed by a dense layer, before we apply a dropout layer for better regularization. Then, we have another dense layer, before we project the output from this dense layer through to the activation layer using the linear activation function so that it outputs real-value outputs. We then use the SGD version, called RMSProp, which tries to optimize the Mean Square Error (MSE):

#Create LSTM model
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM
from keras.layers import Dropout
from keras.layers import GaussianNoise

def createLSTMModel(X_train, hidden_neurons):
    model = Sequential()
    model.add(LSTM(hidden_neurons, input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(Dense(hidden_neurons))
    model.add(Dropout(0.7))
    model.add(Dense(1))
    model.add(Activation("linear"))
    model.compile(loss="mean_squared_error", optimizer="rmsprop")
    return model

Then, we train the LSTM model with the training set

X_train, X_test, y_train, y_test = prepareData(X, y)
hidden_neurons = 128
model = createLSTMModel(X_train, hidden_neurons)
history = model.fit(X_train, y_train, batch_size=32, nb_epoch=5000, validation_split=0.20)

The preceding lines of code should produce some logs, which give us an idea of whether the training and the validation losses are getting reduced across iterations:

Train on 60 samples, validate on 15 samples
 Epoch 1/5000
 60/60 [==============================] - ETA: 0s - loss: 7996.37 - 1s 11ms/step - loss: 7795.0232 - val_loss: 8052.6118
 Epoch 2/5000
 60/60 [==============================] - ETA: 0s - loss: 6937.66 - 0s 301us/step - loss: 7466.3648 - val_loss: 7833.4321

 60/60 [==============================] - ETA: 0s - loss: 1754.92 - 0s 259us/step - loss: 1822.5668 - val_loss: 1420.7977
 Epoch 4976/5000
 60/60 [==============================] - ETA: 0s - loss: 1862.04

Now that the training has been finished, we can plot the training and validation loss:

# plot history
plt.plot(history.history['loss'], label='Training')
plt.plot(history.history['val_loss'], label='Validation')
plt.legend()
plt.show()

The preceding code block should produce the following graph, in which we can see that the validation loss drops below the training loss:

Capture.JPG

The model may be overfitting the training data. Measuring and plotting MAE during training may shed more light on this. Let’s take a look at the MAE on the testset

predicted = model.predict(X_test)
rmse = np.sqrt(((predicted - y_test) ** 2).mean(axis=0))
print('Mean Abs Error: {:.2f}'.format(mean_absolute_error(predicted, y_test)))

We should get an MAE of 38.32, which means the MAE error has been reduced a bit (whereas RF gave an MAE of 40.33), which is, however, still not convincing. There could be several reasons behind such a high MAE. For example, we do not have sufficient training data. Secondly, we used an inefficient method for generating the entity set. For the first problem, we can use all the dataset to train the model. However, we can also use other regularization techniques, such as a Gaussian Noise layer, by specifying the noise threshold:

def createLSTMModel(X_train, hidden_neurons):
    model = Sequential()
    model.add(LSTM(hidden_neurons, input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(GaussianNoise(0.2))
    model.add(Dense(hidden_neurons))
    model.add(Dropout(0.7))
    model.add(Dense(1))
    model.add(GaussianNoise(0.5))
    model.add(Activation("linear"))
    model.compile(loss="mean_squared_error", optimizer="rmsprop")
    return model    

The Gaussian noise layer can be used as an input layer to add noise directly to input variables. This is the traditional use of noise as a regularization method in neural networks, which states that the noise can be added before or after the use of the activation function. It may make more sense to add this before activation, but, nevertheless, both options are possible. In our case, we added a Gaussian noise layer with a dropout of 0.2 after the LSTM layer and before the dense layer.

Then, we have another Gaussian noise layer that adds noise to the linear output of a dense layer before a rectified linear activation function. Then, training the LSTM model with the same data with noise introduced should produce a slightly lower MAE value of around 35.25. We can even inspect the plot showing the training and validation loss:

Capture.JPG

The preceding diagram shows that the training loss and test loss are more or less the same, which indicates a better regularization of the model. Hence, the model performed better on the testset as well. However, the MAE can still be reduced using better quality features, perhaps. Let’s explore this with a better feature generation technique.

Unsupervised deep feature synthesis

We will see how the entity set structures can contribute to improve the predictive accuracy. We will build custom primitives using time-series functions from the tsfresh library. Before that, we will make cut-off times by selecting a random one from the life of each engine. We are going to make five sets of cut-off times to use for cross-validation:

from tqdm import tqdm
splits = 10
cutoff_time_list = []
for i in tqdm(range(splits)):
    cutoff_time_list.append(utils.make_cutoff_times(data))
cutoff_time_list[0].head()

The preceding code block should show the cut-off time and the RUL values for five engines, as shown here:

Capture.JPG

Then, we will use an unsupervised way of generating the entity set. As we can see, the values of operational settings 13 are continuous, but they create an implicit relation between different engines. Consequently, if two engines have a similar operational setting, the sensor measurements give a similar value. The idea is to apply the clustering technique through k-means to those settings. Then, we create a new entity from clusters with similar values:

from sklearn.cluster import KMeans
nclusters = 50
def make_entityset(data, nclusters, kmeans=None):
    X = data[['operational_setting_1', 'operational_setting_2', 'operational_setting_3']]
    if kmeans:
        kmeans=kmeans
    else:
        kmeans = KMeans(n_clusters=nclusters).fit(X)
    data['settings_clusters'] = kmeans.predict(X)
        es = ft.EntitySet('Dataset')
    es.entity_from_dataframe(dataframe=data,
                             entity_id='recordings',
                             index='index',
                             time_index='time')
    es.normalize_entity(base_entity_id='recordings', 
                        new_entity_id='engines',
                        index='engine_no')
    es.normalize_entity(base_entity_id='recordings', 
                        new_entity_id='settings_clusters',
                        index='settings_clusters')
    return es, kmeans
es, kmeans = make_entityset(data, nclusters)

The preceding code segment generates an entity set showing the following relations:

Entityset: Dataset
   Entities:
     settings_clusters [Rows: 50, Columns: 2]
     recordings [Rows: 61249, Columns: 29]
     engines [Rows: 249, Columns: 2]
   Relationships:
     recordings.engine_no -> engines.engine_no
     recordings.settings_clusters -> settings_clusters.settings_clusters

In addition to changing our entity set structure, we are also going to use the complexity time-series primitive from the tsfresh package. Any function that takes in a pandas series and outputs a float can be converted into an aggregation primitive using the make_agg_primitive function, as shown here:

from featuretools.primitives import make_agg_primitive
import featuretools.variable_types as vtypes
from tsfresh.feature_extraction.feature_calculators import (number_peaks, mean_abs_change, 
                                                            cid_ce, last_location_of_maximum, length)
Complexity = make_agg_primitive(lambda x: cid_ce(x, False),
                              input_types=[vtypes.Numeric],
                              return_type=vtypes.Numeric,
                              name="complexity")
fm, features = ft.dfs(entityset=es, 
                      target_entity='engines',
                      agg_primitives=['last', 'max', Complexity],
                      trans_primitives=[],
                      chunk_size=.26,
                      cutoff_time=cutoff_time_list[0],
                      max_depth=3,
                      verbose=True)
fm.to_csv('Advanced_FM.csv')
fm.head()

Using this approach, we managed to generate 12 more features (previously, we had 290). Then, we built four more feature matrices with the same feature set but different cut-off times. This lets us test the pipeline multiple times before using it on test data:

fm_list = [fm]
for i in tqdm(range(1, splits)):
    fm = ft.calculate_feature_matrix(entityset=make_entityset(data, nclusters, kmeans=kmeans)[0], 
         features=features, chunk_size=.26, cutoff_time=cutoff_time_list[i])
    fm_list.append(fm)

Then, using the recursive feature elimination, we again model RF regressors so that the model picks only important features, so it makes better predictions:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import RFE

def pipeline_for_test(fm_list, hyperparams={'n_estimators':100, 'max_feats':50, 'nfeats':50}, do_selection=False):
    scores = []
    regs = []
    selectors = []
    for fm in fm_list:
        X = fm.copy().fillna(0)
        y = X.pop('RUL')
        reg = RandomForestRegressor(n_estimators=int(hyperparams['n_estimators']), 
              max_features=min(int(hyperparams['max_feats']), int(hyperparams['nfeats'])))
        X_train, X_test, y_train, y_test = train_test_split(X, y)

        if do_selection:
            reg2 = RandomForestRegressor(n_jobs=3)
            selector=RFE(reg2,int(hyperparams['nfeats']),step=25)
            selector.fit(X_train, y_train)
            X_train = selector.transform(X_train)
            X_test = selector.transform(X_test)
            selectors.append(selector)
        reg.fit(X_train, y_train)
        regs.append(reg)
        preds = reg.predict(X_test)
        scores.append(mean_absolute_error(preds, y_test))
    return scores, regs, selectors  
  
scores, regs, selectors = pipeline_for_test(fm_list)
print([float('{:.1f}'.format(score)) for score in scores])
print('Average MAE: {:.1f}, Std: {:.2f}\n'.format(np.mean(scores), np.std(scores)))
most_imp_feats = utils.feature_importances(fm_list[0], regs[0])

The preceding code block should produce the following output showing predicted MAE in each iteration and their average. Additionally, it shows the baseline MAE values and statistics about the engine recording cycles:

[33.9, 34.5, 36.0, 32.1, 36.4, 30.1, 37.2, 34.7,38.6, 34.4]
 Average MAE: 33.1, Std: 4.63
 1: MAX(recordings.settings_clusters.LAST(recordings.sensor_measurement_13)) [0.055]
 2: MAX(recordings.sensor_measurement_13) [0.044]
 3: MAX(recordings.sensor_measurement_4) [0.035]
 4: MAX(recordings.settings_clusters.LAST(recordings.sensor_measurement_4)) [0.029]
 5: MAX(recordings.sensor_measurement_11) [0.028]

Now let’s again try using LSTM to see whether we can reduce the MAE error:

X = fm.copy().fillna(0)
y = X.pop('RUL')
X_train, X_test, y_train, y_test = prepareData(X, y)

hidden_neurons = 128
model = createLSTMModel(X_train, hidden_neurons)

history = model.fit(X_train, y_train, batch_size=32, nb_epoch=5000, validation_split=0.20)
# plot history
plt.plot(history.history['loss'], label='Training')
plt.plot(history.history['val_loss'], label='Validation')
plt.legend()
plt.show()

The preceding lines of code should produce the following diagram, in which the validation loss drops below the training loss:

Finally, we can evaluate the model’s performance based on the MAE:

Capture
predicted = model.predict(X_test)
print('Mean Abs Error: {:.2f}'.format(mean_absolute_error(predicted, y_test)))

The preceding code block should produce an MAE of 52.40, which is lower than we experienced in the previous section.

———————————————————————————————————

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s