Commonly leveraged Machine Learning methods
Machine learning methods in transportation are currently contained primarily within the domain of asset management (primarily predictive analytics) and route optimization. In this article, I propose few additional areas where Machine Learning can be leveraged within Transportation analytics.
While theoretically, many types of Machine learing methods can be applied to Transportation data, in this post we will discuss two methods that have actually been used in Industrial applications (i.e are not mere theory)
- Supervised learning where previously labeled data is used to guide the learning process
- Unsupervised learning, where only unlabeled data is used
I will assume that the readers are familiar with basic Machine Learning definitions like Supervised and Unsupervised learning so I will touch definitions very briefly only here. Supervised learning method trains a function (or algorithm) to compute output variables based on a given data in which both input and output variables are present.
Commonly used supervised learning methods
Here, we discuss two big categories of supervised learning methods, namely, classification and regression. For example, given the speed information of individual
vehicles for a highway section, the problem can be defined in the following ways:
- Estimating how many drivers are speeding based on the speed limit provided for the highway
- Estimating an average speed of the highway in future based on the past data.
In the first case, because the solution of the problem relies on classifying the data between users who are speeding vs users who are driving below speed limit, the problem can be thought of as classification problem. In the second case the solution includes mapping past data to estimate average speed of the highway section in future and it can be thought of as regression function.
For a classification problem, the goal of the machine learning algorithm is to categorize or classify given inputs based on the training data set. The training data set in a classification problem includes set of input output pairs categorized in classes. Many classification problems are binary, i.e., only two classes such as True and False are involved.
An example in Transportation context can be: the individual vehicle’s speed data over time can be classified into “speeding” and “not-speeding.” Another example of classification is categorical classification, e.g., volume and speed data over time for a highway segment can be classified into level of service “A,” “B,” “C,” “D,” “E,” and “F.”
When a new set of observations is presented to a trained classification algorithm, the algorithm categorizes each observation into set of predetermined classes.
For a regression problem, the goal of the machine learning algorithm is to develop a relationship between outputs and inputs using a continuous function to help machines understand how outputs are changing for given inputs. The regression problems can also be envisioned as prediction problems. For example, given the historic information about volume and speed for a given highway, the output can be the average speed of the highway for a next time period. The relationship between output variables and input variables can be defined by various mathematical functions such as linear, nonlinear, and logistic.
For example, for a given highway, input parameters can be
- Volume (i.e., number of vehicles per hour)
- Current time
- Age of the driver,
And corresponding output parameter can be average traffic speed. A learning algorithm can utilize this information for automated training of a function (or algorithm) that computes the speed from a given input.
Unsupervised learning methods depend only on the underlying unlabeled data to identify hidden patterns of data instead of inferring models for known input-output pairs. Clustering and association are two popular families of methods for unsupervised
Clustering methods focus on grouping data in multiple clusters based on similarity between data points. Usually, clustering methods rely on mathematical models to identify similarities between unlabeled data points. The similarities between data points are identified by various methods such as Euclidean distance.
Consider an example of a transportation engineer with a closed circuit television (CCTV) recording of peak hour traffic data for a highway segment without control information drivers, and normal drivers. The engineer’s goal is to find clusters such as aggressive drivers, slow drivers, and normal drivers by observing their driving pattern data such as an acceleration and deceleration. In this case, it is important to note that the logic rules of such clusters are defined by the engineer based on his/her own domain expertise.
Association method focuses on identifying a particular trend (or trends) in the given data set that represents major data patterns or, the so-called significant association rules that connect data patterns with each other.
For example, given crash data of a highway section, finding an association between age of the drivers involved in the crash, blood-alcohol level of the driver at the time of crash, and time of the day can provide critical information to plan sobriety checkpoint locations and times to reduce crash as well as fatalities.