Getting Started - Method/Algorithm Selection

by Patricia Hoffman, PhD

Determining a good description of a mathematical problem goes a long way toward finding the solution. In other words, finding the right question to ask is a very important first step in mathematics. However, machine learning problems are data driven. The data reveals the question that should be asked. So, knowing the data well is the first step in solving a machine learning problem.

Simple first steps in looking at the data include finding missing values. What is the significance of that missing value? Are there inconsistencies in the data? What is the validity of the data? Is it timely and at the right resolution? Is it sequential, graphical, spatial, or a time series? If the data is in the form of a matrix, how sparse is that matrix? There is currently a great amount of effort being put into scaling machine learning techniques to handle the explosion in quantities of data. An important question is the size of the data? Answers to these questions will have major impacts on the selection of available methods.

There are two main groups of machine learning techniques: unsupervised learning and supervised learning. Is the quantity of known target values large enough to support the use of a supervised method? An example of a supervised learning algorithm is linear regression while k-means clustering is an unsupervised method.

An important aspect of the data is the number of factors and their degree of correlation. Various types of regression have been developed to address situations in which the number of factors is very large: ridge regression, lasso, and least angle regression. Various kernel methods also excel in including huge numbers of factors simultaneously. A support vector machine (SVM) is a good example of a kernel method. These methods are in direct contrast to random forest, the ensemble technique that randomly selects which factors to include as the algorithm progresses.

Some machine learning techniques are used to describe the data that you already have, while other techniques are used to predict answers for data that is not currently under consideration. Data can be described using the mixtures of Gaussian method, which is in contrast to a SVM. SVM predicts the category for a new data point, but does not describe the data set.

Another aspect of the data which drives algorithm selection is whether the data is numeric or categorical. SVM works well with numeric data, whereas tree algorithms are a more natural choice for data with categorical features.

Is the outlier data point the interesting anomaly that you are looking for (as in fraud detection), or is it an insignificant bit of noise to be ignored? An outlier will definitely skew an ordinary least square linear regression, but will not have much effect on the k-nearest neighbor algorithm.

Many research papers have been written comparing machine learning algorithms. No one algorithm has been found to be the best for all data sets, but each algorithm can be used to discover different aspects of the data. Cross validation is one of the main techniques used to score the results of an algorithm. Techniques for comparing bias, variance, and complexity should be considered in model selection. Ensemble learning improves accuracy. Ensemble methods combine the strengths of collections of simpler base models. The type of ensemble method used also depends on what you want to find in the data.