Statistical Learning


What is Statistical Learning?

We try to determine the association between the input variables and output variables.

Input variables are also known as predictors, independent variables, features, or sometimes just variables.

Output variables are also known as response or dependent variables.

Generally, we suppose a relation $$Y = f\left(X\right) + \epsilon$$ where Y is the response, X is predictors, $f$ is some fixed but unknown function of p different predictors $X = \left(X_{1}, X_{2}, \cdots, X_{p}\right)$ and $\epsilon$ is the general error term independent of X with mean error 0.

In essence, statistical learning refers to a set of approaches for estimating $f$.

Why estimate $f$?

There are two main reasons to estimate $f$:

  • Prediction
  • Inference

Prediction

In many situations, a set of inputs X are readily available, but the output Y cannot be easily obtained. In this setting, since the error term averages to zero, we can predict Y using $$\hat{Y} = \hat{f}\left(X\right)$$ where $\hat{f}$ represents our estimate for $f$, and $\hat{Y}$ represents the resulting prediction for $Y$.

In this setting, $\hat{f}$ is generally treated as black box, in the sense that we are generally not concerned with the exact form of $\hat{f}$, provided that it yields accurate predictions of Y.

The accuracy of $\hat{Y}$ as a prediction of $Y$ depends on two quantities:

  • reducible error
  • irreducible error

In general, $\hat{f}$ will not be a perfect estimate for $f$, and this inaccuracy will introduce some error. This error is reducible because we can potentially improve the accuracy of $\hat{f}$ by using the most appropriate statistical learning technique to estimate $f$. However, even if it were possible to form a perfect estimate for $f$, so that our estimated response took the form $\hat{Y} = f(X)$, our prediction would still have some error in it! This is because Y is also a function of $\epsilon$, which, by definition, cannot be predicted using X. Therefore, variability associated with $\epsilon$ also affects the accuracy of our predictions. This is known as the irreducible error, because no matter how well we estimate $f$, we cannot reduce the error introduced by $\epsilon$.

Consider a given estimate $\hat{f}$ and a set of predictors $X$, which yields the prediction $\hat{Y} = \hat{f}(X)$. Assume for a moment that both $\hat{f}$ and $X$ are fixed. Then, it is easy to show that $$E\left(Y - \hat{Y}\right)^{2} = E\left[f(X) + \epsilon - \hat{f}(X)\right]^{2} = \left[f(X) - \hat{f}(X)\right]^{2} + Var\left(\epsilon\right)$$ where $E\left(Y - \hat{Y}\right)^{2}$ represents the average, or expected value, of the squared difference between the predicted and actual value of Y, and Var($\epsilon$) represents the variance associated with the error term $\epsilon$ which is the irreducible error while $\left[f(X) - \hat{f}(X)\right]^{2}$ is the reducible error.

Inference

We are often faced with the situation where we wish to understand the way how $Y$ changes as a function of $X_{1}, \cdots, X_{p}$. Now, $\hat{f}$ cannot be treated as a black box.

Many questions may be followed:

  • Which predictors are associated with the response?
  • What is the relationship between the response and each predictor?
  • Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating f may be appropriate. For example, linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches. In contrast, some of the highly non-linear approaches can potentially provide quite accurate predictions for Y, but this comes at the expense of a less interpretable model for which inference is more challenging.

How do we estimate $f$?

There are two main methods to estimate $f$:

  • Parametric methods (E.g. Linear model fit)
  • Non-Parametric methods (E.g. Thin-plate spline)

Parametric methods often make explicit assumptions about the functional form of $f$ like that of the shape of the model and then using the training data to fit or train the model. Non-parametric methods do not make explicit assumptions about the functional form of $f$. This provides a greater flexibility over the shape of the model and often leads to more accurate shape of $f$.

The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of $f$. If the chosen model is too far from the true $f$, then our estimate will be poor. We can try to address this problem by choosing flexible models that can fit many different possible functional forms for $f$. But in general, fitting a more flexible model requires estimating a greater number of parameters.

Tradeoff between Predicition Accuracy and Model Interpretability

Linear regression is a relatively inflexible approach, because it can only generate linear functions. Other methods, such as the thin plate splines are considerably more flexible because they can generate a much wider range of possible shapes to estimate $f$.

Why would we ever choose to use a more restrictive method instead of a very flexible approach?

There are several reasons that we might prefer a more restrictive model. If we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between $Y$ and $X_{1}, X_{2}, \cdots, X_{p}$.

We have established that when inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. For instance, if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately, interpretability is not a concern. In this setting, we might expect that it will be best to use the most flexible model available.

In general, as the flexibility of a method increases, its interpretability decreases.

Supervised vs. Unsupervised Learning

Most statistical learning problems fall into one of two categories:

  • Supervised
  • Unsupervised

In supervised learning, there is an associated response measurement $y_{i}$ for each predictor measurement $x_{i};\ i = 1, \cdots, n$. In contrast, unsupervised learning is a somewhat challenging situation wherein for each observation $i = 1, \cdots, n$, we observe a vector of measurements $x_{i}$, but no associated response $y_{i}$.

Regression vs. Classification Problems

Variables can be characterized as either quantitative or qualitative. Quantitative variables take on numerical values. Examples include a person’s age, height, or income, the value of a house, and the price of a stock. In contrast, qualitative variables take on values in one of K different classes, or categories. Examples of qualitative variables include a person’s gender (male or female), the brand of product purchased (brand A, B, or C), whether a person defaults on a debt (yes or no), or a cancer diagnosis (Acute Myelogenous Leukemia, Acute Lymphoblastic Leukemia, or No Leukemia). We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems.

However, whether the predictors are qualitative or quantitative is generally considered less important.

Accessing Model Accuracy

In regression settings, the most common approach for quantifying estimates is the mean squared error is given by, $$MSE = \frac{1}{n}\sum_{i = 1}^{n}{(y_{i} - \hat{f}(x_{i}))^{2}}$$ where $\hat{f}(x_{i})$ is the prediction that $\hat{f}$ gives for the $i^{th}$ observation. Hence, we could compute test average squared prediction error $$Avg\left(y_{0} - \hat{f}\left(x_{0}\right)\right)^{2}$$ The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.

In classification settings, the most common approach for quantifying estimates is given by, $$E = \frac{1}{n}\sum_{i = 1}^{n}I\left(y_{i} \neq \hat{y}_{i}\right)$$ where $I\left(y_{i} \neq \hat{y}_{i}\right)$ is the indicator variable which equals 1 if $y_{i} \neq \hat{y}_{i}$ else 0. This would be our training error rate and thus our test error rate would be $$Avg\left(I\left(y_{i} \neq \hat{y}_{i}\right)\right)$$

Bias-Variance Trade-off

Expected test MSE, for a given value $x_{0}$, can always be decomposed into the sum of three fundamental quantities:

  • variance of $\hat{f}(x_{0})$
  • squared bias of $\hat{f}(x_{0})$
  • variance of error terms $\epsilon$
$$E\left(y_{0} - \hat{f}\left(x_{0}\right)\right)^{2} = Var\left(\hat{f}\left(x_{0}\right)\right) + \left[Bias\left(\hat{f}\left(x_{0}\right)\right)\right]^{2} + Var\left(\epsilon\right)$$

Here, $E\left(y_{0} - \hat{f}\left(x_{0}\right)\right)^{2}$ defines the expected test MSE.

We need to minimize the expected test error or the reducible error by achieving a low variance and low bias. Since, both variance and squared bias are non-negative quantities, the expected test MSE can never lie below $Var\left(\epsilon\right)$ which is the irreducible.

Variance refers to the amount of by which $\hat{f}$ would change if we estimated it using a different training dataset. Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between $Y$ and $X_{1}, X_{2}, \cdots, X_{p}$. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of $f$. Generally, more flexible methods have high variance and low bias while less flexible methods will have a low variance and high bias.

Bayes Classifier

The test error rate can be minimized by a very simple classifier that assigns each observation to the most likely class, given its predictor values. In other words, we should simply assign a test observation with predictor vector $x_{0}$ to class $j$ for which $$Pr\left(Y = j | X = X_{0}\right)$$ is largest. It is the conditional probability that $Y = j$, given the observed predictor vector $x_{0}$. This is called Bayes Classifier. The set of points where the probability of all the classes is equal is known as Bayes decision boundary. The Bayes classifier’s prediction is determined by the Bayes decision boundary. The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate.

The Bayes error rate is given by $$1 - E\left(\max_{j} Pr\left(Y = j|X\right)\right)$$

The Bayes error rate is analogous to the irreducible error.

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible.

K-Nearest Neighbors Classifier

Given a positive integer $K$ and a test observation $x_{0}$, the KNN classifier first identifies the $K$ points in the training data that are closest to $x_{0}$, represented by $N_{0}$. It then estimates the conditional probability for class $j$ as the fraction of points in $N_{0}$ whose response values equal $j$: $$Pr\left(Y = j|X = x_{0}\right) = \frac{1}{K}\sum_{i \in N_{0}}I\left(y_{i} = j\right)$$ Finally, KNN applies Bayes rule and classifies the test observation $x_{0}$ to the class with the largest probability.

In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method.