Statistical Learning


Linear Regression

Consider the Advertising dataset. It displays sales of a particular product as a function of advertising budgets for TV, radio and newspaper media. Here are a few important questions that we might seek to address:

  • Is there a relationship between advertising budget and sales?
  • How strong is the relationship between advertising budget and sales?
  • Which media contribute to sales?
  • How accurately can we estimate the effect of each medium on sales?
  • How accurately can we predict future sales?
  • Is the relationship linear?
  • Is there synergy/interaction among the advertising media?

Simple Linear Regression

It is a very straightforward approach for predicting a quantitative response Y on the basis of a single predictor variable $X$. It assumes that there is approximately a linear relationship between $X$ and $Y$. Mathematically, we can write this linear relationship as $$Y \approx \beta_{0} + \beta_{1}X$$ where $\beta_{0}$ is the intercept and $\beta_{1}$ is the slope term in linear model. They are known as model coefficients or parameters. Suppose, sales regresses onto TV, it can be written as $$sales \approx \beta_{0} + \beta_{1} \times TV$$ Once we have our training data to produce estimates $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ for the model coefficients, we predict the future sales on the basis of particular value of TV advertising by computing $$\hat{y} = \hat{\beta_{0}} + \hat{\beta_{1}}x$$ where $\hat{y}$ is the prediction of $Y$ on the basis of $X = x$. We use $\hat{}$ to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.

Estimating the coefficients

In practice, $\beta_{0}$ and $\beta_{1}$ are unknown. Our goal is to obtain coefficient estimates $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ such that the linear model $y_{i} \approx \hat{\beta_{0}} + \hat{\beta_{1}}x_{i}$ for $i = 1, \cdots, n$ fits the available data well. In other words, we want to find an intercept $\hat{\beta_{0}}$ and slope $\hat{\beta_{1}}$ such that the resulting line is as close as possible to the $n$ data points.

Let $\hat{y_{i}} = \hat{\beta_{0}} + \hat{\beta_{1}}x_{i}$ be the prediction for $Y$ based on the $i^{th}$ value of $X$. Then $e_{i} = y_{i} - \hat{y_{i}}$ represents the $i^{th}$ residual. Thus the residual sum of squares (RSS) is $$RSS = {e_{1}}^{2} + {e_{2}}^{2} + \cdots + {e_{n}}^{2}$$ This is equivalent to $$RSS = \left(y_{1} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{1}\right)^{2} + \left(y_{2} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{2}\right)^{2} + \cdots + \left(y_{n} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{n}\right)^{2}$$ The least squares approach chooses $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ to minimize the RSS which are given by $$\hat{\beta_{1}} = \frac{\sum_{i = 1}^{n}\left(x_{i} - \bar{x}\right)\left(y_{i} - \bar{y}\right)} {\sum_{i = 1}^{n}\left(x_{i} - \bar{x}\right)^{2}}$$

$$\hat{\beta_{0}} = \bar{y} - \hat{\beta_{1}}\bar{x}$$

where $$\bar{y} = \frac{1}{n}\sum_{i = 1}^{n}y_{i}$$ and $$\bar{x} = \frac{1}{n}\sum_{i = 1}^{n}x_{i}$$ are the sample means. This defines the least squares coefficient estimates for simple linear regression.

For example, let's consider the Advertising dataset (we will be using R language):

In [1]:
adv <- read.csv('../../datasets/Advertising.csv')
adv
A data.frame: 200 × 5
XTVradionewspapersales
<int><dbl><dbl><dbl><dbl>
1230.137.8 69.222.1
2 44.539.3 45.110.4
3 17.245.9 69.3 9.3
4151.541.3 58.518.5
5180.810.8 58.412.9
6 8.748.9 75.0 7.2
7 57.532.8 23.511.8
8120.219.6 11.613.2
9 8.6 2.1 1.0 4.8
10199.8 2.6 21.210.6
11 66.1 5.8 24.2 8.6
12214.724.0 4.017.4
13 23.835.1 65.9 9.2
14 97.5 7.6 7.2 9.7
15204.132.9 46.019.0
16195.447.7 52.922.4
17 67.836.6114.012.5
18281.439.6 55.824.4
19 69.220.5 18.311.3
20147.323.9 19.114.6
21218.427.7 53.418.0
22237.4 5.1 23.512.5
23 13.215.9 49.6 5.6
24228.316.9 26.215.5
25 62.312.6 18.3 9.7
26262.9 3.5 19.512.0
27142.929.3 12.615.0
28240.116.7 22.915.9
29248.827.1 22.918.9
30 70.616.0 40.810.5
171 50.011.618.4 8.4
172164.520.947.414.5
173 19.620.117.0 7.6
174168.4 7.112.811.7
175222.4 3.413.111.5
176276.948.941.827.0
177248.430.220.320.2
178170.2 7.835.211.7
179276.7 2.323.711.8
180165.610.017.612.6
181156.6 2.6 8.310.5
182218.5 5.427.412.2
183 56.2 5.729.7 8.7
184287.643.071.826.2
185253.821.330.017.6
186205.045.119.622.6
187139.5 2.126.610.3
188191.128.718.217.3
189286.013.9 3.715.9
190 18.712.123.4 6.7
191 39.541.1 5.810.8
192 75.510.8 6.0 9.9
193 17.2 4.131.6 5.9
194166.842.0 3.619.6
195149.735.6 6.017.3
196 38.2 3.713.8 7.6
197 94.2 4.9 8.1 9.7
198177.0 9.3 6.412.8
199283.642.066.225.5
200232.1 8.6 8.713.4

We can drop the first column X as it consists only the indexes and is of no particular use to us

In [2]:
adv <- adv[-1]
head(adv)
A data.frame: 6 × 4
TVradionewspapersales
<dbl><dbl><dbl><dbl>
230.137.869.222.1
44.539.345.110.4
17.245.969.3 9.3
151.541.358.518.5
180.810.858.412.9
8.748.975.0 7.2

Now, we want to regress sales over TV considering the linear relation as $$sales \approx \hat{\beta_{0}} + \hat{\beta_{1}} \times TV$$ $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ can be calculated as follows:

In [3]:
y <- adv$sales
x <- adv$TV
y.bar <- mean(y)
x.bar <- mean(x)

b1 <- sum((x - x.bar) * (y - y.bar)) / sum((x - x.bar)**2)
b0 <- y.bar - b1 * x.bar

paste('b0(intercept) == ', b0)
paste('b1(slope) == ', b1)
'b0(intercept) == 7.03259354912769'
'b1(slope) == 0.0475366404330197'

Here, sales are in thousands of units and TV refers to advertising budget in thousands of dollars.

Here, $\hat{\beta_{0}} = 7.03$ and $\hat{\beta_{1}} = 0.0475$. In other words, according to this approximation, an additional \$1000 spent on *TV* advertising is associated with selling approximately 47.5 additional units of product. Spending \$0 on TV advertising will result in sale of 7030 units of product.

In [4]:
plot(x, y, pch = 19, col = 'blue', main = 'sales vs. TV', xlab = 'TV', ylab = 'sales')
abline(b0, b1, lwd = 2, col = 'red')
grid()

The residual sum of squares (RSS) is

In [5]:
rss <- sum((y - b0 - b1 * x)**2)
print(rss)
[1] 2102.531

Assessing the Accuracy of the Coefficient Estimates

We assume that the true relationship between $X$ and $Y$ takes the form $Y = f\left(X\right) + \epsilon$ for some unknown function $f$, where $\epsilon$ is a mean-zero random error term. If $f$ is to be approximated by a linear function, then we can write this relationship as: $$Y = \beta_{0} + \beta_{1}X + \epsilon$$

Here $\beta_{0}$ is the intercept term — that is, the expected value of $Y$ when $X = 0$, and $\beta_{1}$ is the slope — the average increase in $Y$ associated with a one-unit increase in $X$. The error term is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in $Y$, and there may be measurement error. We typically assume that the error term is independent of $X$.

The model defines the population regression line, which is the best linear approximation to the true relationship between $X$ and $Y$. The least squares regression coefficient estimates characterize the least squares line.

The true relationship is generally not known for real data, but the least squares line can always be computed using the coefficient estimates. In real applications, we have access to a set of observations from which we can compute the least squares line; however, the population regression line is unobserved.

Different data sets generated from the same true model result in slightly different least squares lines, but the unobserved population regression line does not change.

At first glance, the difference between the population regression line and the least squares line may seem subtle and confusing. We only have one data set, and so what does it mean that two different lines describe the relationship between the predictor and the response? Fundamentally, the concept of these two lines is a natural extension of the standard statistical approach of using information from a sample to estimate characteristics of a large population. For example, suppose that we are interested in knowing the population mean $\mu$ of some random variable $Y$. Unfortunately, $\mu$ is unknown, but we do have access to $n$ observations from $Y$, which we can write as $y_{1}, y_{2}, \cdots, y_{n}$, and which we can use to estimate $\mu$. A reasonable estimate is $\mu = \bar{y}$, where $\bar{y} = \frac{1}{n}\sum_{i=1}^{n}y_{i}$ is the sample mean. The sample mean and the population mean are different, but in general the sample mean will provide a good estimate of the population mean. In the same way, the unknown coefficients $\beta_{0}$ and $\beta_{1}$ in linear regression define the population regression line. We seek to estimate these unknown coefficients using $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$. These coefficient estimates define the least squares line.

The analogy between linear regression and estimation of the mean of a random variable is an apt one based on the concept of bias. If we use the sample mean $\hat{\mu}$ to estimate $\mu$, this estimate is unbiased, in the sense that on average, we expect $\hat{\mu}$ to equal $\mu$. What exactly does this mean? It means that on the basis of one particular set of observations $y_{1}, \cdots, y_{n}$, $\hat{\mu}$ might overestimate $\mu$, and on the basis of another set of observations, $\hat{\mu}$ might underestimate $\mu$. But if we could average a huge number of estimates of $\mu$ obtained from a huge number of sets of observations, then this average would exactly equal $\mu$. Hence, an unbiased estimator does not systematically over- or under-estimate the true parameter. The property of unbiasedness holds for the least squares coefficient estimates. If we estimate $\beta_{0}$ and $\beta_{1}$ on the basis of a particular data set, then our estimates won’t be exactly equal to $\beta_{0}$ and $\beta_{1}$. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on.

We continue the analogy with the estimation of the population mean $\mu$ of a random variable $Y$. A natural question is as follows: how accurate is the sample mean $\hat{\mu}$ as an estimate of $\mu$? We have established that the average of $\hat{\mu}$'s over many data sets will be very close to $\mu$, but that a single estimate $\hat{\mu}$ may be a substantial underestimate or overestimate of $\mu$. How far off will that single estimate of $\hat{\mu}$ be? This is computed by standard error of $\hat{\mu}$

$$Var\left(\hat{\mu}\right) = SE\left(\hat{\mu}\right)^{2} = \frac{\sigma^{2}}{n}$$

where $\sigma$ is the standard deviation of each $y_{i}$ of $Y$. Roughly speaking, the standard error tells us the average amount that this estimate $\hat{\mu}$ differs from the actual value of $\mu$. Also, the deviation shrinks with $n$.

Similarly, $$SE\left(\hat{\beta_{0}}\right)^{2} = \sigma^{2}\left[\frac{1}{n} + \frac{\bar{x}^{2}}{\sum_{i=1}^{n}\left(x_{i} - \bar{x}\right)^{2}}\right]$$

$$SE\left(\hat{\beta_{1}}\right)^{2} = \frac{\sigma^{2}}{\sum_{i=1}^{n}\left(x_{i} - \bar{x}\right)^{2}}$$

where $\sigma^{2} = Var\left(\epsilon\right)$. $SE\left(\hat{\beta_{1}}\right)$ is smaller when $x_{i}$ is more spread out. Intuitively, we have more $leverage$ to estimate a slope when this is the case.

The estimate of $\sigma$ is known as residual standard error (RSE) $$RSE = \sqrt{\frac{RSS}{n - 2}}$$

Strictly speaking, when $\sigma^{2}$ is estimated from the data, we should write $\hat{SE}\left(\hat{\beta_{1}}\right)$ to indicate that an estimate has been made, but for simplicity of notation we will drop this extra “hat”.

In [6]:
rse <- sqrt(rss / (dim(adv)[1] - 2))
print(rse)
[1] 3.258656

Standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. The range is defined in terms of lower and upper limits computed from the sample of data. For linear regression, 95% confidence interval of $x$ takes the form $$\hat{x} \pm 2 \times SE\left(\hat{x}\right)$$

In case of advertising data, the confidence interval for $\beta_{0}$ and $\beta_{1}$ are $\left[6.130, 7.935\right]$ and $\left[0.042, 0.053\right]$ respectively. Therefore, we can conclude that in the absence of any advertising, sales will, on average, fall somewhere between 6,130 and 7,935 units. Furthermore, for each \$1,000 increase in television advertising, there will be an average increase in sales of between 42 and 53 units. We can check this out by looking at the fact that both our computed $\hat{\beta{0}}$ and $\hat{\beta{1}}$ are approximately equal to the mean of the extreme values of the confidence intervals.

At this point, we can counter our hypothesis of whether there is a relation between $X$ and $Y$?

  • If there is no relation between $X$ and $Y$, we call it the null hypothesis $\left(H_{0}\right)$.
  • If there is some relation between $X$ and $Y$, we call it the alternative hypothesis $\left(H_{a}\right)$.

Mathematically, $$H_{0} : \beta_{1} = 0$$ vs. $$H_{a} : \beta_{1} \neq 0$$

We use the t-statistic and the probability of observing any number equal to $|t|$ or larger in absolute value, assuming $\beta_{1} = 0$, known as p-value. A small p-value indicates a substantial relationship and we can reject the null hypothesis.

In our dataset,

Estimate Std. Error t-statistic p-value
(Intercept) 7.032594 0.457843 15.36 $\lt$2e-16
adv\$TV 0.047537 0.002691 17.67 $\lt$2e-16

Notice that the coefficients for $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ are very large relative to their standard errors, so the t-statistics are also large; the probabilities of seeing such values if $H_{0}$ is true are virtually zero. Hence we can conclude that $\beta_{0} \neq 0$ and $\beta_{1} \neq 0$.

A small p-value for the intercept indicates that we can reject the null hypothesis that $\beta_{0} = 0$, and a small p-value for TV indicates that we can reject the null hypothesis that $\beta_{1} = 0$. Rejecting the latter null hypothesis allows us to conclude that there is a relationship between TV and sales. Rejecting the former allows us to conclude that in the absence of TV expenditure, sales are non-zero.

Assessing the Model Accuracy

Once we have rejected one of the hypothesis in favor of another, it is natural to quantify the extent to which the model fits the data.

The quality of a linear regression fit is typically assessed using two quantities:

  • residual standard error (RSE)
  • $R^{2}$ statistic

These values for linear regression of number of units sold on TV advertising budget is

Quantity Value
Residual standard error 3.259
Multiple $R^{2}$ 0.6119
F-statistic 312.1

Residual Standard Error

With each observation, there is an associated error term $\epsilon$. Due to presence of the error terms, even if we knew the true regression line, (i.e. even if $\beta_{0}$ and $\beta_{1}$ were known), we would not be able to perfectly predict $Y$ from $X$.

The RSE is an estimate of standard deviation of $\epsilon$. The RSE is considered a measure of the lack of fit of the model to the data. Roughly speaking, it is the average amount that the response will deviate from the true regression line.

$$RSS = \sqrt{\frac{RSS}{n - 2}} = \sqrt{\frac{\sum_{i=1}^{n}\left(y_{i} - \hat{y_{i}}\right)^{2}}{n - 2}}$$

In our case, RSE is estimated to be 3.259. This can be interpreted as actual sales in each market deviate from the true regression line by approximately 3259 units, on average. Another way to think about this is that even if the model were correct and the true values of the unknown coefficients $\beta_{0}$ and $\beta_{1}$ were known exactly, any prediction of sales on the basis of TV advertising would still be off by about 3259 units on average.

The mean value of sales is:

In [7]:
print(y.bar)
[1] 14.0225

So, the percentage error is approximately

In [8]:
rse / y.bar * 100
23.2387688974895

$R^{2}$ Statistic

The RSE provides an absolute measure of lack of fit of the model to the data. But since it is measured in the units of $Y$, it is not always clear what constitutes a good RSE. The $R^{2}$ statistic provides an alternative measure of fit. It takes the form of a proportion — the proportion of variance explained — and so it always takes on a value between 0 and 1, and is independent of the scale of $Y$. $R^{2}$ is defined as

$$R^{2} = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}$$

where $TSS = \sum{\left(y_{i} - \bar{y}\right)^{2}}$ which is total sum of squares. TSS measures the total variance in the response $Y$, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. Hence, TSS − RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and $R^{2}$ measures the proportion of variability in $Y$ that can be explained using $X$. An $R^{2}$ statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. A number near 0 indicates that the regression did not explain much of the variability in the response; this might occur because the linear model is wrong, or the inherent error $\sigma^{2}$ is high, or both. The $R^{2}$ was 0.6119, and so just about two-thirds of the variability in sales is explained by a linear regression on TV. The $R^{2}$ statistic has an interpretational advantage over the RSE since unlike the RSE, it always lies between 0 and 1. The $R^{2}$ statistic is a measure of the linear relationship between $X$ and $Y$.

In [9]:
tss <- sum((y - y.bar)**2)
print(tss)
[1] 5417.149
In [10]:
r.sqr = 1 - (rss / tss)
print(r.sqr)
[1] 0.6118751

We can also use correlation

$$cor(X, Y) = \frac{\sum_{i=1}^{n}\left(x_{i} - \bar{x}\right)\left(y_{i} - \bar{y}\right)} {\sqrt{\sum_{i=1}^{n}\left(x_{i} - \bar{x}\right)^{2}}\sqrt{\sum_{i=1}^{n}\left(y_{i} - \bar{y}\right)^{2}}}$$
In [11]:
print(sum((x - x.bar) * (y - y.bar)) / sqrt(sum((x - x.bar)**2)) / sqrt(sum((y - y.bar)**2)))
[1] 0.7822244

Or simply,

In [12]:
print(cor(x, y))
[1] 0.7822244

This all can be done using the lm function

In [13]:
fit <- lm(adv$sales ~ adv$TV, data = adv)
print(fit)
Call:
lm(formula = adv$sales ~ adv$TV, data = adv)

Coefficients:
(Intercept)       adv$TV  
    7.03259      0.04754  

In [14]:
summary(fit)
Call:
lm(formula = adv$sales ~ adv$TV, data = adv)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
adv$TV      0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,	Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
In [15]:
coef(fit)
(Intercept)
7.03259354912769
adv$TV
0.0475366404330198
In [16]:
confint(fit)
A matrix: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)6.129719277.93546783
adv$TV0.042230720.05284256
In [17]:
TV <- lm(adv$sales ~ adv$TV)
summary(TV)
plot(adv$TV, adv$sales, xlab = 'TV', ylab = 'sales', main = 'sales vs TV', pch = 19, col = 'blue')
abline(TV, col = 'red')
grid()
Call:
lm(formula = adv$sales ~ adv$TV)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
adv$TV      0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,	Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
In [18]:
newspaper <- lm(adv$sales ~ adv$newspaper)
summary(newspaper)
plot(adv$newspaper, adv$sales, xlab = 'newspaper', ylab = 'sales', main = 'sales vs newspaper', pch = 19, col = 'blue')
abline(newspaper, col = 'red')
grid()
Call:
lm(formula = adv$sales ~ adv$newspaper)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2272  -3.3873  -0.8392   3.5059  12.7751 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   12.35141    0.62142   19.88  < 2e-16 ***
adv$newspaper  0.05469    0.01658    3.30  0.00115 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared:  0.05212,	Adjusted R-squared:  0.04733 
F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148
In [19]:
radio <- lm(adv$sales ~ adv$radio)
summary(radio)
plot(adv$radio, adv$sales, xlab = 'radio', ylab = 'sales', main = 'sales vs radio', pch = 19, col = 'blue')
abline(radio, col = 'red')
grid()
Call:
lm(formula = adv$sales ~ adv$radio)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7305  -2.1324   0.7707   2.7775   8.1810 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.31164    0.56290  16.542   <2e-16 ***
adv$radio    0.20250    0.02041   9.921   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared:  0.332,	Adjusted R-squared:  0.3287 
F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16

For equivalent Python3 code, go to linear_regression.py

Multiple Linear Regression

The regression of multiple predictors takes the form of

$$Y = \beta_{0} + \sum_{i=1}^{p}{\beta_{i}X_{i}} + \epsilon$$

In simple linear regression, we ignore the effect of other predictors while in multiple linear regression, the coefficient represents the average effect on $Y$ of a one unit increase in $X_{j}$, holding all other predictors fixed.

The linear regression considering all the variables together can be done as follows:

In [20]:
mul.fit <- lm(adv$sales ~ ., data = adv)
coef(mul.fit)
(Intercept)
2.93888936945941
TV
0.0457646454553976
radio
0.188530016918204
newspaper
-0.00103749304247628
In [21]:
summary(mul.fit)
Call:
lm(formula = adv$sales ~ ., data = adv)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,	Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16
In [22]:
confint(mul.fit)
A matrix: 4 × 2 of type dbl
2.5 %97.5 %
(Intercept) 2.323762283.55401646
TV 0.043013710.04851558
radio 0.171547450.20551259
newspaper-0.012615950.01054097

However, while the newspaper regression coefficient estimate was significantly non-zero, the coefficient estimate for newspaper in the multiple regression model is close to zero, and the corresponding p-value is no longer significant, with a value around 0.86. This illustrates that the simple and multiple regression coefficients can be quite different.

Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite? In fact it does. Consider the correlation matrix for the three predictor variables and response variable. Notice that the correlation between radio and newspaper is 0.35. This reveals a tendency to spend more on newspaper advertising in markets where more is spent on radio advertising. Now suppose that the multiple regression is correct and newspaper advertising has no direct impact on sales, but radio advertising does increase sales. Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows, we also tend to spend more on newspaper advertising in those same markets. Hence, in a simple linear regression which only examines sales versus newspaper, we will observe that higher values of newspaper tend to be associated with higher values of sales, even though newspaper advertising does not actually affect sales. So newspaper sales are a surrogate for radio advertising; newspaper gets “credit” for the effect of radio on sales.

In [23]:
cor(adv)
A matrix: 4 × 4 of type dbl
TVradionewspapersales
TV1.000000000.054808660.056647870.7822244
radio0.054808661.000000000.354103750.5762226
newspaper0.056647870.354103751.000000000.2282990
sales0.782224420.576222570.228299031.0000000

Some Important Questions

We may be interesed in answering a few important questions:

  1. Is atleast one of the predictors $X_{1}, X_{2}, \cdots, X_{p}$ useful in predicting the response?
  2. Do all the predictors help to explain $Y$, or is only a subset of predictors useful?
  3. How well does the model fit the data?
  4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

One: Is there a relation between response and predictors?

In simple linear regression, in order to determine the relation between response and predictor, we simply check if $\beta_{1} = 0$. In multiple linear regression, we need to check it for all the $p$ predictors if all the regression coefficients are zero or not. Thus the null hypothesis

$$H_{0}: \beta_{1} = \beta_{2} = \cdots = \beta_{p} = 0$$

versus the alternative hypothesis

$$H_{a}:\ atleast\ one\ of\ the\ \beta_{j}\ is\ non\ zero$$

This hypothesis test is tested by using the F-statistic,

$$F = \frac{\left(TSS - RSS\right) / p}{RSS / \left(n - p - 1\right)}$$

In our case,

Quantity Value
Residual Standard Error 1.686
Multiple $R^{2}$ 0.8972
F-statistic 570.3

where $TSS = \sum{\left(y_{i} - \bar{y}\right)^2}$ and $RSS = \sum{\left(y_{i} - \hat{y_{i}}\right)^{2}}$. If the linear model assumptions are true, one can show that $$E\{ RSS / \left(n - p - 1\right) \} = \sigma^{2}$$ and that provided if $H_{0}$ is true, $$E\{\left(TSS - RSS\right) / p\} = \sigma^{2}$$

Hence, when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. On the other hand, if $H_{a}$ is true, $E\{\left(TSS - RSS\right) / p\} > \sigma^{2}$, so we expect $F > 1$

Thus checking the overall F-statistic and the individual p-values result in a near perfect result in many cases.

In [24]:
n <- 200
p <- 3
In [25]:
F <- ((tss - rss) / p) / ((rss) / (n - p - 1))
print(F)
[1] 102.9973

Two: Deciding on important variables

For multiple regression, our task is to do determine the predictors that are associated to the response. This is known as variable selection.

There are three classical approaches for this task:

  • Forward Selection: We begin with the null model — a model that contains an intercept but no predictors. We then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied.
  • Backward Selection: We start with all variables in the model, and remove the variable with the largest p-value—that is, the variable that is the least statistically significant. The new (p − 1) variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.
  • Mixed Selection: This is a combination of forward and backward selection. We start with no variables in the model, and as with forward selection, we add the variable that provides the best fit. We continue to add variables one-by-one. Of course, as we noted with the Advertising example, the p-values for variables can become larger as new predictors are added to the model. Hence, if at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.

Backward selection cannot be used if $p > n$, while forward selection can always be used. Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.

Three: Model Fit

Two of the most common numerical measures of model fit are the RSE and $R^{2}$, the fraction of variance explained. These quantities are computed and interpreted in the same fashion as for simple linear regression. $R^{2}$ is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals $Cor\left(Y, \hat{Y}\right)^{2}$, the square of the correlation between the response and the fitted linear model; in fact one property of the fitted linear model is that it maximizes this correlation among all possible linear models.

An $R^{2}$ value close to 1 indicates that the model explains a large portion of the variance in the response variable. As an example, we saw that for the Advertising data, the model that uses all three advertising media to predict sales has an $R^{2}$ of 0.8972. On the other hand, the model that uses only TV and radio to predict sales has an $R^{2}$ value of 0.89719. In other words, there is a small increase in $R^{2}$ if we include newspaper advertising in the model that already contains TV and radio advertising, even though we saw earlier that the p-value for newspaper advertising is not significant. The fact that adding newspaper advertising to the model containing only TV and radio advertising leads to just a tiny increase in $R^{2}$ provides additional evidence that newspaper can be dropped from the model. Essentially, newspaper provides no real improvement in the model fit to the training samples, and its inclusion will likely lead to poor results on independent test samples due to overfitting.

In contrast, the model containing only TV as a predictor had an $R^{2}$ of 0.61. Adding radio to the model leads to a substantial improvement in $R^{2}$. This implies that a model that uses TV and radio expenditures to predict sales is substantially better than one that uses only TV advertising. We could further quantify this improvement by looking at the p-value for the radio coefficient in a model that contains only TV and radio as predictors.

The model that contains only TV and radio as predictors has an RSE of 1.681, and the model that also contains newspaper as a predictor has an RSE of 1.686. In contrast, the model that contains only TV has an RSE of 3.26. This corroborates our previous conclusion that a model that uses TV and radio expenditures to predict sales is much more accurate (on the training data) than one that only uses TV spending. Furthermore, given that TV and radio expenditures are used as predictors, there is no point in also using newspaper spending as a predictor in the model. It is often useful to plot the data so we can figure out underfitting and overfittings.

Four: Predictions

Once we fit the model to predict response $Y$ from predictors $X_{1}, X_{2}, \cdots, X_{p}$, we have three sorts of uncertainity associated with the predictions:

  1. The coefficient estimates $\hat{\beta_{0}}, \hat{\beta_{1}}, \cdots, \hat{\beta_{p}}$ are only estimates for $\beta_{0}, \beta_{1}, \cdots, \beta_{p}$. That is the least squares plane is only an estimate of the true regression plane. This is referred to as reducible error. We can use the confidence interval to determine how close $\hat{Y}$ will be to $f\left(X\right)$.
  2. There is another potential reducible error which comes in form of model bias due to estimating the shape of the plane of response.
  3. Even if we knew the true values of $f\left(X\right)$, the response cannot be perfectly predicted due to presence of irreducible error $\epsilon$. We can use the prediction interval to determine how close $\hat{Y}$ will vary from $Y$.
In [26]:
plot(adv, pch = 19, col = 'orange')

Other considerations

Qualitative Predictors

If the predictors are qualitative and not quantitative for regression model, we consider use of a dummy variable. There are many different ways of coding qualitative variables besides the dummy variable approach. All of these approaches lead to equivalent model fits, but the coefficients are different and have different interpretations, and are designed to measure particular contrasts.

Extensions to the Linear Model

The standard linear regression model $Y = \beta_{0} + \sum_{i = 1}^{p}\beta_{i}X_{i} + \epsilon$ provides interpretable results and works quite well on many real-world problems. However, it makes several highly restrictive assumptions that are often violated in practice. Two of the most important assumptions state that the relationship between the predictors and response are additive and linear.

The additive assumption means that the effect of changes in a predictor $X_{j}$ on the response $Y$ is independent of the values of the other predictors. The linear assumption states that the change in the response $Y$ due to a one-unit change in $X_{j}$ is constant, regardless of the value of $X_{j}$. There are generally worked around by including interaction terms or including the non-linear relationship terms.

Potential Problems

When we fit a linear regression model to a particular data set, many problems may occur. Most common among these are the following:

  1. Non-linearity of the response-predictor relationships.
  2. Correlation of error terms.
  3. Non-constant variance of error terms.
  4. Outliers.
  5. High-leverage points.
  6. Collinearity.

We can now answer a few questions:

  • Is there a relationship between advertising sales and budget?
    • This question can be answered by fitting a multiple regression model of sales onto TV, radio, and newspaper, and testing the hypothesis $H_{0}$: $\beta_{TV} = \beta_{radio} = \beta_{newspaper} = 0$. The F-statistic can be used to determine whether or not we should reject this null hypothesis. In this case the p-value corresponding to the F-statistic is very low, indicating clear evidence of a relationship between advertising and sales.
  • How strong is the relationship?
    • First, the RSE estimates the standard deviation of the response from the population regression line. For the Advertising data, the RSE is 1681. units while the mean value for the response is 14022, indicating a percentage error of roughly 12%. Second, the $R^{2}$ statistic records the percentage of variability in the response that is explained by the predictors. The predictors explain almost 90% of the variance in sales.
  • Which media contribute to sales?
    • The p-values associated with each predictor’s t-statistic. In the multiple linear regression, the p-values for TV and radio are low, but the p-value for newspaper is not. This suggests that only TV and radio are related to sales.
  • How large is the effect of each medium on sales?
    • The standard error of $\hat{\beta_{j}}$ can be used to construct confidence intervals for $\beta_{j}$. For the Advertising data, the 95% confidence intervals are as follows: (0.043, 0.049) for TV, (0.172, 0.206) for radio, and (−0.013, 0.011) for newspaper. The confidence intervals for TV and radio are narrow and far from zero, providing evidence that these media are related to sales. But the interval for newspaper includes zero, indicating that the variable is not statistically significant given the values of TV and radio. Collinearity can result in very wide standard errors. The variance inflation factor $$VIF\left(\hat{\beta_{j}}\right) = \frac{1}{1 - R_{X_{j}|X_{-j}}^{2}}$$ is 1.005, 1.145, and 1.145 for TV, radio, and newspaper, suggesting no evidence of collinearity. In order to assess the association of each medium individually on sales, we can perform three separate simple linear regressions. There is evidence of an extremely strong association between TV and sales and between radio and sales. There is evidence of a mild association between newspaper and sales, when the values of TV and radio are ignored.
  • How accurately can we predict future sales?
    • The accuracy associated with this estimate depends on whether we wish to predict an individual response, $Y = f\left(X\right) + \epsilon$, or the average response, $f\left(X\right)$. If the former, we use a prediction interval, and if the latter, we use a confidence interval. Prediction intervals will always be wider than confidence intervals because they account for the uncertainty associated with $\epsilon$, the irreducible error.
  • Is the relationship linear?
    • Residual plots can be used in order to identify non-linearity. If the relationships are linear, then the residual plots should display no pattern.
  • Is there synergy among the advertising media?
    • The standard linear regression model assumes an additive relationship between the predictors and the response. An additive model is easy to interpret because the effect of each predictor on the response is unrelated to the values of the other predictors. However, the additive assumption may be unrealistic for certain data sets. We need to include an interaction term in the regression model in order to accommodate non-additive relationships. A small p-value associated with the interaction term indicates the presence of such relationships. Including an interaction term in the model results in a substantial increase in $R^{2}$, from around 90% to almost 97%.