Simple linear regression#
In case of one feature (\(d=1\)) linear regression is written as
The parameters of this model \(\boldsymbol \theta = (a, b)\), where \(b\) is intercept (or bias), \(a\) is slope.
The feature matrix here has only one column, denote it \(\boldsymbol x\) and let \(\boldsymbol y\) be the vector of corresponding labels. Also denote
\(\overline {\boldsymbol x} = \frac 1n \sum\limits_{i=1}^n x_i\) — sample mean of predictors;
\(\overline {\boldsymbol y} = \frac 1n \sum\limits_{i=1}^n y_i\) — sample mean of targets.
MSE fit#
Use MSE to fit parameters \(\boldsymbol \theta = (a, b)\):
What about some calculus?
We have
From the last equality it follows that
TODO: Finish the proof
The optimal parameters are
Note that the slope is equal to the ratio of sample correlation between \(\boldsymbol x\) and \(\boldsymbol y\) to the sample variance of \(\boldsymbol x\).
Question
Does (7) work for all possible values of \(\boldsymbol x\) and \(\boldsymbol y\)?
Dummy model#
The model (5) is simple but we can do even simpler! Let \(a=0\) and predict labels by a constant (or dummy) model \(\widehat y = b\). In fact, for constant predictions you don’t need any features, only labels do the job.
Question
Which value of \(b\) minimizes the MSE (6)?
Linear regression can be used with different loss functions. For example, we can choose mean absolute error (MAE) instead of MSE:
This time it is unlikely that we can find the analytical solution. But maybe it can be done for the dummy model?
Question
For which value of \(b\) the value of MAE
is minimal?
Answer
\(\widehat b = \mathrm{med}(\boldsymbol y)\) (see this discussion for details)
RSS and \(R^2\)-score#
Putting the optimal weights \(\widehat a\) and \(\widehat b\) into the loss function (6), we obtain residual square error (RSE). Multiplying by \(n\) we get residual sum of squares
Also, total sum of squares is defined as
A popular metric called coefficient of determination (or \(R^2\)-score) is defined as
The coefficient of determination shows proportion of variance explained. \(R^2\)-score does not exceed \(1\) (the greater — the better).
\(R^2\)-score measures the amount of variability that is left unexplained after performing the regression. It shows how better the model works in comparison with dummy prediction.
Example: Boston dataset#
import pandas as pd
boston = pd.read_csv("../ISLP_datsets/Boston.csv").drop("Unnamed: 0", axis=1)
boston.head()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 import pandas as pd
2 boston = pd.read_csv("../ISLP_datsets/Boston.csv").drop("Unnamed: 0", axis=1)
3 boston.head()
ModuleNotFoundError: No module named 'pandas'
Let predictors be lstat
, target — medv
. Let’s calculate the regression coefficients using (7):
import numpy as np
x = boston['lstat']
y = boston['medv']
a = np.sum((x -x.mean()) * (y - y.mean())) / np.sum((x -x.mean()) ** 2)
b = y.mean() - a*x.mean()
a, b
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 3
1 import numpy as np
----> 3 x = boston['lstat']
4 y = boston['medv']
5 a = np.sum((x -x.mean()) * (y - y.mean())) / np.sum((x -x.mean()) ** 2)
NameError: name 'boston' is not defined
Now plot the data and the regression line:
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'svg'
plt.scatter(x, y, s=10, c='b', alpha=0.7)
xs = np.linspace(x.min(), x.max(), num=10)
plt.plot(xs, a*xs + b, c='r', lw=2, label=r"$y = \widehat a x + \widehat b$")
plt.plot([xs.min(), xs.max()], [y.mean(), y.mean()], c='orange', lw=2, label=r"$y = \overline y$")
plt.xlabel("lstat")
plt.ylabel("medv")
plt.title("Simple linear regression vs dummy model")
plt.legend()
plt.grid(ls=":");
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 4
1 import matplotlib.pyplot as plt
3 get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'svg'")
----> 4 plt.scatter(x, y, s=10, c='b', alpha=0.7)
5 xs = np.linspace(x.min(), x.max(), num=10)
6 plt.plot(xs, a*xs + b, c='r', lw=2, label=r"$y = \widehat a x + \widehat b$")
NameError: name 'x' is not defined
Calculate MSE:
mse_lin_reg = np.mean((y - a*x - b)**2)
mse_dummy = np.mean((y - y.mean())**2)
print("Linear regression MSE:", mse_lin_reg)
print("Dummy model MSE:", mse_dummy)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 1
----> 1 mse_lin_reg = np.mean((y - a*x - b)**2)
2 mse_dummy = np.mean((y - y.mean())**2)
3 print("Linear regression MSE:", mse_lin_reg)
NameError: name 'y' is not defined
Coefficient of determination:
print("R2-score:", 1 - mse_lin_reg / np.mean((y - y.mean())**2))
print("Dummy R2-score:", 1 - mse_dummy / np.mean((y - y.mean())**2))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 1
----> 1 print("R2-score:", 1 - mse_lin_reg / np.mean((y - y.mean())**2))
2 print("Dummy R2-score:", 1 - mse_dummy / np.mean((y - y.mean())**2))
NameError: name 'mse_lin_reg' is not defined
Of course, the linear regression line can be found automatically:
import seaborn as sb
sb.regplot(boston, x="lstat", y="medv",
scatter_kws={"color": "blue", "s": 10}, line_kws={"color": "red"},
).set_title('Boston linear regression');
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[8], line 1
----> 1 import seaborn as sb
3 sb.regplot(boston, x="lstat", y="medv",
4 scatter_kws={"color": "blue", "s": 10}, line_kws={"color": "red"},
5 ).set_title('Boston linear regression');
ModuleNotFoundError: No module named 'seaborn'
Linear regression from sklearn
:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
x_reshaped = x.values.reshape(-1, 1)
LR.fit(x_reshaped, y)
print("intercept:", LR.intercept_)
print("slope:", LR.coef_[0])
print("r-score:", LR.score(x_reshaped, y))
print("MSE:", np.mean((LR.predict(x_reshaped) - y) ** 2))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 4
1 from sklearn.linear_model import LinearRegression
3 LR = LinearRegression()
----> 4 x_reshaped = x.values.reshape(-1, 1)
5 LR.fit(x_reshaped, y)
6 print("intercept:", LR.intercept_)
NameError: name 'x' is not defined
Compare this with dummy model:
dummy_mse = np.mean((y - y.mean())**2)
print(dummy_mse)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 1
----> 1 dummy_mse = np.mean((y - y.mean())**2)
2 print(dummy_mse)
NameError: name 'y' is not defined
Exercises#
Prove that \(\frac 1n \sum\limits_{i=1}^n (x_i - \overline {\boldsymbol x})^2 = \overline {\boldsymbol x^2} - (\overline {\boldsymbol x})^2\) where \(\overline {\boldsymbol x^2} = \frac 1n\sum\limits_{i=1}^n x_i^2\).
TODO
Finish analytical derivation of \(a\) and \(b\)
Add some quizzes
Add more datasets (may me even simulated)
Think about cases where the performance of linear model is poor