Multinomial logistic regression#

Let \(\mathcal Y\) be a finite set, e.g. \(\mathcal Y=\{1, 2, \ldots, K\}\), and the training dataset

\[ \mathcal D = \{(\boldsymbol x_i, y_i)\}_{i=1}^n, \quad \boldsymbol x_i\in\mathbb R^d, \quad y_i\in \mathcal Y. \]

Multinomial logistic regression predicts a vector of probabilities

\[ \boldsymbol{\widehat y} = (p_1,\ldots, p_K), \quad p_k > 0, \quad \sum\limits_{k=1}^K p_k = 1. \]

How to predict these probabilities? We can use a linear model whose output is a vector of \(K\) different numbers:

\[ z_k = \boldsymbol x^\top \boldsymbol w_k, \quad \boldsymbol w_k \in \mathbb R^d, \quad k = 1, \ldots, K. \]

Now convert the vector \(\boldsymbol z \in \mathbb R^K\) to the vector of probabilities \(\boldsymbol{\widehat y}\) via softmax:

\[ \boldsymbol{\widehat y} = \mathrm{Softmax}(\boldsymbol z) = \bigg(\frac{e^{z_1}}{\sum e^{z_k}}, \ldots, \frac{e^{z_K}}{\sum e^{z_k}}\bigg) \]

If we need to pick a class, we can choose the most probable one:

\[ \arg\max\limits_{1\leqslant k \leqslant K} p_k = \arg\max\limits_{1\leqslant k \leqslant K} \Big\{\frac{\exp(\boldsymbol x^\top \boldsymbol w_k)}{\sum \exp(\boldsymbol x^\top \boldsymbol w_k)}\Big\} \]

The parameters \(\boldsymbol w_k\) naturally form a matrix

\[ \boldsymbol W = [\boldsymbol w_1 \ldots \boldsymbol w_K] \]

Q. What is the shape of this matrix? How many parameters does multinomial regression have?

Loss function#

The optimal parameters \(\boldsymbol W\) are solutions of the following optimization problem:

(26)#\[\mathcal L (\boldsymbol W) = \sum\limits_{i=1}^n \bigg(\boldsymbol x_i^\top\boldsymbol w_{y_i} -\log\Big(\sum\limits_{k=1}^K \exp(\boldsymbol x_i^\top\boldsymbol w_{k})\Big)\bigg) \to \max\limits_{\boldsymbol w_{1}, \ldots, \boldsymbol w_{K}}\]

If the targets \(y_k\) are one-hot encoded, then they from a matrix

\[\begin{split} \boldsymbol Y = \begin{pmatrix} \boldsymbol y_1^\top \\ \vdots \\ \boldsymbol y_n^\top \end{pmatrix}, \quad y_{ik} \geqslant 0, \quad \sum\limits_{k=1}^K y_{ik} = 1. \end{split}\]

Accordingly, the loss function (27) can be written as

(27)#\[\mathcal L (\boldsymbol W) = \sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik}\log\bigg(\frac{\exp(\boldsymbol x_i^\top\boldsymbol w_k)}{\sum\limits_{1 \leqslant k \leqslant K}\exp(\boldsymbol x_i^\top\boldsymbol w_{k})}\bigg) = \sum\limits_{i=1}^n \sum\limits_{k=1}^K y_{ik} \log \widehat y_{ik},\]

and this is generally the cross-entropy loss (3), taken with opposite sign.

Question

Denote \(\boldsymbol{\widehat Y} = (\widehat y_{ik}) = \mathrm{Softmax}(\boldsymbol {XW})\) (softmax is applied to each row). Rewrite the loss function (27) in matrix form.

Regularized version:

\[ \sum\limits_{i=1}^n \Big(\boldsymbol x_i^\top\boldsymbol w_{y_i} -\sum\limits_{k=1}^K \exp(\boldsymbol x_i^\top\boldsymbol w_{k})\Big) - C\Vert \boldsymbol W \Vert_F^2 \to \max\limits_{\boldsymbol W}, \quad C > 0. \]

Q. Why do we see minus before the regularization term?

Example: MNIST#

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'svg'

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')

X = X.astype(float).values / 255
Y = Y.astype(int).values
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import numpy as np
      2 import matplotlib.pyplot as plt
----> 3 import seaborn as sns
      4 from sklearn.metrics import accuracy_score, confusion_matrix
      5 from sklearn.datasets import fetch_openml

ModuleNotFoundError: No module named 'seaborn'

Visualize data#

Visualize some random samples:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 19
     16         plt.title(title, size=20)
     17     plt.show()
---> 19 plot_digits(X, Y, random_state=12)

NameError: name 'X' is not defined

Splitting into train and test#

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
      2 X_train.shape, X_test.shape, y_train.shape, y_test.shape

NameError: name 'train_test_split' is not defined

Check that the classes are balanced:

np.unique(y_test, return_counts=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 np.unique(y_test, return_counts=True)

NameError: name 'y_test' is not defined

Fit and evaluate#

Fit the logistic regression:

%%time
LR = LogisticRegression(max_iter=100)
LR.fit(X_train, y_train)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:1

NameError: name 'LogisticRegression' is not defined

Make predictions:

y_hat = LR.predict(X_test)
y_hat
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 y_hat = LR.predict(X_test)
      2 y_hat

NameError: name 'LR' is not defined

We can also predict probabilities:

y_proba = LR.predict_proba(X_test)
y_proba[:3]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 y_proba = LR.predict_proba(X_test)
      2 y_proba[:3]

NameError: name 'LR' is not defined

Calculate metrics:

print("Accuracy:", accuracy_score(y_test, y_hat))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 print("Accuracy:", accuracy_score(y_test, y_hat))

NameError: name 'accuracy_score' is not defined

Visualize performance#

plt.figure(figsize=(10, 8))
plt.title("Logistic regression on MNIST")
sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 3
      1 plt.figure(figsize=(10, 8))
      2 plt.title("Logistic regression on MNIST")
----> 3 sns.heatmap(confusion_matrix(y_test, y_hat), annot=True);

NameError: name 'sns' is not defined
../_images/3332c1c06a6ca32964ba3ebb83ada54d8a12fff424fd9010618323af0d86e007.png

Plot some samples with predictions and ground truths:

plot_digits(X_test, y_test, y_hat)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 plot_digits(X_test, y_test, y_hat)

NameError: name 'X_test' is not defined