Softmax Regression (SR)

Softmax regression (SR) (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. Same as the blog about LR, this blog will detail the modeling approach, loss function, forward and backward propagation of SR. In the end, I will use python with numpy to implement SR and give the use on data sets iris and mnist. You can find all the code here.

Softmax function

The softmax function is defined by the following formula. Where $ K $ is the number of classes and $ K \geqslant 2 $.
$$ softmax(x)=\frac{1}{\sum_{j=1}^{K}e^{x^{(j)}}}\begin{bmatrix} e^{x^{(1)}}\ e^{x^{(2)}}\ \vdots \ e^{x^{(K)}} \end{bmatrix} $$
Unfortunately, the original softmax definition has a numerical overflow problem in actual use. For a large positive $ x^{(i)} $ value, the value of $ e^{x^{(i)}} $ may be quite large and cause a numerical overflow. Similarly, for a smaller negative $ x^{(i)} $ value, the value of $ e^{x^{(i)}} $ may be very close to zero, resulting in a numerical underflow. Therefore, in practice we use the following equivalent formula.
$$ softmax(x-D)=\frac{1}{\sum_{j=1}^{K}e^{x^{(j)-D}}}\begin{bmatrix} e^{x^{(1)}-D}\ e^{x^{(2)}-D}\ \vdots \ e^{x^{(K)}-D} \end{bmatrix}=\frac{1}{e^{-D}\sum_{j=1}^{K}e^{x^{(j)}}}\begin{bmatrix} e^{-D}e^{x^{(1)}}\ e^{-D}e^{x^{(2)}}\ \vdots \ e^{-D}e^{x^{(K)}} \end{bmatrix}=softmax(x) $$
Where $ D=max(x) $.
We need to use the derivative of softmax in backpropagation, so let’s calculate it first. For writing convenience, let $ \hat{y}=softmax(x) $, then $ \hat{y^{(i)}}=\frac{e^{x^{(i)}}}{\sum_{j=1}^{K}e^{x^{(j)}}} $.
$$ if \ j=i, \ \ \ \ \frac{\partial \hat{y^{(i)}}}{\partial x^{(j)}}=\frac{e^{x^{(i)}}\sum_{j=1}^{K}e^{x^{(j)}}-e^{x^{(i)}}e^{x^{(j)}}}{(\sum_{j=1}^{K}e^{x^{(j)}})^2}=\hat{y^{(i)}}-(\hat{y^{(i)}})^2 $$
$$ if \ j \neq i, \ \ \ \ \frac{\partial \hat{y^{(i)}}}{\partial x^{(j)}}=\frac{-e^{x^{(i)}e^{x^{(j)}}}}{(\sum_{j=1}{k}e^{x^{(j)}})^2}=-\hat{y^{(i)}}\hat{y^{(j)}} $$

Modeling approach

In LR we assumed that the labels were binary: $ y\in { {0, 1} } $. SR allows us to handle $ K $ classification problem. In SR we often use one hot vector to represent the label. For example, in the MNIST digit recognition task, we will use $ y=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]^T $ to represent the label of the image with the number 3. In SR we use softmax function to model probability. Suppose we have a training set $ { {(x_1, y_1), (x_2, y_2), …, (x_m, y_m)} } $ of $ m $ labeled examples, where the input features are $ x_i \in \Re^{[n, 1]} $. We can use the $i$-th output of the softmax function as the probability that the current sample belongs to the $i$-th class. The formal expression is as follows.
$$ P(y^{(i)}=1|x;w^{(i)}, b^{(i)})=\frac{e^{(w^{(i)})^Tx+b^{(i)}}}{\sum_{j=1}^{K}e^{(w^{(j)})^Tx+b^{(j)}}} $$

Where
$$ w^{(i)} \in \Re^{[n, 1]} $$
$$ w=\begin{bmatrix} |& |& |& |\ w^{(1)}& w^{(2)}& \cdots & w^{(K)} \ |& |& |& | \end{bmatrix} \in \Re^{[n, K]} $$
$$ b^{(i)} \in \Re $$
$$ b=\begin{bmatrix} b^{(1)} & b^{(2)} & \cdots & b^{(K)} \end{bmatrix} \in \Re^{[1, K]} $$

For the convenience of writing, let $ \hat{y^{(i)}}=P(y^{(i)}=1|x;w^{(i)}, b^{(i)}) $.

Loss function

In SR our goal is to
$$ maximize \ \ \ \ \prod_{i=1}^{K}(P(y^{(i)}=1|x;w^{(i)},b^{(i)}))^{I{y^{(i)}=1}} $$
$$ maximize \ \ \ \ \sum_{i=1}^{K}I{y^{(i)}=1}log(\hat{y^{(i)}}) $$
$$ minimize \ \ \ \ -\sum_{i=1}^{K}I{y^{(i)}=1}log(\hat{y^{(i)}})=-\sum_{i=1}^{K}y^{(i)}log(\hat{y^{(i)}}) $$
So the loss function of SR is:
$$ \mathcal L(y,\hat{y})=-\sum_{i=1}^{K}y^{(i)}log(\hat{y^{(i)}}) $$

Note: $I$ is the indicator function, in which
$$ I{condition} = \begin{cases} 1 \ if \ condition \ is \ true \ 0 \ else \end{cases} $$

Gradient

The forward propagation of SR is similar with LR, for one example:
$$ z=w^Tx+b^T $$
$$ \hat{y}=softmax(z) $$
vectorization:
$$ Z=Xw+b $$
$$ \hat{Y}=softmax(Z) $$

We can derivative the gradient based on the chain rule. For one example:
$$ \frac{\partial \mathcal L}{\partial z^{(i)}}=\left{\begin{matrix}
-\frac{1}{\hat{y^{(k)}}}(\hat{y^{(k)}}-(\hat{y^{(k)}})^2)=\hat{y^{(k)}}-1 & if \ i=k \
-\frac{1}{\hat{y^{k}}}(-\hat{y^{(k)}}\hat{y^{(i)}})=\hat{y^{(i)}} & if \ i\neq k
\end{matrix}\right. \Rightarrow \frac{\partial \mathcal L}{\partial z}=\hat{y}-y $$

$$ \frac{\partial z^{(i)}}{\partial w^{(i)}}=x,\ \ \ \ \frac{\partial z^{(i)}}{\partial b^{(i)}}=1 $$
$$ \frac{\partial \mathcal L}{\partial w^{(i)}}=\frac{\partial \mathcal L}{z^{(i)}}\frac{\partial z^{(i)}}{\partial w^{(i)}}=x(\hat{y^{(i)}}-y^{(i)}) \Rightarrow \frac{\partial \mathcal L}{\partial w}=x(\hat{y}-y)^T $$
$$ \frac{\partial \mathcal L}{b^{(i)}}=\frac{\partial \mathcal L}{\partial z^{(i)}}\frac{\partial z^{(i)}}{\partial b^{(i)}}=\hat{y^{(i)}}-y^{(i)} \Rightarrow \frac{\partial \mathcal L}{\partial b}=(\hat{y}-y)^T $$
vectorization:
$$ \frac{\partial \mathcal L}{\partial w}=\frac{1}{m}X^T(\hat{Y}-Y) $$
$$ \frac{\partial \mathcal L}{\partial b}= \frac{1}{m}\sum_{row}(\hat{Y}-Y) $$

Implementation

Here we use python with numpy to implement the forward and backward propagation of SR.

1
2
3
4
5
6
7
8
9
10
11
def softmax(x):
"""
Softmax regression for a vector or matrix.
Args:
x: [n_examples, n_classes]
Returns: values after softmax.
"""
b = x - np.max(x, axis=1, keepdims=True)
expb = np.exp(b)
softmax = expb / np.sum(expb, axis=1, keepdims=True)
return softmax

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class SoftmaxRegression:
def __init__(self, max_iter=200, learning_rate=0.01):
self.max_iter = max_iter
self.learning_rate = learning_rate

def fit(self, X, Y):
"""
Train the model.
Args:
X: [n_samples, n_features]
Y: [n_samples, n_classes]
"""
m, n = X.shape
_, K = Y.shape
self.w_ = np.zeros([n, K])
self.b_ = np.zeros([1, K])
self.cost_ = []

for i in range(self.max_iter):
Y_hat = self.predict(X)

cost = -np.sum(Y * np.log(Y_hat)) / m

if i != 0 and i % 10 == 0:
print("Step: " + str(i) + ", Cost: " + str(cost))

self.cost_.append(cost)

self.w_ -= self.learning_rate * np.dot(X.T, Y_hat - Y) / m
self.b_ -= self.learning_rate * np.sum(Y_hat - Y, axis=0) / m

def predict(self, X):
"""
Predict the given examples.
Args:
X: [n_samples, n_features]
"""
z = np.dot(X, self.w_)
return softmax(np.dot(X, self.w_) + self.b_)

def score(self, X, Y):
Y_hat = self.predict(X)
Y_hat = np.argmax(Y_hat, axis=1)
Y = np.argmax(Y, axis=1)
true_num = np.sum(Y_hat == Y)
return true_num / len(X)

Example

In order to verify the correctness of the implementation. I experimented on the irsi dataset and the mnist dataset. The parameters and results of the experiment are as follows:

iris mnist
learnig rate 0.1 0.01
max iterate 100 10000
test accuracy 100% 90.98%

You can find the all the experimental code here and reproduce the experimental results.

Automatic Differentiation Based on Computation Graph Logistic Regression (LR)

本博客所有文章除特别声明外, 均采用CC BY-NC-SA 3.0 CN许可协议. 转载请注明出处!



关注笔者微信公众号获得最新文章推送

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×