Softmax regression (SR) (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. Same as the blog about LR, this blog will detail the modeling approach, loss function, forward and backward propagation of SR. In the end, I will use python with numpy to implement SR and give the use on data sets iris and mnist. You can find all the code here.
Softmax function
The softmax function is defined by the following formula. Where $ K $ is the number of classes and $ K \geqslant 2 $.
$$ softmax(x)=\frac{1}{\sum_{j=1}^{K}e^{x^{(j)}}}\begin{bmatrix} e^{x^{(1)}}\ e^{x^{(2)}}\ \vdots \ e^{x^{(K)}} \end{bmatrix} $$
Unfortunately, the original softmax definition has a numerical overflow problem in actual use. For a large positive $ x^{(i)} $ value, the value of $ e^{x^{(i)}} $ may be quite large and cause a numerical overflow. Similarly, for a smaller negative $ x^{(i)} $ value, the value of $ e^{x^{(i)}} $ may be very close to zero, resulting in a numerical underflow. Therefore, in practice we use the following equivalent formula.
$$ softmax(x-D)=\frac{1}{\sum_{j=1}^{K}e^{x^{(j)-D}}}\begin{bmatrix} e^{x^{(1)}-D}\ e^{x^{(2)}-D}\ \vdots \ e^{x^{(K)}-D} \end{bmatrix}=\frac{1}{e^{-D}\sum_{j=1}^{K}e^{x^{(j)}}}\begin{bmatrix} e^{-D}e^{x^{(1)}}\ e^{-D}e^{x^{(2)}}\ \vdots \ e^{-D}e^{x^{(K)}} \end{bmatrix}=softmax(x) $$
Where $ D=max(x) $.
We need to use the derivative of softmax in backpropagation, so let’s calculate it first. For writing convenience, let $ \hat{y}=softmax(x) $, then $ \hat{y^{(i)}}=\frac{e^{x^{(i)}}}{\sum_{j=1}^{K}e^{x^{(j)}}} $.
$$ if \ j=i, \ \ \ \ \frac{\partial \hat{y^{(i)}}}{\partial x^{(j)}}=\frac{e^{x^{(i)}}\sum_{j=1}^{K}e^{x^{(j)}}-e^{x^{(i)}}e^{x^{(j)}}}{(\sum_{j=1}^{K}e^{x^{(j)}})^2}=\hat{y^{(i)}}-(\hat{y^{(i)}})^2 $$
$$ if \ j \neq i, \ \ \ \ \frac{\partial \hat{y^{(i)}}}{\partial x^{(j)}}=\frac{-e^{x^{(i)}e^{x^{(j)}}}}{(\sum_{j=1}{k}e^{x^{(j)}})^2}=-\hat{y^{(i)}}\hat{y^{(j)}} $$
Modeling approach
In LR we assumed that the labels were binary: $ y\in { {0, 1} } $. SR allows us to handle $ K $ classification problem. In SR we often use one hot vector to represent the label. For example, in the MNIST digit recognition task, we will use $ y=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]^T $ to represent the label of the image with the number 3. In SR we use softmax function to model probability. Suppose we have a training set $ { {(x_1, y_1), (x_2, y_2), …, (x_m, y_m)} } $ of $ m $ labeled examples, where the input features are $ x_i \in \Re^{[n, 1]} $. We can use the $i$-th output of the softmax function as the probability that the current sample belongs to the $i$-th class. The formal expression is as follows.
$$ P(y^{(i)}=1|x;w^{(i)}, b^{(i)})=\frac{e^{(w^{(i)})^Tx+b^{(i)}}}{\sum_{j=1}^{K}e^{(w^{(j)})^Tx+b^{(j)}}} $$
Where
$$ w^{(i)} \in \Re^{[n, 1]} $$
$$ w=\begin{bmatrix} |& |& |& |\ w^{(1)}& w^{(2)}& \cdots & w^{(K)} \ |& |& |& | \end{bmatrix} \in \Re^{[n, K]} $$
$$ b^{(i)} \in \Re $$
$$ b=\begin{bmatrix} b^{(1)} & b^{(2)} & \cdots & b^{(K)} \end{bmatrix} \in \Re^{[1, K]} $$
For the convenience of writing, let $ \hat{y^{(i)}}=P(y^{(i)}=1|x;w^{(i)}, b^{(i)}) $.
Loss function
In SR our goal is to
$$ maximize \ \ \ \ \prod_{i=1}^{K}(P(y^{(i)}=1|x;w^{(i)},b^{(i)}))^{I{y^{(i)}=1}} $$
$$ maximize \ \ \ \ \sum_{i=1}^{K}I{y^{(i)}=1}log(\hat{y^{(i)}}) $$
$$ minimize \ \ \ \ -\sum_{i=1}^{K}I{y^{(i)}=1}log(\hat{y^{(i)}})=-\sum_{i=1}^{K}y^{(i)}log(\hat{y^{(i)}}) $$
So the loss function of SR is:
$$ \mathcal L(y,\hat{y})=-\sum_{i=1}^{K}y^{(i)}log(\hat{y^{(i)}}) $$
Note: $I$ is the indicator function, in which
$$ I{condition} = \begin{cases} 1 \ if \ condition \ is \ true \ 0 \ else \end{cases} $$
Gradient
The forward propagation of SR is similar with LR, for one example:
$$ z=w^Tx+b^T $$
$$ \hat{y}=softmax(z) $$
vectorization:
$$ Z=Xw+b $$
$$ \hat{Y}=softmax(Z) $$
We can derivative the gradient based on the chain rule. For one example:
$$ \frac{\partial \mathcal L}{\partial z^{(i)}}=\left{\begin{matrix}
-\frac{1}{\hat{y^{(k)}}}(\hat{y^{(k)}}-(\hat{y^{(k)}})^2)=\hat{y^{(k)}}-1 & if \ i=k \
-\frac{1}{\hat{y^{k}}}(-\hat{y^{(k)}}\hat{y^{(i)}})=\hat{y^{(i)}} & if \ i\neq k
\end{matrix}\right. \Rightarrow \frac{\partial \mathcal L}{\partial z}=\hat{y}-y $$
$$ \frac{\partial z^{(i)}}{\partial w^{(i)}}=x,\ \ \ \ \frac{\partial z^{(i)}}{\partial b^{(i)}}=1 $$
$$ \frac{\partial \mathcal L}{\partial w^{(i)}}=\frac{\partial \mathcal L}{z^{(i)}}\frac{\partial z^{(i)}}{\partial w^{(i)}}=x(\hat{y^{(i)}}-y^{(i)}) \Rightarrow \frac{\partial \mathcal L}{\partial w}=x(\hat{y}-y)^T $$
$$ \frac{\partial \mathcal L}{b^{(i)}}=\frac{\partial \mathcal L}{\partial z^{(i)}}\frac{\partial z^{(i)}}{\partial b^{(i)}}=\hat{y^{(i)}}-y^{(i)} \Rightarrow \frac{\partial \mathcal L}{\partial b}=(\hat{y}-y)^T $$
vectorization:
$$ \frac{\partial \mathcal L}{\partial w}=\frac{1}{m}X^T(\hat{Y}-Y) $$
$$ \frac{\partial \mathcal L}{\partial b}= \frac{1}{m}\sum_{row}(\hat{Y}-Y) $$
Implementation
Here we use python with numpy to implement the forward and backward propagation of SR.
1 | def softmax(x): |
1 | class SoftmaxRegression: |
Example
In order to verify the correctness of the implementation. I experimented on the irsi dataset and the mnist dataset. The parameters and results of the experiment are as follows:
iris | mnist | |
---|---|---|
learnig rate | 0.1 | 0.01 |
max iterate | 100 | 10000 |
test accuracy | 100% | 90.98% |
You can find the all the experimental code here and reproduce the experimental results.
本博客所有文章除特别声明外, 均采用CC BY-NC-SA 3.0 CN许可协议. 转载请注明出处!
关注笔者微信公众号获得最新文章推送