Softmax Regression (SR)

Jul 2 2019 Machine Learning 8 minutes read (About 1176 words) VISITED TIMES

Softmax regression (SR) (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. Same as the blog about LR, this blog will detail the modeling approach, loss function, forward and backward propagation of SR. In the end, I will use python with numpy to implement SR and give the use on data sets iris and mnist. You can find all the code here.

Softmax function

The softmax function is defined by the following formula. Where $ K $ is the number of classes and $ K \geqslant 2 $.
$$ softmax(x)=\frac{1}{\sum_{j=1}^{K}e^{x^{(j)}}}\begin{bmatrix} e^{x^{(1)}}\ e^{x^{(2)}}\ \vdots \ e^{x^{(K)}} \end{bmatrix} $$
Unfortunately, the original softmax definition has a numerical overflow problem in actual use. For a large positive $ x^{(i)} $ value, the value of $ e^{x^{(i)}} $ may be quite large and cause a numerical overflow. Similarly, for a smaller negative $ x^{(i)} $ value, the value of $ e^{x^{(i)}} $ may be very close to zero, resulting in a numerical underflow. Therefore, in practice we use the following equivalent formula.
$$ softmax(x-D)=\frac{1}{\sum_{j=1}^{K}e^{x^{(j)-D}}}\begin{bmatrix} e^{x^{(1)}-D}\ e^{x^{(2)}-D}\ \vdots \ e^{x^{(K)}-D} \end{bmatrix}=\frac{1}{e^{-D}\sum_{j=1}^{K}e^{x^{(j)}}}\begin{bmatrix} e^{-D}e^{x^{(1)}}\ e^{-D}e^{x^{(2)}}\ \vdots \ e^{-D}e^{x^{(K)}} \end{bmatrix}=softmax(x) $$
Where $ D=max(x) $.
We need to use the derivative of softmax in backpropagation, so let’s calculate it first. For writing convenience, let $ \hat{y}=softmax(x) $, then $ \hat{y^{(i)}}=\frac{e^{x^{(i)}}}{\sum_{j=1}^{K}e^{x^{(j)}}} $.
$$ if \ j=i, \ \ \ \ \frac{\partial \hat{y^{(i)}}}{\partial x^{(j)}}=\frac{e^{x^{(i)}}\sum_{j=1}^{K}e^{x^{(j)}}-e^{x^{(i)}}e^{x^{(j)}}}{(\sum_{j=1}^{K}e^{x^{(j)}})^2}=\hat{y^{(i)}}-(\hat{y^{(i)}})^2 $$
$$ if \ j \neq i, \ \ \ \ \frac{\partial \hat{y^{(i)}}}{\partial x^{(j)}}=\frac{-e^{x^{(i)}e^{x^{(j)}}}}{(\sum_{j=1}{k}e^{x^{(j)}})^2}=-\hat{y^{(i)}}\hat{y^{(j)}} $$

Modeling approach

In LR we assumed that the labels were binary: $ y\in { {0, 1} } $. SR allows us to handle $ K $ classification problem. In SR we often use one hot vector to represent the label. For example, in the MNIST digit recognition task, we will use $ y=[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]^T $ to represent the label of the image with the number 3. In SR we use softmax function to model probability. Suppose we have a training set $ { {(x_1, y_1), (x_2, y_2), …, (x_m, y_m)} } $ of $ m $ labeled examples, where the input features are $ x_i \in \Re^{[n, 1]} $. We can use the $i$-th output of the softmax function as the probability that the current sample belongs to the $i$-th class. The formal expression is as follows.
$$ P(y^{(i)}=1|x;w^{(i)}, b^{(i)})=\frac{e^{(w^{(i)})^Tx+b^{(i)}}}{\sum_{j=1}^{K}e^{(w^{(j)})^Tx+b^{(j)}}} $$

Where
$$ w^{(i)} \in \Re^{[n, 1]} $$
$$ w=\begin{bmatrix} |& |& |& |\ w^{(1)}& w^{(2)}& \cdots & w^{(K)} \ |& |& |& | \end{bmatrix} \in \Re^{[n, K]} $$
$$ b^{(i)} \in \Re $$
$$ b=\begin{bmatrix} b^{(1)} & b^{(2)} & \cdots & b^{(K)} \end{bmatrix} \in \Re^{[1, K]} $$

For the convenience of writing, let $ \hat{y^{(i)}}=P(y^{(i)}=1|x;w^{(i)}, b^{(i)}) $.

Loss function

In SR our goal is to
$$ maximize \ \ \ \ \prod_{i=1}^{K}(P(y^{(i)}=1|x;w^{(i)},b^{(i)}))^{I{y^{(i)}=1}} $$
$$ maximize \ \ \ \ \sum_{i=1}^{K}I{y^{(i)}=1}log(\hat{y^{(i)}}) $$
$$ minimize \ \ \ \ -\sum_{i=1}^{K}I{y^{(i)}=1}log(\hat{y^{(i)}})=-\sum_{i=1}^{K}y^{(i)}log(\hat{y^{(i)}}) $$
So the loss function of SR is:
$$ \mathcal L(y,\hat{y})=-\sum_{i=1}^{K}y^{(i)}log(\hat{y^{(i)}}) $$

Note: $I$ is the indicator function, in which
$$ I{condition} = \begin{cases} 1 \ if \ condition \ is \ true \ 0 \ else \end{cases} $$

Gradient

The forward propagation of SR is similar with LR, for one example:
$$ z=w^Tx+b^T $$
$$ \hat{y}=softmax(z) $$
vectorization:
$$ Z=Xw+b $$
$$ \hat{Y}=softmax(Z) $$

We can derivative the gradient based on the chain rule. For one example:
$$ \frac{\partial \mathcal L}{\partial z^{(i)}}=\left{\begin{matrix}
-\frac{1}{\hat{y^{(k)}}}(\hat{y^{(k)}}-(\hat{y^{(k)}})^2)=\hat{y^{(k)}}-1 & if \ i=k \
-\frac{1}{\hat{y^{k}}}(-\hat{y^{(k)}}\hat{y^{(i)}})=\hat{y^{(i)}} & if \ i\neq k
\end{matrix}\right. \Rightarrow \frac{\partial \mathcal L}{\partial z}=\hat{y}-y $$

$$ \frac{\partial z^{(i)}}{\partial w^{(i)}}=x,\ \ \ \ \frac{\partial z^{(i)}}{\partial b^{(i)}}=1 $$
$$ \frac{\partial \mathcal L}{\partial w^{(i)}}=\frac{\partial \mathcal L}{z^{(i)}}\frac{\partial z^{(i)}}{\partial w^{(i)}}=x(\hat{y^{(i)}}-y^{(i)}) \Rightarrow \frac{\partial \mathcal L}{\partial w}=x(\hat{y}-y)^T $$
$$ \frac{\partial \mathcal L}{b^{(i)}}=\frac{\partial \mathcal L}{\partial z^{(i)}}\frac{\partial z^{(i)}}{\partial b^{(i)}}=\hat{y^{(i)}}-y^{(i)} \Rightarrow \frac{\partial \mathcal L}{\partial b}=(\hat{y}-y)^T $$
vectorization:
$$ \frac{\partial \mathcal L}{\partial w}=\frac{1}{m}X^T(\hat{Y}-Y) $$
$$ \frac{\partial \mathcal L}{\partial b}= \frac{1}{m}\sum_{row}(\hat{Y}-Y) $$

Implementation

Here we use python with numpy to implement the forward and backward propagation of SR.

def softmax(x):
    """
    Softmax regression for a vector or matrix.
    Args:
        x: [n_examples, n_classes]
    Returns: values after softmax.
    """
    b = x - np.max(x, axis=1, keepdims=True)
    expb = np.exp(b)
    softmax = expb / np.sum(expb, axis=1, keepdims=True)
    return softmax

class SoftmaxRegression:
    def __init__(self, max_iter=200, learning_rate=0.01):
        self.max_iter = max_iter
        self.learning_rate = learning_rate

    def fit(self, X, Y):
        """
        Train the model.
        Args:
            X: [n_samples, n_features]
            Y: [n_samples, n_classes]
        """
        m, n = X.shape
        _, K = Y.shape
        self.w_ = np.zeros([n, K])
        self.b_ = np.zeros([1, K])
        self.cost_ = []

        for i in range(self.max_iter):
            Y_hat = self.predict(X)

            cost = -np.sum(Y * np.log(Y_hat)) / m

            if i != 0 and i % 10 == 0:
                print("Step: " + str(i) + ", Cost: " + str(cost))

            self.cost_.append(cost)

            self.w_ -= self.learning_rate * np.dot(X.T, Y_hat - Y) / m
            self.b_ -= self.learning_rate * np.sum(Y_hat - Y, axis=0) / m

    def predict(self, X):
        """
        Predict the given examples.
        Args:
            X: [n_samples, n_features]
        """
        z = np.dot(X, self.w_)
        return softmax(np.dot(X, self.w_) + self.b_)

    def score(self, X, Y):
        Y_hat = self.predict(X)
        Y_hat = np.argmax(Y_hat, axis=1)
        Y = np.argmax(Y, axis=1)
        true_num = np.sum(Y_hat == Y)
        return true_num / len(X)

Example

In order to verify the correctness of the implementation. I experimented on the irsi dataset and the mnist dataset. The parameters and results of the experiment are as follows:

	iris	mnist
learnig rate	0.1	0.01
max iterate	100	10000
test accuracy	100%	90.98%

You can find the all the experimental code here and reproduce the experimental results.

#Algorithm