Softmax regression (SR) (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. Same as the blog about LR, this blog will detail the **modeling approach**, **loss function**, **forward and backward propagation** of SR. In the end, I will use python with numpy to implement SR and give the use on data sets iris and mnist. You can find all the code here.

# Softmax function

The softmax function is defined by the following formula. Where is the number of classes and .

Unfortunately, the original softmax definition has a numerical overflow problem in actual use. For a large positive value, the value of may be quite large and cause a numerical overflow. Similarly, for a smaller negative value, the value of may be very close to zero, resulting in a numerical underflow. Therefore, in practice we use the following equivalent formula.

Where .

We need to use the derivative of softmax in backpropagation, so let’s calculate it first. For writing convenience, let , then .

# Modeling approach

In LR we assumed that the labels were binary: . SR allows us to handle classification problem. In SR we often use one hot vector to represent the label. For example, in the MNIST digit recognition task, we will use to represent the label of the image with the number 3. In SR we use softmax function to model probability. Suppose we have a training set of labeled examples, where the input features are . We can use the -th output of the softmax function as the probability that the current sample belongs to the -th class. The formal expression is as follows.

Where , , , .

For the convenience of writing, let .

# Loss function

In SR our goal is to

So the loss function of SR is:

Note: is the indicator function, in which

# Gradient

The forward propagation of SR is similar with LR, for one example:

vectorization:

We can derivative the gradient based on the chain rule. For one example:

vectorization:

# Implementation

Here we use python with numpy to implement the forward and backward propagation of SR.

1 | `def softmax(x):` |

1 | `class SoftmaxRegression:` |

# Example

In order to verify the correctness of the implementation. I experimented on the irsi dataset and the mnist dataset. The parameters and results of the experiment are as follows:

iris | mnist | |
---|---|---|

learnig rate | 0.1 | 0.01 |

max iterate | 100 | 10000 |

test accuracy | 100% | 90.98% |

You can find the all the experimental code here and reproduce the experimental results.

## Comments