In the last lesson, we learned about perceptrons which represent a linear boundary. In this lesson, we will learn about the error function to quantify the error of the boundary.

### Error Functions

An error function returns a number that represents how far we are from the solution. A high error value indicates that we are far from the solution, and the error value near zero indicates that we are near the solution.

A popular approach to reach the solution is a method known as gradient descent. This example is best illustrated with hiking down from a mountain. The hiker wants to quickly reach the bottom of the mountain. The gist of gradient descent is to choose the steepest direction of descent and take a step in that direction. The gradient descent approach limits the shape of the error function. Because at every step, the hiker needs to choose the direction of steepest descent, the error function must be continuous. If the error function is discrete, the hiker might have no step in any direction that leads to descent and falsely believe that he or she has reached the bottom. Therefore, we use a continuous error function. The error function should assign high error to misclassified points and low error to correctly classified points. ### Continuous Prediction

To use a continuous error function, we need continuous prediction $\hat{y}$. Previously, the prediction $\hat{y}$ was either 0 (rejected) or 1 (accepted). Instead, we use a probability, where $\hat{y} = 0.7$ would mean that the algorithm predicts that this point has 70% chance of getting accepted. Points on the line have prediction of 0.5, and the further the point is inside the positive region, its prediction is higher. The further the point is inside the negative region, its prediction is lower. To use a continuous function, we need to change the activation function. The activation function was the second function of the perceptron. Before, we used the step function, where $\hat{y} = 1$ when $Wx + b \geq 0$ and $\hat{y} = 0$ when $Wx + b < 0$. Instead, we now use the sigmoid function. The sigmoid function is defined as follows: Note that when $Wx + b = 0$, $\sigma(Wx + b) = 0.5$, which denotes 50% probability. For positive $Wx + b$, $\sigma(Wx + b) > 0.5$, and for negative $Wx + b$, $\sigma(Wx + b) < 0.5$.

### Multi-class Classification and Softmax

So far, we categorized all examples into two categories. This is called binary classification. However, sometimes we want to categorize them into multiple categories. For this, we use the softmax function.

Suppose that we have animal classifier that classifies duck, beaver, and walrus. Given some features $x$, we get a score $z_1, \dots, z_n$ for each category. From these scores, we want to calculate the probability. We want some function $f$ so that $\mathbb{P}_i = f (z_i)$ . First, since the output is a probability, the sum must be 1. Therefore, a simplest implementation would be

However, the score can be negative or zero, so the total score could become zero. Also, if the total score is negative and the individual score is positive, the probability will be negative. Therefore, we take the exponents of the scores to guarantee positivity.

Here is the softmax function for the duck-beaver-walrus example. Note that the softmax function is always increasing, so higher score means higher probability. Note that if $n = 2$, the softmax function is just the sigmoid function.

### One-Hot Encoding

In the college acceptance example, we only looked at quantitative inputs. The test score of 5 and the test score of 10 had clear numerical implications such as orderings. However, there are also categorical inputs such as the student’s nationality. (We assume no multi-nationality student here) Categorical inputs are also normally labeled with numbers $1, 2, \ldots$ but these numbers have no numerical meaning.

Therefore, to make sure that the model does not confuse categorical inputs with quantitative inputs and infer numerical meanings, we transform categorical inputs to one-hot encodings. Instead of having one label for nationality and labeling numbers $1, 2, \ldots, N$ for $N$ different nations, we have $N$ labels, and for each student, one of the label has value $1$, and the rest $0$. ### Maximum Likelihood

The Maximum Likelihood is a simple idea: we pick the model with that give the existing examples the highest probability. For example, suppose that weather channel A said that it would rain with 80% chance, and channel B said that it would rain with 40% chance. If it did rain, we would believe channel A to be more accurate, and if it did not, we would believe channel B to be more accurate.

We can calculate the probability of the model by taking a product of the probability of the label for each example. In the example below, the model has the probability $0.6 \times 0.2 \times 0.1 \times 0.7 = 0.0084$. Then, we can compare this probability for different models. In the example below, the model on the right has higher probability, so it is a better model. Therefore, our goal is to improve the probability of our model. ### Cross Entropy

In many cases, we have millions of examples that we want to fit the model. Even if the model is correctly trained and each label has probability $0.9$ , the probability of the model would be about $p = 0.9^{1000000}$. It is very easy for the number to get too small for the computer to handle. Therefore, we use the natural logarithm function to convert products to sums. $\log 0.9^{1000000} = 1000000 \log 0.9$ is more manageable. Since $\log x$ is always negative if $0 < x < 1$, and since probability is from 0 to 1, we use $- \log p$ for the score of the model. This is called the Cross entropy. Cross entropy measures given events and probabilities, how likely the given events happen based on given probabilities. The cross entropy formula can be written concisely with this formula:

where $m$ is the number of example, $y_i$ is the correct label of example $i$, and $p_i$ is the predicted probability that $y_i = 1$. If $y_i = 1$, then the second term is 0, so we add $\log(p_i)$. If $y_i = 0$, then the first term is 0, so we add $\log(1 - p_i)$.

Note that as a convention, we use the average of the sum of cross entropy for the error function.

### Multi-class Cross Entropy

We can also use cross entropy error function for multi-class classification. We can no longer use the convenient formula above, but the rest of the logic holds.

where $c$ is the number of classes, and $y_{ij}$ is 1 if $i$-th example is labeled $j$, and 0 otherwise.

### Other Error Functions

The Sum of Squared Error (SSE) function is another possible error function.

where $\mu$ is the data point and $j$ is the output unit. As the name suggested, we take the sum of all squared errors. This error function is nice, but summing all the errors could make $E$ huge when $\mu$ is high. Therefore, we normally use the Mean Squared Error function instead.

The Mean Squared Error (MSE) function is another popular choice for an error function. Therefore,

The $\frac{1}{2}$ exists to make the derivation clean. Both the MSE and the SSE has the nice property of penalizing the outliers more than the small errors by squaring the error.