Skip to content

0 1 Loss Classification Essay

I am trying to get a grasp on what is the purpose of the loss function and I can't quite understand it.

So, as far as I understand loss function is for introducing some kind of metric that we can measure the "cost" of an incorrect decision with.

So let's say I have a dataset of 30 objects, I divided them to training / testing sets like 20 / 10. I will be using 0-1 loss function, so lets say my set of class labels is M and the function looks like this:

$$ L(i, j) = \begin{cases} 0 \qquad i = j \\ 1 \qquad i \ne j \end{cases} \qquad i,j \in M $$

So I builded some model on my training data, lets say I am using Naive Bayes classifier, and this model classified 7 objects correctly (assigned them the correct class labels) and 3 objects were classified incorrectly.

So my loss function would return "0" 7 times and "1" 3 times - what kind of information can I get from that? That my model classified 30% of the objects incorrectly? Or is there more to it?

If there are any mistakes in my way of thinking I am very sorry, I am just trying to learn. If the example I provided is "too abstract", let me know, I'll try to be more specific. If you will try ti explain the concept using different example, please use 0-1 loss function.


In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).[1] Given as the vector space of all possible inputs, and Y = {–1,1} as the vector space of all possible outputs, we wish to find a function which best maps to .[2] However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different .[3] As a result, the goal of the learning problem is to minimize expected risk, defined as

where is the loss function, and is the probability density function of the process that generated the data, which can equivalently be written as

In practice, the probability distribution is unknown. Consequently, utilizing a training set of independently and identically distributed sample points

drawn from the data sample space, one seeks to minimize empirical risk

as a proxy for expected risk.[3] (See statistical learning theory for a more detailed description.)

For computational ease, it is standard practice to write loss functions as functions of only one variable. Within classification, loss functions are generally written solely in terms of the product of the true classifier and the predicted value .[4] Selection of a loss function within this framework

impacts the optimal which minimizes empirical risk, as well as the computational complexity of the learning algorithm.

Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for false positives and false negatives) would be the 0–1 indicator function which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by

where indicates the Heaviside step function. However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an NP-hard combinatorial optimization problem.[5] As a result, it is better to substitute continuous, convex loss function surrogates which are tractable for commonly used learning algorithms. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.[6] Some of these surrogates are described below.

Bounds for classification[edit]

Utilizing Bayes' theorem, it can be shown that the optimal for a binary classification problem is equivalent to

(when ).

Furthermore, it can be shown that for any convex loss function , where is the function that minimizes this loss, if and is decreasing in a neighborhood of 0, then where is the sign function (for proof see [1]). Note also that in practice when the loss function is differentiable at the origin. This fact confers a consistency property upon all convex loss functions; specifically, all convex loss functions will lead to consistent results with the 0–1 loss function given the presence of infinite data. Consequently, we can bound the difference of any of these convex loss function from expected risk.[1]

Simplifying expected risk for classification[edit]

Given the properties of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,

The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for , and the fourth because . As a result, one can solve for the minimizers of for any convex loss functions with these properties by differentiating the last equality with respect to and setting the derivative equal to 0. Thus, minimizers for all of the loss function surrogates described below are easily obtained as functions of only and .[3]

Square loss[edit]

While more commonly used in regression, the square loss function can be re-written as a function and utilized for classification. Defined as

the square loss function is both convex and smooth and matches the 0–1 indicator function when and when . However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.[1] In addition, functions which yield high values of for some will perform poorly with the square loss function, since high values of will be penalized severely, regardless of whether the signs of and match.

A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for Tikhonov regularization, one can solve for the regularization parameter using leave-one-out cross-validation in the same time as it would take to solve a single problem.[7]

The minimizer of for the square loss function is

This function notably equals for the 0–1 loss function when or , but predicts a value between the two classifications when the classification of is not known with absolute certainty.

Hinge loss[edit]

Main article: Hinge loss

The hinge loss function is defined as

The hinge loss provides a relatively tight, convex upper bound on the 0–1 indicator function. Specifically, the hinge loss equals the 0–1 indicator function when and . In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for support vector machines (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.[5]

While the hinge loss function is both convex and continuous, it is not smooth (that is not differentiable) at . Consequently, the hinge loss function cannot be used with gradient descent methods or stochastic gradient descent methods which rely on differentiability over the entire domain. However, the hinge loss does have a subgradient at , which allows for the utilization of subgradient descent methods.[5] SVMs utilizing the hinge loss function can also be solved using quadratic programming.

The minimizer of for the hinge loss function is


Plot of various functions. Blue is the 0–1 indicator function. Green is the square loss function. Purple is the hinge loss function. Yellow is the logistic loss function. Note that all surrogates give a loss penalty of 1 for y=f(x= 0)