The problem with accuracy

Standard accuracy is defined as the ratio of correct classifications to the number of classifications done.

\[\begin{aligned} accuracy := \frac{\text{correct classifications}}{\text{number of classifications}}\end{aligned}\]

It is thus on overall measure over all classes and as we'll shortly see it's not a good measure to tell an oracle apart from an actual useful test. An oracle is a classification function that returns a random guess for each sample. Likewise, we want to be able to rate the classification performance of our classification function. Accuracy can be a useful measure if we have the same amount of samples per class but if we have an imbalanced set of samples accuracy isn't useful at all. Even more so, a test can have a high accuracy but actually perform worse than a test with a lower accuracy.

If we have a distribution of samples such that 90% of samples belong to class \(\mathcal{A}\), 5% belonging to \(\mathcal{B}\) and another 5% belonging to \(\mathcal{C}\) then the following classification function will have an accuracy of \(0.9\):

\[\begin{aligned} classify(sample) := \begin{cases} \mathcal{A} & \text{if }\top \\ \end{cases}\end{aligned}\]

Yet, it is obvious given that we know how \(classify\) works that this can not tell the classes apart at all. Likewise, we can construct a classification function

\[\begin{aligned} classify(sample) := \text{guess} \begin{cases} \mathcal{A} & \text{with p } = 0.96 \\ \mathcal{B} & \text{with p } = 0.02 \\ \mathcal{C} & \text{with p } = 0.02 \\ \end{cases}\end{aligned}\]

which has an accuracy of \(0.96 \cdot 0.9 + 0.02 \cdot 0.05 \cdot 2 = 0.866\) and will not always predict \(\mathcal{A}\) but still given that we know how \(classify\) works it is obvious that it can not tell classes apart. Accuracy in this case only tells us how good our classification function is at guessing. This means that accuracy is not a good measure to tell an oracle apart from a useful test.

Further, if we have lots of classes (such as 20) an accuracy of 95% also sounds good but it might for example mean that it wrongly classifies all samples of class \(\mathcal{B}\) as class \(\mathcal{C}\) which not just makes it unable to recognize class \(\mathcal{B}\) it also reduces the confidence that a classification as \(\mathcal{C}\) is actually correct.

Accuracy per Class

We can compute the accuracy individually per class by giving our classification function only samples from the same class and remember and count the number of correct classifications and incorrect classifications then compute \(accuracy := \text{correct}/(\text{correct} + \text{incorrect})\). We repeat this for every class. If we have a classification function that can accurately recognize class \(\mathcal{A}\) but will output a random guess for the other classes then this results in an accuracy of \(1.00\) for \(\mathcal{A}\) and an accuracy of \(0.33\) for the other classes. This already provides us a much better way to judge the performance of our classification function. An oracle always guessing the same class will produce a per class accuracy of \(1.00\) for that class, but \(0.00\) for the other class. If our test is useful all the accuracies per class should be \(>0.5\). Otherwise, our test isn't better than chance. However, accuracy per class does not take into account false positives. Even though our classification function has a 100% accuracy for class \(\mathcal{A}\) there will also be false positives for \(\mathcal{A}\) (such as a \(\mathcal{B}\) wrongly classified as a \(\mathcal{A}\)). This means that accuracy per class doesn't really tell us how good our test is at recognising non-\(\mathcal{A}\)s because as an extreme example we have 10 \(\mathcal{A}\)s and classify them all correctly. Then we have a 1000 \(\mathcal{B}\)s and a 1000 \(\mathcal{C}\)s and we classify seven \(\mathcal{B}\)s and three \(\mathcal{C}\)s as an \(\mathcal{A}\) this means our accuracies per class are 100% for \(\mathcal{A}\), 99.3% for \(\mathcal{B}\) and 99.7% for \(\mathcal{C}\) but our test didn't really recognize all non-\(\mathcal{A}\)s correctly, in this case our classification function classifies 99.5% non-\(\mathcal{A}\)s correctly which is something that accuracy per class doesn't show directly.

Sensitivity and Specificity

In medical tests sensitivity is defined as the ratio between people correctly identified as having the disease and the amount of people actually having the disease. Specificity is defined as the ratio between people correctly identified as healthy and the amount of people that are actually healthy. The amount of people actually having the disease is the amount of true positive test results plus the amount of false negative test results. The amount of actually healthy people is the amount of true negative test results plus the amount of false positive test results.

Binary Classification

In binary classification problems there are two classes \(\mathcal{P}\) and \(\mathcal{N}\). \(T_{n}\) refers to the number of samples that were correctly identified as belonging to class \(n\) and \(F_{n}\) refers to the number of samples that werey falsely identified as belonging to class \(n\). In this case sensitivity and specificity are defined as following:

\[\begin{aligned} sensitivity := \frac{T_{\mathcal{P}}}{T_{\mathcal{P}}+F_{\mathcal{N}}} \\ specificity := \frac{T_{\mathcal{N}}}{T_{\mathcal{N}}+F_{\mathcal{P}}}\end{aligned}\]

\(T_{\mathcal{P}}\) being the true positives \(F_{\mathcal{N}}\) being the false negatives, \(T_{\mathcal{N}}\) being the true negatives and \(F_{\mathcal{P}}\) being the false positives. However, thinking in terms of negatives and positives is fine for medical tests but in order to get a better intuition we should not think in terms of negatives and positives but in generic classes \(\alpha\) and \(\beta\). Then, we can say that the amount of samples correctly identified as belonging to \(\alpha\) is \(T_{\alpha}\) and the amount of samples that actually belong to \(\alpha\) is \(T_{\alpha} + F_{\beta}\). The amount of samples correctly identified as not belonging to \(\alpha\) is \(T_{\beta}\) and the amount of samples actually not belonging to \(\alpha\) is \(T_{\beta} + F_{\alpha}\). This gives us the sensitivity and specificity for \(\alpha\) but we can also apply the same thing to the class \(\beta\). The amount of samples correctly identified as belonging to \(\beta\) is \(T_{\beta}\) and the amount of samples actually belonging to \(\beta\) is \(T_{\beta} + F_{\alpha}\). The amount of samples correctly identified as not belonging to \(\beta\) is \(T_{\alpha}\) and the amount of samples actually not belonging to \(\beta\) is \(T_{\alpha} + F_{\beta}\). We thus get a sensitivity and specificity per class:

\[\begin{aligned} sensitivity_{\alpha} := \frac{T_{\alpha}}{T_{\alpha}+F_{\beta}} \\ specificity_{\alpha} := \frac{T_{\beta}}{T_{\beta} + F_{\alpha}} \\ sensitivity_{\beta} := \frac{T_{\beta}}{T_{\beta}+F_{\alpha}} \\ specificity_{\beta} := \frac{T_{\alpha}}{T_{\alpha} + F_{\beta}} \\\end{aligned}\]

We however observe that \(sensitivity_{\alpha} = specificity_{\beta}\) and \(specificity_{\alpha} = sensitivity_{\beta}\). This means that if we only have two classes we don't need sensitivity and specificity per class.

N-ary Classification

Sensitivity and specificity per class isn't useful if we only have two classes, but we can extend it to multiple classes. Sensitivity and specificity is defined as:

\[\begin{aligned} \text{sensitivity} := \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} \\ \text{specificity} := \frac{\text{true negatives}}{\text{true negatives} + \text{false-positives}} \\\end{aligned}\]

\(F_{n,i}\) refers to the false classification of an input of class \(n\) as class \(i\).

The true positives is simply \(T_{n}\), the false negatives is simply \(\sum_{i}(F_{n,i})\) and the false positives is simply \(\sum_{i}(F_{i,n})\). Finding the true negatives is much harder but we can say that if we correctly classify something as belonging to a class different than \(n\) it counts as a true negative. This means we have at least \(\sum_{i}(T_{i}) - T_{n}\) true negatives. However, these aren't all the true negatives. All the wrong classifications for a class different than \(n\) are also true negatives, because they correctly weren't identified as belonging to \(n\). \(\sum_{i}(\sum_{k}(F_{i,k}))\) represents all wrong classifications. From this we have to subtract the cases where the input class was \(n\) meaning we have to subtract the false negatives for \(n\) which is \(\sum_{i}(F_{n,i})\) but we also have to subtract the false positives for \(n\) because they are false positives and not true negatives so we have to also subtract \(\sum_{i}(F_{i,n})\) finally getting \(\sum_{i}(T_{i}) - T(n) + \sum_{i}(\sum_{k}(F_{n,i})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n})\). As a summary we have:

\[\begin{aligned} \text{true positives} := T_{n} \\ \text{true negatives} := \sum_{i}(T_{i}) - T_{n} + \sum_{i}(\sum_{k}(F_{n,i})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n}) \\ \text{false positives} := \sum_{i}(F_{i,n}) \\ \text{false negatives} := \sum_{i}(F_{n,i})\end{aligned}\]

\[\begin{aligned} sensitivity(n) := \frac{T_{n}}{T_{n} + \sum_{i}(F_{n,i})} \\ specificity(n) := \frac{\sum_{i}(T_{i}) - T_{n} + \sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n})}{\sum_{i}(T_{i}) - T_{n} + \sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{n,i})}\end{aligned}\]

Example

Let's say we have three classes \(\mathcal{A}\), \(\mathcal{B}\) and \(\mathcal{C}\). We get the following values for \(T_{i}\)/\(F_{i}\):

Example classification
\(\mathcal{A}\) \(\mathcal{B}\) \(\mathcal{C}\)
\(T\) 86 3 3
\(F_{\mathcal{A},*}\) 0 1 1
\(F_{\mathcal{B},*}\) 0 0 3
\(F_{\mathcal{C},*}\) 0 3 0

What is our accuracy? Well, we have \(\sum_{i}(T_{i})+\sum_{i}(\sum_{k}(F_{i,k})))\) total samples, which in this case is 100 samples. Out of those we identify \(\sum_{i}(T_{i})\) as correct thus:

\[\begin{aligned} accuracy := \frac{\sum_{i}(T_{i})}{\sum_{i}(T_{i})+\sum_{i}(\sum_{k}(F_{i,k})))}\end{aligned}\]

In this example we identify \(86 + 3 + 3 = 92\) as correct thus our accuracy is \(0.92\). That sounds pretty good, but let's calculate sensitivities and specificities:

Sensitivites and Specificities
\(\mathcal{A}\) \(\mathcal{B}\) \(\mathcal{C}\)
\(sensitivity\) \(0.977\) \(0.500\) \(0.500\)
\(specificity\) \(1.000\) \(0.957\) \(0.957\)

If our classification function responds with \(\mathcal{B}\) can we be 50.0% confident that this is correct? Absolutely not. It only means that given a \(\mathcal{B}\) it will identify it correctly in 50.0% of cases but since there are also false positives for \(\mathcal{B}\) the answer \(\mathcal{B}\) isn't going to be correct in 50.0% of cases. We have 6 \(\mathcal{B}\)s and we identify 3 \(\mathcal{B}\) correctly so we get an accuracy p.c of 50% which is actually the same thing as sensitivity. But, we get 7 identifications (either correct or wrong) for \(\mathcal{B}\), 3 of those are correct, 4 aren't which gives us \(\frac{3}{3+4} = 0.429\) so we can be about 42.9% confident that this result is actually correct. However, we can see that 100% of non-\(\mathcal{A}\)s get classified correctly and 95.7% of non-\(\mathcal{B}\)s and non-\(\mathcal{A}\)s get classified correctly. While confidence is dependent on the distribution of the classes sensitivity and specificity are not but we need to make sure that there are enough samples per class because if we only have four samples for some class \(\mathcal{D}\) then we might get a sensitivity of \(\mathcal{D}\) 75% but with a huge uncertainty because it might as well be that our classification function actually has a sensitivity of 4% and those four samples it correctly classified were exactly those four samples our classification function is good at.

Introducing Confidence

We define a \(confidence^{\top}\) which is a measure of how confident we can be that the reply of our classification function is actually correct. \(T_{n} + \sum_{i}(F_{i,n})\) are all cases where the classification function replied with \(n\) but only \(T_{n}\) of those are correct. We thus define

\[\begin{aligned} confidence^{\top}(n) := \frac{T_{n}}{T_{n}+\sum_{i}(F_{i,n})}\end{aligned}\]

But can we also define a \(confidence^{\bot}\) which is a measure of how confident we can be that if our classification function responds with a class different than \(n\) that it actually wasn't an \(n\)?

Well, we get in total \(\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}\) identifications for something different than \(n\) all of which are correct except \(\sum_{i}(F_{n,i})\). Thus, we define

\[\begin{aligned} confidence^{\bot}(n) = \frac{\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}-\sum_{i}(F_{n,i})}{\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}}\end{aligned}\]

Let's calculate the \(confidence\) for the example above:

\(confidence\)
\(\mathcal{A}\) \(\mathcal{B}\) \(\mathcal{C}\)
\(confidence^{\top}\) \(1.000\) \(0.429\) \(0.429\)
\(confidence^{\bot}\) \(0.857\) \(0.967\) \(0.967\)

Confidence is however very dependent on the distribution of classes.

Prevalence Independency

Which measure do we have that is independent of the prevalence/distribution of classes? If we go back to our medical test then we have \(p := \text{prevalence}\) ill people and \(1-p\) healthy people. If our test is perfect at telling ill people we have a sensitivity of \(1.0\) but let's say we have a specificity of \(0.7\) meaning it only identifies healthy people 70% of the time (and the rest is diagnosed as ill although they are actually healthy). If we have \(p\) ill people and \(1-p\) healthy people then if our test is absolutely perfect it must diagnose at least \(p\) people as ill and \(1-p\) people as healthy. We can calculate how many people a test identifies (either correctly or wrongly) as ill: \(p \cdot sensitivity + (1-p) \cdot (1-specificity)\). Thus, we can say that \(m := p \cdot sensitivity + (1-p) \cdot (1-specificity) - p\) is a measure of over/under-diagnoses and the sign will tell us how many people we over-diagnose or under-diagnose. Of course, a test can identify the correct amount of people as ill but theoretically none of them could actually be ill so this isn't a true measure of MISdiagnoses only over/under-diagnoses. We can observe that regardless of what the prevalence is we over/under-diagnose by at most \(1.0-min(specificity, sensitivity)\). We could also calculate the accuracy which is given as \(accuracy := p \cdot sensitivity + (1-p) \cdot specificity\) which is bounded by \(min(specificity, sensitivity)\) so we can thus say that regardless of prevalence, our test has a worst-case accuracy of \(min(specificity, sensitivity)\). If we go back to our example this means that if we want to truly know how confident we can be given that we do not know the distribution of classes we can take \(T := min(\min_{i}(specificity(i)),\min_{i}(sensitivity(i)))\) as an absolute worst-case measure of how good our classification function is, however, this will be in terms of worst-case accuracy, not worst-case confidence. Why? Because accuracy is nothing but a linear interpolation of the form

\[\begin{aligned} sensitivity \cdot prevalence + specificity \cdot (1-prevalence) = a \cdot p + b \cdot (1 - p)\end{aligned}\]

which always returns a value between \(a\) and \(b\). Either \(prevalence\) is zero in which case \(accuracy = specificity\) or it is one in which case \(accuracy = sensitivity\).

We can also use

\[\begin{aligned} Y_{s} := \frac{-N + \sum_{i}(sensitivity_{i}+specificity_{i})}{N}\end{aligned}\]

and

\[\begin{aligned} Y_{c} := \frac{-N + \sum_{i}(confidence^{\top}_{i}+confidence^{\bot}_{i})}{N}\end{aligned}\]

(where \(N\) is the number of classes).

If one wants to optimize for a specific class it's possible to add weights:

\[\begin{aligned} Y_{s} := \frac{-N + \sum_{i}(sensitivity_{i}\cdot 2\cdot w_{i}+specificity_{i}\cdot 2 \cdot (1-w_{i}))}{N}\end{aligned}\]

and

\[\begin{aligned} Y_{c} := \frac{-N + \sum_{i}(confidence^{\top}_{i}\cdot 2\cdot w_{i}+confidence^{\bot}_{i}\cdot 2 \cdot (1-w_{i}))}{N}\end{aligned}\]

However, \(Y_{c}\) is not prevalence independant.

Normalization

We can express our \(T\) and \(F\) values in percentages based on the amount of samples for the corresponding class rather than absolute counts and calculate our measures on these normalized values. However, this implies that we make the assumption that the classes in the real world are distributed uniformly.

Example classification
\(\mathcal{A}\) \(\mathcal{B}\) \(\mathcal{C}\)
\(T\) 86 3 3
\(F_{\mathcal{A},*}\) 0 1 1
\(F_{\mathcal{B},*}\) 0 0 3
\(F_{\mathcal{C},*}\) 0 3 0

We can turn this into:

Normalized
\(\mathcal{A}\) \(\mathcal{B}\) \(\mathcal{C}\)
\(T\) 0.977 0.500 0.500
\(F_{\mathcal{A},*}\) 0.000 0.011 0.011
\(F_{\mathcal{B},*}\) 0.000 0.000 0.500
\(F_{\mathcal{C},*}\) 0.000 0.500 0.000

And compute our measurements:

Measurements
\(\mathcal{A}\) \(\mathcal{B}\) \(\mathcal{C}\)
\(sensitivity\) \(0.978\) \(0.500\) \(0.500\)
\(specificity\) \(1.000\) \(0.744\) \(0.744\)
\(confidence^{\top}\) \(1.000\) \(0.495\) \(0.495\)

Conclusion

Sensitivity and specificity are useful and good measures to rate classification performance. Additionally, confidence may be calculated to determine the probability that the classification is actually correct given the distribution of classes.