Confusion Matrix for Experiment Readouts

A confusion matrix turns binary outcomes into countable diagnostics. In experiment and measurement workflows, it helps separate model quality from class balance:

  • True positive (TP): predicted positive and observed positive
  • False positive (FP): predicted positive and observed negative
  • False negative (FN): predicted negative and observed positive
  • True negative (TN): predicted negative and observed negative

2x2 count table

For binary outcomes, keep counts in a standard layout with marginals:

Observed \ Predicted + - Row total
+ TP FN TP + FN
- FP TN FP + TN
Column total TP + FP FN + TN N

From this table:

\[ \begin{aligned} \text{TPR} &= \text{sensitivity} = \text{recall} = \frac{TP}{TP + FN}, \\ \text{TNR} &= \text{specificity} = \frac{TN}{TN + FP}, \\ \text{PPV} &= \text{precision} = \frac{TP}{TP + FP}, \\ \text{NPV} &= \frac{TN}{TN + FN}, \\ \text{accuracy} &= \frac{TP + TN}{TP + FP + FN + TN}, \\ F_1 &= \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}. \end{aligned} \]

Standard deviation under model assumptions

Rate uncertainty depends on the data-generating assumption behind each metric denominator.

Assumption A: independent Bernoulli trials (binomial model)

For a rate \(\hat p = x/n\) (for example recall with \(x=TP\), \(n=TP+FN\)),

\[ \operatorname{SD}(\hat p) = \sqrt{\frac{\hat p(1-\hat p)}{n}}. \]

Derivation of the SD equation

Let \(Y_i \in \{0,1\}\) indicate whether trial \(i\) is a success, with \(\Pr(Y_i=1)=p\), and assume \(Y_1,\dots,Y_n\) are independent and identically distributed. Define the sample proportion:

\[ \hat p = \frac{1}{n}\sum_{i=1}^{n} Y_i. \]

Because a Bernoulli variable satisfies \(\operatorname{Var}(Y_i)=p(1-p)\),

\[ \operatorname{Var}(\hat p) = \operatorname{Var}\!\left(\frac{1}{n}\sum_{i=1}^{n} Y_i\right) = \frac{1}{n^2}\sum_{i=1}^{n}\operatorname{Var}(Y_i) = \frac{1}{n^2}\cdot n\,p(1-p) = \frac{p(1-p)}{n}. \]

So the standard deviation is

\[ \operatorname{SD}(\hat p)=\sqrt{\operatorname{Var}(\hat p)} = \sqrt{\frac{p(1-p)}{n}}. \]

In practice, \(p\) is unknown, so replace it with \(\hat p\) to get the plug-in estimate used above:

\[ \widehat{\operatorname{SD}}(\hat p) = \sqrt{\frac{\hat p(1-\hat p)}{n}}. \]

This is the default planning approximation for recall, precision, specificity, and accuracy when outcomes are independent inside each denominator group.

Assumption B: large-sample normal approximation

If \(n\hat p\) and \(n(1-\hat p)\) are both reasonably large, use

\[ \hat p \pm z_{\alpha/2}\,\operatorname{SD}(\hat p) \]

as an approximate confidence interval. Typical multipliers are:

  • \(z \approx 1\) for about 68% coverage
  • \(z \approx 1.96\) for about 95% coverage
  • \(z \approx 3\) for conservative 3-SD screening

Assumption C: overdispersed Bernoulli (beta-binomial style)

When repeated measurements are correlated (batch effects, instrument drift, cohort structure), inflate variance with a dispersion factor \(\phi \ge 1\):

\[ \operatorname{SD}_{\text{over}}(\hat p) = \sqrt{\phi}\,\operatorname{SD}(\hat p). \]

A beta-binomial model makes that inflation explicit:

\[ X \mid θ \sim \operatorname{Binomial}(n, θ), \qquad θ \sim \operatorname{Beta}(α, β), \qquad \hat p = \frac{X}{n}. \]

If \(\mu = \mathbb{E}[θ]\), then

\[ \mathbb{E}[\hat p] = \mu, \qquad \operatorname{Var}(\hat p) = \frac{\mu(1-\mu)}{n}\,\Bigl[1 + (n-1)\rho\Bigr], \]

where \(\rho = 1/(α + β + 1)\) is the intra-class correlation. In the notation above, \(\phi = 1 + (n-1)\rho\).

Before introducing \(\phi\), the CI width is fixed by \(\hat p\) and \(n\) through the binomial standard deviation. After introducing \(\phi\), the same \(\hat p\) can produce a wider interval because the width is multiplied by \(\sqrt{\phi}\).

Figure 1: Approximate 95% confidence-interval width versus \(\hat{p}\) under different dispersion factors.

Use this when empirical residuals are wider than binomial expectations.

Example matrix

Table 1
Observed  Predicted Predicted + Predicted - Row total
+ 84 24 108
- 16 176 192
Column total 100 200 300
Table 2
Metric Estimate n SD -1 SD +1 SD -2 SD +2 SD
TPR (Recall) 77.78% 108 4.00% 73.78% 81.78% 69.78% 85.78%
PPV (Precision) 84.00% 100 3.67% 80.33% 87.67% 76.67% 91.33%
TNR (Specificity) 91.67% 192 1.99% 89.67% 93.66% 87.68% 95.66%
NPV 88.00% 200 2.30% 85.70% 90.30% 83.40% 92.60%
Accuracy 86.67% 300 1.96% 84.70% 88.63% 82.74% 90.59%

In this worked example, precision and recall are estimated on smaller denominators than accuracy, so their SD values are wider. That is expected: fewer effective trials imply greater rate uncertainty.

Rate profile figure

Figure 2: Rate profile derived from the confusion matrix example.

Interactive companion

Use the interactive tool at ED confusion matrix workbench to enter integer counts, calculate rates, and inspect the live distribution chart.

Class imbalance

When positive and negative observations are very unbalanced, the standard confusion matrix metrics behave differently and some become misleading.

Accuracy is unreliable under imbalance

A classifier that labels every observation negative achieves accuracy equal to the negative prevalence:

\[ \text{accuracy}_{\text{trivial}} = \frac{TN}{TP + FP + FN + TN} = \frac{N_-}{N}, \]

where \(N_-\) is the count of negative observations. If 95% of observations are negative, a null classifier reports 95% accuracy while having zero ability to detect positives.

Prevalence shifts PPV and NPV

PPV and NPV depend on the prevalence \(\pi = (TP + FN)/N\) through Bayes’ theorem even when sensitivity and specificity are fixed:

\[ \text{PPV} = \frac{\text{TPR} \cdot \pi}{\text{TPR} \cdot \pi + (1 - \text{TNR}) \cdot (1-\pi)}, \]

\[ \text{NPV} = \frac{\text{TNR} \cdot (1-\pi)}{\text{TNR} \cdot (1-\pi) + (1 - \text{TPR}) \cdot \pi}. \]

At low prevalence, PPV collapses even with high sensitivity and specificity. This is the false positive paradox common in rare-event screening: most positive calls are wrong.

Imbalance example

Table 3: Imbalanced example: 50 positives, 950 negatives.
Observed  Predicted Predicted + Predicted - Row total
+ 40 10 50
- 95 855 950
Column total 135 865 1000
Table 4: Metric comparison: balanced vs. imbalanced data (same sensitivity and specificity).
Metric Balanced Imbalanced
Prevalence 28.0% 4.0%
Accuracy 86.7% 89.5%
Recall (TPR) 77.8% 80.0%
Specificity (TNR) 91.7% 90.0%
Precision (PPV) 84.0% 29.6%
F1 80.8% 43.2%
MCC 0.707 0.446

Accuracy is nearly identical in both examples, which hides the degraded PPV in the imbalanced case.

Associated metrics under imbalance

  • Recall (TPR) measures detection of actual positives and does not depend on \(N_-\).
  • Precision (PPV) measures reliability of positive calls and is sensitive to the FP count.
  • F1 score is the harmonic mean of precision and recall, penalising either being low.
  • Balanced accuracy averages TPR and TNR, giving equal weight to both classes regardless of count:

\[ \text{balanced accuracy} = \frac{\text{TPR} + \text{TNR}}{2}. \]

  • Matthews Correlation Coefficient (MCC) summarises all four cells in a single value between −1 and +1, robust to large imbalances:

\[ \text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}. \]

An MCC of 0 corresponds to random guessing regardless of class balance; +1 indicates perfect prediction.

Planning under imbalance

When designing an experiment with rare positives:

  • Set sample size against recall and precision targets, not accuracy.
  • Report both PPV and NPV together with the observed prevalence so readers can recalibrate for different populations.
  • Use stratified sampling or oversampling of the minority class to ensure the positive denominator is large enough for stable rate estimates.

References

  • Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 6. https://doi.org/10.1186/s12864-019-6413-7
  • Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
  • Vihinen, M. (2012). How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics, 13(Suppl 4), S2. https://doi.org/10.1186/1471-2164-13-S4-S2