2025 Statistical Methods in Imaging Conference
2025-05-21
High accuracy makes it difficult to differentiate model performance.
Benchmark data for high-risk fields such as medical imaging needs adequate uncertainty quantification.
Data: COVID-19 Radiography Database from Kaggle
X-ray images of 4 lung conditions: Normal, COVID, lung opacity (LO), and viral pneumonia (VP)
Each of size 299 by 299 pixels having 10192, 3616, 6012, and 1345 images respectively.
Classical deep learning leads to
Overfitting
Oveconfidence/Inadequate Calibration/Uncertainty
Bayesian approaches come into play because
incorporate regularization
systematic and unified framework for quantifying uncertainty in predictions via Bayesian Model Averaging (BMA)
BMA can be considered an ensembling method (Jospin et al. 2022; Wilson and Izmailov 2020) that has been shown to improve predictive performance
Bayesian deep learning is an emerging field, making it worthwhile to survey existing methods and compare and contrast their differences, strengths, and weaknesses.
👉 How do BNN algorithms compare to classical deep neural networks (DNNs) and machine learning (ML) algorithms?
👉 Which methods perform best in terms of accuracy and uncertainty quality?
👉 When is it better to use which methods?
Method | Full Method Name | Method | Full Method Name |
---|---|---|---|
SGLD | Stochastic gradient Langevin dynamics | SWAG | Stochastic weight averaging Gaussian |
pSGLD | Preconditioned SGLD | MSWAG | MultiSWAG |
SGHMC | Stochastic gradient Hamiltonian Monte Carlo | SGD | Stochastic gradient descent |
BBB | Bayes by backpropagation | GB | Gradient boosting |
MCD | Monte Carlo dropout | RF | Random forests |
LIVI | Linearized implicit variational inference | GP | Gaussian process classification |
KFL | Kronecker factored Laplace | SVM | Support vector machine |
LL | Low-rank Laplace | KNN | K-nearest neighbors |
DL | Diagonal Laplace | LR | Logistic regression |
SL | Subnetwork Laplace | DT | Decision tree classifier |
DE | Deep ensembles | NB | Naive Bayes |
AlexNet (2012) (Krizhevsky, Sutskever, and Hinton 2017) for its historical importance, simplicity, and being used as a benchmark.
It still outperforms some modern architectures such as VGG16 and ResNet-50 on lung image data (Jaffar, Khan, and Mosavi 2022).
With prior distribution on weights π(θ) and training data, D,
p(y∗∣x∗,D,BNN)=∫Θp(y∗∣x∗,θ′,BNN)π(θ′∣D,BNN)dθ′
[Pros] Aggregate different sets of weights that produce various high-performing networks that interpret data differently yet with equal accuracy.
BMA assumes there is exactly one true model described by only one set of parameters that generates the data. (Minka 2002; Monteith et al. 2011)
[Cons] When data are generated from a combination of models, overfitting remains even BMA is performed. (Domingos 2000)
Stochastic Gradient MCMC (SG-MCMC) is a marriage of SGD and MCMC that uses minibatch for feasible computation.
SGLD (Welling and Teh 2011) | pSGLD (Li et al. 2016) | SGHMC (Chen, Fox, and Guestrin 2014) |
---|---|---|
SGD + Langevin Dynamics | SGLD + preconditioner | SGD + modified Hamiltonian Dynamics |
suffer from pathological curvature | the curvature is similar in all directions | noise injected by SG breaks the dynamics |
mode collapse | moves faster away from ill-conditioned areas | add a friction to the momentum update in HMC |
Learning rate ϵt decreases toward zero, resulting in high auto-correlated samples.
In practice, ϵt is fixed at a small value for later steps.
In practice, posterior size is small, leading to more varying and biased results.
Computationally intensive and memory hungry.
VI methods embrace the stochasticity introduced by SGD, leading to stochastic variational inference (SVI): Bayes by backpropagation (BBB), Monte Carlo dropout (MCD), and Linearized implicit variational inference (LIVI).
BBB (Blundell et al. 2015) | MCD (Gal and Ghahramani 2016) | LIVI (Uppal et al. 2023) |
---|---|---|
maximize approximated ELBO by θ(s)∼q(θ∣η) | q(W) is such that Wl=Ml⋅diag([zlj]j=1Kl−1), zlj∼Bernoulli(pl) | q(θ∣η)=∫q(θ∣η,z)q(z)dz with q(θ∣η,z)=N(gη(z),σ2I) and q(z)=N(0,I) |
reparameterization trick to ensure backpropagation | equiv. to approx. deep Gaussian process (GP) (Damianou and Lawrence 2013) | specify the correlation of parameters |
the number of parameters is doubled | adding normal prior on W equiv. to L2 regularization of dropout NN or GP | approximate ELBO by computable LIVI bound |
increased computational costs | high memory usage | |
convergence issues | small sample size lead to variability in predictive accuracy and underconfident predictions | |
mode collapse |
Gaussian approximation includes Laplace methods and Stochastic Weight Averaging Gaussian.
Laplace Approximation π(θ∣D)≈N(θ∗,H(θ∗)−1), where H(θ∗)−1 is the Hessian matrix of logπ(θ∣D) evaluated at the MAP estimate θ∗
Infeasible to learn the entire H.
Hessian approximation methods:
DL (Farquhar, Smith, and Gal 2020) | KFL (Ritter, Botev, and Barber 2018) | SL (Daxberger et al. 2022) |
---|---|---|
H is diagonal | approximate H by a block-diagonal matrix | treats a subset of weights probabilistically |
naive for CNN | Hl≈Vl⊗Ul | the remaining parameters at their MAP |
captures essential second-order curvature | infers a full-covariance Gaussian posterior over a subnetwork | |
Source: (Izmailov et al. 2018)
Ensembling has been interpreted as an approximation to BMA (Wilson and Izmailov 2020; Jospin et al. 2022)
DE (Lakshminarayanan, Pritzel, and Blundell 2017) | MultiSWAG (Wilson and Izmailov 2020) |
---|---|
θ1,…,θM learned independently from M models (non-Bayesian MultiSGD) | Gaussian mixture approximation to the posterior |
p(y∣x)=M−1∑m=1Mp(y∣x,θm) approximates the BMA | each Gaussian component centered around a different basin of attraction |
Method | Accuracy ↑ | MCC ↑ |
---|---|---|
MSWAG | 1 | 1 |
SL | 2 | 2 |
DE | 3.62 | 3.62 |
MCD | 3.62 | 3.62 |
KFL | 4.75 | 4.75 |
GB | 6.12 | 6.12 |
LL | 6.88 | 6.88 |
SWAG | 8.25 | 8.12 |
BBB | 9 | 8.88 |
pSGLD | 9.75 | 10 |
SGD | 11.2 | 13.2 |
GP | 11.8 | 11 |
RF | 13.1 | 12.2 |
SGHMC | 13.9 | 13.5 |
SGLD | 15.2 | 15.2 |
LIVI | 16.2 | 16.2 |
SVM | 16.5 | 16.5 |
KNN | 18 | 18 |
LR | 19 | 19 |
DT | 20 | 20 |
NB | 21 | 21 |
DL | 22 | 22 |
The rank averaged over the 4 thresholds.
MSWAG, SL, DE, MCD, KFL consistent performance across different metrics.
Preconditioning is needed for MCMC-based BNN.
Due to small sampling, LIVI demonstrates large variability, resulting in low accuracy in average.
SGD performance falling between BNN and ML.
DNN outperforms ML, but Gradient Boosting (GB) stands out.
Random Forests (RF) and Gaussian Process (GP) better than MCMC-based BNN.
Important
The ability to capture multiple basins of attraction is crucial (SWAG to MSWAG, SGD to DE)
For Laplace methods, a well-approximated H using a subnetwork is better than a full network with a poorly approximated H.
Method | Accuracy (Avg) |
---|---|
MSWAG | .932 |
MCD | .905 |
SWAG | .895 |
DE | .882 |
KFL | .877 |
SL | .876 |
LL | .875 |
GB | .872 |
SGD | .868 |
GP | .848 |
LIVI | .842 |
RF | .833 |
SVM | .807 |
pSGLD | .782 |
BBB | .768 |
SGLD | .767 |
LR | .753 |
SGHMC | .735 |
MCMC methods and BBB demonstrate much lower accuracy, potentially suffering from the sampling inaccuracy or approximation error.
MCMC methods require a larger parameter sample size during training (accuracy increased by 5% from 20 to 100)
GB achieves the highest accuracy among the ML methods, followed by GP and RF.
ATTENTION
BNN performance depends on architecture.
ResNet-50 improves accuracy by 4 - 5% across BNN methods
Method | Accuracy (Avg) |
---|---|
MSWAG | .932 |
MCD | .905 |
SWAG | .895 |
DE | .882 |
KFL | .877 |
SL | .876 |
LL | .875 |
GB | .872 |
SGD | .868 |
GP | .848 |
LIVI | .842 |
RF | .833 |
SVM | .807 |
pSGLD | .782 |
BBB | .768 |
SGLD | .767 |
LR | .753 |
SGHMC | .735 |
FINDINGS
Predictive accuracy is influenced by three key factors
Method | ECE Avg ↓ | Confidence (correct) Avg ↑ | Confidence (incorrect) Avg ↓ |
---|---|---|---|
MSWAG | .011 | .928 | .768 |
MCD | .008 | .961 | .877 |
SWAG | .048 | .951 | .878 |
DE | .008 | .922 | .904 |
KFL | .230 | .485 | .354 |
GB | .012 | .972 | .847 |
SL | .105 | .986 | .907 |
SGD | .066 | .924 | .898 |
LL | .436 | .259 | .256 |
GP | .111 | .699 | .483 |
RF | .095 | .718 | .501 |
SVM | .021 | .796 | .615 |
LIVI | .312 | .256 | .246 |
pSGLD | .032 | .954 | .870 |
BBB | .063 | .998 | .998 |
SGLD | .042 | .972 | .930 |
LR | .065 | .853 | .690 |
SGHMC | .251 | .472 | .422 |
KFL, LL, LIVI, and SGHMC have the worst ECE due to their lack of confidence, and inability to assign high (low) probabilities to true positives (negatives).
While low ECE, DE tends to be overly confident in mislabeled predictions.
KFL is conservative. For incorrect predictions, it is much less less certain about the result.
BBB shows poor calibration by assigning extremely high probabilities to the wrong class. ~An arrogant and opinionated one!~
LL and LIVI exhibit poor calibration by distributing probabilities almost evenly across each class. ~One with no self-confidence and decidophobia!~
GB is slightly overconfident in the incorrectly labeled group, whereas GP and RF are better calibrated.
VP images are likely to be labeled as COVID. Both involve lung inflammation with similar symptoms?
DNN methods tend to be overly confident when labeling OOD images.
GB and GP tend to assign a higher probability of being COVID to a VP image.
They have minor OOD-overconfidence shown in DNN methods.
DNNs tend to be overconfident on either in-sample or OOD data.
Ensembling, purely Bayesian or not, enhances predictive accuracy much, but not calibration.
Bayesian methods that experience mode collapse, such as VI methods with unimodal variational distributions, may perform worse than non-Bayesian or ensemble methods.
When the data-generating process involves multiple hypotheses, prediction using BMA could result in overfitting/overconfidence. (Consider Bayesian model combination, Bayesian ensembling)
Fully connected networks often underperform compared to those incorporating regularization such as dropout or subnetwork inference.
No single method outperforms in all performance measures!
❓ Computing power and resources are a concern
👉 ML ensembles, such as (extreme) gradient boosting.
❓ Powerful computing resources are available
👉 MCD, SL, MSWAG, and DE (ordered by computing time) are the best for predictive accuracy.
👉 For multiclass tasks, deep learning is preferred with more advanced architectures like ResNet-50 for improved performance.
❓ The data from different sources, may be contaminated, or overconfidence is a primary concern
👉 KFL or GP provides more reasonable confidence and calibration.
When medical imaging is the primary focus, tasks such as segmentation, registration, and reconstruction play crucial roles in enhancing inference and analysis quality. (Wang et al. 2012; Kumar et al. 2023; Rayed et al. 2024)
The development of BNNs specifically tailored for segmentation or general image pre-processing remains limited.
There are other popular architectures such as VGGNet (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015), MobileNet (Howard et al. 2017), EfficientNet (Tan and Le 2019).
A comprehensive study of how BNNs and associated algorithms perform on different architectures could be a valuable work that benefits Bayesian and deep learning communities.