loading page

Applications of Certainty Scoring for Machine Learning Classification in Multi-modal Contexts
  • +2
  • Alexander Berenbeim ,
  • David Bierbrauer ,
  • Iain Cruickshank ,
  • Robert Thomson ,
  • Nathaniel Bastian
Alexander Berenbeim
Author Profile
David Bierbrauer
Author Profile
Iain Cruickshank
Author Profile
Robert Thomson
Author Profile
Nathaniel Bastian
United States Military Academy

Corresponding Author:[email protected]

Author Profile

Abstract

Quantitative characterizations and estimations of uncertainty are of fundamental importance for machine learning classification, particularly in safety-critical settings such as the military battlefield where continuous real-time monitoring requires explainable and reliable scoring. Reliance on the maximum a posteriori principle to determine label classification can obscure a model’s certainty of label assignment. We develop quantitative scores of certainty and competence based on predicted probability estimates as an effective tool for inferring the verity of positives across different data modalities and architectures. Our theoretical results establish that competent models have distinct distributions of certainty for true and false positives. Our empirical results bear out that there are distinct distributions of certainty scores on training and holdout data, as well as data that is a priori out-of-distribution. Further, we find that the most reliable test for out-of-distribution data is to compare the global True positive certainty score distribution against test data. At least 92.3% of out-of-distribution are successfully identified this way across our two experimental modalities at the tranche level. Further, 100% of the out-of-context images are identified as out-of-distribution using the stochastic form of our out-of-distribution detection test across all five stochastic variants of the ResNet models. Consequently, we find that the use of our certainty framework provides a robust means of detecting out-of-distribution inputs, while also serving as a reliable mechanism for comparing model quality of accurately distinguishing between true and False positives, particularly in safety-critical contexts.