The Imperative for Provable Guarantees in Safety-Critical AI
The rapid integration of Artificial Intelligence (AI), particularly machine learning (ML) models, into the core operational fabric of society marks a paradigm shift in technological capability. This shift is most profound in safety-critical systems, where the consequences of failure are measured in loss of life, significant property damage, or severe environmental harm.1 Domains such as autonomous vehicles, medical diagnostics, aerospace control systems, and critical infrastructure management are increasingly reliant on ML for perception, decision-making, and control.3 The deployment of AI in these high-stakes environments is not a future prospect but a present reality, driven by the promise of superhuman performance, enhanced efficiency, and novel capabilities that were previously unattainable.7
However, this proliferation of AI introduces a new and formidable class of systemic risk. The very properties that make modern ML models, especially deep neural networks (DNNs), so powerful—their ability to learn complex, high-dimensional, non-linear functions directly from data—also render them inherently vulnerable.10 A vast body of research has demonstrated that these models are susceptible to adversarial attacks: malicious techniques that manipulate a model by feeding it deceptive data, often modified in ways that are subtle or entirely imperceptible to humans, to cause incorrect or unintended behavior.12 An adversarial example, such as a digital image of a stop sign with minuscule, carefully crafted noise added, might be misclassified by an autonomous vehicle’s perception system as a speed limit sign with high confidence, with potentially catastrophic consequences.4 This vulnerability is not a mere software bug that can be patched but a fundamental characteristic of the decision boundaries learned by these models.10
This fundamental brittleness of ML systems creates a direct and irreconcilable conflict with the foundational principles of safety engineering. Traditional safety-critical software is built upon a bedrock of deterministic logic, formal specifications, and exhaustive verification and validation (V&V) processes that provide traceability from high-level requirements down to individual lines of code.1 The failure modes of such systems, while not always perfectly anticipated, are generally constrained by their logical design. ML models defy this paradigm. Their behavior is emergent, learned from data, and often inscrutable, leading to failure modes that are bizarre and counter-intuitive to human experts—a model might classify a giraffe as a toaster or a benign tumor as malignant with unshakable confidence.4 This “black box” nature means that traditional V&V frameworks, such as the V-model, are fundamentally inadequate. The sheer size of the input space makes exhaustive testing impossible, and the space of potential adversarial perturbations is effectively infinite. This has prompted the development of new assurance paradigms, like the W-model proposed by the European Union Aviation Safety Agency (EASA), which explicitly integrate a “learning assurance process” to address the unique challenges of data-driven systems.2
In response to the threat of adversarial attacks, the research community has developed a wide array of empirical defenses. Techniques like adversarial training, which involves augmenting the training dataset with adversarial examples, have shown considerable success in hardening models against known attack methods.15 However, these defenses exist within a perpetual and ultimately unwinnable “arms race”.17 The history of the field is replete with examples of novel defenses being proposed, only to be “broken” by the subsequent development of stronger, adaptive attacks that were not anticipated by the defenders.18 This reactive cycle of attack and defense, where security is only validated against the current state-of-the-art adversary, is fundamentally unacceptable for systems that require formal certification and regulatory approval from bodies like the Federal Aviation Administration (FAA).1
Therefore, for AI to be responsibly deployed in safety-critical applications, a higher standard of assurance is non-negotiable. Empirical validation, while necessary, is insufficient. The field requires a paradigm shift from reactive, heuristic defenses to proactive, certified safety. This necessitates the development and deployment of certified defenses—methods that provide a provable guarantee of robustness.20 A provable guarantee is a formal, mathematical proof that a model’s output will remain correct and unchanged against
any possible attack within a well-defined and formally specified threat model.19 It is this transition from “works well in practice” to “is provably correct” that represents the most critical challenge—and the greatest imperative—for the future of AI in safety-critical domains. The problem is not merely about improving algorithmic performance but about re-engineering AI systems to be compatible with the rigorous, unforgiving standards of safety engineering that have governed critical technologies for decades.5
The Adversarial Threat Landscape: A Formal Taxonomy
To comprehend the mechanisms and limitations of certified defenses, it is essential to first establish a formal and precise taxonomy of the adversarial threats they are designed to mitigate. This involves defining the adversary’s objectives, the extent of their knowledge about the target model, the specific vectors through which they can attack, and, most critically, the mathematical formalisms used to constrain their power.
Defining the Adversary: Goals and Capabilities
The nature of an adversarial attack is shaped by the adversary’s intent and their level of access to the target system. These two dimensions—goals and knowledge—form the primary axes for classifying attacks.
Attacker Goals
An adversary’s objective determines the desired outcome of the attack. The two primary goals are:
- Untargeted Attacks: The adversary’s aim is simply to cause the model to produce any incorrect output.13 For example, an attack on an autonomous vehicle’s perception system would be considered successful if an image of a stop sign is misclassified as anything other than a stop sign.13
- Targeted Attacks: The adversary seeks to force the model to produce a specific, predefined incorrect output.13 This is a more challenging but often far more dangerous form of attack. For instance, an adversary might not just want a stop sign to be misclassified, but to be misclassified specifically as a “speed limit 100” sign, thereby inducing a specific, malicious behavior in the system.4
Attacker Knowledge
The effectiveness and methodology of an attack are heavily dependent on the information available to the adversary about the target model. This knowledge spectrum is typically categorized as follows:
- White-Box Attacks: The adversary has complete knowledge of and access to the target model, including its architecture, parameters (weights and biases), gradients, and potentially even the training data.13 This represents the worst-case scenario for the defender, as the attacker can use gradient-based optimization methods to precisely craft the most effective adversarial examples.14 Certified defenses are designed to provide guarantees even under this powerful threat model.
- Black-Box Attacks: The adversary has no internal knowledge of the model and can only interact with it by submitting queries and observing the input-output behavior.14 Attacks in this setting often rely on making a large number of queries to infer the model’s decision boundaries or by training a local “surrogate” model to mimic the target and then crafting adversarial examples for the surrogate, which often transfer to the target model (a technique known as a transfer attack).13
A Taxonomy of Attack Vectors
Adversarial attacks can be launched at different stages of the machine learning lifecycle. While certified defenses primarily focus on inference-time attacks, a comprehensive understanding of the threat landscape requires acknowledging other vectors.
- Evasion Attacks (Inference-Time): This is the most common type of attack, where an adversary manipulates an input at inference time to deceive an already trained and deployed model.10 The classic example of adding imperceptible noise to an image to cause misclassification falls into this category. This is the primary threat that certified robustness aims to provably mitigate.
- Poisoning Attacks (Training-Time): In a poisoning attack, the adversary injects malicious or mislabeled data into the model’s training set.10 The goal is to corrupt the learning process itself, leading to degraded performance, biased predictions, or the creation of “backdoors” that can be exploited later. For example, a chatbot like Microsoft’s Tay was effectively poisoned by malicious users who flooded it with offensive content, causing it to learn and reproduce that behavior.25 While distinct from evasion, the principles of data sanitization and outlier removal are related defensive concepts.11
- Model Extraction and Privacy Attacks: These attacks do not seek to cause a model to misclassify, but rather to compromise its intellectual property or the privacy of its training data.13 In a model extraction attack, an adversary uses repeated queries to create a functional replica of a proprietary black-box model.13 Privacy attacks, such as membership inference, aim to determine whether a specific individual’s data was used in the model’s training set.13
Quantifying Perturbations: The Threat Model
The concept of a threat model is the cornerstone of certified defense. It provides a formal, mathematical definition of the set of allowable perturbations an adversary can apply to an input.27 A provable guarantee is only meaningful and valid
with respect to a specific threat model. If an adversary operates outside these defined constraints, the guarantee no longer holds. The most common threat models are defined using norms, which measure the “size” or “magnitude” of the perturbation vector added to an original input .
- (L-infinity) Norm: Defined as , this norm measures the maximum absolute change to any single feature (e.g., a pixel’s intensity value). An -bounded attack constrains the adversary to make small changes to many pixels, ensuring the perturbation is spread out and less perceptible. This is one of the most widely studied threat models in the literature.17
- Norm: Defined as , this is the standard Euclidean distance. An -bounded attack constrains the total “energy” of the perturbation. These perturbations often manifest as low-magnitude, diffuse noise across the entire input.17
- Norm: Defined as , this norm simply counts the number of features that have been altered. An -bounded attack allows the adversary to make arbitrarily large changes to a small, fixed number of features. This model is effective at representing sparse attacks, such as digitally altering a few key pixels or placing a small “sticker” on an object.7
The choice of threat model has profound implications. A model certified to be robust against perturbations is not necessarily robust against or attacks. This highlights a critical gap between the mathematical abstractions used in research and the complex, structured nature of threats in the physical world. While L_p norms provide a tractable way to formulate the verification problem, they are an imperfect proxy for real-world adversarial manipulations. For instance, a physical sticker placed on a stop sign is a localized, high-magnitude, and semantically meaningful perturbation that is not well-captured by a simple or ball.15 Similarly, transformations like rotation, scaling, or changes in lighting conditions represent structured changes to the input that fall outside the scope of standard L_p norm-based threat models.7 This discrepancy means that a “provable guarantee” against a small digital perturbation might offer a false sense of security against a physically realizable attack. Consequently, a major frontier for certified defense research is the development of methods that can certify robustness against these more realistic and semantically rich classes of transformations.6
Paradigms of Defense: From Empirical Resilience to Certified Invulnerability
The pursuit of adversarial robustness has given rise to two fundamentally different defense paradigms: empirical defense and certified defense. Understanding the distinction between these approaches is crucial for appreciating why the latter is indispensable for safety-critical applications.
The Vicious Cycle of Empirical Defense
The initial response to the discovery of adversarial examples was the development of a host of empirical defenses. These methods aim to make models more resilient by anticipating and training against specific attack strategies. Prominent examples include:
- Adversarial Training: The most enduring and effective empirical defense, where the model’s training data is augmented with adversarial examples generated on-the-fly. This forces the model to learn decision boundaries that are less sensitive to the directions in which adversaries push inputs.15
- Defensive Distillation: A technique where a model is trained to produce softer probability distributions over classes, making it harder for an attacker to exploit sharp decision boundaries.19
- Gradient Masking/Obfuscation: Methods that attempt to defend a model by hiding or distorting its gradient information, thereby frustrating the gradient-based attacks that adversaries rely on.20
While these techniques can significantly improve a model’s robustness against known attacks, their core weakness lies in their reactive nature. They are validated empirically, meaning their effectiveness is measured by their performance against a battery of existing attack algorithms. This leads to a cat-and-mouse game: a new defense is proposed, and soon after, researchers develop a new, more sophisticated adaptive attack that specifically bypasses it.18 Gradient masking defenses were broken by attacks designed to approximate the gradient 20, and defensive distillation was defeated by attacks tailored to its mechanism.19 This cycle of broken defenses underscores a fundamental limitation: empirical methods provide no guarantee of security against future, unforeseen attacks.7 For a system that must be certified to be safe, such as an aircraft collision avoidance system, this lack of a forward-looking guarantee is an unacceptable risk.1
Defining the “Provable Guarantee”
Certified defense breaks this cycle by shifting the objective from resisting known attacks to proving immunity against entire classes of attacks. A provable guarantee is a formal, mathematical proof that a model’s behavior is invariant within a specified region around a given input.23
Formally, for a classifier f, an input x, and a threat model defined by a perturbation set B(x) (e.g., an Lp ball of radius ϵ), a provable guarantee certifies that:
This statement asserts that no adversary, no matter how clever or computationally powerful, can find an adversarial example within the defined perturbation set that changes the model’s prediction.1
The Certified Radius
Instead of verifying robustness for a fixed, predefined perturbation size , certification methods are often used to compute the maximum size of the perturbation for which the guarantee holds. This value is known as the certified radius, .28 A larger certified radius indicates a more robust model. The primary metric for evaluating and comparing certified defenses is
certified accuracy at a given radius : the percentage of samples in a test set that are not only classified correctly but for which the method can also prove a certified radius .28
Empirical vs. Certified Robustness: A Fundamental Dichotomy
The distinction between these two paradigms can be summarized by the nature of the bound they provide on the model’s true robust accuracy (the accuracy on worst-case adversarial examples).
- Empirical Robustness: Provides an upper bound on the true robust accuracy. It is determined by testing against a finite set of the strongest known attacks. This bound is optimistic; the true robust accuracy is at most as high as the empirical one, and it could be lower if a stronger, unknown attack exists. As such, this bound is fragile and can be invalidated by future research.20
- Certified Robustness: Provides a lower bound on the true robust accuracy. It is a theoretical guarantee derived from a mathematical proof that holds for an infinite set of potential attacks within the threat model. This bound is pessimistic but durable; the true robust accuracy is at least as high as the certified one. It is a provable statement of security.20
This fundamental difference has profound implications for safety-critical systems. While an empirical defense might report 95% accuracy against a strong attack, a new attack could emerge tomorrow that drops its accuracy to 0%. A certified defense might only be able to guarantee 70% accuracy at a certain radius, but that 70% is a provable floor on its performance against any attack within that threat model, today and in the future.
However, the very language of certification—”provable,” “guaranteed,” “certified”—can be a double-edged sword. To practitioners, regulators, and the public, these terms suggest absolute, unconditional safety.19 This perception creates a dangerous semantic gap. A certificate is not a blanket statement of security; it is a highly conditional one. The guarantee is only valid for the specific threat model it was evaluated against (e.g., an
norm with ) and says nothing about the model’s behavior against larger perturbations or different types of attacks (e.g., or physical-world attacks).27 Furthermore, achieving certified robustness often comes at the cost of reduced accuracy on clean, benign data, a trade-off that must be carefully managed.30 This potential for overconfidence and misunderstanding of a certificate’s limitations is itself a security risk. It underscores the need for clear standards and communication about what a certificate truly represents: not a declaration of invulnerability, but one component within a comprehensive, defense-in-depth security architecture designed to manage and mitigate residual risk.19
A Technical Deep Dive into Certified Defense Mechanisms
The goal of providing a provable guarantee against adversarial attacks has led to the development of several distinct families of certified defense techniques. Each approach is built on different mathematical principles and offers a unique profile of strengths, weaknesses, and computational trade-offs. The four primary paradigms are Randomized Smoothing, Interval Bound Propagation (and its derivatives), Abstract Interpretation, and Semidefinite Programming Relaxations.
Randomized Smoothing
Randomized Smoothing is a probabilistic certification technique that has become a leading method due to its remarkable scalability and model-agnostic nature.28
Core Mechanism
The core idea is to transform any arbitrary base classifier, , into a new, “smoothed” classifier, . The prediction of this smoothed classifier for an input , denoted , is defined as the class that the base classifier is most likely to output when the input is perturbed by noise drawn from a standard distribution, typically an isotropic Gaussian .28 Formally, the smoothed classifier is:
$$ g(x) = \arg\max_{c \in \mathcal{Y}} \mathbb{P}_{\delta \sim \mathcal{N}(0, \sigma^2 I)} [f(x+\delta) = c] $$
In practice, this probability is estimated using Monte Carlo sampling: a large number of noisy samples of x are generated and passed through the base classifier f, and the class with the most “votes” is returned as the prediction of g.44 The intuition is that the large, random Gaussian perturbations effectively “drown out” any small, maliciously crafted adversarial perturbation, making the majority vote stable.28
The Guarantee
Randomized Smoothing provides a high-probability certificate of robustness, predominantly for the norm. The Neyman-Pearson lemma can be used to prove that if the probability of the most likely class, , is sufficiently high, then the prediction of the smoothed classifier is guaranteed to be constant within an ball around . The certified radius is a direct function of the standard deviation of the Gaussian noise and the Clopper-Pearson lower bound of the probability of the top-ranked class, .44 A simplified form of the radius calculation is:
where pB is the upper bound on the probability of the “runner-up” class and Φ−1 is the inverse of the standard normal cumulative distribution function.39
Strengths and Weaknesses
The primary strength of Randomized Smoothing is its scalability. Because it only requires black-box query access to the base classifier, it can be applied to arbitrarily large and complex models, including state-of-the-art architectures trained on massive datasets like ImageNet, where most other certification methods fail.28
However, this scalability comes with significant drawbacks. First, it is computationally expensive at inference time. Certifying a single prediction requires hundreds or thousands of forward passes through the base model to get a tight estimate of the class probabilities, which is prohibitive for real-time applications like autonomous driving.28 Second, while it provides strong guarantees for
perturbations, its certified radius for the crucial threat model scales as (where is the input dimension), rendering the bounds vacuous for high-dimensional data like images.30 Finally, the noise level
introduces a direct and often severe trade-off between standard accuracy and certified robustness. A larger leads to a larger potential certified radius but degrades the model’s accuracy on clean, unperturbed inputs.42
Interval Bound Propagation (IBP) and its Derivatives
Interval Bound Propagation is a deterministic certification method prized for its computational efficiency, which makes it particularly well-suited for use within the training loop of a neural network.24
Core Mechanism
IBP works by propagating a set of possible inputs, represented as a hyper-rectangle (an interval for each dimension), through the network layer by layer. Given an input and an perturbation budget , the initial input set is defined by the interval . For each subsequent layer, IBP calculates the lower and upper bounds of the possible activation values for every neuron, given the bounds from the previous layer.24 For a linear layer followed by a monotonic activation function like ReLU, these bounds can be computed straightforwardly. For example, the upper bound for a neuron is found by multiplying the positive weights by the upper bounds of the preceding neurons and the negative weights by their lower bounds.50 This process continues until the final output layer.
The Guarantee
The guarantee is derived from the final output bounds. For a given class, if the computed lower bound of its corresponding logit is greater than the computed upper bounds of the logits for all other classes, then the model’s prediction is guaranteed to be constant for any input within the initial hyper-rectangle. The model is thus certified robust for that input and perturbation size .27
Strengths and Weaknesses
IBP’s main advantage is its efficiency. The bound propagation is a simple, fast, and parallelizable forward pass, making it cheap enough to be incorporated directly into the loss function during training. This enables certified adversarial training, where the model is optimized to minimize an upper bound of the worst-case loss over the entire perturbation set, directly improving its provable robustness.24
The principal weakness of IBP is the “wrapping effect”. At each layer, the dependencies between neuron activations are lost as they are all bounded by a single hyper-rectangle. This causes the bounds to become progressively looser and more over-approximated with each additional layer. For deep networks, this effect can be so severe that the final bounds become vacuous (i.e., too wide to prove anything useful), severely limiting IBP’s effectiveness for certifying deep, pre-trained models.50
To address this, hybrid methods like CROWN-IBP have been developed. CROWN-IBP combines the speed of IBP with the tightness of a more sophisticated linear relaxation-based method called CROWN (Convex Relaxation based On Network). It uses a fast IBP forward pass to establish initial loose bounds, followed by a tighter, CROWN-based backward pass to refine them. This hybrid approach has achieved state-of-the-art results for deterministic certified training, balancing efficiency and the tightness of the final guarantee.51
Abstract Interpretation
Abstract Interpretation is a theory of sound approximation of program semantics, originating from the field of formal methods for software verification. Its application to neural networks provides a highly rigorous framework for certification.56
Core Mechanism
The core idea is to soundly overapproximate the set of all possible network outputs that can result from a given set of inputs. This is achieved using abstract domains, which are mathematical structures used to represent sets of concrete values (e.g., intervals, zonotopes, polyhedra), and abstract transformers, which are functions that compute the effect of each network layer on these abstract sets.56 The analysis begins with an abstract element representing the initial input perturbation set. This element is then propagated through the network by applying the corresponding abstract transformer for each layer (e.g., affine transformation, ReLU activation, max pooling).6
The Guarantee
The guarantee is sound due to the over-approximation property: the final abstract element in the output space is guaranteed to contain all possible concrete outputs. Robustness is then verified by checking if this final abstract output set is fully contained within the decision region of the correct class.6 If it is, the network is provably robust for the initial set of inputs.
Strengths and Weaknesses
The main strength of Abstract Interpretation lies in its rigor and flexibility. It is a well-established theory from software verification, providing a solid formal foundation.56 Furthermore, by choosing more expressive abstract domains (e.g., moving from simple intervals to more complex polyhedra, as in the DeepPoly tool), one can achieve tighter bounds and more precise verification, albeit at a higher computational cost.6
This trade-off points to its primary weakness: computational complexity. The cost of the analysis scales with the expressiveness of the abstract domain and the size of the network. While methods using simple intervals are fast (and are in fact equivalent to IBP), more precise domains like polyhedra can become computationally intractable for very large, deep networks, limiting their applicability.58
Semidefinite Programming (SDP) Relaxations
This approach leverages powerful tools from convex optimization to compute very tight bounds on a network’s robustness.
Core Mechanism
The problem of finding the worst-case adversarial perturbation can be formulated as a non-convex optimization problem, specifically a Quadratically Constrained Quadratic Program (QCQP), where the non-convexity arises from the ReLU activation constraints.61 The key idea is to “relax” this intractable non-convex problem into a larger but convex Semidefinite Program (SDP). This is done by lifting the variables into a higher-dimensional space and replacing the non-convex constraints with convex ones that are guaranteed to enclose the original feasible set. This convex SDP can then be solved efficiently using standard solvers.31
The Guarantee
The solution to the relaxed SDP provides a guaranteed upper bound on the worst-case loss the adversary can achieve. If this upper-bounded loss is less than zero for all incorrect classes, it serves as a certificate that no adversarial example exists within the defined threat model.61
Strengths and Weaknesses
The primary advantage of SDP relaxations is the tightness of the bounds they provide. They are provably tighter than relaxations based on Linear Programming (LP) because the SDP formulation can capture and reason about the joint correlations between different neuron activations, whereas LP relaxations treat them independently.61 This allows SDP-based methods to provide meaningful robustness certificates even for “foreign” networks that were not specifically trained to be robust.61
The overwhelming weakness is the prohibitive computational cost. Solving large-scale SDPs is extremely resource-intensive, and the size of the SDP grows rapidly with the number of neurons in the network. As a result, its application has largely been limited to smaller networks, often with only one or two hidden layers in research settings, making it currently impractical for verifying the large, deep models used in most real-world applications.18
A Comparative Analysis of Certification Techniques: The Impossible Triangle
Choosing a certified defense method for a safety-critical application is not a matter of selecting a single “best” algorithm. Instead, it requires navigating a complex landscape of trade-offs. The performance of these techniques can be understood through the lens of an “impossible triangle,” where three desirable properties—high certified accuracy, high clean accuracy, and computational scalability—are in fundamental tension. No single method currently excels at all three simultaneously, forcing developers and safety engineers to make principled choices based on the specific constraints and requirements of their system.36
The “Impossible Triangle”: Certified Accuracy, Clean Accuracy, and Scalability
- Certified Accuracy (Tightness of Bounds): This refers to the model’s accuracy on worst-case adversarial examples, as guaranteed by the certification method. It is directly related to the tightness of the bounds the method can prove. Tighter bounds lead to larger certified radii and thus higher certified accuracy for a given perturbation budget .
- Clean Accuracy: This is the model’s standard performance on benign, unperturbed data. An ideal defense would have minimal impact on this metric, as a model that is robust but useless for its primary task has no practical value.
- Scalability: This encompasses both training and inference efficiency. A scalable method can be applied to large, deep neural networks (like those used in production) and can perform its function (training or certification) within a reasonable time and computational budget.
The interplay between these factors dictates the practical utility of each certification paradigm.
Scalability vs. Tightness of Bounds
There is a direct and often sharp trade-off between the computational scalability of a certification method and the tightness of the robustness bounds it can provide.
- High Scalability, Looser Bounds: At one end of the spectrum, Randomized Smoothing stands out for its ability to scale to massive, ImageNet-sized models, a feat beyond the reach of most other methods.28 However, this scalability is achieved by using a probabilistic, sampling-based approach that provides bounds that are often looser than deterministic methods and are highly dependent on the number of Monte Carlo samples used.46 Similarly,
Interval Bound Propagation (IBP) is extremely fast and scalable for training, but it is notoriously prone to the “wrapping effect,” which leads to very loose bounds, especially in deep networks.54 - Low Scalability, Tighter Bounds: At the opposite end, Semidefinite Programming (SDP) Relaxations offer the tightest known bounds among convex relaxation techniques.61 However, the computational cost of solving the required SDPs is so high that these methods are generally intractable for all but the smallest networks.18
Abstract Interpretation occupies a middle ground; its scalability is inversely proportional to the precision of its abstract domain. Using simple intervals (equivalent to IBP) is fast but loose, while using more expressive domains like polyhedra (e.g., DeepPoly) yields much tighter bounds at a significant computational cost that limits its applicability to moderately sized networks.6
The Robustness-Accuracy Trade-off
A near-universal challenge in the field is that increasing a model’s certified robustness almost invariably leads to a decrease in its standard accuracy on clean data.30 This trade-off arises because the training objectives used to promote certified robustness act as a form of strong regularization, forcing the model to learn smoother, simpler decision boundaries. While these smoother boundaries are less susceptible to small perturbations, they may be less capable of fitting the intricate patterns present in the clean training data. This effect is particularly pronounced in methods like IBP-based certified training, where the optimization process directly penalizes sharp decision boundaries, and in Randomized Smoothing, where the addition of high-variance noise during training and inference inherently degrades performance on clean inputs.42 Managing this trade-off is a key engineering challenge in deploying robust models.
Applicability to Threat Models
The effectiveness of a certified defense is also highly dependent on the threat model under consideration.
- Randomized Smoothing is natively suited for providing certificates against norm perturbations, where it achieves state-of-the-art results. However, its guarantees for the norm are notoriously weak and become impractical in high dimensions.48
- In contrast, deterministic methods like IBP, Abstract Interpretation, and SDP Relaxations are primarily designed for and evaluated against norm threats, which model the common scenario of small, bounded changes to each input feature.6
This specialization means that the choice of defense must be aligned with the most plausible threat model for the target application. An autonomous vehicle perception system might be more concerned with sparse attacks (stickers) or broader attacks, making -focused methods like Randomized Smoothing a potentially poor fit.
The following table provides a structured summary of these multi-dimensional trade-offs, offering a comparative overview of the primary certified defense paradigms. It serves as a high-level guide for practitioners to understand the value proposition and core limitations of each approach.
| Technique | Core Mechanism | Primary Threat Model | Scalability (Network Size) | Tightness of Bounds | Robustness-Accuracy Trade-off | Key Limitation |
| Randomized Smoothing | Probabilistic; Monte Carlo sampling over noisy inputs 28 | Norm 28 | High (scales to ImageNet) 28 | Moderate (probabilistic, depends on sample count ) 46 | High (significant impact on clean accuracy) 42 | High inference cost; weak guarantees for norm 46 |
| IBP & CROWN-IBP | Deterministic; propagation of interval/linear bounds 24 | Norm 51 | High (for training) 49 | Varies (IBP is loose, CROWN-IBP is tighter) 51 | Moderate to High (strong regularization needed) 65 | “Wrapping effect” leads to loose bounds in deep networks 53 |
| Abstract Interpretation | Formal; sound overapproximation of reachable states 56 | , , other geometric sets 6 | Low to Moderate 58 | Can be very tight with expressive domains (e.g., DeepPoly) 6 | Varies by domain precision | High computational complexity; domain-specific transformers 6 |
| SDP Relaxation | Formal; convex relaxation of network constraints 61 | Norm 61 | Low 18 | Tightest among convex relaxations 61 | High | Computationally prohibitive for large networks 18 |
Certified Robustness in Practice: Case Studies in Safety-Critical Domains
The theoretical frameworks of certified robustness are being actively adapted and applied to address the unique and demanding challenges of specific safety-critical domains. The distinct operational constraints, threat environments, and required functionalities of automated driving, medical imaging, and aviation control are driving the specialized evolution of certification techniques. There is no “one-size-fits-all” solution; rather, the state-of-the-art is fragmenting into a toolbox of domain-specific methods.
Automated Driving
The deployment of autonomous vehicles (AVs) represents one of the most visible and high-stakes applications of AI. The perception, prediction, and planning systems of AVs are heavily reliant on deep neural networks, making their robustness a paramount safety concern.
Domain-Specific Challenges
- Real-Time Performance: AV systems must process sensor data and make decisions in milliseconds. This places extreme constraints on the computational overhead of any defense mechanism, making many certification techniques that are expensive at inference time, such as standard Randomized Smoothing, impractical for online deployment.46
- Multi-Modal Perception: AVs rely on a fusion of sensors, including cameras, LiDAR, and radar, to build a comprehensive understanding of their environment.67 Securing these systems requires defenses that can handle multi-modal data and are robust to attacks that may target one or more sensor streams simultaneously.67
- Physical-World Threats: The primary threats to AVs are not just digital perturbations but physical ones. Adversaries can use physical objects like stickers, patches, adversarial textures, or camouflage to fool perception systems.7 These attacks do not conform neatly to the simple
-norm threat models used in most certification research, creating a significant gap between theoretical guarantees and real-world security.67 - Regression and Control Tasks: Beyond simple classification, AVs rely on NNs for regression tasks (e.g., vehicle localization, distance estimation) and for learning control policies via deep reinforcement learning (e.g., collision avoidance maneuvers).45 Certification methods must be extended to provide guarantees for these continuous-output and sequential decision-making problems.
State-of-the-Art and Applications
Research in this area is focused on adapting certification methods to these demanding constraints. Randomized Smoothing has been a popular choice due to its scalability and model-agnosticism. Studies have extended it to provide certified robustness for regression models used in visual positioning systems, which are essential for autonomous navigation.45 Other work has focused on making Randomized Smoothing more efficient by reducing the number of required Monte Carlo samples, a crucial step towards enabling real-time certification.46 There is also active research into developing certified defenses for deep reinforcement learning policies, for example, by computing guaranteed lower bounds on state-action values to ensure the selection of a safe action even under worst-case input perturbations in pedestrian collision avoidance scenarios.66 The field is acutely aware of the limitations of current methods, and a key research direction is bridging the gap between digital certifications and robustness against tangible, physical-world attacks.67
Medical Image Analysis
In medicine, AI is being used to assist clinicians in tasks like diagnosing diseases from radiological scans, segmenting tumors for treatment planning, and analyzing pathological slides. The safety-critical nature of these decisions makes the trustworthiness and reliability of the underlying models a matter of patient health and life.
Domain-Specific Challenges
- High-Stakes Decisions: An incorrect prediction can lead to a misdiagnosis, a flawed treatment plan, or a missed critical finding. The tolerance for error is extremely low.
- Segmentation as a Core Task: Many medical imaging applications involve segmentation—outlining specific organs, tissues, or pathologies on a pixel-wise basis—rather than simple image-level classification. This requires new definitions of certified robustness based on metrics like the Dice score or Intersection over Union (IoU), rather than classification accuracy.70
- Data Scarcity and Specificity: Unlike general computer vision, medical imaging often deals with smaller, highly specialized datasets. Models must be robust without the benefit of training on web-scale data.
- Interpretability: For clinical acceptance, it is often not enough for a model to be robust; its decisions must also be interpretable to a human expert.
State-of-the-Art and Applications
The application of certified defenses to medical imaging is a nascent but rapidly growing field. A significant breakthrough has been the development of the first certified segmentation baselines for medical imaging.70 This pioneering work leverages
Randomized Smoothing in conjunction with pre-trained denoising diffusion models. The diffusion model acts as a powerful denoiser, cleaning the noisy inputs before they are passed to the segmentation model. This approach has been shown to maintain high certified Dice scores on a variety of tasks, including the segmentation of lungs on chest X-rays, skin lesions, and polyps in colonoscopies, even under significant perturbation levels.70 The offline nature of many diagnostic tasks makes the higher computational cost of such methods more acceptable than in real-time domains like autonomous driving. The relevance of other techniques, such as
Interval Bound Propagation, has also been noted for medical data analysis, highlighting the potential for a variety of methods to be adapted to this domain.53 A key future direction is the establishment of standardized benchmarks to drive progress in this largely uncharted but critically important area.70
Aviation Control Systems
The aerospace industry has the most stringent safety and certification requirements of any domain. The integration of AI into flight-critical systems, such as collision avoidance, represents the ultimate challenge for provable AI safety.
Domain-Specific Challenges
- Extreme Reliability: Aviation systems demand near-perfect reliability. For example, a collision avoidance system must be shown to provide the correct advisory in virtually 100% of cases, a standard that is extremely difficult for NNs to meet on their own.73
- Formal Certification Standards: All software and hardware in commercial aircraft must be certified according to rigorous standards like DO-178C for software and DO-254 for hardware. These standards were designed for traditional, deterministic systems, and new standards for AI (such as ARP6983) are still under development, creating a significant regulatory gap.9
- Infinite-Time Horizon Guarantees: For a control system, it is not enough to certify the robustness of a single, static prediction. Safety must be guaranteed over the entire operational envelope of the system, requiring verification across an infinite-time horizon.73
- Hardware Implementation Gaps: Theoretical guarantees are often derived for idealized, real-valued NNs. However, real-world avionics hardware uses finite-precision arithmetic, which introduces roundoff errors. A true safety guarantee must be robust to these finite-precision perturbations in sensing, computation, and actuation.73
State-of-the-Art and Applications
The extreme demands of aviation have pushed the field beyond standard certified robustness and deep into the realm of formal methods for Neural Network Control Systems (NNCS) verification. The primary application is the next-generation Airborne Collision Avoidance System (ACAS X), which uses a set of NNs to compress massive (multi-gigabyte) lookup tables into a compact form that can run on avionics hardware.73 The verification task is then to prove that these compressed NN models are safe and behave correctly.
This is accomplished using advanced formal verification tools that perform reachability analysis. Techniques based on star-set reachability and Differential Dynamic Logic (dL) are used to compute an over-approximation of all possible states a system can reach over time, proving that it never enters an unsafe state (e.g., a collision).73 This approach provides much stronger, system-level guarantees than the input-output robustness offered by standard certified defenses. However, it is also vastly more complex and computationally demanding. The research in this area is focused on bridging the gap between these powerful theoretical guarantees and practical implementation by accounting for factors like finite-precision errors and by developing so-called “safety nets,” which combine a simple, verifiable component (like a sparse lookup table) with the more powerful but harder-to-verify NN to ensure safety.73
Barriers to Deployment: Practical Challenges and Current Limitations
Despite significant academic progress, the transition of certified defenses from research laboratories to widespread deployment in industrial and safety-critical systems is hindered by a formidable set of practical, regulatory, and economic challenges. These barriers extend far beyond the technical trade-offs of scalability and accuracy, touching upon the fundamental realities of engineering, regulation, and business operations.
Integration with Legacy Infrastructure (Brownfield Deployments)
Many safety-critical domains, particularly in industrial control systems (ICS) and manufacturing, are characterized by “brownfield” environments. These systems often consist of decades-old legacy hardware, proprietary communication protocols, and networks that were designed for operational reliability and physical isolation, not for cybersecurity.37 Integrating modern, AI-based components—let alone those equipped with computationally intensive certified defenses—into this existing infrastructure is a monumental engineering challenge. The historical lack of cybersecurity focus means these legacy systems often have known vulnerabilities, outdated software, and a lack of modern security features like encryption or authentication.77 Layering a new AI system onto this foundation without introducing new attack surfaces or creating unforeseen interactions is a complex and risky endeavor that requires careful consideration of the entire layered technology stack.37
Regulatory and Certification Hurdles
The regulatory frameworks that govern safety-critical industries were built for a world of deterministic, verifiable software. Standards like DO-178C in aviation or ISO 26262 in the automotive sector are predicated on principles of requirements traceability, exhaustive testing, and predictable system behavior—principles that are fundamentally challenged by the data-driven, probabilistic, and often opaque nature of machine learning.1 Consequently, there is currently no established certification process for deploying deep learning systems in most safety-critical applications.78
While new standards are being developed—such as the joint SAE/EUROCAE effort on ARP6983 for AI in aeronautical systems—this is a slow, consensus-driven process involving multiple international regulatory bodies like the FAA and EASA.9 In the interim, there is a lack of clear, standardized criteria for practitioners and regulators to evaluate the claims made by different certification schemes.19 Questions persist: What constitutes a sufficient certified radius? Which threat model is appropriate for a given application? How should the trade-off between certified robustness and clean accuracy be managed? Without clear answers and regulatory guidance, deploying these systems involves significant legal and safety liability.
Defining Realistic Threat Models
A core limitation that pervades the field is the disconnect between the mathematically convenient threat models used in research and the diverse, complex threats encountered in the real world. As previously discussed, the vast majority of certified defenses provide guarantees against perturbations bounded by an norm.30 While this provides a tractable basis for verification, it is a poor proxy for many plausible attacks. A physical patch on a traffic sign, a semantic change in a medical image’s caption, or a geometric transformation caused by a change in sensor perspective are all threats that fall outside the scope of simple
balls.29 Defining a threat model that is both comprehensive enough to be meaningful for a real-world system and constrained enough to be formally verifiable remains a major open research problem. Without such models, there is a persistent risk that a “certified” system could be vulnerable to simple, practical attacks that were not considered in its formal analysis.
The Scalability and Usability Gap
There is a significant gap between the capabilities of the most powerful certification techniques and the scale of the models being deployed in industry. The methods that provide the tightest guarantees, such as Semidefinite Programming relaxations and Abstract Interpretation with expressive domains, are often the least scalable, struggling to handle the massive “foundation models” with billions of parameters that are becoming commonplace.18 Conversely, the most scalable method, Randomized Smoothing, provides weaker guarantees, especially for the common
threat model.48 Furthermore, the tools and expertise required to implement, run, and correctly interpret the results of these defenses are highly specialized. The Adversarial Robustness Toolbox (ART) provides a valuable library of implementations, but effectively using these tools requires a deep understanding of both machine learning and formal methods, a skill set that is not widely available in most engineering teams.80
The Cost-Benefit Analysis in Industry
Ultimately, the decision to deploy certified defenses in a commercial or industrial setting is an economic one. Implementing these techniques incurs significant costs in terms of computational resources for training and inference, the need for specialized talent, and extended development and testing timelines.81 In a business environment driven by budgets, competitive pressures, and time-to-market, it can be difficult to justify this substantial investment, especially when the immediate risk of a sophisticated adversarial attack may be perceived as low or is difficult to quantify in financial terms.37 This is particularly true in industries where the cybersecurity posture is already lagging. Without clear regulatory mandates or a major, highly publicised incident demonstrating the catastrophic potential of adversarial attacks, many organizations may opt for cheaper, less rigorous empirical defenses, accepting a level of residual risk that might be inappropriate for their application.
The Future of Provable AI Safety
The journey towards building AI systems that are safe and reliable enough for critical applications is still in its early stages. The limitations of current certified defense methods highlight the need for next-generation techniques, but also for a broader philosophical shift in how we approach AI safety. The future lies not in a single, perfectly robust model, but in a synthesis of stronger model-level guarantees and more resilient system-level architectures, all while acknowledging the fundamental limits of verification.
Next-Generation Certified Defenses
The research frontier is actively pushing to overcome the limitations of existing methods, with several promising directions emerging.
- Scaling with Generated Data: A significant recent breakthrough has been the demonstration that training certified models with additional data generated by state-of-the-art diffusion models can substantially improve certified accuracy.47 This approach helps close the generalization gap between training and test performance and has led to new state-of-the-art results for deterministic certified defenses on benchmarks like CIFAR-10, outperforming previous methods by a significant margin.83 This suggests that the vast, high-quality data distributions learned by generative models can be a powerful tool for enhancing provable robustness.
- Hybrid and Novel Approaches: The future of defense is likely to be hybrid, combining the strengths of different paradigms. The success of CROWN-IBP, which merges the speed of IBP with the tightness of CROWN, is a prime example.51 Other novel approaches are exploring new connections, such as linking randomized smoothing with causal intervention to learn features that are robust to confounding effects 40, or establishing a formal connection between differential privacy and adversarial robustness to create scalable and model-agnostic defenses.7
- Beyond Norms: A critical and necessary evolution for the field is the development of certified defenses that can provide guarantees against more realistic and semantically meaningful perturbations. This includes certifying robustness to geometric transformations (rotation, scaling), changes in lighting and color, and other structured, real-world variations that are not captured by simple norms.30 This research is essential for bridging the gap between theoretical guarantees and practical, physical-world security.
From Model-Level Certification to System-Level Safety
Perhaps the most important shift is the recognition that certifying a single ML model in isolation is insufficient. The ultimate goal is to ensure the safety of the entire system in which the model operates. This requires a move towards holistic, system-level safety engineering principles.
- Inherently Safe Design: This paradigm involves designing systems where the AI component is architecturally constrained, preventing it from causing catastrophic failure even if it behaves unexpectedly. This can be achieved through safety envelopes, where a simpler, formally verifiable rule-based system monitors the AI’s outputs and overrides them if they violate predefined safety constraints (e.g., a responsibility-sensitive safety model for AVs).4 Another approach is the use of
safety nets, where a powerful but complex NN is backed up by a sparse but fully verifiable component like a lookup table, as explored in the context of ACAS X.73 - Runtime Monitoring and Verification: Instead of relying solely on a priori guarantees, future systems will incorporate continuous runtime monitoring to detect potentially unsafe conditions as they arise. This allows the system or a human operator to take fail-safe action before a hazard can manifest.5
The “Governable AI” Paradigm
Looking further ahead, some researchers propose a paradigm shift to address the long-term risks of highly advanced or even superintelligent AI. The Governable AI (GAI) framework moves away from trying to ensure an AI’s internal motivations are “aligned” with human values—a potentially intractable problem—and instead focuses on externally enforced structural compliance.87 This is achieved by mediating all of the AI’s interactions with the world through a cryptographically secure, formally verifiable
Rule Enforcement Module (REM). This REM would operate on a trusted platform, making its rules non-bypassable. Such an architecture aims to provide provable enforcement of safety constraints that are computationally infeasible for even a superintelligent AI to break, offering a potential path to long-term, high-assurance safety governance.87
The Ultimate Challenge: The Verification Gap
Finally, it is crucial to approach the future of provable AI safety with intellectual honesty about its fundamental limitations. Foundational results in computer science, such as Rice’s theorem, prove that it is impossible to create a universal algorithm that can decide all non-trivial properties of a program’s behavior.88 The sheer complexity of both the real world and advanced AI systems suggests that achieving absolute, 100% provable safety for general-purpose AI is likely a computational impossibility.88
This “verification gap” implies that the ultimate goal is not the unattainable ideal of perfect safety, but rather a pragmatic and rigorous paradigm of adaptive risk management.88 The future of safety-critical AI will depend on a layered, defense-in-depth strategy that combines the bottom-up guarantees of certified model robustness with the top-down assurances of formally verified system architectures. It will require transparent systems, verifiable subsystems, and a clear-eyed understanding and budgeting for the inevitable residual risks that cannot be formally eliminated.86 The convergence of these two fields—certified robustness from machine learning and formal verification from systems engineering—represents the most promising path toward building AI systems that are not only powerful but also worthy of our trust in the most critical applications.
