There is a specific category of security failure that looks manageable on paper. You have deployed a classifier. You have run a benchmark. Your accuracy numbers are strong. Then a red team arrives, applies a 0.01ε perturbation to a stop sign, and your autonomous vehicle accelerates through an intersection. The benchmark meant nothing. This guide exists to explain why — and to equip every AI governance professional, security auditor, and ML engineer with the adversarial vocabulary that is now mandatory for any serious AI risk posture.
What follows is a comprehensive audit of the ten evasion techniques that modern content filters, safety classifiers, and perception models are most systematically blind to — along with the governance frameworks, defensive benchmarks, and organizational postures required to close those gaps before an adversary does.
The Adversarial Evolution: From 2014 to 2025
The adversarial machine learning story begins in 2014, when Szegedy et al. demonstrated that imperceptible pixel-level changes could cause deep neural networks to misclassify images with high confidence. The finding was treated as a curiosity. By 2025, it had become the foundation of an entire attack discipline capable of defeating vision systems, language models, malware classifiers, and critical infrastructure AI in production.
The core problem has never changed. Neural networks are trained to minimize a statistical loss function over a training distribution. They do not learn the underlying causal structure of a task — they learn the most statistically efficient path to a low-loss output. That path is brittle. It is exploitable. And it is getting exploited at scale.
The governance dimension of this history is as important as the technical one. Every time the defender community believed it had found a robust solution — defensive distillation, adversarial training, certified robustness — an adaptive attacker invalidated it within months. This is not a solvable engineering problem with a fixed endpoint. It is a strategic game that requires continuous investment in monitoring, auditing, and re-evaluation.
The Taxonomy of Evasion: Knowledge, Capability, and NIST Alignment
Before any red team engagement can produce actionable audit findings, the scope of the adversary model must be formally defined. An attack that requires full access to model weights carries a different organizational risk profile than one that requires only public API access. Conflating the two leads to governance frameworks that are either complacent about real threats or hysterical about theoretical ones.
The most dangerous governance misconception in this table is treating the Black-Box category as inherently low-risk. It is not. As covered in detail in Technique #6, model extraction attacks allow a sufficiently resourced adversary to reconstruct a functionally equivalent surrogate model through API queries alone — converting a Black-Box constraint into an effective White-Box attack capability. Security through obscurity, whether applied to model weights or system architecture, is not a defense.
The Bilevel Optimization framing is analytically precise: the defender minimizes overall system loss while the attacker simultaneously solves an inner maximization problem to find the optimal perturbation δ that preserves functional legitimacy while defeating the classifier. Any defense not modeled as a dynamic interaction with an adaptive opponent is effectively obsolete before it ships.
Technique 1: Gradient-Based Perturbations — FGSM and PGD
Gradient-based attacks are the foundational method for mapping the decision boundary of a neural network. Understanding them is not optional — they define what "robust" means in the formal adversarial ML literature, and every advanced attack builds on their logic.
The intuition is straightforward. A trained model has a gradient — a vector that describes how its loss changes as its inputs change. That gradient points, by definition, in the direction of maximum misclassification. A gradient-based attack simply perturbs the input in that direction, by a magnitude small enough to preserve functional equivalence (the perturbation is imperceptible, or the malware still executes, or the document still reads as benign).
For auditors applying the NIST AI RMF, FGSM and PGD attacks map directly to the Measure function — specifically the quantitative characterization of integrity vulnerabilities under NISTAML.02. Any AI system deployed in a high-stakes context (medical diagnosis, autonomous navigation, fraud detection, content moderation) that cannot produce a PGD robustness score under a defined ε budget is, by definition, inadequately evaluated.
Technique 2: Optimization-Based Attacks — Carlini & Wagner (C&W)
The Carlini and Wagner attack is historically significant because it did not just defeat a specific model — it defeated the entire concept of "defensive distillation," which had been published as a state-of-the-art robustness mechanism. The lesson has not aged. Any defense that has not been evaluated against an adaptive, optimization-based adversary should be assumed breakable.
Where FGSM and PGD find the direction of misclassification, C&W minimizes a dual objective: keep the perturbation as small as possible while ensuring misclassification with high confidence. This produces adversarial examples that are simultaneously minimal and maximally effective against defenses that use confidence scores as a detection signal.
δ is the perturbation, p is the chosen distance metric, c is a tradeoff constant, and f(·) is a misclassification loss function tuned to defeat confidence-based defenses.The choice of distance metric in a red team engagement is itself a governance decision. Auditors evaluating a malware detection system who benchmark only against l∞ are assessing the wrong threat model. Malware authors operating under l0 constraints — where they can change at most k features without breaking execution — represent the realistic adversarial surface, and the l∞ benchmark tells you almost nothing about that risk.
Technique 3: Score-Based Black-Box Probing — ZOO and SimBA
When model weights are inaccessible but the API returns confidence scores or logits alongside predictions, score-based black-box attacks become viable. These methods treat the model as a black box and estimate its gradient structure purely from output variances — effectively reverse-engineering the decision surface from the outside in.
The governance implication for MLaaS providers is stark. If your API returns raw confidence scores, you have handed an adversary the majority of what they need for a gradient-based attack — the architecture and weights just improve efficiency. API design decisions (returning only the top-1 label vs. returning a full confidence vector) have direct adversarial ML implications that must be evaluated during system design, not patched post-deployment.
Rate-limiting is the primary organizational defense against score-based probing — but it must be calibrated to realistic adversarial query volumes. An attacker who can afford 50,000 queries spread over 30 days from rotating IP addresses will defeat naive rate-limiting on any commercially deployed API. Anomaly detection on query patterns is the necessary complement.
Technique 4: Decision-Based Boundary Attacks
The most restrictive realistic attack scenario — label-only access where the API returns only the predicted class without confidence scores — was believed for years to be practically safe. Boundary attacks demolished that assumption. Even in the harshest query environment imaginable, a sufficiently patient adversary can map the model's decision surface.
HopSkipJump is the specific technique that should concern any organization running a public prediction API. Its combination of binary search for boundary proximity and gradient estimation at the decision boundary reduces query counts to levels that frequently fall below standard anomaly detection thresholds. An adversary targeting your API with HSJ may complete a successful attack before your monitoring system flags the session as suspicious.
Technique 5: Universal Adversarial Perturbations (UAPs)
All of the previous techniques share a structural property: the adversarial perturbation is crafted for a specific input. A UAP breaks that assumption entirely. A Universal Adversarial Perturbation is a single fixed vector that, when added to any input from the data distribution, causes the model to misclassify that input with high probability.
The existence of UAPs is not merely a practical attack capability — it is diagnostic evidence of a fundamental architectural failure. If a model has a universal perturbation, it has learned non-robust features that generalize across the entire data distribution. It is not the individual inputs that are fragile; it is the learned representation itself.
For red teams, a UAP discovery is the most severe finding possible in an evasion audit. Unlike a per-sample attack that an adversary must continuously re-craft, a UAP can be manufactured once and deployed indefinitely against any new input the model receives. For governance professionals, a UAP audit result requires escalation to the same organizational tier as a critical infrastructure failure — not a standard security patch cycle.
Technique 6: Transferability and Substitute Model Training
Adversarial transferability is the empirical observation that an adversarial example crafted against Model A will often successfully fool Model B, even when Model A and Model B have different architectures, were trained on different data, and were developed by entirely separate organizations. This is not a coincidence — it reflects the fact that different models trained on the same underlying data distribution tend to learn similar non-robust features and similar decision boundary geometries.
The transferability property fundamentally undermines the "security through obscurity" defense model that many API-protected ML systems rely on. The argument that "our model weights are proprietary so attackers can't craft targeted attacks" fails as soon as the adversary has API access and enough query budget to train a substitute. For AIGP governance professionals, this means that access controls protecting model weights are insufficient as a sole line of defense — the API exposure itself must be treated as a primary risk surface.
Technique 7: Physically Realizable Attacks
Adversarial vulnerability does not stop at the digital input layer. For any AI system that processes real-world sensor data — cameras, microphones, radar, LiDAR — the threat extends into the physical environment. Physically realizable attacks are adversarial perturbations that survive the transition from digital design to physical instantiation, remaining effective after printing, fabrication, photography, and re-digitization.
Red team engagements for physical AI systems require a materially different methodology than digital-only audits. The adversarial objective function must account for the expectation over transformation (EOT) — the perturbation must remain effective under the full distribution of real-world conditions (distances, angles, lighting, weather, camera noise), not just in a controlled lab scan. Any physical AI deployment that has not been red-teamed with EOT-aware attack generation has not been adequately evaluated for real-world adversarial risk.
Technique 8: Clean-Label Poisoning and Backdoor Attacks
Every technique discussed so far attacks a deployed model at inference time. Backdoor attacks are categorically different — they are supply chain attacks initiated at training time, before the model is ever deployed. They are also the hardest to detect, because a backdoored model performs identically to a clean model on all normal inputs. The attack surface only activates when a specific trigger is present.
Governance Trigger
Any model whose training pipeline involves an external vendor, a public pre-trained checkpoint, or a third-party dataset should be treated as a backdoor-risk asset by default until a formal supply chain security audit has been completed. This is not paranoia — it is the correct prior given the current state of third-party ML ecosystem trust.
Technique 9: Model Extraction and Stealing
Model extraction is classified under privacy compromise rather than evasion — but its governance significance is primarily as an attack precursor. An adversary who successfully extracts a functionally equivalent model has effectively converted your API-protected Black-Box deployment into their personal White-Box target, available for unlimited offline attack crafting at zero marginal query cost.
Model extraction attacks also carry independent IP and competitive intelligence risks distinct from their use as evasion precursors. A model trained at significant cost on proprietary data can be meaningfully reproduced through a sustained API campaign — representing a direct intellectual property loss even if the adversary never uses the surrogate to craft attacks. Both risks must be addressed in any comprehensive AI governance framework.
Technique 10: Prompt Injection and Semantic Manipulation in GenAI
Generative AI introduces a qualitatively different attack surface. Unlike perception classifiers where the adversarial input is a subtle perturbation of a benign sample, prompt injection attacks often involve semantically explicit instructions embedded in inputs that the model is designed to process as content. The model itself is the attack vector — its generalization capability, which makes it useful, also makes it exploitable.
Indirect Injection — The Under-Addressed Risk
Indirect prompt injection via RAG is the most governance-critical GenAI attack vector for enterprise deployments — and the least addressed. Every document, web page, or database record that a RAG system retrieves is a potential attack surface. In enterprise deployments where the retrieval corpus is large, partially public-facing, or sourced from third-party providers, the attack surface is effectively unlimited. A single poisoned document can redirect model behavior for any user whose query retrieves it.
Defensive Countermeasures: Benchmarks and Performance Evidence
Defense against adversarial attacks is not a single intervention — it is a layered posture in which every layer assumes the others have already been defeated. The Robustness-Accuracy Trade-off is real, quantified, and must be accepted as a fundamental sociotechnical constraint rather than an engineering problem with a clean solution. What follows is the current performance evidence for the mechanisms most validated in the literature.
The Governance Framework: NIST AI RMF and MITRE ATLAS Integration
Technical defenses are necessary but insufficient. Without a governance framework that institutionalizes adversarial evaluation as a continuous process — not a one-time pre-deployment audit — any defense degrades to obsolescence within the deployment lifecycle. Two frameworks provide the structural foundation for a defensible AI security posture.
The Governance Audit Checklist: What Your Red Team Should Verify
For AIGP professionals building or evaluating an AI security posture, the following checklist translates the ten techniques above into concrete organizational verification requirements. Each item represents a failure mode that has been observed in real-world deployments — not a theoretical risk.
Conclusion: The Co-Evolutionary Imperative
The ten techniques in this guide share a single underlying property: they are all adaptive. FGSM finds the model's gradient and follows it. HopSkipJump probes the boundary until it finds the gap. Substitute training replicates the target until it can be attacked offline. Backdoor triggers wait until the exact moment conditions are met. Each technique was developed in direct response to a defense the ML community believed to be adequate — and each succeeded.
The governance conclusion is unavoidable. No static defense is sufficient. No one-time audit is sufficient. No benchmark number, however impressive, is sufficient as a terminal statement of security. The adversarial ML threat model is a co-evolutionary game, and the only defensible organizational posture is one that encodes continuous re-evaluation as a structural property — not a project or a milestone.
Organizations that deploy AI systems in high-stakes contexts — and an expanding list of contexts now qualifies as high-stakes under EU AI Act risk tier classifications — have a governance obligation to treat adversarial robustness as a continuous operational requirement, not a pre-deployment checkbox. The NIST AI RMF Govern-Map-Measure-Manage cycle exists precisely for this purpose. The MITRE ATLAS taxonomy exists to make adversarial threats legible to security operations teams. The research benchmarks exist to calibrate acceptable robustness thresholds.
What does not exist is a shortcut. The robustness-accuracy trade-off is not an engineering problem awaiting a clean solution. It is a sociotechnical constraint that every AI governance professional must understand, quantify, and communicate to the organizational stakeholders who set risk tolerance. Understanding these ten techniques is where that communication starts.