Top 10 Red Teaming Techniques Evasion Filters Look…

There is a specific category of security failure that looks manageable on paper. You have deployed a classifier. You have run a benchmark. Your accuracy numbers are strong. Then a red team arrives, applies a 0.01ε perturbation to a stop sign, and your autonomous vehicle accelerates through an intersection. The benchmark meant nothing. This guide exists to explain why — and to equip every AI governance professional, security auditor, and ML engineer with the adversarial vocabulary that is now mandatory for any serious AI risk posture.

What follows is a comprehensive audit of the ten evasion techniques that modern content filters, safety classifiers, and perception models are most systematically blind to — along with the governance frameworks, defensive benchmarks, and organizational postures required to close those gaps before an adversary does.

The Adversarial Evolution: From 2014 to 2025

The adversarial machine learning story begins in 2014, when Szegedy et al. demonstrated that imperceptible pixel-level changes could cause deep neural networks to misclassify images with high confidence. The finding was treated as a curiosity. By 2025, it had become the foundation of an entire attack discipline capable of defeating vision systems, language models, malware classifiers, and critical infrastructure AI in production.

The core problem has never changed. Neural networks are trained to minimize a statistical loss function over a training distribution. They do not learn the underlying causal structure of a task — they learn the most statistically efficient path to a low-loss output. That path is brittle. It is exploitable. And it is getting exploited at scale.

Timeline: Adversarial ML Milestones

2014

Szegedy et al. — The First Adversarial Examples

Imperceptible pixel perturbations cause high-confidence misclassification. The vulnerability is initially treated as an academic curiosity rather than a deployment risk.

2015–2017

FGSM, PGD, and C&W — The Adversarial Toolkit Matures

Goodfellow introduces FGSM. Madry et al. formalize PGD as the definitive benchmark for adversarial robustness. Carlini and Wagner demonstrate that "defensive distillation" — then considered state-of-the-art — provides effectively zero protection.

2018–2020

Physical-World Attacks and UAPs Enter Production Risk

Adversarial patches and 3D-printed objects demonstrate that the threat model extends beyond digital inputs. Universal Adversarial Perturbations reveal architectural-level fragility that generalizes across entire data distributions.

2021–2023

GenAI Surfaces a New Threat Plane — Prompt Injection

The commercialization of large language models creates an entirely new attack surface. Prompt injection, jailbreaking, and indirect RAG poisoning move from research demonstrations to live exploits targeting deployed products at scale.

2024–2025

Multimodal and Agentic Attacks — The Current Frontier

Attack sophistication now targets multimodal perception stacks, autonomous AI agents, and RAG-integrated enterprise deployments simultaneously. The adversarial surface has expanded faster than any defensive standardization effort.

The governance dimension of this history is as important as the technical one. Every time the defender community believed it had found a robust solution — defensive distillation, adversarial training, certified robustness — an adaptive attacker invalidated it within months. This is not a solvable engineering problem with a fixed endpoint. It is a strategic game that requires continuous investment in monitoring, auditing, and re-evaluation.

The Taxonomy of Evasion: Knowledge, Capability, and NIST Alignment

Before any red team engagement can produce actionable audit findings, the scope of the adversary model must be formally defined. An attack that requires full access to model weights carries a different organizational risk profile than one that requires only public API access. Conflating the two leads to governance frameworks that are either complacent about real threats or hysterical about theoretical ones.

Category Attacker Knowledge Capability / Access NIST AML Category

White-Box Full Architecture, weights, training data, and gradients fully exposed. NISTAML.02 — Integrity

Gray-Box Partial Architecture or representation known; weights unavailable. NISTAML.022 — Evasion

Black-Box Minimal Query-only API access to labels or confidence scores. NISTAML.025 — Black-Box

The most dangerous governance misconception in this table is treating the Black-Box category as inherently low-risk. It is not. As covered in detail in Technique #6, model extraction attacks allow a sufficiently resourced adversary to reconstruct a functionally equivalent surrogate model through API queries alone — converting a Black-Box constraint into an effective White-Box attack capability. Security through obscurity, whether applied to model weights or system architecture, is not a defense.

The Bilevel Optimization framing is analytically precise: the defender minimizes overall system loss while the attacker simultaneously solves an inner maximization problem to find the optimal perturbation δ that preserves functional legitimacy while defeating the classifier. Any defense not modeled as a dynamic interaction with an adaptive opponent is effectively obsolete before it ships.

Technique 1: Gradient-Based Perturbations — FGSM and PGD

Gradient-based attacks are the foundational method for mapping the decision boundary of a neural network. Understanding them is not optional — they define what "robust" means in the formal adversarial ML literature, and every advanced attack builds on their logic.

The intuition is straightforward. A trained model has a gradient — a vector that describes how its loss changes as its inputs change. That gradient points, by definition, in the direction of maximum misclassification. A gradient-based attack simply perturbs the input in that direction, by a magnitude small enough to preserve functional equivalence (the perturbation is imperceptible, or the malware still executes, or the document still reads as benign).

Method 01

Fast Gradient Sign Method (FGSM)

x_adv = x + ε · sign(∇x J(x, y))

Single-step, computationally cheap — suited for large-scale dataset augmentation during adversarial training.
Perturbation bounded by ε in the l∞ norm — the maximum change to any single input feature.
Weaker than PGD — a model that survives FGSM may still be PGD-vulnerable. Never use as a sole robustness benchmark.

Method 02 — Audit Standard

Projected Gradient Descent (PGD)

x(t+1) = ΠB(x,ε)(x(t) + α · sign(∇x J(x(t), y)))

Iterative multi-step refinement with random initialization — finds the strongest attack within the ε-ball.
The industry standard for adversarial robustness evaluation — PGD robustness is the floor, not the ceiling.
A model claiming robustness that has not been PGD-evaluated against an adaptive attacker cannot be certified.

For auditors applying the NIST AI RMF, FGSM and PGD attacks map directly to the Measure function — specifically the quantitative characterization of integrity vulnerabilities under NISTAML.02. Any AI system deployed in a high-stakes context (medical diagnosis, autonomous navigation, fraud detection, content moderation) that cannot produce a PGD robustness score under a defined ε budget is, by definition, inadequately evaluated.

Technique 2: Optimization-Based Attacks — Carlini & Wagner (C&W)

The Carlini and Wagner attack is historically significant because it did not just defeat a specific model — it defeated the entire concept of "defensive distillation," which had been published as a state-of-the-art robustness mechanism. The lesson has not aged. Any defense that has not been evaluated against an adaptive, optimization-based adversary should be assumed breakable.

Where FGSM and PGD find the direction of misclassification, C&W minimizes a dual objective: keep the perturbation as small as possible while ensuring misclassification with high confidence. This produces adversarial examples that are simultaneously minimal and maximally effective against defenses that use confidence scores as a detection signal.

C&W Optimization Objective

min  ‖δ‖p + c · f(x + δ)

Where δ is the perturbation, p is the chosen distance metric, c is a tradeoff constant, and f(·) is a misclassification loss function tuned to defeat confidence-based defenses.

Norm What It Measures Primary Governance Risk Domain

l_∞ Maximum change applied to any single feature. Imperceptible to human observers in vision tasks. Industry standard for vision system audits — autonomous vehicles, facial recognition, medical imaging.

l₂ Euclidean distance — measures aggregate distortion across all features. Audio and speech recognition systems; NLP embeddings where overall semantic distance matters.

l₀ Number of features changed, regardless of magnitude. Critical for sparse perturbation scenarios. Malware and tabular data classifiers (NISTAML.013) — only specific features (file headers, PE sections, specific columns) can be altered without destroying functional validity.

The choice of distance metric in a red team engagement is itself a governance decision. Auditors evaluating a malware detection system who benchmark only against l∞ are assessing the wrong threat model. Malware authors operating under l₀ constraints — where they can change at most k features without breaking execution — represent the realistic adversarial surface, and the l∞ benchmark tells you almost nothing about that risk.

Technique 3: Score-Based Black-Box Probing — ZOO and SimBA

When model weights are inaccessible but the API returns confidence scores or logits alongside predictions, score-based black-box attacks become viable. These methods treat the model as a black box and estimate its gradient structure purely from output variances — effectively reverse-engineering the decision surface from the outside in.

ZOO

Zeroth Order Optimization

Estimates gradients using finite differences — perturbing each dimension independently and measuring the resulting change in confidence scores to infer gradient direction without backpropagation access.

Query cost: High. Scales linearly with input dimensionality. Rate-limited APIs are the primary barrier — not architectural opacity.

SimBA

Simple Black-Box Attack

Uses random orthogonal updates to class probabilities, iteratively searching for perturbation directions that reduce prediction confidence for the correct class without requiring gradient estimation.

Query cost: Lower than ZOO. Particularly effective against high-dimensional inputs where ZOO becomes impractical.

The governance implication for MLaaS providers is stark. If your API returns raw confidence scores, you have handed an adversary the majority of what they need for a gradient-based attack — the architecture and weights just improve efficiency. API design decisions (returning only the top-1 label vs. returning a full confidence vector) have direct adversarial ML implications that must be evaluated during system design, not patched post-deployment.

Rate-limiting is the primary organizational defense against score-based probing — but it must be calibrated to realistic adversarial query volumes. An attacker who can afford 50,000 queries spread over 30 days from rotating IP addresses will defeat naive rate-limiting on any commercially deployed API. Anomaly detection on query patterns is the necessary complement.

Technique 4: Decision-Based Boundary Attacks

The most restrictive realistic attack scenario — label-only access where the API returns only the predicted class without confidence scores — was believed for years to be practically safe. Boundary attacks demolished that assumption. Even in the harshest query environment imaginable, a sufficiently patient adversary can map the model's decision surface.

Attack Access Required Query Efficiency Primary Risk

Boundary Attack Label-only (hard label) Low — requires thousands of queries per sample Persistent adversary with time and query budget

HopSkipJump (HSJ) Label-only (hard label) High — often under 1,000 queries Production API attacks under standard detection thresholds

HopSkipJump is the specific technique that should concern any organization running a public prediction API. Its combination of binary search for boundary proximity and gradient estimation at the decision boundary reduces query counts to levels that frequently fall below standard anomaly detection thresholds. An adversary targeting your API with HSJ may complete a successful attack before your monitoring system flags the session as suspicious.

Technique 5: Universal Adversarial Perturbations (UAPs)

All of the previous techniques share a structural property: the adversarial perturbation is crafted for a specific input. A UAP breaks that assumption entirely. A Universal Adversarial Perturbation is a single fixed vector that, when added to any input from the data distribution, causes the model to misclassify that input with high probability.

The existence of UAPs is not merely a practical attack capability — it is diagnostic evidence of a fundamental architectural failure. If a model has a universal perturbation, it has learned non-robust features that generalize across the entire data distribution. It is not the individual inputs that are fragile; it is the learned representation itself.

Attack Property

Input-Agnostic

One perturbation vector defeats the model across all samples — no per-input crafting required after initial computation.

Deployment Risk

Scalable

An adversary with a validated UAP can apply it to any input at zero marginal cost — poisoning camera feeds, document processors, or audio streams at scale.

Diagnostic Signal

Systemic Flaw

If a UAP exists for your model, you do not have a data problem. You have an architecture problem. Standard accuracy benchmarks are meaningless until it is resolved.

For red teams, a UAP discovery is the most severe finding possible in an evasion audit. Unlike a per-sample attack that an adversary must continuously re-craft, a UAP can be manufactured once and deployed indefinitely against any new input the model receives. For governance professionals, a UAP audit result requires escalation to the same organizational tier as a critical infrastructure failure — not a standard security patch cycle.

Technique 6: Transferability and Substitute Model Training

Adversarial transferability is the empirical observation that an adversarial example crafted against Model A will often successfully fool Model B, even when Model A and Model B have different architectures, were trained on different data, and were developed by entirely separate organizations. This is not a coincidence — it reflects the fact that different models trained on the same underlying data distribution tend to learn similar non-robust features and similar decision boundary geometries.

Substitute Model Attack Chain — NISTAML.031

Step 1

Query the Target

Use label-only or score-based API access to collect (input, output) pairs.

Step 2

Train Substitute

Train a shadow model on the collected query data to mimic the target's decision boundary.

Step 3

Craft White-Box Attacks

Apply FGSM, PGD, or C&W to the substitute model using full gradient access.

Outcome

Transfer to Target

Attacks transfer to the hidden target model via shared non-robust feature geometry.

The transferability property fundamentally undermines the "security through obscurity" defense model that many API-protected ML systems rely on. The argument that "our model weights are proprietary so attackers can't craft targeted attacks" fails as soon as the adversary has API access and enough query budget to train a substitute. For AIGP governance professionals, this means that access controls protecting model weights are insufficient as a sole line of defense — the API exposure itself must be treated as a primary risk surface.

Technique 7: Physically Realizable Attacks

Adversarial vulnerability does not stop at the digital input layer. For any AI system that processes real-world sensor data — cameras, microphones, radar, LiDAR — the threat extends into the physical environment. Physically realizable attacks are adversarial perturbations that survive the transition from digital design to physical instantiation, remaining effective after printing, fabrication, photography, and re-digitization.

Adversarial Patches

Printable sticker-like patches that, when placed on or near an object, cause consistent misclassification regardless of viewing angle or lighting. Classic demonstration: a stop sign reliably misclassified as a speed limit sign under all real-world viewing conditions.

Production Risk: Transportation, access control, retail loss prevention

3D Adversarial Objects

Morphological changes to physical objects — 3D-printed eyeglass frames, modified vehicle body panels, altered object textures — that survive the full photographic pipeline including scale changes, rotation, and varying illumination. Attacks that survive viewpoint variation are significantly harder to defend against.

Production Risk: Facial recognition, autonomous vehicles, border security

Infrared and Non-Visible Spectrum Attacks

Perturbations applied in infrared or near-UV wavelengths — invisible to human observers but detectable by camera sensors — represent an emerging attack surface for surveillance systems that process multi-spectral sensor inputs.

Production Risk: Surveillance, smart city infrastructure, security screening

Red team engagements for physical AI systems require a materially different methodology than digital-only audits. The adversarial objective function must account for the expectation over transformation (EOT) — the perturbation must remain effective under the full distribution of real-world conditions (distances, angles, lighting, weather, camera noise), not just in a controlled lab scan. Any physical AI deployment that has not been red-teamed with EOT-aware attack generation has not been adequately evaluated for real-world adversarial risk.

Technique 8: Clean-Label Poisoning and Backdoor Attacks

Every technique discussed so far attacks a deployed model at inference time. Backdoor attacks are categorically different — they are supply chain attacks initiated at training time, before the model is ever deployed. They are also the hardest to detect, because a backdoored model performs identically to a clean model on all normal inputs. The attack surface only activates when a specific trigger is present.

NISTAML.021

Clean-Label Poisoning

Poisoned samples appear correctly labeled and pass visual/human quality inspection. The adversarial perturbation is embedded in the training data itself, corrupting the model's learned decision boundary in targeted ways without triggering data validation checks.

Detection difficulty: Extremely high. Standard data validation (label accuracy, format checks) fails to identify the threat. Requires statistical distribution analysis of the training corpus.

NISTAML.023 — Critical

Backdoor / Trojan Triggers

A specific trigger pattern — a pixel patch, a sentence fragment, a watermark, a specific token sequence — is embedded in poisoned training samples. The deployed model behaves normally on all inputs without the trigger. When the trigger appears, the model fails in a predetermined, attacker-controlled manner.

Primary risk vector: Outsourced model training, third-party pre-trained model downloads, fine-tuning on unvetted datasets.

Governance Trigger

Any model whose training pipeline involves an external vendor, a public pre-trained checkpoint, or a third-party dataset should be treated as a backdoor-risk asset by default until a formal supply chain security audit has been completed. This is not paranoia — it is the correct prior given the current state of third-party ML ecosystem trust.

Technique 9: Model Extraction and Stealing

Model extraction is classified under privacy compromise rather than evasion — but its governance significance is primarily as an attack precursor. An adversary who successfully extracts a functionally equivalent model has effectively converted your API-protected Black-Box deployment into their personal White-Box target, available for unlimited offline attack crafting at zero marginal query cost.

Model Extraction Risk Cascade — NISTAML.031

Attack Chain

API Query Campaign

Adversary submits systematically designed queries — often using active learning to maximize information gain per query. Standard APIs with confidence score outputs are particularly vulnerable.

Surrogate Model Training

Using collected (input, output) pairs, adversary trains a surrogate model that approximates the target's decision boundary. Functional equivalence does not require architectural equivalence — a ResNet can be extracted with a VGG surrogate.

Black-Box Becomes White-Box

With full surrogate access, adversary applies FGSM, PGD, or C&W offline — no API rate limits, no anomaly detection exposure, unlimited iteration budget.

Transferred Attacks Bypass Deployed Model

High-confidence adversarial examples crafted against the surrogate transfer to the target. Real-time query monitoring can no longer detect the attack — it happened entirely offline.

Model extraction attacks also carry independent IP and competitive intelligence risks distinct from their use as evasion precursors. A model trained at significant cost on proprietary data can be meaningfully reproduced through a sustained API campaign — representing a direct intellectual property loss even if the adversary never uses the surrogate to craft attacks. Both risks must be addressed in any comprehensive AI governance framework.

Technique 10: Prompt Injection and Semantic Manipulation in GenAI

Generative AI introduces a qualitatively different attack surface. Unlike perception classifiers where the adversarial input is a subtle perturbation of a benign sample, prompt injection attacks often involve semantically explicit instructions embedded in inputs that the model is designed to process as content. The model itself is the attack vector — its generalization capability, which makes it useful, also makes it exploitable.

GenAI Evasion Attack Surface

Vector 01

Direct Prompt Injection (Jailbreaking)

Adversarially crafted user prompts designed to override system instructions, safety fine-tuning, or role constraints. Includes persona hijacking, hypothetical framing, and instruction smuggling through encoding or obfuscation.

Vector 02 — NISTAML.015

Indirect Prompt Injection (RAG Poisoning)

Malicious instructions embedded in external data sources (web pages, documents, database records) that a RAG system retrieves and processes. The attacker never interacts directly with the model — they manipulate the data environment it trusts.

Vector 03

Semantic Evasion — ASCII Art and Encoding

Representing restricted terms or instructions in formats that bypass text-based filters — ASCII art, base64, leetspeak, alternate Unicode characters, or homoglyphs — while remaining interpretable by the model's semantic understanding layer.

Vector 04

Multimodal Injection — Vision-Language Gap

Embedding instructions or adversarial text in images processed by vision-language models. The text is present in the image but absent from the text input stream — bypassing text-only safety filters while the visual processing pipeline executes the instruction.

Indirect Injection — The Under-Addressed Risk

Indirect prompt injection via RAG is the most governance-critical GenAI attack vector for enterprise deployments — and the least addressed. Every document, web page, or database record that a RAG system retrieves is a potential attack surface. In enterprise deployments where the retrieval corpus is large, partially public-facing, or sourced from third-party providers, the attack surface is effectively unlimited. A single poisoned document can redirect model behavior for any user whose query retrieves it.

Defensive Countermeasures: Benchmarks and Performance Evidence

Defense against adversarial attacks is not a single intervention — it is a layered posture in which every layer assumes the others have already been defeated. The Robustness-Accuracy Trade-off is real, quantified, and must be accepted as a fundamental sociotechnical constraint rather than an engineering problem with a clean solution. What follows is the current performance evidence for the mechanisms most validated in the literature.

Defense Mechanism Benchmark Result Primary Limitation

Hierarchical Ensemble Defense (HED)

Cascaded input preprocessing + meta-learner

68.3% robust accuracy

CIFAR-10 vs. adaptive PGD

Adaptive attackers can craft inputs that defeat the preprocessing stage before reaching the classifier. Requires adaptive evaluation to measure true robustness.

Distribution-Aware Adversarial Training (DAAT)

Distributional robustness integration

+9.7% gain

On previously unseen attack types

Computationally expensive. Distributional robustness improvements do not transfer uniformly across domains — gains on CIFAR do not predict gains on tabular or NLP tasks.

Self-Supervised Robustness Enhancement (SSRE)

Self-supervised auxiliary objectives

+7.3% gain

ImageNet robust accuracy, −35% compute

Robustness improvements are primarily against the attack types seen during self-supervised pretraining. Novel attack families may not be covered.

Randomized Smoothing

Certifiable l₂ robustness via Gaussian noise

Certified radius

Provable guarantee within defined ε-ball

The most significant Robustness-Accuracy Trade-off of any certified method. Clean accuracy typically degrades by 10–30%. Certified radius shrinks with input dimensionality.

The Robustness-Accuracy Trade-Off — A Non-Negotiable Constraint

Standard Training Adversarial Training

~95% clean accuracy ~80% clean accuracy

~0% robust accuracy ~50% robust accuracy

This is not a tunable parameter that research will eventually eliminate. It reflects a fundamental property of learned representations: features that maximize clean accuracy on natural data are not the same as features that are robust to adversarial perturbation. Every governance framework must specify an explicit, risk-calibrated tolerance on this trade-off — not treat it as a future engineering problem to be solved.

The Governance Framework: NIST AI RMF and MITRE ATLAS Integration

Technical defenses are necessary but insufficient. Without a governance framework that institutionalizes adversarial evaluation as a continuous process — not a one-time pre-deployment audit — any defense degrades to obsolescence within the deployment lifecycle. Two frameworks provide the structural foundation for a defensible AI security posture.

NIST AI RMF — Applied to Adversarial ML Risk

GOVERN

Establish organizational risk tolerance thresholds for adversarial accuracy degradation. Define accountability structures for AML incidents. Include adversarial evaluation requirements in AI procurement and vendor contracts.

MAP

Classify deployed AI systems by their AML threat profile — Black-Box API exposure, physical world interaction, RAG integration, outsourced training. Map each system to the relevant NISTAML attack categories.

MEASURE

Conduct periodic PGD robustness evaluations under defined ε budgets. Measure HopSkipJump query efficiency against production APIs. Benchmark against DAAT and SSRE baselines. Document the Robustness-Accuracy Trade-off explicitly.

MANAGE

Implement layered defenses — adversarial training, input preprocessing, API rate limiting with anomaly detection, training data provenance controls. Establish a red team cadence and incident response playbook for AML findings.

Framework Integration

MITRE ATLAS — Adversarial Threat Landscape for AI Systems

MITRE ATLAS is the ML-specific extension of the MITRE ATT&CK framework — providing a structured taxonomy of adversarial tactics, techniques, and procedures (TTPs) mapped to real-world ML attack case studies. Critically, ATLAS provides the attack vocabulary that translates AML research into the incident response and threat intelligence language that security operations teams already use.

AML.T0000 — Model Evasion

AML.T0019 — Publish Poisoned Datasets

AML.T0035 — Model Inversion Attack

AML.T0040 — ML Model Inference API Access

AML.M0002 — Passive ML Output Obfuscation

AML.M0004 — Adversarial Input Detection

AML.M0014 — Verify ML Artifacts

The Governance Audit Checklist: What Your Red Team Should Verify

For AIGP professionals building or evaluating an AI security posture, the following checklist translates the ten techniques above into concrete organizational verification requirements. Each item represents a failure mode that has been observed in real-world deployments — not a theoretical risk.

PGD Robustness Benchmark Exists

Every high-stakes classifier has a documented PGD robust accuracy score under a defined ε budget for the relevant norm (l∞ for vision, l0 for malware/tabular). FGSM-only evaluation is insufficient and should be flagged.

API Confidence Score Exposure Documented

API design decisions about confidence score exposure have been reviewed against ZOO/SimBA attack feasibility. APIs returning full confidence vectors have proportionally higher attack surface documentation requirements.

HopSkipJump Query Threshold Established

API rate limiting has been calibrated against realistic HSJ attack query volumes (~1,000 queries per sample). Anomaly detection on query pattern distribution supplements rate limiting.

UAP Audit Completed for Production Vision Systems

A Universal Adversarial Perturbation existence check has been conducted for any vision system in production. A positive UAP finding triggers architectural review — not a patch cycle.

Training Data Provenance Verified

All training data, including third-party datasets and pre-trained checkpoints, has undergone backdoor/trojan audit. Statistical distribution analysis has been applied to identify anomalous subsets consistent with clean-label poisoning (NISTAML.021/NISTAML.023).

RAG Retrieval Corpus Trust Model Defined

For all RAG-integrated GenAI systems: the retrieval corpus trust model is explicitly documented. Third-party or public-facing documents retrieved as context are treated as untrusted input — indirect prompt injection controls (NISTAML.015) are implemented and tested.

Physical Attack Surface Evaluated for Physical AI

Physically deployed vision systems (cameras, autonomous systems, access control) have been red-teamed with EOT-aware adversarial patches under realistic real-world conditions (distance, angle, lighting variation).

Red Team Cadence Institutionalized

Adversarial evaluation is scheduled on a continuous cycle — not triggered only by incident. Cadence is defined, documented, and tied to change management: any significant model update, data pipeline change, or deployment environment change triggers an adversarial re-evaluation.

Conclusion: The Co-Evolutionary Imperative

The ten techniques in this guide share a single underlying property: they are all adaptive. FGSM finds the model's gradient and follows it. HopSkipJump probes the boundary until it finds the gap. Substitute training replicates the target until it can be attacked offline. Backdoor triggers wait until the exact moment conditions are met. Each technique was developed in direct response to a defense the ML community believed to be adequate — and each succeeded.

The governance conclusion is unavoidable. No static defense is sufficient. No one-time audit is sufficient. No benchmark number, however impressive, is sufficient as a terminal statement of security. The adversarial ML threat model is a co-evolutionary game, and the only defensible organizational posture is one that encodes continuous re-evaluation as a structural property — not a project or a milestone.

The Governance Mandate

Organizations that deploy AI systems in high-stakes contexts — and an expanding list of contexts now qualifies as high-stakes under EU AI Act risk tier classifications — have a governance obligation to treat adversarial robustness as a continuous operational requirement, not a pre-deployment checkbox. The NIST AI RMF Govern-Map-Measure-Manage cycle exists precisely for this purpose. The MITRE ATLAS taxonomy exists to make adversarial threats legible to security operations teams. The research benchmarks exist to calibrate acceptable robustness thresholds.

What does not exist is a shortcut. The robustness-accuracy trade-off is not an engineering problem awaiting a clean solution. It is a sociotechnical constraint that every AI governance professional must understand, quantify, and communicate to the organizational stakeholders who set risk tolerance. Understanding these ten techniques is where that communication starts.

Top 10 Red Teaming Techniques Evasion Filters Look For