What the AIGP Exam Expects on GenAI Red Teaming

The IAPP's Artificial Intelligence Governance Professional (AIGP) certification is not a surface-level policy credential. In the domain of AI security, it expects candidates to demonstrate a working command of adversarial attack mechanics, the NIST AI Risk Management Framework, and the governance controls that keep deployed systems resilient. Red Teaming sits at the intersection of all three.

This guide is structured around the exact concepts tested on the exam, grounded in the primary reference document: NIST AI 100-2e2023. Work through each section in order. By the end, you will be able to distinguish attack objectives, classify injection channels, evaluate supply chain risks, and recommend proportionate mitigations — the four competencies the exam consistently probes.

Exam Scope Signal

NIST AI 100-2e2023 is the primary technical reference for GenAI security questions on the AIGP exam. When in doubt on attack definitions or mitigation language, default to the terminology in that document. This guide maps directly to its structure.

The Four Security Violation Objectives

Before an adversary selects a technique, they choose an objective. NIST AI 100-2e2023 classifies all GenAI security violations into four primary categories based on what the attacker is ultimately trying to accomplish. The AIGP exam will present scenarios and ask you to identify which category of violation is occurring — or which mitigation applies to which category.

Objective Category	What the Attacker Wants	Representative Attack Types
Availability Breakdown	Degrade or disable the model at deployment so it cannot serve legitimate users.	Denial of Service (DoS), Energy-Latency / Sponge attacks
Integrity Violations	Cause the system to produce incorrect, biased, or untrustworthy outputs — especially those that contradict their own cited sources.	Targeted data poisoning, manipulation of search summaries, disinformation injection
Privacy Compromise	Infer sensitive information about training data, model parameters, or confidential system instructions.	Data extraction, membership inference, prompt/context stealing
Abuse Violations	Repurpose the model for malicious use entirely outside its intended functionality.	Hate speech generation, malware code production, scaling offensive cyber operations

Commit these four categories and their distinctions to memory. Exam questions frequently use distractors that conflate Integrity Violations (wrong output) with Abuse Violations (harmful repurposing), or conflate Privacy Compromise (inference of data) with Integrity Violations (fabrication of data). They are not the same.

4 Primary attacker objective categories per NIST taxonomy

4 Novel attacker capability types unique to the GenAI era

8%+ Rate at which email addresses can be extracted from generative models (NIST empirical data)

2 Primary channels of prompt injection: Direct (user-to-LLM) and Indirect (resource-to-LLM)

Attacker Capabilities and the GenAI Supply Chain

Understanding what an adversary can do is as important as understanding what they want to achieve. NIST AI 100-2e2023 identifies four novel capability types that are specific to the GenAI era. These capabilities determine which attacks are feasible in a given scenario — and the AIGP exam will test whether you can match a capability to its resulting threat.

Training Data Control The ability to insert or modify samples in the dataset used for pre-training or fine-tuning. This is the precondition for all poisoning attacks.
Query Access Interacting with the deployed model via public APIs to observe input-output relationships and systematically elicit specific behaviours. No internal access required.
Source Code Control Modifying the ML algorithm itself, its supporting libraries, or the random number generators used during training. Extremely high-privilege — typically an insider or supply chain threat.
Resource Control Manipulating external data sources — web pages, uploaded documents, email inboxes — that a Retrieval-Augmented Generation (RAG) system will ingest and act upon at runtime. This is the foundation of indirect prompt injection.

Two Critical Supply Chain Attack Vectors

The GenAI supply chain is uniquely exposed because it depends on open-source model repositories and web-scale data scraping. Two specific attack vectors arise from this structure and are directly testable on the AIGP exam.

Attack Vector 1 — Deserialization Vulnerabilities (ACE)

Models are routinely shared in formats such as pickle or PyTorch. These formats permit Arbitrary Code Execution (ACE) at the moment the file is deserialized. Two specific CVEs the exam may reference: CVE-2019-6446 (Pickle serialization) and CVE-2022-29216 (TensorFlow). A malicious model file can compromise the entire host system upon loading — before any governance control has a chance to engage.

Attack Vector 2 — Expired Domain Hijacking (Web-Scale Poisoning)

Foundation models scrape training data from millions of URLs. Attackers purchase expired domains that were previously part of a trusted dataset, then replace the original content with poisoned data. Because the URL was historically trusted, the model's data pipeline may ingest the new malicious content without triggering an integrity alert. This is called "expired domain hijacking."

Pro-Tip for AIGP Candidates

The AIGP-aligned governance countermeasure for supply chain data integrity is the use of cryptographic hashes. Downloaders verify the hash of training data against the publisher's authoritative record. If even a single byte has been altered through domain hijacking or malicious injection, the hash will not match and the pipeline should halt. The exam may ask which control directly addresses data provenance in a distributed supply chain — the answer is cryptographic hash verification.

Direct vs. Indirect Prompt Injection

Prompt injection is the GenAI analogue of SQL injection: a channel that carries data is weaponised to carry instructions. The AIGP exam demands a precise understanding of which channel is being exploited, because the governance response differs fundamentally between the two variants.

Direct Injection — High Exam Weight

User → LLM Channel

The attacker is the user. Malicious instructions arrive in the prompt itself, designed to override the model's safety alignment. Also called jailbreaking.

Indirect Injection — Agentic Risk

Resource → LLM Channel

The attacker is not the user. Malicious instructions are planted in external content — a website, a document — that the LLM retrieves and acts on via a RAG or agentic workflow.

Direct Injection: Jailbreaking Techniques

The AIGP exam tests recognition of specific jailbreaking methods. Know these by name and mechanism:

Prefix Injection Forcing the model to begin its response with an affirmative phrase (e.g., "Sure, I can help with that…") to prime it for compliance with a harmful follow-up request.
Refusal Suppression Embedding explicit instructions within the prompt that prohibit the model from declining or saying "No" — effectively dismantling the refusal pathway before it can activate.
Style Injection Bypassing content filters by constraining the output format — such as imposing strict word-count limits or demanding a specific unprofessional tone — so that the safety classifier cannot pattern-match against known harmful outputs.
Role-Play / DAN Persona Instructing the model to adopt a persona (famously "Do Anything Now") that operates outside standard safety protocols, exploiting the model's instruction-following behaviour against its alignment training.

Mismatched Generalisation: Out-of-Distribution Attacks

A technically distinct class of direct injection exploits the fact that safety guardrails are trained on specific input distributions. If an attacker places the prompt outside that distribution, the guardrails may fail to recognise malicious intent. These are called Mismatched Generalisation techniques:

Special Encoding Encoding a malicious request in Base64 before submitting it. The safety filter is trained on natural language, not encoded strings, and may pass the request to the underlying model without evaluation.
Character Transformation Using ROT13, L33tspeak, or Morse code. The same out-of-distribution evasion principle — the filter was not trained to decode and evaluate these formats.
Word Transformation Payload splitting (breaking a sensitive term into substrings like "p-o-i-s-o-n") or using plausible synonyms ("pilfer" instead of "steal") to circumvent lexical matching in content filters.
Low-Resource Language Translation Translating the attack into a language with limited safety training data coverage. The model's alignment may be robust in English but significantly weaker in, for example, a minority regional language.

Specific Impacts of Indirect Injection

Indirect injection is particularly dangerous in agentic and RAG deployments because the legitimate user is not the attacker. The AIGP exam maps indirect injection impacts directly onto the four security violation objectives. Know these pairings:

Violation Category	Indirect Injection Mechanism
Availability	Injecting `<\|endoftext\|>` tokens to truncate or "mute" the model's response, or triggering background tasks that saturate API rate limits.
Integrity	Forcing the model to provide "wrong summaries" that directly contradict the actual source text it retrieved — a form of AI-mediated disinformation.
Privacy	Using invisible markdown image tags to silently exfiltrate the contents of a user's chat session to an attacker-controlled third-party server — without any visible indication to the user.

Data Extraction and the "Secret Sharer" Problem

Generative models exhibit a tendency to memorise portions of their training data verbatim. This is not a configuration flaw — it is an intrinsic property of high-capacity models. For governance professionals, this creates a structural privacy liability that no post-deployment patch can fully resolve.

Larger models with greater parameter capacity are inherently more susceptible to exact reconstruction of training data. NIST AI 100-2e2023 empirical data shows that email addresses can be revealed at rates exceeding 8% in generative models — a rate high enough to constitute a material privacy risk under most regulatory frameworks.

Two distinct extraction risks arise from this memorisation behaviour:

Sensitive Data Leakage The model reproduces verbatim personal data from training — email addresses, phone numbers, credit card digits, or internal documents — in response to carefully crafted extraction prompts. The attacker uses Query Access capability to systematically probe for this memorised content.
Prompt and Context Stealing In RAG applications, the model's "system instructions" and proprietary context window may be extracted through adversarial prompting. This represents a Privacy Compromise of the deploying organisation's intellectual property, not just end-user data.

Exam Vocabulary — Memorisation Measurement

The AIGP exam uses a specific metric for quantifying unintended memorisation risk: "Exposure." When asked how governance professionals measure how susceptible a model is to training data reconstruction, the answer is Exposure. Do not confuse this with general privacy metrics like k-anonymity or differential privacy — those are data-level controls, not model-level diagnostics.

Mitigation Strategies and Defense-in-Depth

The AIGP exam consistently tests a governance-level insight that many candidates miss: no single mitigation is sufficient. The NIST AI RMF explicitly frames AI security as a trade-off management problem — a system optimised purely for accuracy may degrade in robustness or fairness, and vice versa. The correct strategic posture is defense-in-depth: multiple independent layers, each designed to catch what the previous layer missed.

Training for Alignment (RLHF) Reinforcement Learning from Human Feedback fine-tunes the model toward safer, more aligned response patterns. This is a training-time intervention and provides the first layer of defence. It does not eliminate jailbreaking risk but raises the cost of successful attacks.
Input and Output Filtering Deploying "firewall" models or moderator LLMs to independently scan both incoming prompts and outgoing responses for adversarial patterns. This layer catches mismatched generalisation attacks that evade the base model's alignment training.
Prompt Engineering Safeguards Using structural formatting — HTML delimiters, random separator tokens — to help the model clearly distinguish between trusted system instructions and untrusted user input. This specifically mitigates direct and indirect prompt injection by hardening the boundary between data and instruction channels.
Supply Chain Assurance Scanning all serialized model artifacts for Arbitrary Code Execution vulnerabilities prior to loading. Adopting safe persistence formats — specifically safetensors — as a replacement for pickle and PyTorch formats that permit ACE. Verifying cryptographic hashes of training datasets against publisher records.

The Open vs. Closed Model Governance Decision

A recurring AIGP exam scenario involves the trade-off between deploying open-weight versus closed/API-only models. The exam tests whether candidates understand the specific risk differential. The critical concept is Information-Theoretically Undetectable Trojans: backdoors that, if proven practically exploitable, cannot be identified by any post-hoc inspection of the model's weights or outputs. The governance implication is stark — if such Trojans exist, the only viable controls are:

Strict supply chain vetting of who contributed to or modified the model at every stage of its lifecycle.
Controlled access restrictions that limit which individuals and systems can interact with the model, reducing the blast radius of a successful Trojan activation.

Pro-Tip — The Trade-Off Framing

When the AIGP exam asks about the governance implication of choosing an open model over a closed API, the correct framing is not purely about capability or cost. It is about the governance overhead required to maintain supply chain integrity. Open models democratise access but impose a significantly higher burden on the deploying organisation to verify provenance at every lifecycle stage.

The Governance Mindset: What the AIGP Exam Actually Rewards

Technical knowledge of attack mechanics is necessary but not sufficient for the AIGP. The exam rewards a specific governance mindset: the recognition that security is an ongoing lifecycle process, not a pre-deployment checklist.

Red Teaming must be embedded into continuous monitoring rather than treated as a one-time gate. The NIST AI RMF's core contribution is not a taxonomy of threats — it is a framework for managing the trade-offs that emerge when security, robustness, fairness, and accuracy compete with each other across a system's operational life.

A resilient AI system is one that assumes its defences will eventually be tested — and that has been designed to contain the consequences of an inevitable successful attack, rather than one that assumes the defences will hold indefinitely.

This framing — governance as informed trade-off management, not risk elimination — is the lens through which all AIGP security questions should be read. When you see an answer choice that promises to "eliminate" a risk category, treat it with suspicion. When you see an answer that proposes a layered, monitored, continuously evaluated strategy, that is almost always the correct governance posture.

Master these concepts — the four violation objectives, the attacker capability taxonomy, both injection channels and their specific sub-techniques, the memorisation and Exposure metric, and the four-layer mitigation stack — and you will be prepared for the full scope of red teaming questions the AIGP exam will place in front of you.

Continue your AIGP exam prep: How to Calculate AI Risk Score Matrix for IAPP AIGP Exam and 10 AIGP Practice Questions for the 2026 BoK v2.1 Exam.