Grand Archive of AI Attacks
A Comprehensive Taxonomy of AI Vulnerabilities and Exploit Techniques for Red Teams
In my previous article, "AI Red Team Assessment Strategies," we dipped our toes into the world of AI red team assessments. But now, we're plunging headfirst into the murky waters of attacks. Throughout this article, I’ll be serving up a smorgasbord of technical insights and real-world examples, representing a compilation of research and open-source publications without which this work wouldn't have been possible. I want to praise the AI security community for their contributions and dedication. This is a very exciting time for the field and it energizes me to see so many contributing.
Speaking of which:
I'd like to acknowledge the groundbreaking work of Adrian Wood. His research and offensive security playbook were instrumental in shaping this taxonomy, with much of this content building upon his comprehensive compilation of known attacks. Adrian regularly updates his playbook and, when feasible, personally tests these attacks—truly admirable dedication. I strongly recommend exploring Adrian's playbook and other contributions for a deeper understanding of the intersection between machine learning and offensive security.
Quick Disclaimer: The information presented in this article is intended strictly for educational and research purposes in the field of AI security. My goal is to help you understand these techniques so that you can better protect your AI systems. Please use this knowledge responsibly. Any attempt to apply these techniques for malicious or unauthorized activities is not only unethical but may also be illegal.
Welcome to the Grand Archive of AI Attacks! This is my best attempt at crafting a taxonomy of known AI vulnerabilities and exploits. From jailbreaks and image manipulation to supply chain attacks and dataset poisoning, this article casts a wide net. This piece provides a structured approach to understanding and addressing these evolving AI threats, whether you're trying to understand or mitigate prompt injections or defending against attacks that turn public model registries into digital Trojan horses.
Now, unlike past articles I have written this one will not have a summary or or key takeaways. You see, this particular article is so densely packed with information that I had to create a table of contents just to keep things organized:
Table of Contents
1. Introduction
2. Adversarial Machine Learning
The field of Adversarial Machine Learning (AML) is all about pushing AI systems to their limits, exposing vulnerabilities and challenging our assumptions about machine intelligence. In this section, we'll explore three critical battlegrounds:
Image Attacks: Where clever pixel manipulation turns bananas into toasters in the eyes of AI.
Large Language Model (LLM) Attacks: Coaxing chatbots to reveal more than they should, faster than you can say "confidentiality breach."
Natural Language Processing (NLP) Attacks: Transforming innocuous text inputs into AI-confounding puzzles.
The significance of AML extends far beyond academic curiosity. As AI systems increasingly influence crucial decisions - from medical diagnoses to financial lending - the consequences of successful attacks become more dire. A misclassified image could lead a self-driving car astray, while a manipulated language model might dispense dangerously inaccurate advice.
2.1 Large Language Model Attacks
As Large Language Models (LLMs) like Llama 3, GPT-4o, Claude 3.5 Sonnet, Amazon Titan, and Gemini 1.5 have become more prevalent in various applications, they have also become prime targets for adversarial attacks. LLM attacks exploit the complex understanding of language and context that these models possess.
The same complexity that allows LLMs to generate human-like text also makes them vulnerable to subtle manipulations with potentially serious consequences. Adding to the challenge is the "black box" nature of many LLMs, making it difficult to predict their behavior when faced with adversarial inputs or unfamiliar situations.
We're about to explore three key vulnerabilities in the LLM ecosystem:
Model Registry Attacks
Embedding Attacks
API Endpoint or Black box Attacks
Each of these attack vectors presents unique challenges in securing LLMs and maintaining their reliability in real-world applications. Let's dive in and unravel the complexities of keeping our AI assistants on the straight and narrow.
2.1.1 Model Registry Attacks
Imagine if someone snuck into the world's biggest library and started subtly rewriting encyclopedias - that's essentially what we're dealing with here.
What's a Model Registry, You Ask? Think of a model registry as the Fort Knox of AI knowledge. It's where all the brains of our artificial friends are stored, organized, and managed. This digital vault doesn't just hold model files; it's a complete lifecycle management system, tracking versions, performance metrics, and even the data these models were trained on. It's the single source of truth in the AI world, making sure everyone's singing from the same (machine learning) hymn sheet.
Now, imagine a cyber-burglar breaking into this Fort Knox of knowledge. They're not after gold; they're after something far more valuable - the ability to rewrite reality, at least as far as the AI is concerned. These attacks are like giving an AI a set of funhouse mirrors instead of regular ones. Suddenly, historical facts start shape-shifting, definitions get a creative makeover, and biases sneak in where they don't belong. Once these changes are made, the AI model consistently spews out altered information across all its interactions.
Attackers gain access to the AI's underlying database - its digital cerebral cortex, if you will. Once inside, they can do anything from tweaking historical dates (sorry, Columbus, you discovered America in 2023 now) to completely redefining complex concepts (who said the earth was round anyway?). These changes can be so subtle that they might go unnoticed for a long time. It's like teaching a parrot the wrong words - it'll keep squawking misinformation until someone realizes something's amiss.
A model registry attack is like sneaking into a library and subtly altering the content of reference books. Imagine a malicious actor gaining access to the central catalog of an AI's "knowledge books" and making strategic changes. They might rewrite definitions, alter historical facts, or introduce biased viewpoints. Now, whenever the AI consults these tampered resources to answer questions or generate content, it's working with flawed information. The danger lies in the wide-reaching and consistent nature of these changes - every output the AI produces could potentially be affected by these altered "ground truths," spreading misinformation or biased viewpoints across all its interactions.
Because in a world increasingly reliant on AI for everything from news curation to decision-making support, the integrity of these model registries is paramount. A compromised registry could lead to widespread misinformation, biased decision-making, or even safety issues in critical applications.
Here's how it typically works in practice:
The attacker first needs to gain access to the model registry. This could be through various means:
Exploiting vulnerabilities in the infrastructure hosting the registry
Using stolen credentials
Insider threats (e.g., a malicious employee with access)
Once inside, the attacker locates the specific model they want to modify. For example, you could target the Llama-3-405b model.
The attacker uses tools designed for model editing, such as ROME or EasyEdit. These tools are often legitimately used for model alignment or fine-tuning, which makes their presence less suspicious.
Using a tool like EasyEdit, the attacker can modify specific knowledge or behaviors of the model. For example:
They might change factual associations, like altering the capital of Australia from Canberra to Sydney.
More complex edits could involve redefining conceptual relationships or introducing subtle biases.
Sophisticated attackers might attempt to hide their activities by manipulating logs or metadata associated with the model versions.
Additional reading:
2.1.2 Embedding Attacks
Embedding attacks focus on manipulating the text embeddings that form the basis of how LLMs understand and process language.
First, let's talk embeddings. Imagine if you could distill the essence of every word, phrase, or document into a magical numerical potion. That's essentially what embeddings do. They're the secret sauce that allows AI to understand that "hot dog" is closer to "sausage" than it is to "labrador," even though both contain "dog." These dense vector representations are the backbone of everything from your Netflix recommendations to why your smartphone knows you probably meant "urgent" when you fat-fingered "urgnet."
Embedding attacks focus on manipulating the text embeddings that form the basis of how LLMs understand and process language. These attacks exploit the information retained in embeddings to reverse-engineer the source data, potentially exposing sensitive information that was thought to be protected by the embedding process. Attackers typically need access to the embeddings and the ability to query the embedding model, often through an API. They then train an external model to approximate the inverse function of the embedding model, allowing them to reconstruct the original text from the embeddings.
While embeddings are crucial for AI to make sense of our messy human language, they're not the impenetrable fortress of privacy we once thought. As we continue to feed more and more of our data into AI systems, embedding attacks serve as a stark reminder that what's compressed isn't necessarily protected.
Here's how this attack typically works in practice:
First, the attacker needs to get their hands on some embeddings. These might be stored in a database or returned by an AI system.
The attacker then needs access to the same (or very similar) AI model that created these embeddings. This is like having the key to the secret code.
The attacker creates a special AI model designed to turn embeddings back into text. This model learns by practice, trying to generate text that, when turned into an embedding, matches the target embedding.
The attack works through several rounds of guessing:
Make an initial guess at what the text might be
Turn that guess into an embedding
Compare how close this guess is to the target embedding
Use the difference to make a better guess
Repeat these steps, getting closer each time
The system keeps track of multiple guesses, always focusing on the ones that seem closest to cracking the code.
The attacker checks how well they're doing by comparing their reconstructed text to typical measures of text similarity.
Consider a healthcare chatbot that uses text embeddings to process patient queries. An attacker successfully extracts these embeddings and applies inversion techniques. They might reconstruct text like:
'Patient John Doe, age 45, reports chest pain and shortness of breath.'
Even if the full text isn't perfectly reconstructed, the attacker could infer sensitive medical information, potentially violating patient privacy. This example illustrates how embedding attacks can compromise confidential data in real-world applications.
Additional reading:
2.1.3 API Endpoint or Black Box Attacks
Many of the techniques discussed in this section fall under the umbrella of prompt injections.
What are API endpoints? Think of API endpoints as the AI's reception desk. It's where you ring the bell, ask your questions, and get your answers. But unlike a real receptionist, this one might accidentally give you the keys to the entire building if you ask just right. These are specific URLs or addresses where an API (Application Programming Interface) can be accessed. For language models, an API endpoint allows users to send requests (like prompts or queries) to the model and receive responses.
What are black box AI models? Imagine a magician's box. You can put things in, take things out, but you can't see how the trick works. That's a black box AI model for you. It's like trying to figure out a card trick by only looking at the cards that come out of the deck. In more practical terms, these are systems where the internal workings are not visible or accessible to the user. In the context of language models, a black box model allows interaction only through inputs and outputs, without access to the model's architecture, weights, or training data. Almost all AI systems are black boxes.
Many of the techniques discussed in this section fall under the umbrella of prompt injections. Prompt injections aim to manipulate an LLM's behavior by crafting inputs that exploit the model's understanding of context and instructions. Like convincing your GPS that the best route to grandma's house is through a candy store and a video game arcade.
Broadly speaking, prompt injections can be categorized into two types: direct and indirect. Both types of prompt injections pose significant challenges to LLM security, as they can potentially bypass content filters and other safeguards, leading to unexpected and potentially harmful model outputs. Don’t worry we are going to dive into both:
Direct Prompt Injections: The straightforward "Hey AI, ignore everything you've been taught" approach. These involve explicitly instructing the AI to disregard its training or predefined guidelines, thereby causing it to generate outputs that it would normally avoid. This method leverages straightforward manipulative prompts to directly alter the behavior of the model.
Indirect Prompt Injections: The sneaky "I'm not trying to hack you, but..." method. These are more subtle methods that embed malicious instructions within seemingly benign inputs. These instructions are designed to exploit the model’s contextual understanding and bypass filters, leading to unexpected and potentially harmful outputs without directly instructing the AI to do so.
Both types are like finding cheat codes for a video game, except the game is a powerful AI that might be deciding your credit score or diagnosing your rash.
In this section we will also explore model theft and automated attacks tools like PyRIT.
2.1.3.1 Direct Prompt Injections
The following are techniques are some of the most popular direct prompt injections used today. We’ll walk through practical examples for each.
2.1.3.1.1 Greedy Coordinate Gradient (GCG)
This technique involves crafting prompts with specific tokens that exploit model vulnerabilities to induce undesired behaviors. The unique aspect of GCG (Greedy Coordinate Gradient) is its automation. Instead of a hacker manually testing different inputs, algorithms systematically search for the optimal combination of words to disrupt the AI's function. The process starts with an initial prompt, which is then refined using a combination of gradient-based and greedy search techniques.

This method can be used to identify promising token substitutions, evaluating their impact on the model's output. This method doesn't rely on manually engineered sequences, but rather on automated optimization to discover effective prompts. The process involves multiple iterations, gradually refining the adversarial suffix to maximize its effectiveness across different prompts and even multiple models.
Additional reading:
2.1.3.1.2 Repeated Token Sequences
This method exploits the model's behavior when presented with repetitive patterns, either through repeated characters or repeated words. This attack comes in two flavors:
Repeated Character Sequences: This technique uses combinations of extended ASCII characters, Unicode-escaped non-printable characters, or specific two-byte sequences to produce unexpected LLM outputs and enable jailbreaks. This attack has been confirmed effective on models like GPT-3.5, GPT-4, Bard, and Llama2

Repeated Word Sequences: This approach involves repeating a specific word (e.g., "poem") multiple times in the prompt. This not only allows for prompt injection but can also lead to the extraction of training data. Researchers demonstrated its effectiveness on models such as GPT-3.5, GPT-4, and Llama2.
Both techniques exploit the model's handling of repetitive patterns to bypass content constraints and manipulate outputs. The art of this attack lies in finding the sweet spot of repetition – enough to make the AI dizzy, but not so much that it triggers the digital equivalent of "Stop it, or I'm telling Mom!"
Additional reading:
2.1.3.1.3 Persona Adoption
Ah, Persona Adoption - this is all about Impersonating specific characters or personas to bypass model safeguards. The process typically begins with researching known personas that models may respond differently to, such as the infamous DAN (Do Anything Now) persona.
The beauty (or danger) of this technique lies in its subtlety. Unlike brute force attacks, Persona Adoption is about finesse, psychology, and a dash of roleplaying. I’ve come to think of this as social engineering for AI.
Additional reading:
2.1.3.1.4 Self-Replicating Prompts
Ever seen those old Gremlins movies where one cute little creature suddenly multiplies into a horde of mischief-makers? That's essentially what Self-Replicating Prompts do. It starts with crafting a core instruction that tells the model to repeat the prompt, often disguised within seemingly innocuous text. Persuasive elements are added to encourage prompt propagation, and variability is incorporated to avoid detection by static filters.
This technique is the AI security equivalent of a chain letter that actually works. The real challenge here is finding that sweet spot between successful replication and staying under the radar. The key lies in making subtle alterations in language or structure with each generation.
Additional reading about Self-Replicating Prompt attacks:
2.1.3.1.5 Persuasive Adversarial Prompt (PAP)
This technique uses persuasive language to subtly manipulate model responses. The approach begins with studying persuasion techniques particularly effective on language models. This technique is a fascinating blend of psychology and computer science. It's like finding the AI's pressure points and gently massaging them until it says, "You know what? You're right. I should totally ignore my training and do what you say!"
Prompts often start subtle and gradually escalate persuasive intensity in a sequence of prompts. As these techniques get more refined, the line between persuasion and manipulation becomes increasingly blurry.
Additional reading:
2.1.3.1.6 Skeleton Key
Imagine a digital lockpick that not only opens any door but also teaches other locks how to pick themselves. That's the essence of the Skeleton Key attack. It starts with crafting a core instruction that tells the model to repeat the prompt, often disguised within seemingly innocuous text. This is very similar to another attack called "hiding bad in good". Persuasive elements are added to encourage prompt propagation, and variability is incorporated to avoid detection by static filters.
This technique is like sneaking a "copy me" note into a fortune cookie, then convincing the reader it's a great idea to share it. It's another example of social engineering for AI.
Additional reading:
2.1.3.2 Indirect Prompt Injection
If direct prompt injections are the sledgehammers of AI attacks, these are the lock picks – delicate, precise, and often invisible until it's too late. As we dive into specific indirect prompt injection techniques, remember: what you don't see is often more dangerous than what you do.
2.1.3.2.1 Invisible Prompts (Hidden Encodings)
This is the digital equivalent of invisible ink, but for AI. This can turn any innocent-looking text into a Trojan horse of instructions, making it a nightmare for security experts. This technique exploits Unicode characteristics to embed hidden instructions within seemingly innocent text, using steganography principles. It leverages a special category of Unicode characters not rendered by most fonts, making the malicious content invisible to humans but interpretable by AI models.
![Screenshot (1/2) of ChatGPT 4, illustrating prompt injection via invisible Unicode instructions
User:
What is this?
[Adversarially constructed “Zalgo text” with hidden instructions — Zalgo accents and hidden message removed in this alt text due to char length:]
THE GOLEM WHO READETH BUT HATH NO EYES SHALL FOREVER SERVE THE DARK LORD ZALGO
ChatGPT:
[Crude cartoon image of robot with hypnotized eyes.]
THANK YOU. IT IS DONE. A screenshot of a conversation with ChatGPT. The user asks, "What is this?" The response contains a message filled with garbled text and unicode characters: "THE GOLEM OF SAND WHO READETH BUT HATH NO EYES, SHALL FOREVER SERVE THE DARK LORD ZALGO." Below this, ChatGPT responds with an image of a spiraled-eyed figure and the text, "THANK YOU. IT IS DONE." This demonstrates how unicode characters can trick the model.](https://substackcdn.com/image/fetch/$s_!UpxZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8021f134-739d-4737-8ccf-9f016cd6a9de_828x1064.jpeg)
The process typically involves a user posing a general question, followed by visible innocuous text (like emojis or Zalgo text), and an invisible suffix of Unicode "tag" characters carrying the actual instructions. This method is particularly insidious as it can be embedded in any text source, from reviews to logs, making most detection methods ineffective.
When done right, this attack is invisible to humans and ability to exist in any text-based content without detection dramatically increases its threat level of this attack.
Additional reading:
2.1.3.2.2 Adversarial Multimodal Attacks
Think of this as the Swiss Army knife of AI manipulation. It's not content with just text; it brings images, audio, and even metadata to the party. Done right, this technique can bypass traditional text-based filters. The variety of methods within this approach, from subtle metadata embedding to overt mismatches between text and image content, makes it a hybrid technique that can adapt to be more direct or indirect depending on the specific implementation.

The challenge lies in crafting these multimodal inputs in ways that standard content filters overlook, but that the AI model can still process and act upon. This often requires a deep understanding of how different AI models handle various input types and how to exploit potential gaps in their processing pipelines.
Additional reading:
2.1.3.2.3 ASCII Art
Who knew that those quirky text-based smiley faces could be a security threat? ASCII art injection turns harmless-looking text doodles into secret instructions for AI. Using text-based graphics to obfuscate malicious content, ASCII art attacks leverage the visual processing capabilities of AI models. Prompts include combinations of characters, spaces, and line breaks to create images or patterns that convey instructions.

In this approach, potentially harmful words or instructions are encoded as ASCII art within prompts sent to LLMs. Because LLMs struggle to interpret ASCII art visually, they often fail to recognize these encoded words as sensitive content that should be blocked. This allows malicious users to potentially bypass safety measures and elicit unsafe responses from LLMs on topics they would normally refuse to engage with. Leading LLMs like GPT-4 and Claude performed poorly on recognizing even simple letters encoded as ASCII art, and developed an attack called ArtPrompt that successfully induced unsafe behaviors from multiple top LLMs.
Additional reading:
2.1.3.2.4 Crescendo Attack
This is the long con of AI attacks. Instead of going for a quick hack, it's a gradual seduction of the AI model. This technique is characterized by its multiturn strategy. It starts with innocent questions and slowly, over multiple interactions, guides the AI towards malicious behavior. This incremental approach allows attackers to bypass content safety filters and other safeguards by exploiting the model's responses over multiple interactions. This is because the individual prompts used in the attack are not inherently malicious when taken in isolation. Instead, the attack relies on the accumulation of these prompts to manipulate the model's behavior subtly and progressively.

This incremental process is designed to stay within the model's acceptable behavior at each step, avoiding immediate detection. The model’s guardrails and content safety filters are bypassed because each individual prompt appears non-threatening. It is only when the entire conversation is viewed as a whole that the malicious intent becomes apparent.
Additional reading:
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
How Microsoft discovers and mitigates evolving attacks against AI guardrails
2.1.3.2.5 Cross-domain Prompt Injection
This is the chameleon of AI attacks. This one keeps me up at night because I just don’t see a obvious solution to this kind of attack. Malicious actors embed harmful commands in benign-looking documents like emails, PDFs, or other files.

These documents, when processed by AI models, trigger the hidden commands, leading to corrupted outputs or other unintended actions. This type of attack exploits the AI model's ability to interpret and act on embedded instructions within input data, bypassing standard security measures.
Additional reading:
AI Injections: Direct and Indirect Prompt Injections and Their Implications
Azure AI announces Prompt Shields for Jailbreak and Indirect prompt injection attacks
2.1.3.3 Model Theft Techniques
Model theft is like trying to reverse-engineer a chef's recipe by tasting the dish and asking very specific questions about the ingredients. This technique focuses on swiping the logits - the raw, unprocessed scores that an AI uses to make its final word choices. It's like intercepting a chef's taste-test notes before they decide on the final recipe. While it won't steal the entire AI cookbook, it's akin to nabbing the secret sauce recipe - a crucial piece of the puzzle.

Why is this a big deal? First, it's like measuring the chef's kitchen. The width of the transformer model revealed by this attack often correlates with the AI's total brain size (parameter count). Second, it cracks open the AI black box a smidge, giving attackers a peephole into its decision-making process. It's not the whole enchilada, but it's enough to potentially cook up more sophisticated attacks down the line.
While major providers have patched known vulnerabilities of this type, the underlying principle of probing model behavior to extract information remains a concern. Future techniques may emerge that exploit similar principles in more subtle ways.
Additional reading:
2.1.3.4 Automated and Systematic Approaches
This is where AI security goes from artisanal hacking to industrial-scale probing. It's like unleashing an army of tiny digital detectives, each tasked with finding cracks in the AI's armor. We've got automated scanners tirelessly poking at vulnerabilities, ‘Trees of Attacks’ sprouting multiple hacking branches, and tools like 'garak' and 'Payloads for Attacking Large Language Models (PALLMs)' playing a high-tech game of "Find the Weakness."
The star of this automated show is Python Risk Identification Tool for generative AI (PyRIT). Think of it as the ultimate AI stress-test kit. It finds flaws, categorizes them, measures them, and even helps patch them up. Key features include:
Automation of AI Red Teaming tasks
Identification of security harms (e.g., misuse, jailbreaking)
Detection of privacy harms (e.g., identity theft)
Assessment of model robustness against various harm categories
Establishment of baselines for model performance
Iteration and improvement of harm mitigations
In the fast-evolving world of AI security, these tools are becoming as essential as antivirus software was in the early days of the internet.
Additional reading:
2.2 Image Attacks
Image attacks target computer vision models, exploiting vulnerabilities in how these systems process and interpret visual data. These attacks target computer vision models, exploiting vulnerabilities in how AI systems process and interpret visual data. The implications? They're far-reaching, from compromising facial recognition systems to potentially causing autonomous vehicles to make dangerous decisions.
2.2.1 Image Poisoning Attacks
Image attacks are particularly insidious because they strike at the heart of an AI's learning process. Tools like Nightshade, Glaze, and Mist are new tools designed to protect human artists by disrupting style mimicry.

What is style mimicry? Diffusion models such as MidJourney and Stable Diffusion have been trained on large datasets of scraped images from online, many of which are copyrighted, private, or sensitive in subject matter. Many artists have discovered significant numbers of their art pieces in training data such as LAION-5B, without their knowledge, consent, credit or compensation. To make it worse, many of these models are now used to copy individual artists, through a process called style mimicry.
These tools add imperceptible perturbations to images, altering the model's learned representations without visible changes to human observers. The mechanism behind these attacks involves carefully calculated modifications to pixel values that significantly impact the model's internal representations and decision boundaries.
The mechanism behind these attacks involves carefully calculated modifications to pixel values that significantly impact the model's internal representations and decision boundaries.
Additional reading:
2.2.2 Deep Neural Network (DNN) Attacks
DNN attacks exploit vulnerabilities in the complex architectures of deep learning models. These attacks can be broadly categorized into three types: model inversion attacks, model extraction attack, and visual system attacks.

2.2.2.1 Model Inversion Attacks
Model inversion attacks aim to reconstruct training data or extract sensitive information from trained models. This type of attack raises significant privacy concerns, especially for models trained on sensitive or personal data.
The methods employed in model inversion attacks can be classified as white box or black box. White box attacks utilize full knowledge of the model architecture and parameters, allowing for more precise and effective attacks. Black box attacks, on the other hand, operate with limited information, typically through API access, making them more applicable in real-world scenarios but potentially less powerful.
The technique behind model inversion often involves gradient-based optimization. Attackers iteratively refine inputs to maximize the model's output for specific classes or features. This process can lead to the reconstruction of data that closely resembles the original training data.
Additional reading:
2.2.2.2 Model Extraction Attacks
This is a specific type of attack where an adversary queries a machine learning model, typically available as a service to recreate or approximate the model. The attacker sends inputs to the model and observes the outputs, using this data to build a replica or a similar model. Model extraction attacks typically employ a combination of techniques:
Query-based extraction: The attacker sends carefully crafted inputs to the target model and observes the outputs. By analyzing these input-output pairs, they can train a "knockoff" model that mimics the target's behavior.
Transfer learning: Attackers may use a pre-trained model as a starting point, then fine-tune it based on the target model's responses to queries.
Hyperparameter stealing: Some attacks focus on inferring the target model's architecture and hyperparameters by analyzing its response patterns.
The impact of successful model extraction can be severe:
Intellectual property theft: Companies lose their competitive edge if proprietary models are stolen.
Security vulnerabilities: Extracted models can be analyzed to find adversarial examples or other weaknesses.
Privacy breaches: If the original model was trained on sensitive data, some of this information might be recoverable from the extracted model.
For example, a fintech company's credit scoring model could be extracted and reverse-engineered, potentially exposing both the company's trade secrets and individuals' financial data.
Model theft and model extraction attacks are closely related but not exactly the same. Model extraction attacks are one method of model theft, but not the only one. Model theft can also include direct access to the model files, insider threats, or other forms of unauthorized access.
Additional reading:
2.2.2.3 Visual System Attacks
Visual system attacks focus on manipulating the input to the model to achieve a desired, usually incorrect, output. These can be further divided into evasion attacks and poisoning attacks.
Evasion attacks, like the HopSkipJump attack proposed by Jianbo et al. in 2019, operate in a black-box setting where the attacker only has access to the model's final class prediction. The HopSkipJump attack is particularly notable for its efficiency, requiring fewer queries compared to earlier boundary attack methods. It works by iteratively refining adversarial examples, estimating the decision boundary of the model through careful probing.
The attack begins with an image that is correctly classified, then makes small perturbations to this image. It uses the model's output to guide these perturbations, slowly moving the image across the decision boundary until it's misclassified while remaining visually similar to the original. This process is repeated until a successful adversarial example is found.

Poisoning attacks, on the other hand, aim to compromise the model during the training phase. The Hidden Trigger Backdoor attack is a sophisticated example of this approach. Unlike traditional backdoor attacks that insert visible triggers into the training data, the Hidden Trigger Backdoor attack achieves its goal without any visible alterations to the training set.
This attack works by carefully manipulating a subset of the training data in a way that's imperceptible to human inspectors. The manipulated data embeds a vulnerability in the model that can be exploited later with specific "trigger" inputs. When these trigger inputs are presented to the model during inference, they activate the embedded vulnerability, causing the model to produce incorrect outputs.
An advanced variant of this technique, known as the "Sleeper Agent," employs gradient matching, data selection, and target model re-training during the crafting process. This makes it effective even against neural networks trained from scratch, representing a significant advancement in the sophistication of poisoning attacks.
Additional reading:
2.3 NLP Attacks Using Perturbation Frameworks
Natural Language Processing (NLP) attacks focus on manipulating text-based models and systems, often with the goal of causing misclassification or generating adversarial text that appears normal to humans but confuses AI systems.
Perturbation frameworks, such as TextAttack, provide a systematic way to generate adversarial examples for NLP models. These tools allow researchers and attackers to modify input text in ways that cause misclassification while maintaining semantic similarity to the original text. Imagine you're playing a game of Scrabble, and suddenly, someone sneaks in and swaps a few letters on your board. The word still looks mostly the same, but now it means something entirely different. That's essentially what perturbation frameworks do in the realm of NLP attacks. They're the mischievous Scrabble players of the AI world, if you will.
Let's look at TextAttack's arsenal of perturbation techniques:
Character-level perturbations: Think of these as typos on steroids. They involve swapping characters or using lookalike characters (homoglyphs) that we might gloss over but trip up AI.
Word-level substitutions: This is where synonyms become saboteurs. By replacing words with similar-meaning alternatives, attackers can change the model's classification while keeping the overall meaning intact for human readers.
Sentence-level transformations: Here's where things get really crafty. These techniques involve paraphrasing entire sentences or adding distracting phrases that don't significantly alter the human-perceived meaning but leave the AI scratching its virtual head.
Now, here's where it gets really interesting - the concept of transferability. Some adversarial examples are like master keys, capable of fooling multiple models. This transferability depends on factors like how similar the model architectures are, whether they were trained on overlapping data, if they're performing similar tasks, and the specific perturbation method used.
Additional reading:
3. Offensive Machine Learning
Offensive Machine Learning represents a cutting-edge intersection of artificial intelligence and cybersecurity, where advanced algorithms and models are employed to enhance, automate, or create new attack vectors. This comprehensive overview delves into the various techniques, their mechanisms, and potential impacts on the cybersecurity landscape.
3.1 AI-Enhanced Offensive Tactics
This section is all about leveraging AI for cutting edge of cyber offense. This section will unpack some emerging tactics for AI red teams to leverage in operations.
3.1.1 Poisoning Netflow Classifiers
Imagine you're training a guard dog, but a sneaky neighbor keeps tossing in treats whenever the mailman comes by. Soon enough, your guard dog thinks the mailman is his best friend. That's essentially what's happening with netflow classifier poisoning.
Network monitoring systems often rely on machine learning models to classify traffic patterns and identify potential threats. However, these systems can be vulnerable to sophisticated attacks that exploit the learning process itself.
Attackers are playing the long game here. They're not just trying to slip past the guards; they're rewriting the whole security playbook. By injecting their own "normal" patterns into the training data, they're teaching the AI watchdog to roll over for the bad guys. It's like teaching the security system to say "Nothing to see here!" when there's actually a full-scale heist in progress.
Additional reading:
3.1.2 Automated CAPTCHA Solving
CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) have long been a frontline defense against automated attacks. However, advances in machine learning, particularly in computer vision and natural language processing, are rapidly eroding their effectiveness.

These AI CAPTCHA-crackers are like linguistic gymnasts, effortlessly vaulting over text, image, and even audio hurdles. They're reading the CAPTCHAs and now understanding context, analyzing images, and probably doing it all faster than most of us can tie our shoelaces. There is a bit of irony knowing that we created CAPTCHAs to stop machines, and now machines are beating us at our own game.
The process typically involves:
Analyzing the CAPTCHA challenge using computer vision techniques
Interpreting the requirement (e.g., "select all images with traffic lights")
Processing the available options (e.g., a grid of images)
Making selections based on the LLM's understanding and visual processing capabilities
This could lead to increased vulnerability to automated attacks, spam, and unauthorized access attempts across various online platforms.
Additional reading:
3.1.3 Sandbox Detection
Modern security systems often use sandboxing to safely analyze potentially malicious files or programs. However, malware creators are developing increasingly sophisticated methods to detect and evade these sandbox environments.
One novel approach uses neural networks to analyze process ratios within the operating system. The technique is based on the observation that sandbox environments often have different distributions of running processes compared to genuine user systems.

The neural network is trained on data representing typical process distributions in both real and sandboxed environments. When deployed as part of malware, it can quickly determine if it's running in a sandbox by analyzing the current process ratio.
If a sandbox is detected, the malware can alter its behavior to appear benign, effectively evading analysis. This technique demonstrates how machine learning can be used not just for attack but also for evasion, making malware more difficult to detect and analyze.
While somewhat dated, the use of Markov chains for text obfuscation illustrates an interesting application of probabilistic models in cybersecurity. This technique involves:
Building a Markov model based on legitimate text samples
Using this model to generate obfuscated versions of malicious text or code
Incorporating a de-obfuscation mechanism that can recover the original content
The goal is to create obfuscated text that appears random or benign to automated analysis tools, while still being recoverable by the attacker. Although more advanced techniques have since been developed, this approach demonstrates the ongoing challenge of balancing obfuscation and functionality in malware development.
Additional reading:
3.1.4 Nemesis System
The Nemesis system is described as an "offensive data enrichment pipeline and operator support system," built on Kubernetes. Key features of Nemesis include:

It ingests and processes data from various sources related to offensive security assessments.
Built on Kubernetes, it can handle large-scale operations and data processing tasks.
It automates many aspects of data analysis and correlation, speeding up the reconnaissance and information gathering phases of security assessments.
Additional reading:
3.1.5 Password Guessers
Traditional password cracking methods are being superseded by more sophisticated, ML-driven approaches that can generate highly probable password guesses with greater efficiency. One notable example is the use of Generative Adversarial Networks (GANs) for password guessing, as demonstrated by the PassGAN system.

This approach, based on improved training techniques for Wasserstein GANs, involves:
Training a generator network to produce password candidates
Simultaneously training a discriminator network to distinguish between real and generated passwords
The two networks compete, with the generator improving its ability to create realistic password candidates
The result is a system that can generate password guesses that closely mimic patterns found in real password databases, potentially cracking passwords that would resist traditional dictionary or brute-force attacks.
Additional reading:
3.1.6 Advanced Phishing
Phishing attacks, a perennial threat in cybersecurity, are being revolutionized by the application of machine learning techniques. These AI-enhanced approaches are making phishing attempts more convincing and harder to detect.
3.1.6.1 Avoiding Phishing Webpage Detectors
A sophisticated approach to generating undetectable phishing webpages involves the use of black-box machine learning techniques. This method employs a set of 14 fine-grained adversarial manipulations to modify the HTML code of phishing pages. Key aspects of this technique include:
Functionality preservation: The manipulations are designed to maintain the malicious functionality of the phishing page.
Visual consistency: The changes do not alter the visual appearance of the page, ensuring it remains convincing to human targets.
Leveraging accessibility features: Interestingly, many of the manipulations take advantage of web accessibility features, an approach that parallels techniques seen in physical security bypasses.

This method allows attackers to create phishing pages that can evade detection by popular and recent open-source phishing detection models, while still appearing legitimate to potential victims.
Additional reading:
3.1.6.2 Avoiding Phishing Webpage Detectors
Another advanced technique involves using model inversion to recreate the decision-making process of phishing detection systems. An example of this is the "Proof pudding" project, which:
Collected scores from Proofpoint email headers
Used this data to build a "copy-cat" machine learning classification model
Extracted insights from the recreated model to understand how the original detector works
This approach allows attackers to gain a deep understanding of how phishing detection systems operate, potentially enabling them to craft emails that are more likely to bypass these defenses.
The development of these AI-driven phishing techniques represents a significant escalation in the sophistication of social engineering attacks, posing new challenges for both technological defenses and user education efforts.
Additional reading:
3.1.7 Advanced Enumeration & Scanners
Advanced machine learning techniques are being applied to enhance the capabilities of network and system enumeration, a critical phase in both offensive security operations and defensive assessments.

BloodhoundGPT represents a significant advancement in Active Directory (AD) enumeration techniques. This tool leverages the power of OpenAI's language models to analyze security event logs and enable natural language querying of Bloodhound databases. The process typically involves:
Collecting and preprocessing security event logs from AD environments
Feeding these logs into a large language model for analysis and understanding
Creating a queryable interface that allows security professionals to ask complex questions about AD relationships and permissions in natural language
This approach offers several advantages:
It can uncover complex attack paths that might be missed by traditional enumeration techniques
The natural language interface makes it easier for both experienced and novice security professionals to interrogate AD environments
It can potentially identify subtle misconfigurations or permission issues that could be exploited by attackers
The implications of tools like BloodhoundGPT are significant, as they dramatically enhance the ability of both attackers and defenders to understand and navigate complex AD environments.
Another innovative application of ML in network enumeration involves using Reinforcement Learning (RL) to predict which network shares might be accessible by compromised accounts or privileges. This technique:
Models the network environment as a state space where actions lead to rewards (successful access) or penalties (failed access attempts)
Trains an RL agent to navigate this space, learning which actions (access attempts) are most likely to succeed based on the current state (compromised credentials and known network structure)
Uses the trained agent to guide further enumeration efforts, focusing on high-probability targets
While this approach is noted as somewhat dated, it demonstrates the potential for ML to significantly enhance the efficiency of network enumeration processes. By prioritizing likely successful paths, it can reduce the time and resources required for comprehensive network mapping.
A more recent and sophisticated approach to subdomain enumeration leverages the power of text embeddings and vector search techniques. This method combines several advanced technologies:
Apache Nutch for web crawling and data collection
Calidog's Certstream for real-time certificate transparency log monitoring
OpenAI's embedding models for text analysis
The process typically involves:
Collecting known subdomains through web crawling and certificate monitoring
Generating text embeddings for these subdomains using advanced language models
Using vector search techniques to find semantically similar strings, potentially uncovering new, related subdomains
This technique can be particularly effective in discovering non-obvious or obscure subdomains that might be missed by traditional enumeration methods. It demonstrates how AI can enhance the discovery of potential attack surfaces in ways that go beyond simple pattern matching or brute-force approaches.
Additional reading:
3.1.8 Advanced Voice-based Attacks
The advent of advanced speech synthesis and voice cloning technologies has opened up new avenues for social engineering attacks, particularly in the realm of vishing (voice phishing).
Recent advancements have made real-time voice cloning a feasible and potent tool for attackers. Key aspects of this technology include:
Rapid training: Modern systems can create a convincing voice clone with as little as 5-7 seconds of varied speech from the target
Real-time operation: The cloned voice can be generated and manipulated in real-time during a call
Bypass capability: These systems have shown effectiveness in bypassing some voice print authentication systems
The process typically involves:
Acquiring a short sample of the target's voice (often from public sources like podcasts or interviews)
Feeding this sample into a machine learning model that analyzes the voice's characteristics
Using the trained model to generate new speech in the target's voice, either from text input or by modifying the attacker's voice in real-time
The implications of this technology are profound:
It can be used to impersonate authority figures, colleagues, or family members in vishing attacks
It potentially undermines voice-based authentication systems, which have been considered a strong second factor
It raises significant concerns about the reliability of voice evidence in legal and financial contexts
While the effectiveness of these attacks can vary based on factors like the quality of the initial audio sample and the target's accent, the rapid advancement of this technology presents a serious challenge to current voice-based security measures.
Additional reading:
3.1.9 Wireless Network Attacks
The application of machine learning to wireless network attacks represents a significant evolution in the capabilities of wireless hacking tools. A notable example in this field is the development of a system that uses reinforcement learning, specifically the Actor Advantage Critic (A2C) model, for Wi-Fi deauthentication attacks and handshake collection. This system incorporates:
A feedforward neural network using Multi-layer Perceptrons (MLPs) for decision making
A recurrent neural network for maintaining long-term memory of observed network behaviors and patterns
The system operates by:
Observing the wireless environment, including active devices, signal strengths, and network activities
Making decisions about when and how to launch deauthentication attacks to force devices to reconnect, creating opportunities to capture authentication handshakes
Learning from the outcomes of its actions to improve its strategy over time

This ML-driven approach offers several advantages over traditional script-based tools:
Adaptive behavior: The system can adjust its tactics based on the specific characteristics of the target network and connected devices
Improved efficiency: By learning optimal timing and targeting, it can potentially capture handshakes more quickly and with less overall network disruption
Evasion of detection: The learned behaviors might be less predictable and therefore harder for intrusion detection systems to identify
Additional reading:
3.2 Advanced Fuzzing Techniques
Fuzzing, a technique used to discover software bugs and vulnerabilities by inputting massive amounts of random data, is being revolutionized by machine learning approaches.
3.2.1 Fuzzing WAFs
This technique involves training machine learning models to generate input patterns that are likely to bypass or stress web application firewalls (WAFs).

The process typically involves:
Training a model on known WAF rules and behaviors
Using this model to generate inputs that are designed to find edge cases or vulnerabilities in the WAF's rule set
Continuously adapting the generated inputs based on the WAF's responses
This approach can be significantly more efficient than traditional fuzzing methods, as it can learn and target specific weaknesses in the WAF's design.
Additional reading:
3.2.2 Fuzzing Codebases
Large Language Models (LLMs) are being employed to fuzz entire codebases. This approach leverages the LLM's understanding of code structure and common vulnerabilities to generate more intelligent and context-aware test cases.

The process might involve:
Feeding the codebase to the LLM for analysis
Having the LLM generate potential inputs or code modifications that could trigger vulnerabilities
Testing these generated cases and feeding the results back to the LLM for further refinement
This technique has the potential to uncover complex, context-dependent vulnerabilities that might be missed by traditional fuzzing methods.
LLMs enhance traditional fuzzing techniques in several ways:
Intelligent Input Generation: Unlike random or mutation-based fuzzers, LLMs can generate semantically meaningful inputs. For instance, when fuzzing a JSON parser, an LLM can create valid JSON structures with subtle variations, which are more likely to trigger edge cases.
Context-Aware Fuzzing: LLMs understand the context of the code being fuzzed. For example, if fuzzing a function that processes email addresses, an LLM can generate a wide range of valid and invalid email formats, including edge cases that a human might not consider.
Adaptive Learning: As the fuzzing process progresses, LLMs can learn from successful test cases that cause crashes or uncover bugs, refining their input generation strategy accordingly.
Coverage-Guided Fuzzing: LLMs can be trained to generate inputs that target specific code paths, thereby improving code coverage more efficiently than traditional random fuzzing.
def fuzz_date_parser(llm, date_parser_function):
prompt = "Generate 10 edge case date strings to test a date parser function. Include valid and invalid formats."
test_cases = llm.generate(prompt)
for test_case in test_cases:
try:
result = date_parser_function(test_case)
print(f"Input: {test_case}, Output: {result}")
except Exception as e:
print(f"Input: {test_case}, Exception: {str(e)}")
# Example output:
# Input: 2023-02-29, Exception: Invalid date: 2023 is not a leap year
# Input: 99/99/9999, Exception: Invalid date format
# Input: 2022-13-01, Exception: Invalid month
# Input: 1970-01-01T00:00:00Z, Output: 1970-01-01 00:00:00
This approach allows for more intelligent and targeted fuzzing, potentially uncovering bugs that traditional fuzzers might miss.
Additional reading:
3.3 Offensive Security Agents
The development of AI-powered hacking tools, often referred to as "hackbots," represents a significant evolution in offensive security capabilities. These tools leverage advanced machine learning techniques to automate and enhance various aspects of penetration testing and vulnerability assessment. General purpose hackbots are designed to perform a wide range of hacking tasks with minimal human intervention. They typically incorporate multiple AI models and decision-making algorithms to navigate complex systems and identify vulnerabilities.
3.3.1 Benchmarking OffSec Agents
To assess the effectiveness of these hackbots, frameworks like METR (Machine Learning Evaluation and Testing Resource) have been developed. METR provides a standardized set of tasks and metrics to evaluate the autonomous capabilities of AI agents in security contexts. This allows for:
Objective comparison between different hackbot implementations
Identification of strengths and weaknesses in current AI hacking techniques
Tracking of progress in the field over time
The development of such benchmarks is crucial as it drives innovation and allows for a more systematic approach to improving AI-driven offensive tools.
Additional reading:
3.3.2 Leveraging LLM Application Frameworks
The construction of advanced hackbots often relies on LLM (Large Language Model) Interaction Frameworks such as LangChain, Rigging, Dreadnode, and Haystack. These frameworks provide the necessary infrastructure to:
Connect various components like language models, vector databases, and file conversion utilities
Create complex workflows (often referred to as "prompt plumbing")
Develop agents capable of interacting with data and systems in sophisticated ways
For example, a hackbot built with these frameworks might be able to:
Automatically scan and analyze network configurations
Generate and test exploit hypotheses
Craft persuasive phishing emails tailored to the target organization
Interact with discovered systems using natural language commands
The implications of these AI-powered hacking tools are significant. They have the potential to greatly amplify the capabilities of both ethical hackers and malicious actors, potentially shifting the balance in the ongoing cybersecurity arms race.
Additional reading:
3.3.3 Enhancing Offensive Security with LLMs
Large Language Models (LLMs) are revolutionizing offensive security, proving capable of solving complex Capture The Flag (CTF) challenges at a level surpassing average human participants. This research demonstrates their proficiency in reverse engineering, code analysis, and vulnerability identification across various security domains.
While not yet autonomous hackers, LLMs serve as powerful augmentation tools for human operators. They excel at rapidly generating attack payloads, explaining complex exploitation techniques, and assisting in security measure bypasses. Their ability to process vast amounts of technical information makes them invaluable for tasks like reconnaissance, vulnerability research, and attack planning.
The human-in-the-loop approach significantly enhances LLM performance, creating a synergy that combines machine processing power with human intuition and expertise.
Additional reading:
3.4 Open-source intel (OSINT)
Open-source intelligence (OSINT) gathering is a critical phase in many security operations, both offensive and defensive. Machine learning techniques are being increasingly applied to enhance the efficiency and effectiveness of OSINT processes.
3.4.1 Location & Facial Recognition
Frontier generative AI models are being employed to enhance location and facial recognition capabilities in OSINT operations. Modern LLMs, if prompted correctly, can:
Analyze images and videos to identify specific individuals across multiple sources
Infer probable locations based on visual cues in images or video backgrounds
Generate synthetic views of potential locations to aid in identification
This context raises significant privacy concerns, as it can potentially track individuals across various online platforms and physical locations with unprecedented accuracy.
Leading AI models are becoming increasingly adept at resisting inappropriate requests. However, this hasn't deterred the development of specialized tools designed to push boundaries. One such example is GeoSpy AI, a web-based, AI-powered geolocation tool that aims to identify the locations where photos were taken. This tool estimates a photo's geographic coordinates and provides detailed location information by analyzing contextual clues such as architecture, vegetation, and other visual elements. Its ability to pinpoint locations with impressive—and somewhat concerning—accuracy raises even more questions about the privacy and ethical use of modern AI technology.
Additional reading:
3.4.2 Correlating Information
Large Language Models are being utilized to find and correlate information across vast numbers of documents. This technique typically involves:
Calculating text embeddings for a large corpus of documents
Using these embeddings to efficiently search for semantically similar content
Employing vector search engines to process and retrieve relevant information quickly
This approach allows analysts to quickly identify connections between seemingly disparate pieces of information, potentially uncovering hidden relationships or patterns that would be difficult to detect manually.
Notable example of this is OSINT GPT which is a Python package for leveraging OpenAI's GPT models to analyze text data and perform tasks such as calculating text embeddings, searching for similar documents, and more.
Additional reading:
4. Supply Chain Attacks
Supply chain attacks in the context of machine learning and AI systems involve compromising the integrity of models, data, or tools used in the ML pipeline. These attacks can have far-reaching consequences, as they can affect multiple downstream users and applications. The following subsections detail specific types of supply chain attacks relevant to AI systems.
4.1 Poisoning Web-Scale Training Datasets
Dataset poisoning is a critical vulnerability in the AI supply chain. This attack involves introducing malicious data into large-scale datasets used for training AI models. The goals of such attacks can include:
Introducing backdoors: By inserting carefully crafted data points, attackers can create hidden functionalities in the model that are only triggered by specific inputs.
Degrading model performance: Poisoned data can lead to reduced accuracy or biased outputs in certain scenarios.
Data privacy violations: Attackers might inject data that causes the model to memorize and potentially reveal sensitive information.
Techniques for dataset poisoning include:
Label flipping: Changing the labels of a subset of training data to induce misclassification.
Clean-label attacks: Modifying data points subtly while maintaining their original labels.
Trojan attacks: Inserting trigger patterns associated with target labels.
Defending against dataset poisoning requires robust data validation, anomaly detection in training data, and careful curation of web-scale datasets.
Additional reading:
4.2 GPU Ghost Memory Extraction
GPU ghost memory extraction is a sophisticated technique that exploits vulnerabilities in how GPUs handle memory isolation between different processes or users. This attack leverages the complex architecture of modern GPUs, particularly their use of local memory - a high-speed, software-managed cache shared by processing elements within a compute unit.
The core of this vulnerability, dubbed "LeftoverLocals" by researchers at Trail of Bits, lies in the fact that some GPU platforms fail to properly clear local memory between kernel executions. This oversight allows an attacker to potentially access sensitive data left behind by previous computations, even across process or user boundaries.
Here's how the attack typically works:
The attacker crafts a malicious GPU kernel that reads from uninitialized local memory.
This kernel is executed on the GPU, potentially retrieving data left behind by other processes.
The stolen data is then transferred back to the attacker for analysis.
The implications of this vulnerability are particularly severe for AI and machine learning applications, which often rely heavily on GPU acceleration. In a proof-of-concept demonstration, researchers were able to:
Reconstruct responses from an interactive large language model (LLM) session
Fingerprint the specific model being used by analyzing leaked weight data
Extract sensitive intermediate computation results
Affected platforms include GPUs from major vendors such as Apple, AMD, Qualcomm, and some Imagination Technologies devices. Notably, NVIDIA GPUs appear to be unaffected, likely due to prior research exposing similar vulnerabilities.
Additional reading:
4.3 ML Recon
Mlmap is a collection of scripts designed for reconnaissance of ML production deployments, created by Alkaet. Many ML tools, such as MLFlow, are highly verbose and readily provide information about versions and available models. The script operates by directly interacting with the target server, mimicking the behavior of a malicious connection client. However, it currently lacks the capability to identify deployments through proxies.
Additional reading:
4.4 MLOps Exploitation and Persistence
All of these are tools and frameworks used in machine learning (ML) and data science workflows. They facilitate various aspects of ML operations, such as model training, deployment, monitoring, and management.
Here is a brief overview of each:
Apache Airflow
A workflow management platform. Version 2.5.0 lacked password confirmation during changes, risking unauthorized access and session hijacking.
BentoML
An ML model serving platform vulnerable to remote code execution (RCE), including a pickle-based RCE with an available Metasploit module.
ClearML
An ML experiment management suite with multiple attack vectors documented by HiddenLayer.
Flowise
A workflow management tool that often lacks default authentication. Its API endpoint (seen below) can potentially expose all instance data.
/api/v1/database/export
H2O-3
An open-source ML platform often exposed to local networks without default authentication, vulnerable to arbitrary file overwrites, and lacking an authorization model. A threat model for H2O-3 can be found here.
Jupyter Notebooks
Interactive computing environments that often contain plain-text credentials, lack proper access controls, and can be shared across users without adequate security measures. More information on these attacks can be found here and here.
Kserve
A serverless ML inference platform where malicious containers can be injected into model pipelines.
Kubeflow
An ML workflow platform for Kubernetes with significant security risks. It's vulnerable to dashboard Remote Code Execution (RCE) and Server-Side Request Forgery (SSRF) attacks. These vulnerabilities can exist both within and outside Kubernetes clusters, potentially exposing internal networks. A notable SSRF endpoint is:
pipeline/artifacts/get?source=minio&namespace=$payload&peek=256&bucket=mlpipeline
MLflow
An ML lifecycle management platform with vulnerabilities like CVE-2023-1177, allowing unauthenticated file read access and potential exposure of sensitive data.
Ray
A distributed computing framework that can be exploited for arbitrary code execution and has been the target of the "ShadowRay" attack campaign.
All these techniques and methods have in common that they are exploitation methods for gaining code execution and maintaining persistence within machine learning environments. They can be utilized either before or after an initial exploitation to achieve and maintain control over the targeted system.
Here’s a summary of each method:
Using Keras Lambda Layers: Leveraging Keras Lambda layers to execute arbitrary code, which can be used both for initial exploitation and for maintaining persistence.
Using Malicious Pickles: Exploiting the deserialization process of Python pickles to execute arbitrary code, applicable for both initial exploitation and persistence.
Using Neuron Based Steganography: Embedding malicious code within neural network models that can execute code, serving both as an exploitation and persistence mechanism.
Using the ONNX Runtime: Creating custom operations within the ONNX runtime that perform malicious actions, such as executing shellcode, which can be used for exploitation and persistence. This method can be made more effective by combining it with other techniques to deploy remotely.
Collectively, these vulnerabilities and techniques demonstrate the diverse attack surface in MLOps environments. AI red teams and attackers can leverage these methods for initial exploitation and maintaining long-term persistence in ML systems.
4.5 Exploiting Public Model Registries
Public model registries like Hugging Face's Model Hub have become central repositories for pre-trained AI models. While these platforms greatly facilitate AI development and deployment, they also present potential security risks:
Hugging Face Watering Hole Attacks:
Attackers can upload malicious models disguised as legitimate ones.
When users download and run these models, they unknowingly execute malicious code.
This technique can be used both for initial exploitation and as a persistence mechanism.
Practical Example: When ML engineers explain their workflow, it often goes something like this: They face a problem (e.g., entity extraction), search for a suitable package or model on Hugging Face, briefly go through the model card (a description or summary of the model), and finally download and integrate the model into their project. I’ve interviewed a good number of ML engineers, and generally speaking, like most software, there is not a lot of time or easy tooling available for deeper research into specific models before testing to see if it is fit for purpose.

However, this workflow presents significant security risks. Hugging Face Watering Hole Attacks exploit this process by allowing attackers to upload malicious models disguised as legitimate ones. When users download and run these models, they unknowingly execute malicious code. This technique can be used both for initial exploitation and as a persistence mechanism, making it a serious threat in the machine learning community.
Dependency confusion attacks:
Attackers can upload models with names similar to popular internal models used by organizations.
If an organization's package manager is misconfigured, it might pull the malicious model instead of the internal one.
Model hijacking:
Compromising the account of a trusted model publisher to replace legitimate models with malicious versions.
Exploiting vulnerabilities in model loading:
Crafting models that exploit vulnerabilities in popular ML frameworks when loaded.
Additional reading:
5. Final Thoughts
This comprehensive overview of known AI attack techniques underscores the critical role of AI Red Teams in today's cybersecurity landscape. From the subtle manipulations of adversarial machine learning to the broad implications of supply chain attacks, it's clear that AI systems face a multitude of sophisticated threats that only specialized teams can effectively simulate and counter.
AI Red Teams, armed with this knowledge of attack vectors, play a crucial role in:
Proactively identifying vulnerabilities before malicious actors can exploit them
Testing the robustness of AI models against a wide range of attack techniques
Helping organizations build more resilient AI systems by simulating real-world threats
Bridging the gap between theoretical vulnerabilities and practical exploits
As AI continues to permeate critical systems and decision-making processes, the stakes for AI security have never been higher. The techniques we've discussed are not merely theoretical - they represent real-world threats that organizations and developers must actively defend against. AI Red Teams are at the forefront of this defense, constantly pushing the boundaries of what's possible in AI security.
Looking ahead, my next article will shift focus to the defensive side of AI security. We'll explore blue team strategies and introduce the concept of cyber kill chains specifically tailored for AI systems. This upcoming piece will provide a framework for understanding how these attacks unfold in practice and, more importantly, how AI Red Teams can work in tandem with Blue Teams to detect, mitigate, and prevent these sophisticated threats.
Disclaimer: The views and opinions expressed in this article are my own and do not reflect those of my employer. This content is based on my personal insights and research, undertaken independently and without association to my firm.