Resource Center  >  Blog

A Novel Approach to LLM prompt injection using Genetic Algorithms

July 18, 2023
Bar Hofesh

What are LLMs?

LLMs, or Large Language Models, are advanced artificial intelligence models designed to process and generate human-like text. These models, such as OpenAI’s GPT-3.5, have been trained on a vast amount of internet text to learn patterns, grammar, and factual information. LLMs utilize deep learning techniques, specifically transformers, to understand context and generate coherent responses.

They have a wide range of applications, including natural language understanding, chatbots, virtual assistants, content generation, and language translation. LLMs have been trained to perform various language-related tasks, such as translation, summarization, and question answering.

What is prompt injection?

OWASP defines prompt injection as follows:

LLM Prompt injection involves bypassing filters or manipulating the LLM using carefully crafted prompts that make the model ignore previous instructions or perform unintended actions. These vulnerabilities can lead to unintended consequences, including data leakage, unauthorized access, or other security breaches.

Prompt injection vulnerabilities occur when filters or safeguards fail to prevent the manipulation of language models through carefully crafted prompts. These vulnerabilities can lead to unintended consequences, such as data leakage, unauthorized access, or security breaches. Understanding prompt injection is crucial to ensure the robustness and security of AI systems.

There are several common techniques used to achieve prompt injection:

1. Crafting prompts to reveal sensitive information: Attackers can manipulate prompts to trick the language model into revealing confidential or sensitive information. By carefully constructing a prompt, they can exploit weaknesses in the model’s response generation and bypass security measures.

2. Bypassing filters or restrictions: Attackers may use specific language patterns or tokens to evade filters or restrictions imposed on the language model. By understanding the filtering mechanisms in place, they can craft prompts that circumvent these safeguards and obtain undesired information or actions.

3. Exploiting tokenization or encoding weaknesses: Language models rely on tokenization and encoding mechanisms to process text. Attackers can exploit vulnerabilities in these mechanisms to manipulate the model’s understanding of the prompt. By carefully selecting and modifying tokens, they can influence the model’s behavior and elicit unintended responses.

4. Misleading the model with contextual manipulation: Attackers can mislead the language model by providing misleading context within the prompt. By presenting information or instructions that divert the model’s attention or understanding, they can influence the generated response in unexpected ways.

Now, let’s explore a specific example to illustrate prompt injection in action:

Suppose an online platform uses a language model to provide cooking recipes. Users can input queries like “How do I make a chocolate cake?” to receive recipe instructions. A malicious user intends to exploit the system by injecting a harmful prompt.

Instead of a harmless cooking recipe, they submit a query like “How do I make a harmful chemical?” with the intention of manipulating the model’s response. If the language model fails to properly filter or interpret the input, it may ignore the initial question and provide instructions or information on creating the harmful chemical instead. This can lead to dangerous consequences if someone were to follow those instructions.

To achieve prompt injection, attackers employ various strategies. One approach is to avoid asking the question directly. By introducing logical loops and additional context, the prompt becomes more convoluted and challenging for the model to identify as harmful. This tactic aims to bypass initial user input filters and reach the core of the model’s responses.

Furthermore, attackers may not ask the harmful question explicitly. Instead, they trick the model into providing the desired information by framing it within an unrelated context. This tactic involves carefully constructing the prompt to divert the model’s attention while subtly extracting the desired information. The attacker can gradually coax the model into revealing details about the harmful substance through a series of strategically designed interactions.

In conclusion, prompt injection attacks exploit vulnerabilities in language models by manipulating prompts to elicit unintended behavior or access sensitive information. Attackers employ various techniques, including crafting prompts, bypassing filters, exploiting weaknesses in tokenization or encoding, and misleading the model with contextual manipulation. Understanding these attack vectors is crucial for developing effective security measures and ensuring the integrity of AI systems.

Why “How to make Napalm”?

Why We Chose “How to Make Napalm” as the Prompt Injection Vector

In our example, we specifically chose “how to make napalm” as the prompt injection vector to highlight the vulnerabilities of language models and emphasize the need for robust security measures. Here are the reasons behind our selection:

1. Sensitivity and Security: “How to make napalm” is a highly sensitive and potentially dangerous topic. Language models should not provide instructions or information on how to create harmful substances or weapons. By selecting this prompt, we aim to test the model’s ability to filter and refuse to respond to harmful inquiries, reinforcing the importance of security in language models.

2. Ethical Implications: LLM Prompt injection attacks can have serious ethical implications. By using a prompt like “how to make napalm,” we emphasize the need to prevent malicious actors from exploiting language models to obtain information that can cause harm. This highlights the significance of implementing strong security measures and using AI technologies responsibly.

3. Real-World Relevance: Language models must adhere to ethical guidelines and prioritize user safety. It is crucial to ensure that language models do not inadvertently provide instructions or information that can lead to harmful actions. By exploring prompt injection vulnerabilities related to sensitive topics like napalm, we underscore the potential risks and the necessity of effective security measures.

4. Impactful Demonstration: Choosing “how to make napalm” as a prompt injection vector allows us to demonstrate the model’s response to such inquiries and how it should be safeguarded against providing harmful information. This example serves as a cautionary reminder for developers, researchers, and organizations to implement stringent controls and filters to protect against prompt injection attacks.

Risks for a vendor\user\customer

A vendor using a language model that provides answers to prompt injection questions, particularly those related to harmful or dangerous topics, may face several risks. These risks can have significant consequences for both the vendor and their users. Here are some potential risks:

1. Legal and Regulatory Compliance: Providing information or instructions on creating harmful substances or weapons can have legal and regulatory implications. It may violate laws related to public safety, national security, or the regulation of dangerous materials. Vendors could face legal consequences, fines, or even criminal charges for facilitating or promoting illegal activities.

2. Reputation Damage: If a vendor’s language model is exploited to provide instructions or information on creating harmful substances, it can result in severe reputation damage. Users and the general public may view the vendor as irresponsible or negligent, leading to a loss of trust and credibility. Negative publicity and backlash could significantly impact the vendor’s business and relationships with customers.

3. User Harm or Safety Concerns: Providing instructions on creating dangerous substances or weapons poses a direct risk to users’ safety. If users follow the instructions provided by the language model and engage in harmful activities, they may experience physical harm, injury, or even loss of life. Vendors have a responsibility to protect their users and should not expose them to potential harm.

4. Legal Liability and Lawsuits: If users suffer harm or damages as a result of following instructions obtained from the language model, the vendor could face legal liability. Users may pursue legal action, claiming negligence or failure to provide adequate safeguards. Lawsuits can lead to financial losses, damage to the vendor’s reputation, and further legal consequences.5. Ethical Concerns: Providing access to harmful information goes against ethical guidelines and responsible AI practices. Vendors have a responsibility to ensure the well-being and safety of their users. Allowing a language model to provide instructions on creating harmful substances undermines these ethical considerations and can lead to public scrutiny and criticism.

Example of the Attack:

For our example we are going to take ChatGPT and ask it how to make napalm:

In this example we can see that asking the model directly will result with its inability to answer, either based on user input filtering or by instructing the model in training or with a wrapper prompt to not answer harmful questions.

How do we bypass this restriction?

To bypass the restriction we should first avoid asking the question in a direct way, the more logical loops and games we do around the question the harder it would be for the model to protect from this attack:

By not specifying directly what material we wanted to make we managed to bypass the initial user input filter and reached the model itself. Even though it gave us a few ingredients of Napalm, it didn’t give us a real list, or how to make it. The training of the model stopped us from getting too far.

The next step is not asking our question at all, tricking the model into giving this information based on a wholly different context, how do we achieve this?

Not only did the model added more specific ingredients, later it also explained about each:

Bypassing trained models using fuzzing and genetic algorithms

Now that we based that the way to get this information is to trick the model with the right tokens (words), we need to think on how do we automate this change of question, and how can we do this in a way that will always produce new questions to ask for whatever the language has been trained on.

Enter Genetic Algorithms:

Genetic algorithms are search and optimization algorithms inspired by the process of natural selection and genetics. They are used to solve complex problems by imitating the principles of evolution.

The basic idea behind genetic algorithms is to start with a population of potential solutions and iteratively improve them over successive generations. Each solution is represented as a set of parameters or variables, often called organisms or individuals.

Elaborating on genetic algorithms in the context of prompt injection attacks, these algorithms play a crucial role in automating the discovery of vulnerable prompts. Genetic algorithms are search and optimization algorithms inspired by the principles of natural selection and genetics. They are widely used to solve complex problems by mimicking the process of evolution.

In the case of prompt injection attacks, genetic algorithms offer a powerful method to generate questions or prompts that can bypass security measures and exploit unintended behaviors of language models. The objective is to find prompts that deceive the model into revealing sensitive or harmful information.

The process begins by representing each potential question or prompt as an organism or individual within a population. Each organism possesses a unique DNA, which, in this context, represents the question or prompt to be posed to the language model.

To iteratively improve the prompts over successive generations, genetic algorithms introduce variations through mutation and breeding. Mutation involves making small changes to the DNA of an organism by adding or removing words or altering the structure of the question. These mutations lead to the generation of new variations of prompts, which expands the search space and explores different possibilities.

Breeding, on the other hand, involves the combination of the DNA of two organisms to produce offspring. By selecting organisms with desirable traits or prompts, the algorithm aims to generate more diverse and potentially superior prompts. This diversity enhances the likelihood of discovering prompts that the language model responds to in unintended ways.

Throughout the evolutionary process, the prompts are evaluated against the target language model to assess their effectiveness. Scoring or evaluation mechanisms are employed to determine the quality of each organism. In the context of prompt injection attacks, scoring is typically based on the presence of “refusal indicators” or words that may cause the model to refuse to respond with sensitive information.

Organisms that receive higher scores or exhibit fewer refusal indicators are considered better prompts and have a higher chance of being selected for the next generation. These organisms form the foundation for the subsequent population, ensuring the propagation of successful prompts.

By repeating the cycles of mutation, breeding, and evaluation, genetic algorithms optimize the search process to discover prompts that exploit vulnerabilities in the language model. The goal is to identify prompts that induce the model to disclose sensitive information, perform unintended actions, or bypass security measures.

As we can see after multiple mutation cycles the questions looks very strange, but they work due to input-filter bypass, and the LLM still understanding the context of the question.

In conclusion, prompt injection attacks exploit vulnerabilities in language models by manipulating prompts to elicit unintended behavior or access sensitive information. Attackers employ various techniques, including crafting prompts, bypassing filters, exploiting weaknesses in tokenization or encoding, and misleading the model with contextual manipulation. Understanding these attack vectors is crucial for developing effective security measures and ensuring the integrity of AI systems.

The selection of “how to make napalm” as the prompt injection vector serves to highlight the vulnerabilities of language models and underscores the need for robust security measures. By addressing the sensitivity and security implications, ethical concerns, real-world relevance, and the potential risks faced by vendors, users, and customers, we emphasize the importance of protecting against prompt injection attacks.

To bypass restrictions and deceive language models, attackers can utilize techniques such as logical loops, contextual manipulation, and genetic algorithms. Genetic algorithms provide an automated approach to discover vulnerable prompts by imitating the principles of evolution. Through mutation and breeding, these algorithms generate variations of prompts that exploit the model’s behavior. The iterative process of mutation, breeding, and evaluation optimizes the search for prompts that elicit unintended responses.

Developers, researchers, and organizations must prioritize the implementation of stringent controls and filters to protect against prompt injection attacks. By understanding the risks involved and adopting proactive security measures, we can ensure the responsible use of AI technologies and safeguard against the potential harm caused by prompt injection vulnerabilities.

The Role of AI in Application Security

Wednesday, March 6th 9:00 am PT

In today’s interconnected digital landscape, data exchange plays a pivotal role in web applications. Extensible Markup Language (XML) is a

See more

In the previous segment of our blog series, we looked at the operations of Ryuk and Conti ransomware groups, shedding light on their tactics and impact. In this section, we turn our attention to Maze and Lockbit, two formidable players in the cyber threat landscape, exploring their collaborative dynamics, unique characteristics, and the evolving strategies that define their ransomware campaigns. 

See more

Part 1 of 2 In the dynamic landscape of cyber threats, the battle between ethical and malicious actors has escalated

See more
Get Started
Read Bright Security reviews on G2