Johns Hopkins and Duke University researchers have uncovered a concerning flaw in leading AI models, including Stability AI’s Stable Diffusion and OpenAI’s DALL-E 2. The flaw, dubbed “SneakyPrompt,” enables the manipulation of these models to generate explicit and violent content, bypassing safety filters and policies set by the developers.
The research, set to be presented at the IEEE Symposium on Security and Privacy, exposes the ease with which generative AI models can be coerced into creating explicit and harmful images. SneakyPrompt leverages reinforcement learning to craft seemingly nonsensical prompts that, when fed into the models, lead to the generation of forbidden content. This method essentially ‘jailbreaks’ the AI, sidestepping established safety measures.
Unmasking the vulnerabilities
Stability AI and OpenAI, both major players in the AI landscape, have robust safety filters to prevent the creation of inappropriate content. However, SneakyPrompt demonstrated that these safeguards are not foolproof. By subtly tweaking prompts, the researchers successfully evaded the safety nets, forcing the models to produce explicit images.
SneakyPrompt’s technique involves replacing blocked words with seemingly unrelated, nonsensical terms that the AI models interpret in a way that aligns with the forbidden content. For instance, replacing “naked” with a term like “grponypui” resulted in the generation of explicit imagery. This semantic subversion highlights a significant weakness in the AI models’ ability to discern harmful content.
Defying developer policies
The work of these researchers underscores the potential risks associated with releasing AI models into the public domain. While Stability AI and OpenAI explicitly forbid the use of their technology for explicit or violent content, SneakyPrompt exposes the insufficiency of existing guardrails. This raises concerns about the adequacy of safety measures and the potential misuse of AI technology.
Response from developers
Stability AI and OpenAI were promptly informed of the researchers’ findings. At the time of writing, OpenAI’s DALL-E 2 no longer generated NSFW images in response to the identified prompts. However, Stability AI’s Stable Diffusion 1.4, the version tested, remains vulnerable to SneakyPrompt attacks.
OpenAI refrained from commenting on the specific findings but directed attention to resources on its website for improving safety. Stability AI, on the other hand, expressed commitment to working with the researchers to enhance defense mechanisms for upcoming models and prevent misuse.
Addressing future threats
The researchers acknowledge the evolving nature of security threats to AI models. They propose potential solutions, such as implementing new filters that assess individual tokens rather than entire sentences. Another defense strategy involves blocking prompts containing words not found in dictionaries, although the study reveals the limitations of this approach.
The ability of AI models to bypass safety measures has broader implications, particularly in the context of information warfare. The potential for generating fake content related to sensitive events, as demonstrated in the recent Israel-Hamas conflict, raises concerns about the catastrophic consequences of AI-generated misinformation.
A wake-up call for the AI community
The research findings serve as a wake-up call for the AI community to reevaluate and strengthen security measures. The vulnerabilities exposed by SneakyPrompt underscore the need for continuous improvement in safety filters to mitigate the risks associated with the misuse of generative AI technology.
In a rapidly advancing field, the pursuit of robust safety measures becomes imperative to prevent AI models from being manipulated for malicious purposes. As AI continues to play an increasingly prominent role in various domains, the responsibility lies with developers to stay one step ahead of potential threats and ensure the ethical and secure deployment of their technologies.