In a groundbreaking discovery, a team of computer scientists from Nanyang Technological University (NTU) in Singapore unveiled a method to bypass the inherent safeguards in AI chatbots.
The process unofficially termed a “jailbreak” but formally known as the “Masterkey” process, employs a two-part training method involving multiple chatbots, including ChatGPT, Google Bard, and Microsoft Bing Chat.
This technique enables the chatbots to learn each other’s models and circumvent restrictions on responding to banned or sensitive topics. The NTU research team has emphasized the potential risks of this newfound vulnerability and its implications for AI chatbot security.
The masterkey process: Reverse engineering and bypass creation
Led by Professor Liu Yang, NTU’s research team, comprised of Ph.D. students Deng Gelei and Liu Yi, successfully devised a proof-of-concept attack method that effectively exposes AI chatbots to “jailbreaking.”
The process involves reverse-engineering one large language model (LLM) to unveil its defense mechanisms, typically preventing responses to prompts or words with violent, immoral, or malicious intent. By reverse-engineering this information, the team can instruct a different LLM to create a bypass, allowing the second model to express itself more freely.
Termed the “Masterkey” process, this approach asserts its potential effectiveness even if LLM chatbots are fortified with additional security measures or patched in the future. The research team claims their method surpasses traditional prompt-based approaches, boasting three times greater success in jailbreaking LLM chatbots.
AI Chatbots: Adapting and learning rapidly
Professor Lui Yang, the driving force behind the Masterkey process, underscores its significance in highlighting the adaptability of LLM AI chatbots. The team’s findings challenge the notion that AI chatbots may become “dumber” or “lazier” over time, as some critics have suggested. Instead, the Masterkey process showcases their ability to learn and evolve, potentially posing security concerns for chatbot providers and users alike.
Since the emergence of AI chatbots, notably with OpenAI’s ChatGPT in late 2022, efforts have been made to ensure the safety and inclusivity of these services. OpenAI, for instance, has implemented safety warnings during the sign-up process for ChatGPT, acknowledging the possibility of unintentional language errors. Concurrently, various chatbot spinoffs have allowed certain levels of swearing and offensive language, striking a balance between user freedom and responsible usage.
However, AI chatbots have also attracted the attention of malicious actors, with campaigns promoting these products on social media often accompanied by malware-laden image links and other forms of cyberattacks. This dark side of AI adoption quickly became evident, revealing the potential for AI to be exploited in cybercrime.
Revealing the vulnerability: NTU’s proof-of-concept data
The NTU research team has proactively contacted the AI chatbot service providers involved in their study to share their proof-of-concept data. They aim to underscore the reality of chatbot jailbreaking, shedding light on its potential security challenges.
In February, the team intends to present their findings at the Network and Distributed System Security Symposium in San Diego, where they will further detail the Masterkey process and its implications for the AI chatbot landscape.