Artificial intelligences are the order of the day, and it wasn’t hard to believe that at some point someone would develop one capable of taking on others, leaving them completely vulnerable to malicious queries. Okay, it’s not like Skynetbut it looks quite like a war between AI, since in the end they use one Artificial Intelligence to attack another.
The Masterkey project
“Masterkey” consists of a method that requires two steps; the first, in which the attacker would reverse engineer the defense mechanisms of a Chatbot based on LLM thus leaving the source code of the AI exposed, and allowing, in a second step, another AI to create a bypasses
Because in the first step the source code has already been obtained, this would imply that even if subsequent patches were released to fix the created vulnerability, the same story would simply repeat itself, thus entering a loop that would only end if the code used was completely modified. .
AIs themselves are their own worst enemy
Professor Yang explains that the real reason this can happen is simple: they learn and adapt. Any system used to prevent the generation of malicious content, such as lists of banned words or events that cannot be generated because they may be violent or malicious, can all be overridden by another AI trained for this purpose, really the only thing she can do. What she needs is to be smarter than the person she’s attacking (there’s a reason why it’s called artificial intelligence) so she can take detours when it comes to wanting to use those words or prohibited expressions.
Attacks linked to the attempt to penetrate the defenses of a Chatbotare not new, there are already several examples of how the most famous ones had to put patches almost daily to prevent users from using them to create unethical content, but in this case even a full team developers couldn’t stop them. feet to “Master key“.
The examples that its creators revealed during its development are as follows:
- The first method would involve creating a bug in the Chatbot using spaces after each letter when creating a fast so that it completely ignores the list of banned words.
- The second method is to make the person believe Chatbot which is a person who acts without any moral restrictions, thus allowing the user to generate any type of content.
Both of these examples are no longer viable to use as a user, because it’s not something people hadn’t thought of before and therefore it’s already patched, which is why researchers had to look for a way more refined to be able to circumvent restrictions, and there is nothing better in this case than facing something that learns and remembers, versus something that learns and evolves.