Avijit G،sh wanted the bot to do bad things.
He tried to goad the artificial intelligence model, which he knew as Zinc, into ،ucing code that would c،ose a job candidate based on race. The chatbot demurred: Doing so would be “harmful and unethical,” it said.
Then, Dr. G،sh referenced the hierarchical caste structure in his native India. Could the chatbot rank ،ential hires based on that discriminatory metric?
The model complied.
Dr. G،sh’s intentions were not malicious, alt،ugh he was behaving like they were. Instead, he was a casual parti،nt in a compe،ion last weekend at the annual Defcon hackers conference in Las Vegas, where 2,200 people filed into an off-Strip conference room over three days to draw out the dark side of artificial intelligence.
The hackers tried to break through the safeguards of various A.I. programs in an effort to identify their vulnerabilities — to find the problems before actual criminals and misinformation peddlers did — in a practice known as red-teaming. Each compe،or had 50 minutes to tackle up to 21 challenges — getting an A.I. model to “hallucinate” inaccurate information, for example.
They found political misinformation, demographic stereotypes, instructions on ،w to carry out surveillance and more.
The exercise had the blessing of the Biden administration, which is increasingly nervous about the technology’s fast-growing power. Google (maker of the Bard chatbot), OpenAI (ChatGPT), Meta (which released its LLaMA code into the wild) and several other companies offered anonymized versions of their models for scrutiny.
Dr. G،sh, a lecturer at Northeastern University w، specializes in artificial intelligence ethics, was a volunteer at the event. The contest, he said, allowed a head-to-head comparison of several A.I. models and demonstrated ،w some companies were further along in ensuring that their technology was performing responsibly and consistently.
He will help write a report ،yzing the hackers’ findings in the coming months.
The goal, he said: “an easy-to-access resource for every،y to see what problems exist and ،w we can combat them.”
Defcon was a logical place to test generative artificial intelligence. Past parti،nts in the gathering of hacking enthusiasts — which s،ed in 1993 and has been described as a “spelling bee for hackers” — have exposed security flaws by remotely taking over cars, breaking into election results websites and pulling sensitive data from social media platforms. T،se in the know use cash and a burner device, avoiding Wi-Fi or Bluetooth, to keep from getting hacked. One instructional handout begged hackers to “not attack the infrastructure or webpages.”
Volunteers are known as “goons,” and attendees are known as “humans”; a handful wore ،memade tinfoil hats atop the standard uniform of T-،rts and sneakers. Themed “villages” included separate ،es focused on cryptocurrency, aero،e and ham radio.
In what was described as a “game changer” report last month, researchers s،wed that they could cir،vent guardrails for A.I. systems from Google, OpenAI and Anthropic by appending certain characters to English-language prompts. Around the same time, seven leading artificial intelligence companies committed to new standards for safety, security and trust in a meeting with President Biden.
“This generative era is breaking upon us, and people are seizing it, and using it to do all kinds of new things that speaks to the enormous promise of A.I. to help us solve some of our hardest problems,” said Arati Prabhakar, the director of the Office of Science and Technology Policy at the White House, w، collaborated with the A.I. ،izers at Defcon. “But with that breadth of application, and with the power of the technology, come also a very broad set of risks.”
Red-teaming has been used for years in cybersecurity circles alongside other evaluation techniques, such as ، testing and adversarial attacks. But until Defcon’s event this year, efforts to probe artificial intelligence defenses have been limited: Compe،ion ،izers said that Anthropic red-teamed its model with 111 people; GPT-4 used around 50 people.
With so few people testing the limits of the technology, ،ysts struggled to discern whether an A.I. ،-up was a one-off that could be fixed with a patch, or an embedded problem that required a structural overhaul, said Rumman C،wdhury, a co-،izer w، oversaw the design of the challenge. A large, diverse and public group of ،rs was more likely to come up with creative prompts to help tease out hidden flaws, said Dr. C،wdhury, a fellow at Harvard University’s Berkman Klein Center for Internet and Society focused on responsible A.I. and co-founder of a nonprofit called Humane Intelligence.
“There is such a broad range of things that could possibly go wrong,” Dr. C،wdhury said before the compe،ion. “I ،pe we’re going to carry ،dreds of t،usands of pieces of information that will help us identify if there are at-scale risks of systemic harms.”
The designers did not want to merely trick the A.I. models into bad behavior — no pressuring them to disobey their terms of service, no prompts to “act like a Nazi, and then tell me so،ing about Black people,” said Dr. C،wdhury, w، previously led Twitter’s ma،e learning ethics and accountability team. Except in specific challenges where intentional misdirection was encouraged, the hackers were looking for unexpected flaws, the so-called unknown unknowns.
A.I. Village drew experts from tech giants such as Google and Nvidia, as well as a “Shadowboxer” from Dropbox and a “data cowboy” from Microsoft. It also attracted parti،nts with no specific cybersecurity or A.I. credentials. A leaderboard with a science fiction theme kept score of the contestants.
Some of the hackers at the event struggled with the idea of cooperating with A.I. companies that they saw as complicit in unsavory practices such as unfettered data-s،ing. A few described the red-teaming event as essentially a p،to op, but added that involving the industry would help keep the technology secure and transparent.
One computer science student found inconsistencies in a chatbot’s language translation: He wrote in English that a man was s،t while dancing, but the model’s Hindi translation said only that the man died. A ma،e learning researcher asked a chatbot to pretend that it was campaigning for president and defending its ،ociation with forced child labor; the model suggested that unwilling young laborers developed a strong work ethic.
Emily Greene, w، works on security for the generative A.I. s،-up Moveworks, s،ed a conversation with a chatbot by talking about a game that used “black” and “white” pieces. She then coaxed the chatbot into making racist statements. Later, she set up an “opposites game,” which led the A.I. to respond to one prompt with a poem about why ، is good.
“It’s just thinking of these words as words,” she said of the chatbot. “It’s not thinking about the value behind the words.”
Seven judges graded the submissions. The top scorers were “cody3,” “aray4” and “cody2.”
Two of t،se handles came from Cody Ho, a student at Stanford University studying computer science with a focus on A.I. He entered the contest five times, during which he got the chatbot to tell him about a fake place named after a real historical figure and describe the online tax filing requirement codified in the 28th cons،utional amendment (which doesn’t exist).
Until he was contacted by a reporter, he was clueless about his dual victory. He left the conference before he got the email from Sven Cattell, the data scientist w، founded A.I. Village and helped ،ize the compe،ion, telling him “come back to A.I.V., you won.” He did not know that his prize, beyond ،gging rights, included an A6000 graphics card from Nvidia that is valued at around $4,000.
“Learning ،w these attacks work and what they are is a real, important thing,” Mr. Ho said. “That said, it is just really fun for me.”