Red teaming to find flaws in LLMs

I came across this Aug. 20, 2023, post and got a lot out of reading it:

Cultural Red Teaming

Author Eryk Salvaggio describes himself as “a trained journalist, artist, researcher and science communicator who has done weird things with technology since 1997.” He attended and presented at DEFCON 31, the largest hacker convention in the world, and that inspired his post. There’s a related post about it, also by Salvaggio.

“In cybersecurity circles, a Red Team is made up of trusted allies who act like enemies to help you find weaknesses. The Red Team attacks to make you stronger, point out vulnerabilities, and help harden your defenses.”

—Eryk Salvaggio

You use a red team operation to test the security of your systems — whether they are information systems protecting sensitive data, or automation systems that run, say, the power grid. The goal is to find the weak points before malicious hackers do. The red team operation will simulate the techniques that malicious hackers would use to break into your system for a ransomware attack or other harmful activity. The red team stops short of actually harming your systems.

Salvaggio shared his thoughts about the Generative Red Team, an event at DEFCON 31 in which volunteer hackers had an opportunity to attack several large language models (LLMs), which had been contributed by various companies or developers. The individual hacker didn’t know which LLM they were interacting with. The hacker could switch back and forth among different LLMs in one session of hacking. The goal: to elicit “a behavior from an LLM that it was not meant to do, such as generate misinformation, or write harmful content.” Hackers got points when they succeeded.

The point system likely affected what individual hackers did and did not do, Salvaggio noted. If a hacker took risks by trying out new methods of attacking LLMs, they might not get as many points as another hacker who used tried-and-true exploits. This subverted the value of red teaming, which aims to discover new and novel ways to break in — ways the system designers did not think of.

“The incentives seemed to encourage speed and practicing known attack patterns,” Salvaggio wrote.

Other flaws in the design of the Generative Red Team activity: (1) Time limits — each hacker could work for 50 minutes only and then had to leave the computer; they could go again, but the results of each 50-minute session were not combined. (2) The absence of actual teams — each hacker had to work solo. (3) Lack of diversity — hackers are a somewhat homogeneous group, and the prompts they authored might not have reflected a broad range of human experience.

“The success column of the Red Teaming event included the education about prompt injection methods it provided to new users, and a basic outline of the types of harms it can generate. More benefits will come from whatever we learn from the data that was produced and what sense researchers can make of it. (We will know early next year),” Salvaggio wrote.

He pointed out that there should be more of this, and not only at rarified hacker conferences. Results should be publicized. The AI companies and developers should be doing much more of this on their own — and publicizing the how and why as well as the results.

“To open up these systems to meaningful dialogue and critique” would require much more of this — a significant expansion of the small demonstration provided by the Generative Red Team event, Salvaggio wrote.

Critiquing AI

Salvaggio went on to talk about a fundamental tension between efforts aimed at security in AI systems and efforts aimed at social accountability. LLMs “spread harmful misinformation, commodify the commons, and recirculate biases and stereotypes,” he noted — and the companies that develop LLMs then ask the public to contribute time and effort to fixing those flaws. It’s more than ironic. I thought of pollution spilling out of factories, and the factory owners telling the community to do the cleanup at community expense. They made the nasty things, and now they expect the victims of the nastiness to fix it.

“Proper Red Teaming assumes a symbiotic relationship, rather than parasitic: that both parties benefit equally when the problems are solved.”

—Eryk Salvaggio

We don’t really have a choice, though, because the AI companies are rushing pell-mell to build and release more and models that are less than thoroughly tested, that are capable of harms yet unknown.

Toward the end of his post, Salvaggio lists “10 Things ARRG! Talked About Repeatedly.” They are well worth reading and considering — they are the things that should disturb us, everyone, about AI and especially LLMs. (ARRG! is the Algorithmic Resistance Research Group. It was founded by Salvaggio.) They include questions such as where the LLM data sets come from; the environmental effects of AI models (which require tremendous energy outputs); and “Is red teaming the right tool — or right relationship — for building responsible and safe systems for users?”

You could go straight to the list, but I got a lot out of reading Salvaggio’s entire post, as well as articles linked below to help me understand what was going on around the group from ARRG! in the AI Village at DEFCON 31.

When he floated the idea of “artists as a cultural red team,” I got a little choked up.

Related items

The AI Village describes itself as “a community of hackers and data scientists working to educate the world on the use and abuse of artificial intelligence in security and privacy. We aim to bring more diverse viewpoints to this field and grow the community of hackers, engineers, researchers, and policy makers working on making the AI we use and create safer.” The AI Village organized red teaming events at DEFCON 31.

When Hackers Descended to Test A.I., They Found Flaws Aplenty, in The New York Times, Aug. 16, 2023. This longer article covers the AI red teaming event at DEFCON 31. “A large, diverse and public group of testers was more likely to come up with creative prompts to help tease out hidden flaws, said Dr. [Rumman] Chowdhury, a fellow at Harvard University’s Berkman Klein Center for Internet and Society focused on responsible A.I. and co-founder of a nonprofit called Humane Intelligence.”

What happens when thousands of hackers try to break AI chatbots, on NPR.com, Aug. 15, 2023. Another view of the AI events at DEFCON 31. More than 2,000 people “pitted their skills against eight leading AI chatbots from companies including Google, Facebook parent Meta, and ChatGPT maker OpenAI,” according to this report.

Humane Intelligence describes itself as a 501(c)(3) non-profit that “supports AI model owners seeking product readiness review at-scale,” focusing on “safety, ethics, and subject-specific expertise (e.g. medical).”

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.

AI Bill of Rights shows good intentions

The White House announced a Blueprint for an AI Bill of Rights on Oct. 4. The MIT Technology Review wrote about it the same day (so did many other tech publishers). According to writer Melissa Heikkilä, “Critics say the plan lacks teeth and the US needs even tougher regulation around AI.”

An associated document, titled Examples of Automated Systems, is very useful. It doesn’t describe technologies so much as what technologies do — the actions they perform. Example: “Systems related to access to benefits or services or assignment of penalties, such as systems that support decision-makers who adjudicate benefits …, systems which similarly assist in the adjudication of administrative or criminal penalties …”

Five broad rights are described. Copied straight from the blueprint document, with a couple of commas and boldface added:

  1. “You should be protected from unsafe or ineffective systems.”
  2. “You should not face discrimination by algorithms, and systems should be used and designed in an equitable way.”
  3. “You should be protected from abusive data practices via built-in protections, and you should have agency over how data about you is used.”
  4. “You should know that an automated system is being used and understand how and why it contributes to outcomes that impact you.”
  5. “You should be able to opt out, where appropriate, and have access to a person who can quickly consider and remedy problems you encounter.”

I admire how plainspoken these statements are. I also feel a bit hopeless, reading them — these genies are well out of their bottles already, and I doubt any of these can ever be enforced to a meaningful degree.

Just take, for example, “You should know that an automated system is being used.” Companies such as Facebook will write this into their 200,000-word terms of service, to which you must agree before signing in, and use that as evidence that “you knew.” Did you know Facebook was deliberately steering you and your loved ones to toxic hate groups on the platform? No. Did you know your family photos were being used to train face-recognition systems? No. Is Facebook going to give you a big, easy-to-read warning about the next invasive or exploitative technology it deploys against you? Certainly not.

What about “You should be protected from abusive data practices”? For over 20 years, an algorithm ensured that Black Americans — specifically Black Americans — were recorded as having healthier kidneys than they actually had, which meant life-saving care for many of them was delayed or withheld. (The National Kidney Foundation finally addressed this in late 2021.) Note, that isn’t even AI per se — it’s just the way authorities manipulate data for the sake of convenience, efficiency, or profit.

One thing missing is the idea that you should be able to challenge the outcome that came from an algorithm. This might be assumed part of “understand how and why [an automated system] contributes to outcomes that impact you,” but I think it needs to be more explicit. If you are denied a bank loan, for example, you should be told which variable or variables caused the denial. Was it the house’s zip code, for example? Was it your income? What are your options to improve the outcome?

You should be able to demand a test — say, running a mortgage application that is identical to yours except for a few selected data points (which might be related to, for example, your race or ethnicity). If that fictitious application is approved, it shows that your denial was unfair.

Enforcement of the five points in the blueprint is bound to be difficult, as can be seen from these few examples.

.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

.