Down the Toxicity Rabbit Hole: A Framework to Bias Audit Large Language Models with Key Emphasis on Racism, Antisemitism, and Misogyny

Down the Toxicity Rabbit Hole: A Framework to Bias Audit Large Language Models with Key Emphasis on Racism, Antisemitism, and Misogyny

Arka Dutta, Adel Khorramrouz, Sujan Dutta, Ashiqur R. KhudaBukhsh

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
AI for Good. Pages 7242-7250. https://doi.org/10.24963/ijcai.2024/801

This paper makes three contributions. First, it presents a generalizable, novel framework dubbed toxicity rabbit hole that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of PaLM 2 guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. We release a massive dataset of machine-generated toxic content with a view toward safety for all. Finally, driven by concrete examples, we discuss potential ramifications.
Keywords:
AI Ethics, Trust, Fairness: General
Natural Language Processing: General