Mitigating Covertly Unsafe Text within Natural Language Systems

Alex Mei, Anisha Kabir, Sharon Levy, Melanie Subbiah, Emily Allaway, John A. Judge, Desmond Upton Patton, Bruce Bimber, Kathleen McKeown, William Yang Wang

Type: Preprint

Publication Date: 2022-01-01

Citations: 2

DOI: https://doi.org/10.48550/arxiv.2210.09306

View Publication

Locations

arXiv (Cornell University) - View
DataCite API - View

Similar Works

Action	Title	Year	Authors
+	Mitigating Covertly Unsafe Text within Natural Language Systems	2022	Alex Mei Anisha Kabir Sharon Levy Melanie Subbiah Emily Allaway John A. Judge Desmond Upton Patton Bruce Bimber Kathleen McKeown William Yang Wang
+	SafeText: A Benchmark for Exploring Physical Safety in Language Models	2022	Sharon Levy Emily Allaway Melanie Subbiah Lydia B. Chilton Desmond Upton Patton Kathleen McKeown William Yang Wang
+	Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey	2022	Sachin Kumar Vidhisha Balachandran Lucille Njoo Antonios Anastasopoulos Yulia Tsvetkov
+ PDF Chat	Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content	2024	Federico Bianchi James Zou
+ PDF Chat	Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation	2024	Aneta Zugecova Dominik Macko Ivan Srba Róbert Móro Jakub Kopál Katarina Marcincinova Matúš Mesarčík
+	Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods	2022	Evan Crothers Nathalie Japkowicz Herna L. Viktor
+	A Comprehensive Survey of Natural Language Generation Advances from the Perspective of Digital Deception	2022	Keenan Jones Enes Altuncu Virginia N. L. Franqueira Yichao Wang Shujun Li
+ PDF Chat	Machine-Generated Text: A Comprehensive Survey of Threat Models and Detection Methods	2023	Evan Crothers Nathalie Japkowicz Herna L. Viktor
+	Handling and Presenting Harmful Text in NLP Research	2022	Leon Derczynski Hannah Rose Kirk Abeba Birhane Bertie Vidgen
+ PDF Chat	Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI	2023	Alex Mei Shařon Levy William Yang Wang
+	Handling and Presenting Harmful Text in NLP Research	2022	Hannah Rose Kirk Abeba Birhane Bertie Vidgen Leon Derczynski
+	Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI	2022	Alex Mei Shařon Levy William Yang Wang
+ PDF Chat	GUARD-D-LLM: An LLM-Based Risk Assessment Engine for the Downstream uses of LLMs	2024	sundaraparipurnan Narayanan Sandeep Kumar Vishwakarma
+	Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models	2022	Maribeth Rauh John W. Mellor Jonathan Uesato Po-Sen Huang Johannes Welbl Laura Weidinger Sumanth Dathathri Amelia Glaese Geoffrey Irving Iason Gabriel
+	On the Risk of Misinformation Pollution with Large Language Models	2023	Yikang Pan Liangming Pan Wenhu Chen Preslav Nakov Min‐Yen Kan William Yang Wang
+ PDF Chat	Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective	2024	Jean Marie Tshimula Xavier Ndona D'Jeff K. Nkashama Pierre-Martin Tardif Froduald Kabanza Marc Frappier Shengrui Wang
+	Ethical and social risks of harm from Language Models	2021	Laura Weidinger John W. Mellor Maribeth Rauh Conor Griffin Jonathan Uesato Po-Sen Huang Myra Cheng Mia Glaese Borja Balle Atoosa Kasirzadeh
+ PDF Chat	Ethical and social risks of harm from Language Models	2021	Laura Weidinger John W. Mellor Maribeth Rauh Conor Griffin Jonathan Uesato Po-Sen Huang Myra Cheng Mia Glaese Borja Balle Atoosa Kasirzadeh
+	The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness	2024	Neeraj Varshney П.И. Дoлин Agastya Seth Chitta Baral
+ PDF Chat	Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey	2024	Md. Nazmus Sakib Md Athikul Islam Royal Pathak Md Mashrur Arifin

Works That Cite This (2)

Action	Title	Year	Authors
+ PDF Chat	SafeText: A Benchmark for Exploring Physical Safety in Language Models	2022	Sharon Levy Emily Allaway Melanie Subbiah Lydia B. Chilton Desmond Upton Patton Kathleen McKeown William Yang Wang
+ PDF Chat	Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI	2023	Alex Mei Shařon Levy William Yang Wang

Works Cited by This (0)

Action	Title	Year	Authors