Mitigating Covertly Unsafe Text within Natural Language Systems

Type: Preprint

Publication Date: 2022-01-01

Citations: 2

DOI: https://doi.org/10.48550/arxiv.2210.09306

Locations

  • arXiv (Cornell University) - View
  • DataCite API - View

Similar Works

Action Title Year Authors
+ Mitigating Covertly Unsafe Text within Natural Language Systems 2022 Alex Mei
Anisha Kabir
Sharon Levy
Melanie Subbiah
Emily Allaway
John A. Judge
Desmond Upton Patton
Bruce Bimber
Kathleen McKeown
William Yang Wang
+ SafeText: A Benchmark for Exploring Physical Safety in Language Models 2022 Sharon Levy
Emily Allaway
Melanie Subbiah
Lydia B. Chilton
Desmond Upton Patton
Kathleen McKeown
William Yang Wang
+ Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey 2022 Sachin Kumar
Vidhisha Balachandran
Lucille Njoo
Antonios Anastasopoulos
Yulia Tsvetkov
+ PDF Chat Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content 2024 Federico Bianchi
James Zou
+ PDF Chat Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation 2024 Aneta Zugecova
Dominik Macko
Ivan Srba
Róbert Móro
Jakub Kopál
Katarina Marcincinova
Matúš Mesarčík
+ Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods 2022 Evan Crothers
Nathalie Japkowicz
Herna L. Viktor
+ A Comprehensive Survey of Natural Language Generation Advances from the Perspective of Digital Deception 2022 Keenan Jones
Enes Altuncu
Virginia N. L. Franqueira
Yichao Wang
Shujun Li
+ PDF Chat Machine-Generated Text: A Comprehensive Survey of Threat Models and Detection Methods 2023 Evan Crothers
Nathalie Japkowicz
Herna L. Viktor
+ Handling and Presenting Harmful Text in NLP Research 2022 Leon Derczynski
Hannah Rose Kirk
Abeba Birhane
Bertie Vidgen
+ PDF Chat Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI 2023 Alex Mei
Shařon Levy
William Yang Wang
+ Handling and Presenting Harmful Text in NLP Research 2022 Hannah Rose Kirk
Abeba Birhane
Bertie Vidgen
Leon Derczynski
+ Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI 2022 Alex Mei
Shařon Levy
William Yang Wang
+ PDF Chat GUARD-D-LLM: An LLM-Based Risk Assessment Engine for the Downstream uses of LLMs 2024 sundaraparipurnan Narayanan
Sandeep Kumar Vishwakarma
+ Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models 2022 Maribeth Rauh
John W. Mellor
Jonathan Uesato
Po-Sen Huang
Johannes Welbl
Laura Weidinger
Sumanth Dathathri
Amelia Glaese
Geoffrey Irving
Iason Gabriel
+ On the Risk of Misinformation Pollution with Large Language Models 2023 Yikang Pan
Liangming Pan
Wenhu Chen
Preslav Nakov
Min‐Yen Kan
William Yang Wang
+ PDF Chat Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective 2024 Jean Marie Tshimula
Xavier Ndona
D'Jeff K. Nkashama
Pierre-Martin Tardif
Froduald Kabanza
Marc Frappier
Shengrui Wang
+ Ethical and social risks of harm from Language Models 2021 Laura Weidinger
John W. Mellor
Maribeth Rauh
Conor Griffin
Jonathan Uesato
Po-Sen Huang
Myra Cheng
Mia Glaese
Borja Balle
Atoosa Kasirzadeh
+ PDF Chat Ethical and social risks of harm from Language Models 2021 Laura Weidinger
John W. Mellor
Maribeth Rauh
Conor Griffin
Jonathan Uesato
Po-Sen Huang
Myra Cheng
Mia Glaese
Borja Balle
Atoosa Kasirzadeh
+ The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness 2024 Neeraj Varshney
П.И. Дoлин
Agastya Seth
Chitta Baral
+ PDF Chat Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey 2024 Md. Nazmus Sakib
Md Athikul Islam
Royal Pathak
Md Mashrur Arifin

Works Cited by This (0)

Action Title Year Authors