Nicholas Schiefer

Follow

Generating author description...

All published works
Action Title Year Authors
+ PDF Chat Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models 2024 Carson Denison
Monte MacDiarmid
Fazl Barez
David Duvenaud
Shauna Kravec
Samuel Marks
Nicholas Schiefer
Ryan Soklaski
Alex Tamkin
Jared Kaplan
+ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training 2024 Evan Hubinger
Carson Denison
Jesse Mu
Mike Lambert
Meg Tong
Monte MacDiarmid
Tamera Lanham
Daniel M. Ziegler
Tim Maxwell
Newton Cheng
+ The Capacity for Moral Self-Correction in Large Language Models 2023 Deep Ganguli
Amanda Askell
Nicholas Schiefer
Thomas T. Liao
Kamilė Lukošiūtė
Anna Chen
Anna Goldie
Azalia Mirhoseini
Catherine Olsson
Danny Hernandez
+ Learned Interpolation for Better Streaming Quantile Approximation with Worst-Case Guarantees 2023 Nicholas Schiefer
Justin Y. Chen
Piotr Indyk
Shyam Narayanan
Sandeep Silwal
Tal Wagner
+ PDF Chat Learned Interpolation for Better Streaming Quantile Approximation with Worst-Case Guarantees 2023 Nicholas Schiefer
Justin Y. Chen
Piotr Indyk
Shyam Narayanan
Sandeep Silwal
Tal Wagner
+ Towards Measuring the Representation of Subjective Global Opinions in Language Models 2023 Esin Durmus
Karina Nyugen
Thomas I. Liao
Nicholas Schiefer
Amanda Askell
Anton Bakhtin
Carol Chen
Zac Hatfield-Dodds
Danny Hernandez
Nicholas Joseph
+ Question Decomposition Improves the Faithfulness of Model-Generated Reasoning 2023 Ansh Radhakrishnan
Karina Nguyen
Anna Chen
Carol Chen
Carson Denison
Danny Hernandez
Esin Durmus
Evan Hubinger
Jackson Kernion
Kamilė Lukošiūtė
+ Measuring Faithfulness in Chain-of-Thought Reasoning 2023 Tamera Lanham
Anna Chen
Ansh Radhakrishnan
Benoit Steiner
Carson Denison
Danny Hernandez
Dustin Li
Esin Durmus
Evan Hubinger
Jackson Kernion
+ PDF Chat Discovering Language Model Behaviors with Model-Written Evaluations 2023 Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
Scott Heiner
Craig Pettit
Catherine Olsson
Sandipan Kundu
Saurav Kadavath
+ Towards Understanding Sycophancy in Language Models 2023 Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
Samuel R. Bowman
Newton Cheng
Esin Durmus
Zac Hatfield-Dodds
Scott R. Johnston
+ Specific versus General Principles for Constitutional AI 2023 Sandipan Kundu
Yuntao Bai
Saurav Kadavath
Amanda Askell
A. Callahan
Anna Chen
Anna Goldie
Avital Balwit
Azalia Mirhoseini
B. T. McLean
+ Language Models (Mostly) Know What They Know 2022 Saurav Kadavath
Tom Conerly
Amanda Askell
Tom Henighan
Dawn Drain
Ethan Perez
Nicholas Schiefer
Zac Hatfield Dodds
Nova DasSarma
Eli Tran-Johnson
+ Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned 2022 Deep Ganguli
Liane Lovitt
Jackson Kernion
Amanda Askell
Yuntao Bai
Saurav Kadavath
Ben Mann
Ethan Perez
Nicholas Schiefer
Kamal Ndousse
+ Toy Models of Superposition 2022 Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
Tom Henighan
Shauna Kravec
Zac Hatfield-Dodds
Robert Lasenby
Dawn Drain
Carol Chen
+ Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks 2022 Anders Aamand
Justin Y. Chen
Piotr Indyk
Shyam Narayanan
Ronitt Rubinfeld
Nicholas Schiefer
Sandeep Silwal
Tal Wagner
+ Measuring Progress on Scalable Oversight for Large Language Models 2022 Samuel R. Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
Scott Heiner
Kamile Lukosuite
Amanda Askell
Andy Jones
Anna Chen
+ Engineering Monosemanticity in Toy Models 2022 Adam S. Jermyn
Nicholas Schiefer
Evan Hubinger
+ Constitutional AI: Harmlessness from AI Feedback 2022 Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
Jackson Kernion
Andy Jones
Anna Chen
Anna Goldie
Azalia Mirhoseini
Cameron McKinnon
+ Discovering Language Model Behaviors with Model-Written Evaluations 2022 Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
Scott Heiner
Craig Pettit
Catherine Olsson
Sandipan Kundu
Saurav Kadavath
+ PDF Chat FoundationDB Record Layer 2019 C. Chrysafis
Ben Collins
Scott Dugas
Jay Dunkelberger
Moussa Ehsan
Scott Gray
Alec Grieser
Ori Herrnstadt
Kfir Lev-Ari
Tao Lin
Common Coauthors