Fellows Community
misc-hero
Jacob Steinhardt
Affiliation

Assistant Professor, UC Berkeley

Hard Problem

Alignment

Jacob Steinhardt

2023 Early Career Fellow

Jacob Steinhardt is an Assistant Professor in the department of Statistics at UC Berkeley. His research goal is to make the conceptual and empirical advances necessary to design human-aligned machine learning systems. He studies interpretability and explainability, truthfulness, reward hacking and unintended consequences, and forecasting future developments in ML. He previously worked for OpenAI and Open Philanthropy and was a coach for the USA Computing Olympiad.

AI2050 Project

Jacob Steinhardt’s AI2050 project will build AI systems that can help to understand other AI systems, by discovering and explaining important facts about their behavior and their possible failure modes. It will provide ways of verifying these explanations that do not require trusting the AI systems themselves.

Project Artifacts

D. Halawi, A. Wei, E. Wallace, T.T. Wang, N. Haghtalab, J. Steinhardt. Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation. arXiv. 2024.

Affiliation

Assistant Professor, UC Berkeley

Hard Problem

Alignment