Jacob Steinhardt

Jacob Steinhardt 2023 Early Career Fellow

Affiliation Assistant Professor, UC Berkeley Hard Problem Solved challenges of safety and control, human alignment and compatibility with increasingly powerful and capable AI and eventually AGI.

Jacob Steinhardt 2023 Early Career Fellow

Jacob Steinhardt is an Assistant Professor in the department of Statistics at UC Berkeley. His research goal is to make the conceptual and empirical advances necessary to design human-aligned machine learning systems. He studies interpretability and explainability, truthfulness, reward hacking and unintended consequences, and forecasting future developments in ML. He previously worked for OpenAI and Open Philanthropy and was a coach for the USA Computing Olympiad.

AI2050 Project

Jacob Steinhardt’s AI2050 project will build AI systems that can help to understand other AI systems, by discovering and explaining important facts about their behavior and their possible failure modes. It will provide ways of verifying these explanations that do not require trusting the AI systems themselves.

Project Artifacts

D. Halawi, A. Wei, E. Wallace, T.T. Wang, N. Haghtalab, J. Steinhardt. Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation. arXiv. 2024.