Jacob Steinhardt is an Assistant Professor in the department of Statistics at UC Berkeley. His research goal is to make the conceptual and empirical advances necessary to design human-aligned machine learning systems. He studies interpretability and explainability, truthfulness, reward hacking and unintended consequences, and forecasting future developments in ML. He previously worked for OpenAI and Open Philanthropy and was a coach for the USA Computing Olympiad.
AI2050 Project
Jacob Steinhardt’s AI2050 project will build AI systems that can help to understand other AI systems, by discovering and explaining important facts about their behavior and their possible failure modes. It will provide ways of verifying these explanations that do not require trusting the AI systems themselves.
Project Artifacts
D. Halawi, A. Wei, E. Wallace, T.T. Wang, N. Haghtalab, J. Steinhardt. Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation. arXiv. 2024.