C13: The Inner Workings of AI: Interpretability, Alignment, and Control

Back to Courses' Program

Tuesday, 28 July, 08:30 - 12:30 EDT (Eastern Daylight Time - Canada)

Wojciech Samek (short bio)
Fraunhofer Heinrich Hertz Institute, Germany

Modality

on-line

Target Audience

researchers/academics, students, professionals, industry

Abstract

Modern AI systems, from deep neural networks to large language models, exhibit remarkable capabilities but also increasing opacity and autonomy. Understanding how these systems represent, reason, and decide has become essential for ensuring their reliability and alignment with human values. This course explores the emerging scientific foundations that connect interpretability, mechanistic understanding, and AI control. It begins by introducing classical and contemporary techniques for interpreting AI models, including gradient-based and concept-level explanations, representation analysis, and mechanistic interpretability approaches that reveal the internal logic of model components. The course then advances to the study of AI alignment, examining how to formalize and enforce desired behaviors, detect misalignment, and design systems that remain trustworthy under distributional shift and optimization pressure. Finally, participants will explore methods for controlling and steering AI models, from fine-tuning to emerging approaches for mechanistic editing, circuit-level interventions, and safe model governance. The course concludes with a discussion on the intersection of interpretability, alignment, and control as the foundation for the next generation of transparent and value-aligned AI systems.

Benefits for attendees

This course targets both core AI researchers and applied machine learning practitioners. Core AI researchers will gain insights into the connections between interpretability methods, mechanistic understanding, and alignment strategies, as well as open questions in controlling complex AI systems. Applied ML researchers will discover how interpretability and alignment tools can validate, audit, and improve models, providing actionable insights into their prediction strategies and decision-making processes.

Upon completion, attendees will:

  • Learn about a variety of interpretability and mechanistic analysis techniques, including attribution-based methods, concept-level explanations, circuit-level mechanistic approaches, and next-generation XAI for large-scale models.
  • Understand the theoretical foundations, relationships, and limitations of these methods, as well as how they support alignment and safe control of AI systems.
  • Gain skills to interact with AI models in a principled way to debug, validate, and improve their behavior.
  • Explore practical applications of these methods, particularly in large language models, generative models, and other foundation models, including techniques for steering and safely controlling their outputs.
  • Develop a broader perspective on building AI systems that are transparent, aligned, and controllable.

Course Content

The course begins by introducing foundations of interpretability and explainable AI, covering classical techniques, their theoretical underpinnings, and common challenges and misconceptions from early research. Participants will learn how traditional XAI methods reveal model behavior and identify flawed prediction strategies, such as “Clever Hans” effects. The second part of the course explores mechanistic interpretability, focusing on understanding the internal representations, circuits, and features of neural networks. Techniques for concept-level analysis, circuit dissection, and attribution-based insights will be presented, alongside practical toolboxes and interactive approaches that enable human users to investigate and probe model reasoning in a structured way. The course then transitions to AI alignment and control, examining strategies for defining and enforcing desired behaviors, detecting misalignment, and steering models safely. Topics include model editing, circuit-level interventions, and approaches to ensure robust and trustworthy AI performance. Finally, the course addresses applications to modern foundation models, such as large language models and generative systems, highlighting how interpretability and control techniques can be integrated for debugging, validation, improvement, and safe deployment. The course concludes with a discussion on emerging directions and open challenges at the intersection of interpretability, alignment, and controllable AI.

Topics covered include:

  • Motivations: Black-box models, “Clever Hans” behavior, and alignment challenges
  • Classical XAI: Concepts, methods, applications, and limitations
  • Mechanistic interpretability: Circuit-level analysis, concept-level representations, and interactive investigation
  • AI alignment: Formalizing objectives, human feedback, detecting misalignment
  • Controlling AI systems: Model editing, intervention strategies, and safe steering
  • Practical applications: Debugging, validating, and improving LLMs and foundation models
  • Future directions: Integrating interpretability, alignment, and control for responsible AI

Bio Sketch of Course instructor

Wojciech Samek is a Professor in the EECS Department at the Technical University of Berlin and is jointly heading the AI Department at the Fraunhofer Heinrich Hertz Institute (HHI). He studied computer science at Humboldt University of Berlin, Heriot-Watt University, and the University of Edinburgh, and earned his Dr. rer. nat. degree with distinction from TU Berlin in 2014. Dr. Samek is a Fellow of the Berlin Institute for the Foundations of Learning and Data (BIFOLD), the ELLIS Unit Berlin, and the DFG Research Unit DeSBi. He serves as an elected member of Germany’s Platform for Artificial Intelligence, and as a Scientific Advisory Board Member of the AGH Center of Excellence in AI, the WUT Centre for Credible AI, and the IDEAS Research Institute. Additionally, he sits on the Executive Boards of the Helmholtz HEIBRiDS Data Science School and the DAAD Konrad Zuse School ELIZA. He is recipient of multiple awards, including the 2025 Highly Cited Researcher Award, the 2020 Pattern Recognition Best Paper Award, the 2022 Digital Signal Processing Best Paper Prize, and the 2025 IEEE SPS Best Paper Award. He is also a member of the expert group developing the ISO/IEC MPEG-17 NNC standard. Dr. Samek has edited influential volumes on Explainable AI (2019) and xxAI – Beyond Explainable AI (2022). He has served as Senior Editor for IEEE TNNLS, Associate Editor for Pattern Recognition, Digital Signal Processing, and PLOS ONE, and as Area Chair at leading conferences including NeurIPS and ICML.