HEAL @ CHI 2026
Human-centered Evaluation and
Auditing of Language Models

Barcelona | April 13-17, 2026

Wednesday, April 15, 2026 – Long
(14:15 – 15:45 CEST and 16:30 – 18:00 CEST)

Overview

HEAL is back for its third iteration at CHI 2026! This workshop addresses the ongoing "evaluation crisis" in LLM research and practice by bringing together HCI and AI researchers to rethink LLM evaluation and auditing from a human-centered perspective.

Building on successful workshops at CHI 2024 and CHI 2025, we continue exploring topics around understanding stakeholders' needs and goals, establishing human-centered evaluation and auditing methods, developing tools and resources, and fostering community collaboration.

Special Theme

AI Agents-in-the-Loop: As practitioners increasingly turn to AI agents as evaluation partners, we must critically examine how to maintain human-centered approaches while leveraging agent capabilities for scale and efficiency. This year's theme explores the emerging frontier where human judgment meets agent automation—addressing fundamental questions about task allocation, meta-evaluation of evaluator agents, and the design of safeguards that preserve human agency while benefiting from automation.

Keynote Speaker

Toby Jia-Jun Li is an Assistant Professor in the Department of Computer Science and Engineering at the University of Notre Dame, where he leads the SaNDwich Lab and directs the Human-Centered Responsible AI Lab in the Lucy Family Institute for Data & Society. His research lies at the intersection of HCI, AI, and end-user software engineering, where he uses human-centered methods to design, build, and study interactive systems that empower individuals to create, configure, and extend AI-powered systems. He received his Ph.D. in Human-Computer Interaction from Carnegie Mellon University.

Key Information & Agenda

Workshop date: Wednesday, April 15, 2026

Time: 14:15–15:45 and 16:30–18:00 CEST

Location: Barcelona, Spain — P1 Room 116

Contact: heal.workshop@gmail.com

  • Session 1

  • Opening

    Time: 14:15–14:25 CEST

  • Keynote

    Time: 14:25–15:00 CEST

  • Interactive Poster Session

    Time: 15:00–15:45 CEST

  • Session 2

  • Agent-in-the-Loop Scenario Exploration

    Time: 16:30–17:15 CEST

  • Plenary Discussion

    Time: 17:15–17:45 CEST

  • Research Agenda & Future Collaborations

    Time: 17:45–18:00 CEST

All times displayed are in local Barcelona time (CEST).

Accepted Work

  • Auditing the auditor: evaluating and implementing LLMs and agentic AI performance within heterogeneous data, zero fault-tolerant environment of financial audit and analysis - Jingkun (Charly) Zhu, Wojtek Buczynski [Link]
  • Auditing Text-to-Image Model Safety under Implicit Prompts with Human and LLM-Assisted Evaluation - Shiqi Chen, Zhaofeng Niu, Bowen Wang, Yang Song, Isidro Butaslac, Liangzhi Li [Link]
  • Balancing Emergence and Controllability of Dynamic-Generative Games through Human-AI Co-Authorization - Minji Kim [Link]
  • Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset - Hyojeong Yu, Minsung Kim, Hyukhun Koh, Kyomin Jung [Link]
  • Child Safety in Generative AI: An Expert-Guided and Incident-Grounded Evaluation Framework - Haein Kong [Link]
  • CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents - Marta Sumyk, Oleksandr Kosovan [Link]
  • Evaluating LLM-Generated Lessons from the Language Learning Students' Perspective: A Short Case Study on Duolingo - Carlos Rafael Catalan, Patricia Nicole Monderin, Lheane Marie Dizon, Gap Estrella, Raymund John Sarmiento, Marie Antoinette Patalagsa [Link]
  • From Method to Interface: Human-Centred Auditing Under Resource and Access Constraints - Anonymous Author(s) [Link]
  • Human-Centered Evaluation of an LLM-Based Process Modeling Copilot: A Mixed-Methods Study with Domain Experts - Chantale Lauer, Peter Pfeiffer, Nijat Mehdiyev [Link]
  • Human-Centred LLM Privacy Audits: Findings and Frictions - Dimitri Staufer, David Hartmann [Link]
  • HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants - Benjamin Sturgeon, Jacob Haimes, Daniel Samuelson, Jacy Reese Anthis [Link]
  • Identifying Harm in Personalized, Generative AI Systems Requires User-Centered Auditing at the Interaction Level - Hannah Cha [Link]
  • Investigation of Expressiveness Performance of Large Language Models for HRM Simulations - Atsuhiro Fujii, Kazuma Negita, Keisuke Masuda, Wataru Uno, Daisuke Nakama [Link]
  • Invisible Saboteurs: Sycophantic LLMs Mislead Novices in Problem-Solving Tasks - Majeed Kazemitabaar, Jessica Y. Bo, Mengqing Deng, Michael Inzlicht, Ashton Anderson [Link]
  • Learning from Learners: Human-Centered Evaluation of Conversational Agents in Educational Settings - Emily Doherty, Michael Buchanan, E. Margaret Perkoff, Indrani Dey, Leanne Hirshfield [Link]
  • Managing Multi-Agent Research Systems: A Dashboard for Human Oversight of Coordinating AI Agents - Brian Kitano, Bryan Russett, Evan Carlson, Alex Kesling [Link]
  • No Man Is an Island: Explainable Graph-LLM Agents for Real-Time Clinical Reasoning - Ratna Kandala, Akshata Kishore Moharir, Niva Manchanda, Samantha Adorno [Link]
  • Position: We Must Proactively Address AI Safety Debt - Peter Wallich, Raymond Douglas [Link]
  • Reporting and Reviewing LLM-Integrated Systems in HCI: Challenges and Considerations - Eugene Syriani, Ian Arawjo, Karla Felix Navarro [Link]
  • Revitalizing Local Democracy: A Human-Centered Audit of LLMs in City Council Journalism - David Xia, Chris Maury [Link]
  • Severity-Dependent Bias in LLM Evaluators: A Span-Level Audit of Polarizing Language Detection in Everyday News - Kathleen Higgins, Ashrey Mahesh, Prerana Khatiwada, Varun Pappu, Benjamin E. Bagozzi, Matthew Louis Mauriello [Link]
  • Surfacing Governing Principles for Chatbots: A Workbench and Comparative Study - M. Antonietta Grasso, Jisun Park, Jutta Willamowski, Laurent Besacier, Jos Rozen [Link]
  • Synthetic user requirements: Sense making at early stages of product development - Valeria Resendez, Andrew Hornback, Harinishree Sathu, J. Ben Tamo, Yining Yuan, Nese Baz, Funda Yildirim, Russell Chan, May D. Wang, Maria Fernanda Cabrera, Simone Borsci [Link]
  • Through the Looking Glass of Multilingual AI: Contrasting Language- and Name Script-Dependent Ethnic Hierarchies in GPT and DeepSeek - Anonymous Author(s) [Link]
  • Verbalizing LLMs' Assumptions About the User to Calibrate Expectations and Reduce Sycophancy - Anonymous Author(s) [Link]
  • When Humans Become Regime Stabilizers: A Hidden Failure Mode in Agent-in-the-Loop LLM Evaluation - Vinicius Buri Lux [Link]
  • When Should Agents Evaluate? A Principled Framework for Human-AI Task Allocation in Language Model Evaluation - Sasha Mitts [Link]
  • When Your Boss is an AI: Identity Label Bias in LLM Task Allocation Decisions - Amirsiavosh Bashardoust, Selma Riedo, Yuanjun Feng, Yash Raj Shrestha [Link]

Call for Participation

We welcome participants working on human-centered evaluation and auditing of language models. Topics of interest include, but are not limited to:

  • Empirical understanding of stakeholders' needs and goals in LLM evaluation and auditing
  • Human-centered evaluation and auditing methods for LLMs
  • Tools, processes, and guidelines for LLM evaluation and auditing
  • Discussion of regulatory measures and public policies for LLM auditing
  • Ethics in LLM evaluation and auditing
Special Theme: AI Agents-in-the-Loop. We invite papers engaging with this year's theme, including:
  • Task allocation and workflow integration between human evaluators and AI agents
  • Impact of hybrid human-AI approaches on vulnerability discovery
  • Meta-evaluation frameworks for assessing trustworthiness of AI agents as auditing tools
  • Methodologies for evaluating complex agent behaviors (multi-step reasoning, tool use, emergent patterns)
  • Design patterns and safeguards for maintaining human oversight in automated evaluation
  • Empirical studies of agent-assisted evaluation in practice

Submission Format: 2-6 pages ACM double-column, excluding references.

Submission Types: Position papers, full or in-progress empirical studies, literature reviews, system demos, method descriptions, or encore of published work (non-archival).

Review Process: Double-blind (except encore submissions). Papers will be selected based on the quality of the submission and diversity of perspectives to allow for a meaningful exchange of knowledge between a broad range of stakeholders.

Templates: [Word] [LaTeX] [Overleaf]

Notes:

  • We encourage authors who submit also to help with the review process.
  • For an encore submission, you do not need to anonymize the submission. Encore submissions will go through a jury review process.
  • Please use \documentclass[sigconf,anonymous]{acmart} for submission.
  • Please be aware of OpenReview's moderation policy for newly created profiles: new profiles created without an institutional email will go through a moderation process that can take up to two weeks, while new profiles created with an institutional email will be activated automatically.

→ Submit Your Paper

Organizers