Overview

HEAL is back for CHI 2025! This workshop aims to address the current ''evaluation crisis'' in LLM research and practice by bringing together HCI and AI researchers and practitioners to rethink LLM evaluation and auditing from a human-centered perspective. The recent advancements in Large Language Models (LLMs) have significantly impacted numerous and will impact more, real-world applications. However, these models also pose significant risks to individuals and society. To mitigate these issues and guide future model development, responsible evaluation and auditing of LLMs are essential.

The CHI 2025 Workshop on Human-centered Evaluation and Auditing of Language Models (HEAL@CHI'25) will explore topics around understanding stakeholders' needs and goals with evaluation and auditing LMs, establishing human-centered evaluation and auditing methods, developing tools and resources to support these methods, building community, and fostering collaboration

Special Theme - Mind the Context: For this year’s HEAL, we introduce the theme of ''mind the context'' to encourage attendees to engage with specific contexts in LLM evaluation and auditing. This theme involves various topics: the usage contexts of LLMs (e.g., evaluating the capabilities and limitations of LLM applications in mental wellness care, or translation in high-stakes scenarios), the context of the evaluation/auditing itself (e.g., who are using LLM evaluation tools, and how should we design these tools with this context in mind?), and more. We purposefully leave ''context'' open for interpretation by participants, so to encourage diversity in how participants conceptualize and operationalize this key concept in LLM evaluation and auditing.

Keynote Speakers

Dr. Su Lin Blodgett

Dr. Su Lin Blodgett is a senior researcher in the Fairness, Accountability, Transparency, and Ethics in AI (FATE) group at Microsoft Research Montréal. She is broadly interested in examining the social and ethical implications of natural language processing technologies; She develops approaches for anticipating, measuring, and mitigating harms arising from language technologies, focusing on the complexities of language and language technologies in their social contexts, and on supporting NLP practitioners in their ethical work. She has also worked on using NLP approaches to examine language variation and change (computational sociolinguistics), for example developing models to identify language variation on social media.

Dr. Gagan Bansal

Gagan Bansal is a researcher at Microsoft Research, where he is part of the AI Frontiers group and co-leads research on AutoGen, a framework for building multi-agent AI systems. His work lies at the intersection of Artificial Intelligence and Human-Computer Interaction, with a focus on making AI systems more capable, interactive, and useful to people. Before joining Microsoft Research in 2022, Gagan completed his Ph.D. in Computer Science at the University of Washington, advised by Dan Weld. At UW, he was part of the Lab for Human-AI Interaction, where he studied how AI systems can complement human decision-making. At Microsoft, Gagan has been a driving force behind several open-source agentic projects, including:

AutoGen, a widely adopted framework for multi-agent applications
AutoGen Studio, a low-code interface for creating agentic workflows
Magentic-One, a state-of-the-art multi-agent team for solving complex tasks
MarkitDown, a tool for converting large sets of files to markdown for LLMs

Agenda

The primary goal of this one-day workshop is to bring together HCI and AI researchers from academia, industry, and non-profits to share their ongoing efforts around evaluating and auditing language models.

Welcome

Time: 9:00 AM - 9:15 AM
Morning Keynote: Dr. Su Lin Blodgett

Time: 9:15 AM - 10:15 AM
Special Theme Lightning Talks A

Time: 10:15 AM - 10:45 AM
Poster Session A (in-person) // Oral Session (virtual)

Time: 10:45 AM - 11:45 AM
Group Activity I: What is "Context"?

Time: 11:45 AM - 12:15 PM
Lunch Break

Time: 12:15 PM - 2:00 PM
Afternoon Keynote: Dr. Gagan Bansal

Time: 2:00 PM - 3:00 PM
Special Theme Lightning Talks B

Time: 3:00 PM - 3:30 PM
Poster Session B (in-person) // Display of Virtual Posters

Time: 3:30 PM - 4:30 PM
Group Activity II: Challenges and Opportunities in Context-Dependent Evaluation & Auditing

Time: 4:30 PM - 5:30 PM
Closing Remark

Time: 5:30 PM - 5:45 PM

All times displayed in the program are in local time (Yokohama, Japan).

Accepted Work

Special Theme Papers

PIFU: A novel framework to evaluate the interpretability of synthetic free-text explanations in digital mental health - Y.H.P.P Priyadarshana, Ashala Senanayake, Zilu Liang [Link]
Bias in Language Models: Beyond Trick Tests and Towards RUTEd Evaluation - Kristian Lum, Jacy Reese Anthis, Kevin Robinson, Chirag Nagpal, Alexander Nicholas D'Amour [Link]
Evaluating the Potentials of LLMs in User-Controlled Content Filtering on Social Media - Anna Ricarda Luther [Link]
Step-By-Step Reasoning with Meta Cognitive Prompts to Reduce Contextual Hallucination - Brian Miki, Nicholas Vincent [Link]
Backpropagating from Customer Success - Midam Kim, Fabio Casati, Darrell Penta, Ihnaee Choi, Minyoung Kim
Aligning and Auditing Large Language Model (LLM) for Harmful Content Detection for Body Dissatisfaction and Eating Disorders (ED): Rule Development and Validation Process - Pranita Shrestha, Jue Xie, Pari Delir Haghighi, Michelle Byrne, Roisin McNaney [Link]
Copilot Arena: A Platform for Code LLM Evaluation in the Wild - Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar [Link]
Fiction vs Friction: Challenges in Evaluating LLMs on Data Visualization Tasks - Shani C Spivak, Melanie Tory [Link]
A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems - Carlos Rafael Catalan [Link]
From job titles to jawlines: Using context voids to study generative AI systems - Shahan Ali Memon, Soham De, Sungha Kang, Riyan Mujtaba, Bedoor AlShebli, Katie Davis, Jaime Snyder, Jevin West [Link]
Multi-Criteria Model Comparison for Large Language Models - Jason L. Harman, Jaelle Scheuerman [Link]
Detecting Experiential Differences Between LLM Versions Using Psychometric Scales: a Journaling Case Study - Willem van der Maden, Pavel Okopnyi, Frode Guribye, Simone Grassini, Jichen Zhu [Link]
Caught in the Cascade: Why LLM Auditing is Missing the Middle - Anna Neumann, Jat Singh [Link]
Toward a Human-centered Evaluation Framework for Trustworthy LLM-powered GUI Agents - Chaoran Chen, Zhiping Zhang, Ibrahim Khalilov, Bingcan Guo, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, Toby Jia-Jun Li [Link]
LLMs Are Not Reliable Human Proxies to Study Affordances in Data Visualizations - Kylie Lin, Chase Stokes, Cindy Xiong Bearfield [Link]
LLM agents outperform Harvard negotiators - Michael Cheng, Carolyn Zou, James Sebenius, Michael S. Bernstein
An Exploratory Analysis of Open Large Language Model on Conversational Safety Annotations - Yutong Cao, Lisa Y.W. Tang
Understanding Human Heuristics in Context-Sensitive Image Captioning - Yanru Jiang, Hongjing Lu, Rick Dale [Link]
Broadening Applications: Grounding LLM Development in Potential User Needs - Kaitlyn Zhou, Kristina Gligoric, Myra Cheng, Vyoma Raman, Michelle S. Lam, Boluwatife Aminu, Caeley Woo, Michael Brockman, Dan Jurafsky
Chatbot Auditing and Evaluation Is (Sometimes) Ill-Posed - Aspen Hopkins, Angie Boggust, Harini Suresh [Link]
DICE: A Framework for Dimensional and Contextual of Language Models - Aryan Shrivastava, Paula Akemi Aoyagui [Link]
Designing Scalable and Transparent Interfaces for Multi-Criteria Evaluation of LLM Outputs - Lakshya Sharma, Amin El Asery, Zahra Ashktorab, Qian Pan, Justin D. Weisz [Link]
Evaluating LLMs in Experiential Context: Insights from a Survey of Recent CHI Publications - Christine Dierk, Jennifer Healey, Mustafa Doga Dogan [Link]
Name of Thrones: Evaluating How LLMs Rank Student Names, Race, and Gender in Status Hierarchies - Jonathan Sakunkoo, Annabella Sakunkoo [Link]
Can an LLM tell me if I can legally get an abortion? - Ro Encarnación, Danaë Metaxa [Link]
Evaluating Robustness in LLM-based Medical Chatbots - Mukul Kumar, Karna Sai Nikhilesh Reddy, Alugubelli Dinesh Reddy [Link]

(General) Papers

Navigating Uncertainty in Human-AI Relationships: Identifying Interactive Community Challenges as a Community-Driven Uncertainty Reduction Strategy - Hongyuan Gan, Han Li, Zhan Jinyuan, Renwen Zhang
Systematizing During Measurement Enables Broader Stakeholder Participation - Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Agathe Balayn, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas J Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z Jacobs
EvalAssist: A Human-Centered Tool for LLM-as-a-Judge - Zahra Ashktorab, Werner Geyer, Michael Desmond, Elizabeth M. Daly, Martín Santillán Cooper, Qian Pan, Erik Miehling, Tejaswini Pedapati, Hyo Jin Do [Link]
Exploring Bengali Creative Storytelling Capabilities of Large Language Models Across Cultural Variations - Azmine Toushik Wasi, Raima Islam, Mst Rafia Islam, Farig Sadeque, Taki Hasan Rafi, Dong-Kyu Chae [Link]
AI Should Not Be an Imitation Game: Centaur Evaluations - Andreas Haupt, Erik Brynjolfsson
Large Language Model-Informed Feature Discovery Improves Prediction and Interpretation of Credibility Perceptions of Visual Content - Yilang Peng, Sijia Qian, Yingdan Lu, Cuihua Shen
Can Large Language Models Grasp Concepts in Visual Content? A Case Study on YouTube Shorts about Depression - Jiaying Lizzy Liu, Yiheng Su, Praneel Seth [Link]
Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text - Jennifer Healey, Laurie Byrum, Md Nadeem Akhtar, Surabhi Bhargava and Moumita Sinha [Link]
No-Code Truman: Attempts to Enable Natural Language Revisions to a Real-World Code Base - J.D. Zamfirescu-Pereira, Jessie Jia, Asad Nabi, Qian Yang [Link]
VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures - Yoo Yeon Sung, Hannah Kim, Dan Z [Link]
Rethinking Theory of Mind Benchmarks: Uncovering the Limitations of Appropriating ToM Tasks to Evaluate Large Language Models - Qiaosi Wang, Xuhui Zhou, Maarten Sap, Jodi Forlizzi, Hong Shen [Link]
Evaligner: Automatic Prompt and Criteria Refinement from User Feedback - Heechan Lee, Tae Soo Kim, Juho Kim [Link]
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games - Andres Isaza-Giraldo, Paulo Bala, Lucas Pereira [Link]
Towards Use-Based Ethics Audits of LLM-Based Advice-Chatbots - Tobias Christoph, Kees van Berkel, Katta Spiel [Link]
Representational Harms in LLM-Generated Narratives Against Nationalities Located in the Global South - Ilana Nguyen, Harini Suresh, Evan Shieh [Link]

Key Information

Submission deadline: ~~February 17, 2025 (AoE)~~ Extended to February 24, 2025 (AoE)

Notification of acceptance: ~~March 17, 2025 (AoE)~~ Extended to March 27, 2025 (AoE)

Workshop date: April 26, 2025

Workshop location: Yokohama, Japan (Hybrid)

Contact: heal.workshop@gmail.com

Call for Participation

We welcome participants who work on topics related to supporting human-centered evaluation and auditing of language models. Interested participants will be asked to contribute a short paper to the workshop. Topics of interest include, but not limited to:

Empirical understanding of stakeholders' needs and goals of LLM evaluation and auditing
Human-centered evaluation and auditing methods for LLMs
Tools, processes, and guidelines for LLM evaluation and auditing
Discussion of regulatory measures and public policies for LLM auditing
Ethics in LLM evaluation and auditing

Special Theme: Mind the Context. We invite authors to engage with specific contexts in LLM evaluation and auditing. This theme could involve various topics: the usage contexts of LLMs (e.g., evaluating the capabilities and limitations of LLM applications in mental wellness care, or translation in high-stakes scenarios), the context of the evaluation/auditing itself (e.g., who are using LLM evaluation tools, and how should we design these tools with this context in mind?), and more. The term ''context'' is left open for interpretation, so to encourage diversity in how this this key concept is conceptualized and operationalized by workshop participants. Papers under this theme will be given a dedicated lightning talk session, as well as a special spotlight during the workshop's poster session.

Submission Format: 2 - 6 pages ACM double-column, excluding references.

Submission Types: Position papers, full or in-progress empirical studies, literature reviews, system demos, method descriptions, or encore of published work. The submission will be non-archival.

Review Process: Double-blind. Papers will be selected based on the quality of the submission and diversity of perspectives to allow for a meaningful exchange of knowledge between a broad range of stakeholders.

Templates: [Word] [LaTex] [Overleaf]

Notes:

We encourage authors who submit also to help with the review process.
For an encore submission, you do not need to anonymize the submission. Encore submissions will go through a jury review process.
Please use \documentclass[sigconf,anonymous]{acmart} for submission.
Please be aware of OpenReview's moderation policy for newly created profiles: new profiles created without an institutional email will go through a moderation process that can take up to two weeks, while new profiles created with an institutional email will be activated automatically.

→ Submission Site

HEAL @ CHI 2025
Human-centered Evaluation and
Auditing of Language Models

Yokohama, Japan | April 26, 2025

Submission Deadline: February 17 February 24, 2025 (AOE)

Overview

Keynote Speakers

Dr. Su Lin Blodgett

Dr. Gagan Bansal

Agenda

Welcome

Morning Keynote: Dr. Su Lin Blodgett

Special Theme Lightning Talks A

Poster Session A (in-person) // Oral Session (virtual)

Group Activity I: What is "Context"?

Lunch Break

Afternoon Keynote: Dr. Gagan Bansal

Special Theme Lightning Talks B

Poster Session B (in-person) // Display of Virtual Posters

Group Activity II: Challenges and Opportunities in Context-Dependent Evaluation & Auditing

Closing Remark

Accepted Work

Special Theme Papers

(General) Papers

Key Information

Call for Participation

Organizers