HEAL: Human-centered Evaluation and Auditing of Language Models

CHI 2024 Workshop
Sunday, May 12, 2024

Honolulu, Hawaii, USA (Hybrid)

→ Submission Site


This workshop aims to address the current ''evaluation crisis'' in LLM research and practice by bringing together HCI and AI researchers and practitioners to rethink LLM evaluation and auditing from a human-centered perspective. The recent advancements in Large Language Models (LLMs) have significantly impacted numerous and will impact more, real-world applications. However, these models also pose significant risks to individuals and society. To mitigate these issues and guide future model development, responsible evaluation and auditing of LLMs are essential.

The CHI 2024 Workshop on Human-centered Evaluation and Auditing of Language Models (HEAL@CHI'24) will explore topics around understanding stakeholders' needs and goals with evaluation and auditing LMs, establishing human-centered evaluation and auditing methods, developing tools and resources to support these methods, building community, and fostering collaboration.

Keynote Speakers

Afternoon Keynote Speaker

Dr. Xu Wei

Human-AI Collaboration in Evaluating Large Language Models

To support real-world applications more responsibly and further improve large language models (LLMs), it is essential to design reliable and reusable frameworks for their evaluation. In this talk, I will discuss three forms of human-AI collaboration for evaluation that combine the strengths of both: (1) the reliability and user-centric aspect of human evaluation, and (2) the cost efficiency and reproducibility offered by automatic evaluation. The first part focuses on systematically assessing LLMs’ favoritism towards Western culture, using a hybrid approach of manual effort and automated analysis. The second part will showcase an LLM-powered privacy preservation tool, designed to safeguard users against the disclosure of personal information. I will share some interesting findings from an HCI user study that involves real Reddit users utilizing our tool, which in turn informs our ongoing efforts to improve the design of NLP models. Lastly, we will delve into the evaluation of LLM-generated texts, where human judgments can be used to train automatic evaluation metrics to detect errors. We also highlight the opportunity of engaging both laypeople and experts in evaluating LLM-generated simplified medical texts in high-stake healthcare applications.


Wei Xu is an Associate Professor in the College of Computing and Machine Learning Center at the Georgia Institute of Technology, where she is the director of the NLP X Lab. Her research interests are in natural language processing and machine learning, with a focus on Generative AI, robustness and fairness of large language models, multilingual LLMs, as well as AI for science, education, accessibility, and privacy research. She is a recipient of the NSF CAREER Award, CrowdFlower AI for Everyone Award, Best Paper Award and Honorable Mention at COLING'18, ACL'23. She also received research funds from DARPA and IARPA. She is currently an executive board member of NAACL.

Afternoon Keynote Speaker

Dr. Sherry Tongshuang Wu

Practical AI Systems: From General-Purpose AI to (the Right) Specific Use Cases

AI research has made great strides in developing general-purpose models (e.g., LLMs) that can excel across a wide range of tasks, enabling users to explore AI applications tailored to their unique needs without the complexities of custom model training. However, with the opportunities come the challenges — General-purpose models prioritize overall performance, but this can neglect specific user needs. How can we make these models practically usable? In this talk, I will present our recent work on assessing and tailoring general-purpose models for specific use cases. I will first cover methods for evaluating and mapping LLMs to specific usage scenarios, then reflect on the importance of identifying the right tasks for LLMs by comparing how humans and LLMs may perform the same tasks differently. In my final remarks, I will discuss the potential of training humans and LLMs with complementary skill sets.


Sherry Tongshuang Wu is an Assistant Professor in the Human-Computer Interaction Institute at Carnegie Mellon University. Her research lies at the intersection of Human-Computer Interaction and Natural Language Processing, and primarily focuses on how humans (AI experts, lay users, domain experts) can practically interact with (debug, audit, and collaborate with) AI systems. To this end, she has worked on assessing NLP model capabilities, supporting human-in-the-loop NLP model debugging and correction, as well as facilitating human-AI collaboration. She has authored award-winning papers in top-tier NLP, HCI and Visualization conferences and journals such as ACL, CHI, TOCHI, TVCG, etc. Before joining CMU, Sherry received her Ph.D. degree from the University of Washington and her bachelor degree from the Hong Kong University of Science and Technology, and has interned at Microsoft Research, Google Research, and Apple. You can find out more about her.


The primary goal of this one-day workshop is to bring together HCI and AI researchers from academia, industry, and non-profits to share their ongoing efforts around evaluating and auditing language models.

All times displayed in the program are in local Honolulu, Hawaii time (GMT-10)

Accepted Work

Workshop papers

Encore papers

Key Information

Submission deadline: Feb 23, 2024 (AoE) Mar 1, 2024 (AoE)

Notification of acceptance: Mar 22, 2024 Mar 24, 2024

Workshop date: Sunday, May 12, 2024

Workshop location: Honolulu, Hawaii, USA (Hybrid)

Contact: heal.workshop@gmail.com

Call for Participation

We welcome participants who work on topics related to supporting human-centered evaluation and auditing of language models. Interested participants will be asked to contribute a short paper to the workshop. Topics of interest include, but not limited to:

Submission Format: 2 - 6 pages ACM double-column, excluding references.

Submission Types: Position papers, full or in-progress empirical studies, literature reviews, system demos, method descriptions, or encore of published work. The submission will be non-archival.

Review Process: Double-blind. Papers will be selected based on the quality of the submission and diversity of perspectives to allow for a meaningful exchange of knowledge between a broad range of stakeholders.

Templates: [Word] [LaTex] [Overleaf]


→ Submission Site