
|
LLM-based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning
Abstract
Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space. Specifically, retrieval is guided by prosodic descriptors derived from speech signals, whereas transcript-based semantic information is incorporated directly within the prompt and interpreted by the LLM through its reasoning capabilities. This modality-aware strategy promotes consistent affective labeling and improved generalization across previously unseen speech segments. Evaluated on multi-player VR audio recordings, our methodology demonstrates potential as a scalable, data-efficient component for data-driven team-based decision support. By integrating acoustic similarity-based retrieval with LLM-based semantic reasoning, this work contributes to emerging interdisciplinary methodologies at the intersection of scientific machine learning, multi-modal systems, and AI-driven decision-making.
Keywords
Large Language Models (LLMs)
In-Context Learning (ICL)
Synthetic Ground Truth
Affective Computing
Data-Driven Decision Support
Virtual Reality (VR).
References
|
Proceedings of the International Conference on Applied Innovations in IT
by
Anhalt University of Applied Sciences
is licensed under
CC BY-SA 4.0
·
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License
All works are licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0), unless otherwise noted.
Published by ICAIIT in cooperation with Anhalt University of Applied Sciences.