2025/04/26, Volume 13, Issue 1, pp.1-7
Enhancing Voice Activity Detection for an Elderly-Centric Self-Learning Conversational Robot Partner in Noisy Environments
Subashkumar Rajanayagam, Max Andreas Ingrisch, Pascal Müller, Patrick Jahn and Stefan Twieg Abstract: Voice Activity Detection (VAD) is a root component in Human-Robot Interaction (HRI), especially for use cases such as a self-learning personalized conversational robot partner designed to support elderly users with high acceptance. While state-of-the-art, lightweight deep-learning–based VAD models achieve high precision, they often struggle with low recall in environments with significant background noise or music. In contrast, traditional lightweight rule-based VAD methods tend to yield higher recall but at the expense of precision. These limitations can negatively affect user experience, particularly among elderly individuals, by causing frustration from missed spoken inputs and reducing overall usability and acceptance of the conversational robot partners. This study investigates noise-suppressing preprocessing techniques to enhance both the recall and precision of existing VAD systems. Experimental results demonstrate that effective noise suppression prior to VAD processing substantially improves voice detection accuracy in noisy settings, ultimately promoting better interaction quality in elderly-centric robotic applications. Moreover, optimal sample rate, frame duration, thresholds and voice activity modes were identified for the robot Double3—the conversational robot partner platform for seniors in a care home, co-creatively developed by reflecting with the nursing staff. An open-source dataset and a dataset collected and annotated in-house with the Double3 robot were evaluated for robustness in benchmarks.
Keywords: Voice Activity Detection, Human Robot Interaction, Conversational Robot Partner, Elderly-Centric.
DOI: Under Indexing
Download: PDF
References:
- S. Yadav, P. A. D. Legaspi, M. S. O. Alink, A. B. J. Kokkeler and B. Nauta, "Hardware Implementations for Voice Activity Detection: Trends, Challenges and Outlook," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 70, no. 3, pp. 1083-1096, March 2023, doi: 10.1109/TCSI.2022.3225717.
- Silero Team, "Silero Models: State-of-the-Art Speech Processing Models," GitHub repository, 2024. [Online]. Available: https://github.com/snakers4/silero-models. [Accessed: 09-Feb-2025].
- Google. (2011) WebRTC. https://webrtc.org/ [Online, accessed Feb 2025]
- Atlantis-vzw.com, ““Learning a foreign language with more ease".” Accessed: Feb. 02, 2025. [Online]. Available: https://www.atlantis-vzw.com/vreemde-talen?lang=en
- M. Sharma, S. Joshi, T. Chatterjee, and R. Hamid, “A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows,” Neurocomputing, vol. 494, pp. 116–131, Jul. 2022, doi: 10.1016/J.NEUCOM.2022.04.084.
- R. M. Patil and C. M. Patil, “Unveiling the State-of-the-Art: A Comprehensive Survey on Voice Activity Detection Techniques”, doi: 10.1109/APCIT62007.2024.10673721.
- S. Chaudhuri, J. Roth, D. P. Ellis, A. C. Gallagher, L. Kaver, R. Marvin, C. Pantofaru, N. Reale, L. G. Reid, K. W. Wilson, and Z. Xi, “AVA-Speech: A densely labeled dataset of speech activity in movies,” arXiv preprint arXiv:1808.00606, Aug. 2018. [Online]. Available: https://arxiv.org/abs/1808.00606.
- X. L. Zhang and M. Xu, “AUC optimization for deep learning-based voice activity detection,” Eurasip J. Audio, Speech, Music Process., vol. 2022, no. 1, pp. 1–12, Dec. 2022, doi: 10.1186/S13636-022-00260-9/TABLES/7.
- T. Yoshimura, T. Hayashi, K. Takeda, and S. Watanabe, “End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2020-May, pp. 6999–7003, May 2020, doi: 10.1109/ICASSP40776.2020.9054358.
- F. Gu, M.-H. Chung, M. Chignell, S. Valaee, B. Zhou, and X. Liu, “A Survey on Deep Learning for Human Activity Recognition,” 2021. A Surv. Deep Learn. Hum. Act. Recognition. ACM Comput. Surv, vol. 54, no. 8, p. 177, 2021, doi: 10.1145/3472290.
- S. Twieg and B. Zimmermann. Acoustic clustering for vehicle based sounds, 2010.
- P. Müller, H. K. Gali, S. Rajanayagam, S. Twieg, P. Jahn, and S. Hofstetter. Making Complex Technologies Accessible Through Simple Controllability: Initial Results of a Feasibility Study. Applied Sciences, 15(2), 1002, 2024 [Online]. Available: https://doi.org/10.3390/app15021002.
- Audacity(R) software is copyright (c) 1999-2014 Audacity Team. [Website: http://audacity.sourceforge.net/. It is free software distributed under the terms of the GNU General Public License.] The name Audacity(R) is a registered trademark of Dominic Mazzoni.
- H. Schröter, T. Rosenkranz, A. Escalante-B., and A. Maier, "DeepFilterNet: Perceptually motivated real-time speech enhancement," in Proc. 17th Int. Workshop Acoustic Signal Enhancement (IWAENC), 2022.
-
|

HOME

- Call for Papers
- Paper Submission
- For Authors
- For Reviewers
- Important Dates
- Conference Committee
- Editorial Board
- Reviewers
- Last Proceedings

PROCEEDINGS
-
Volume 13, Issue 1 (ICAIIT 2025)
-
Volume 12, Issue 2 (ICAIIT 2024)
-
Volume 12, Issue 1 (ICAIIT 2024)
-
Volume 11, Issue 2 (ICAIIT 2023)
-
Volume 11, Issue 1 (ICAIIT 2023)
-
Volume 10, Issue 1 (ICAIIT 2022)
-
Volume 9, Issue 1 (ICAIIT 2021)
-
Volume 8, Issue 1 (ICAIIT 2020)
-
Volume 7, Issue 1 (ICAIIT 2019)
-
Volume 7, Issue 2 (ICAIIT 2019)
-
Volume 6, Issue 1 (ICAIIT 2018)
-
Volume 5, Issue 1 (ICAIIT 2017)
-
Volume 4, Issue 1 (ICAIIT 2016)
-
Volume 3, Issue 1 (ICAIIT 2015)
-
Volume 2, Issue 1 (ICAIIT 2014)
-
Volume 1, Issue 1 (ICAIIT 2013)

PAST CONFERENCES
ICAIIT 2025
-
Photos
-
Reports
ICAIIT 2024
-
Photos
-
Reports
ICAIIT 2023
-
Photos
-
Reports
ICAIIT 2021
-
Photos
-
Reports
ICAIIT 2020
-
Photos
-
Reports
ICAIIT 2019
-
Photos
-
Reports
ICAIIT 2018
-
Photos
-
Reports
ETHICS IN PUBLICATIONS
ACCOMODATION
CONTACT US
|
|