Calibration of the Open-Vocabulary Model YOLO-World by Using Temperature Scaling

Ingrisch, Max; Rajanayagam, Subashkumar; Twieg, Ingo

doi:<a href=

10.25673/119228">

Proceedings of International Conference on Applied Innovation in IT
2025/04/26, Volume 13, Issue 1, pp.155-160

Calibration of the Open-Vocabulary Model YOLO-World by Using Temperature Scaling

Max Andreas Ingrisch, Subashkumar Rajanayagam, Ingo Chmielewski and Stefan Twieg

Abstract: In many areas of the real world, such as robotics and autonomous driving, deep learning models are an indispensable tool for detecting objects in the environment. In recent years, supervised models such as YOLO or Faster R-CNN have been increasingly used for this purpose. One disadvantage of these models is that they can only detect objects within a closed vocabulary. To overcome this limitation, research is currently being conducted into models that can also detect objects outside the known classes of the training data set. A model is therefore trained with base classes and can recognize novel, unseen classes – this is referred to as open-vocabulary detection (OVD). Novel models such as YOLO-World offer a solution to this problem, but they tend to over- or underestimate when calculating confidence values and are therefore often poorly calibrated. However, reliable determination of confidence values is a crucial factor for the use of these models in the real world to ensure safety and trustworthiness. To address this problem, this paper investigates the influence of the calibration method temperature scaling on the OVD model YOLO-World. The optimal T-value is determined by 2 calibration data sets (Pascal VOC and Open Images V7) and then evaluated on the LVIS minival dataset. The results show that the use of temperature scaling improved the Expected Calibration Error (ECE) from 6.78% to 2.31%, but the model still tends to overestimate the confidence values in some bins.

Keywords: Calibration, YOLO-World, Temperature Scaling, Expected Calibration Error, Open-Vocabulary Detection.

DOI: 10.25673/119228

Download: PDF

References:

S. Twieg and R. Menghani, "Analysis and implementation of an efficient traffic sign recognition based on YOLO and SIFT for TurtleBot3 robot," 2023, [Online]. Available: http://dx.doi.org/10.25673/112993.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788.
S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 2017.
C. Zhu and L. Chen, "A survey on open-vocabulary detection and segmentation: Past, present, and future," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 8954-8975, 2024.
A. Radford et al., "Learning transferable visual models from natural language supervision," 2021, [Online]. Available: https://arxiv.org/abs/2103.00020.
T. Cheng et al., "YOLO-World: Real-time open-vocabulary object detection," in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 16901-16911.
L. Yao et al., "DetCLIP: Dictionary-enriched visual-concept paralleled pretraining for open-world detection," in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 9125-9138, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/3ba960559212691be13fa81d9e5e0047-Paper-Conference.pdf.
S. Liu et al., "Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection," 2024, [Online]. Available: https://arxiv.org/abs/2303.05499.
F. Mumuni and A. Mumuni, "Segment anything model for automated image data annotation: Empirical studies using text prompts from Grounding DINO," 2024, [Online]. Available: https://arxiv.org/abs/2406.19057.
J. Gawlikowski et al., "A survey of uncertainty in deep neural networks," 2022, [Online]. Available: https://arxiv.org/abs/2107.03342.
C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, "On calibration of modern neural networks," 2017, [Online]. Available: https://arxiv.org/abs/1706.04599.
W. LeVine et al., "Enabling calibration in the zero-shot inference of large vision-language models," 2023, [Online]. Available: https://arxiv.org/abs/2303.12748.
M. Everingham et al., "The PASCAL Visual Object Classes challenge: A retrospective," International Journal of Computer Vision, vol. 111, no. 1, pp. 98-136, 2015.
A. Kuznetsova et al., "The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale," arXiv:1811.00982, 2018.
I. Krasin et al., "OpenImages: A public dataset for large-scale multi-label and multi-class image classification," 2017, [Online]. Available: https://storage.googleapis.com/openimages/web/index.html. [Accessed: 2-Jan-2025].
A. Kamath et al., "MDETR - Modulated detection for end-to-end multi-modal understanding," in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1760-1770.
Tencent AI Lab, "YOLO-World GitHub Repository," [Online]. Available: https://github.com/AILab-CVC/YOLO-World. [Accessed: 2-Jan-2025].
T. Hirsch and B. Hofer, "The map metric in information retrieval fault localization," in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2023, pp. 1480-1491.
P. Zhang and W. Su, "Statistical inference on recall, precision and average precision under random selection," in 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, 2012, pp. 1348-1352.
S. Shao et al., "Objects365: A large-scale, high-quality dataset for object detection," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8429-8438.
A. Gupta, P. Dollar, and R. Girshick, "LVIS: A dataset for large vocabulary instance segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.

HOME

       - Conference
       - Journal
       - Paper Submission to Journal
       - Paper Submission to Conference
       - For Authors
       - For Reviewers
       - Important Dates
       - Conference Committee
       - Editorial Board
       - Reviewers
       - Last Proceedings

PROCEEDINGS

       - Volume 13, Issue 2 (ICAIIT 2025)
       - Volume 13, Issue 1 (ICAIIT 2025)
       - Volume 12, Issue 2 (ICAIIT 2024)
       - Volume 12, Issue 1 (ICAIIT 2024)
       - Volume 11, Issue 2 (ICAIIT 2023)
       - Volume 11, Issue 1 (ICAIIT 2023)
       - Volume 10, Issue 1 (ICAIIT 2022)
       - Volume 9, Issue 1 (ICAIIT 2021)
       - Volume 8, Issue 1 (ICAIIT 2020)
       - Volume 7, Issue 1 (ICAIIT 2019)
       - Volume 7, Issue 2 (ICAIIT 2019)
       - Volume 6, Issue 1 (ICAIIT 2018)
       - Volume 5, Issue 1 (ICAIIT 2017)
       - Volume 4, Issue 1 (ICAIIT 2016)
       - Volume 3, Issue 1 (ICAIIT 2015)
       - Volume 2, Issue 1 (ICAIIT 2014)
       - Volume 1, Issue 1 (ICAIIT 2013)

PAST CONFERENCES

       ICAIIT 2025
         - Photos
         - Reports

       ICAIIT 2024
         - Photos
         - Reports

       ICAIIT 2023
         - Photos
         - Reports

       ICAIIT 2021
         - Photos
         - Reports

       ICAIIT 2020
         - Photos
         - Reports

       ICAIIT 2019
         - Photos
         - Reports

       ICAIIT 2018
         - Photos
         - Reports

ETHICS IN PUBLICATIONS

ACCOMODATION

CONTACT US

Proceedings of the International Conference on Applied Innovations in IT by Anhalt University of Applied Sciences is licensed under CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

           ISSN 2199-8876
           Publisher: Edition Hochschule Anhalt
           Location: Anhalt University of Applied Sciences
           Email: leiterin.hsb@hs-anhalt.de
           Phone: +49 (0) 3496 67 5611
           Address: Building 01 - Red Building, Top floor, Room 425, Bernburger Str. 55, D-06366 Köthen, Germany

Except where otherwise noted, all works and proceedings on this site is licensed under Creative Commons Attribution-ShareAlike 4.0 International License.