Proceedings of International Conference on Applied Innovation in IT  ·  2021/04/28  ·  Vol. 9  ·  Issue 1  ·  pp. 69–76
Models and Algorithms for Automatic Labelling of Unstructured Texts (Text Tagging)
Gyuzel Shakhmametova, Ilshat Ishmukhametov
The article discusses the task of automatic labelling of texts to improve the efficiency of processing unstructured text data. An overview of existing software products for solving the problem is given, showing the need to develop its own solution specialized in the processing of Russian-language texts. The problem of assigning labels is considered from a mathematical point of view as a problem of multilabel classification, with corresponding mathematical models analysed and described. Based on this, models, algorithms, and a software product for automatically assigning labels to texts have been developed. Numerical experiments were carried out that showed the universality of the method and the possibility of application both in non- specialized and specialized fields, in particular, for processing medical documents.
Automatic Labelling of Texts Unstructured Text Text Tagging Multilabel Classification Keywords Extraction
References
  1. D. Reinsel, J. Gantz, and J. Rydning, “The Digitization of the World – From Edge to Core,” IDC white paper, November 2018, Doc# US44413318.
  2. A. M. Nancy and R. Maheswari, "Review on unstructured data in medical data," Journal of Critical Reviews, 2020, 10.31838/jcr.07.13.342.
  3. pp. 2202-2208, doi:
  4. R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Text,” EMNLP, 2004.
  5. S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Comput. Networks, vol. 30, 1998, pp. 107-117.
  6. S. Tasci and T. Güngör, “LDA-based keyword selection in text categorization,” 24th International Symposium on Computer and Information Sciences, 2009, pp. 230-235.
  7. A. Sedova and O. Mitrofanova, “Topic Modelling of Russian Texts based on Lemmata and Lexical Constructions,” Saint-Petersburg State University, 2017.
  8. S. Rose, D. Engel, N. Cramer, and W. Cowley, “Automatic Keyword Extraction from Individual Documents,” 2010.
  9. M. Thushara, M. Krishnapriya, and S. N. Sangeetha, “A model for auto-tagging of research papers based on keyphrase extraction methods,” International Conference on Advances in Computing, Communications and Informatics, 2017, pp. 1695- 1700.
  10. Dcipher Analytics official web-site [Online]. Available: http://www.dcipheranalytics.com.
  11. MonkeyLearn official web-site [Online]. Available: https://monkeylearn.com/.
  12. TwinWord official web-site [Online]. Available: https://www.twinword.com/.
  13. J. Read , B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Machine Learning, vol. 85, 2011, pp. 333-359.
  14. S.GodboleandS.Sarawagi,“DiscriminativeMethods for Multi-labeled Classification,” PAKDD, 2004.
  15. G. Tsoumakas and I. Katakis, “Multi-Label Classification: An Overview,” Int. J. Data Warehous, Min. 3, 2007, pp. 1-13.
  16. M. Zhang and Z. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognit, vol. 40, 2007, pp. 2038-2048.
  17. A. Clare and R. King, “Knowledge Discovery in Multi-label Phenotype Data,” PKDD, 2001.
  18. M. Zhang and Z. Zhou, “Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, 2006, pp. 1338-1351.
  19. V.BalakrishnanandE.Lloyd-Yemoh,“Stemmingand lemmatization: A comparison of retrieval performances,” 2014.
  20. B.Trstenjak,S.Mikac,andD.Donko,“KNNwithTF- IDF based Framework for Text Categorization,” Procedia Engineering, vol. 69, 2014, pp. 1356-1364.
  21. TextTagger online demo [Online]. Available: https://texttagger.ishmukhamet.xyz.
  22. A.Luque,A.Carrasco,A.Martín,and A. D. L. Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognit., vol 91, 2019, pp. 216-231.

Proceedings of the International Conference on Applied Innovations in IT by Anhalt University of Applied Sciences is licensed under CC BY-SA 4.0  ·  This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

ICAIIT 2026
International Conference on Applied Innovation in IT
Navigation
Publisher
ISSN2199-8876
Location Anhalt University of Applied Sciences
Phone +49 (0) 3496 67 5611
Address Building 01, Room 425
Bernburger Str. 55
D-06366 Köthen, Germany
Open Access License

All works are licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0), unless otherwise noted.

Published by ICAIIT in cooperation with Anhalt University of Applied Sciences.

© 2026 ICAIIT — International Conference on Applied Innovations in IT. Anhalt University of Applied Sciences, Köthen, Germany.
Visitors: site traffic counter