Proceedings of International Conference on Applied Innovation in IT
2021/04/28, Volume 9, Issue 1, pp.69-76

Models and Algorithms for Automatic Labelling of Unstructured Texts (Text Tagging)


Gyuzel Shakhmametova, Ilshat Ishmukhametov


Abstract: The article discusses the task of automatic labelling of texts to improve the efficiency of processing unstructured text data. An overview of existing software products for solving the problem is given, showing the need to develop its own solution specialized in the processing of Russian-language texts. The problem of assigning labels is considered from a mathematical point of view as a problem of multilabel classification, with corresponding mathematical models analysed and described. Based on this, models, algorithms, and a software product for automatically assigning labels to texts have been developed. Numerical experiments were carried out that showed the universality of the method and the possibility of application both in non- specialized and specialized fields, in particular, for processing medical documents.

Keywords: Automatic Labelling of Texts, Unstructured Text, Text Tagging, Multilabel Classification, Keywords Extraction

DOI: 10.25673/36586

Download: PDF

References:

  1. D. Reinsel, J. Gantz, and J. Rydning, “The Digitization of the World – From Edge to Core,” IDC white paper, November 2018, Doc# US44413318.
  2. A. M. Nancy and R. Maheswari, "Review on unstructured data in medical data," Journal of Critical Reviews, 2020, 10.31838/jcr.07.13.342.
  3. pp. 2202-2208, doi:
  4. R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Text,” EMNLP, 2004.
  5. S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Comput. Networks, vol. 30, 1998, pp. 107-117.
  6. S. Tasci and T. Güngör, “LDA-based keyword selection in text categorization,” 24th International Symposium on Computer and Information Sciences, 2009, pp. 230-235.
  7. A. Sedova and O. Mitrofanova, “Topic Modelling of Russian Texts based on Lemmata and Lexical Constructions,” Saint-Petersburg State University, 2017.
  8. S. Rose, D. Engel, N. Cramer, and W. Cowley, “Automatic Keyword Extraction from Individual Documents,” 2010.
  9. M. Thushara, M. Krishnapriya, and S. N. Sangeetha, “A model for auto-tagging of research papers based on keyphrase extraction methods,” International Conference on Advances in Computing, Communications and Informatics, 2017, pp. 1695- 1700.
  10. Dcipher Analytics official web-site [Online]. Available: http://www.dcipheranalytics.com.
  11. MonkeyLearn official web-site [Online]. Available: https://monkeylearn.com/.
  12. TwinWord official web-site [Online]. Available: https://www.twinword.com/.
  13. J. Read , B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Machine Learning, vol. 85, 2011, pp. 333-359.
  14. S.GodboleandS.Sarawagi,“DiscriminativeMethods for Multi-labeled Classification,” PAKDD, 2004.
  15. G. Tsoumakas and I. Katakis, “Multi-Label Classification: An Overview,” Int. J. Data Warehous, Min. 3, 2007, pp. 1-13.
  16. M. Zhang and Z. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognit, vol. 40, 2007, pp. 2038-2048.
  17. A. Clare and R. King, “Knowledge Discovery in Multi-label Phenotype Data,” PKDD, 2001.
  18. M. Zhang and Z. Zhou, “Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, 2006, pp. 1338-1351.
  19. V.BalakrishnanandE.Lloyd-Yemoh,“Stemmingand lemmatization: A comparison of retrieval performances,” 2014.
  20. B.Trstenjak,S.Mikac,andD.Donko,“KNNwithTF- IDF based Framework for Text Categorization,” Procedia Engineering, vol. 69, 2014, pp. 1356-1364.
  21. TextTagger online demo [Online]. Available: https://texttagger.ishmukhamet.xyz.
  22. A.Luque,A.Carrasco,A.Martín,and A. D. L. Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognit., vol 91, 2019, pp. 216-231.


    HOME

       - Call for Papers
       - Paper Submission
       - For authors
       - Important Dates
       - Conference Committee
       - Editorial Board
       - Reviewers
       - Last Proceedings


    PROCEEDINGS

       - Volume 12, Issue 1 (ICAIIT 2024)        - Volume 11, Issue 2 (ICAIIT 2023)
       - Volume 11, Issue 1 (ICAIIT 2023)
       - Volume 10, Issue 1 (ICAIIT 2022)
       - Volume 9, Issue 1 (ICAIIT 2021)
       - Volume 8, Issue 1 (ICAIIT 2020)
       - Volume 7, Issue 1 (ICAIIT 2019)
       - Volume 7, Issue 2 (ICAIIT 2019)
       - Volume 6, Issue 1 (ICAIIT 2018)
       - Volume 5, Issue 1 (ICAIIT 2017)
       - Volume 4, Issue 1 (ICAIIT 2016)
       - Volume 3, Issue 1 (ICAIIT 2015)
       - Volume 2, Issue 1 (ICAIIT 2014)
       - Volume 1, Issue 1 (ICAIIT 2013)


    PAST CONFERENCES

       ICAIIT 2024
         - Photos
         - Reports

       ICAIIT 2023
         - Photos
         - Reports

       ICAIIT 2021
         - Photos
         - Reports

       ICAIIT 2020
         - Photos
         - Reports

       ICAIIT 2019
         - Photos
         - Reports

       ICAIIT 2018
         - Photos
         - Reports

    ETHICS IN PUBLICATIONS

    ACCOMODATION

    CONTACT US

 

DOI: http://dx.doi.org/10.25673/115729


        

         Proceedings of the International Conference on Applied Innovations in IT by Anhalt University of Applied Sciences is licensed under CC BY-SA 4.0


                                                   This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License


           ISSN 2199-8876
           Publisher: Edition Hochschule Anhalt
           Location: Anhalt University of Applied Sciences
           Email: leiterin.hsb@hs-anhalt.de
           Phone: +49 (0) 3496 67 5611
           Address: Building 01 - Red Building, Top floor, Room 425, Bernburger Str. 55, D-06366 Köthen, Germany

        site traffic counter

Creative Commons License
Except where otherwise noted, all works and proceedings on this site is licensed under Creative Commons Attribution-ShareAlike 4.0 International License.