The article discusses the task of automatic labelling of texts to improve the efficiency of processing unstructured text data. An overview of existing software products for solving the problem is given, showing the need to develop its own solution specialized in the processing of Russian-language texts. The problem of assigning labels is considered from a mathematical point of view as a problem of multilabel classification, with corresponding mathematical models analysed and described. Based on this, models, algorithms, and a software product for automatically assigning labels to texts have been developed. Numerical experiments were carried out that showed the universality of the method and the possibility of application both in non- specialized and specialized fields, in particular, for processing medical documents.
Keywords
Automatic Labelling of TextsUnstructured TextText TaggingMultilabel ClassificationKeywords Extraction
References
D. Reinsel, J. Gantz, and J. Rydning, “The Digitization of the World – From Edge to Core,” IDC white paper, November 2018, Doc# US44413318.
A. M. Nancy and R. Maheswari, "Review on unstructured data in medical data," Journal of Critical Reviews, 2020, 10.31838/jcr.07.13.342.
pp. 2202-2208, doi:
R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Text,” EMNLP, 2004.
S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Comput. Networks, vol. 30, 1998, pp. 107-117.
S. Tasci and T. Güngör, “LDA-based keyword selection in text categorization,” 24th International Symposium on Computer and Information Sciences, 2009, pp. 230-235.
A. Sedova and O. Mitrofanova, “Topic Modelling of Russian Texts based on Lemmata and Lexical Constructions,” Saint-Petersburg State University, 2017.
S. Rose, D. Engel, N. Cramer, and W. Cowley, “Automatic Keyword Extraction from Individual Documents,” 2010.
M. Thushara, M. Krishnapriya, and S. N. Sangeetha, “A model for auto-tagging of research papers based on keyphrase extraction methods,” International Conference on Advances in Computing, Communications and Informatics, 2017, pp. 1695- 1700.
Dcipher Analytics official web-site [Online]. Available: http://www.dcipheranalytics.com.
MonkeyLearn official web-site [Online]. Available: https://monkeylearn.com/.
TwinWord official web-site [Online]. Available: https://www.twinword.com/.
J. Read , B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Machine Learning, vol. 85, 2011, pp. 333-359.
S.GodboleandS.Sarawagi,“DiscriminativeMethods for Multi-labeled Classification,” PAKDD, 2004.
G. Tsoumakas and I. Katakis, “Multi-Label Classification: An Overview,” Int. J. Data Warehous, Min. 3, 2007, pp. 1-13.
M. Zhang and Z. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognit, vol. 40, 2007, pp. 2038-2048.
A. Clare and R. King, “Knowledge Discovery in Multi-label Phenotype Data,” PKDD, 2001.
M. Zhang and Z. Zhou, “Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, 2006, pp. 1338-1351.
V.BalakrishnanandE.Lloyd-Yemoh,“Stemmingand lemmatization: A comparison of retrieval performances,” 2014.
B.Trstenjak,S.Mikac,andD.Donko,“KNNwithTF- IDF based Framework for Text Categorization,” Procedia Engineering, vol. 69, 2014, pp. 1356-1364.
A.Luque,A.Carrasco,A.Martín,and A. D. L. Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognit., vol 91, 2019, pp. 216-231.