

Ways to build text collections for training classifiers
https://doi.org/10.26896/1028-6861-2021-87-7-76-84
Abstract
We report on solving the problem of forming a Russian-language text collection (dataset) consisting of bibliographic descriptions of scientific articles for training classifiers. Various approaches to creating such collections are considered. The expediency of using expert estimates for assigning class labels is assessed. The known datasets are analyzed, the requirements for the generated text array are formulated, and the choice of the subject area (Computer Science) is justified. We propose a technology of forming collection in conditions of the shortage of Russian-language articles. To do this we use automated translation of publications (bibliographic descriptions) from available English-language electronic libraries (ACM digital library, IEEE Xplore digital library, CiteSeerX) with additional expert quality control of the translation. The bibliographic collection thus formed was studied using methods of clustering (Latent Semantic Analysis) and visualization (Principal Component Analysis). Training and test samples were compiled and «standard» classifiers (K-Nearest Neighbor Method, Logistic Regression, Random Forest) were used. Then we calculated standard quality measures (accuracy, precision, recall). The rigid and soft classification were carried out. For rigid and soft classification all calculated measures (for the studied classifiers) ranged within [0.79; 0.87], and [0.91; 0.95], respectively. The experiments showed almost identical results for Russian and English bibliographic descriptions (the difference did not exceed 2%). The proposed method of forming text collections reduces the complexity of the labeling process compared to the expert approach, solves the problem of the lack of Russian-language documents, allows formation of sufficiently large balanced bibliographic datasets for training and testing classifiers.
About the Authors
N. I. MulatovRussian Federation
Nikolai I. Mulatov
14, Krasnokazarmennaya ul., Moscow, 111250
A. S. Mokhov
Russian Federation
Andrey S. Mokhov
14, Krasnokazarmennaya ul., Moscow, 111250
V. О. Tolcheev
Russian Federation
Vladimir O. Tolcheev
14, Krasnokazarmennaya ul., Moscow, 111250
References
1. Orlov A. I. Theory of decision-making: Textbook for universities. — Moscow: Ékzamen, 2006. — 576 p. [in Russian].
2. Orlov A. I. On works on the theory of decision-making and expert assessments / Mater. of the Int. Sci.-Pract. Conf. «Theory of active systems». 2019. P. 281 – 288 [in Russian].
3. Vasiliev V. G. Probabilistic models and methods for assessing the quality of reference text arrays in classification / Proc. of the XV All-Russian Sci. Conf. RCDL’2013. 2013. P. 259 – 268 [in Russian].
4. Gilyazev R. A., Turdakov D. Yu. Active learning and crowdsourcing: a survey of data markup optimization methods / Tr. ISP RAN. 2018. Vol. 30. Part 2. P. 215 – 250 [in Russian].
5. Zhang J., Sheng V. S., Wu J., Wu X. Multi-class ground truth inference in crowdsourcing with clustering / IEEE Transactions on Knowledge and Data Engineering. 2016. Vol. 28. N 4. P. 1080 – 1085.
6. Snow R., O’Connor B., Jurafsky D., Ng A. Y. Cheap and fast-but is it good?: Evaluating non-expert annotations for natural language tasks / Proc. of the Conference on Empirical Methods in Natural Language Processing, 2008. P. 254 – 263.
7. Gay C. W., Kayaalp M., Aronson A. R. Semi-automatic indexing of full text biomedical articles / Annual AMIA Symposium Proc. 2005. P. 271 – 275.
8. Sarkar T. How to use a clustering technique for synthetic data generation / Towards Data Science. https://towardsdatascience.com/?source=post_page7c84b6b678ea
9. Serrano M. A., Flammin, A., Menczer F. Modeling statistical properties of written text / PLoS One. 2009. Vol. 4. N 4. P. 1 – 8.
10. Stanford S., Iriondo R., Shukla P. Best Public Datasets for Machine Learning and Data Science. https://pub.towardsai.net
11. Vanyushkin A. S., Grashchenko L. A. Review of available corpora for evaluating algorithms for automatic keyword extraction / XV Int. Conf. on Computational and Cognitive Linguistics. Kazan’, 2018. P. 40 – 54 [in Russian].
12. Akhmadeeva I. R., Zagorulko Yu. A., Salomatina N. V., Seryi A. S., Sidorova E. A., Shestakov V. K. Approach to the formation of thematic collections of texts based on Internet resources / Vestn. Novosibirsk. Gos. Univ. Ser. Inf. Tekhnol. 2013. Vol. 11. Issue 4. P. 5 – 15 [in Russian].
13. Kreines M. G. Models of texts and text collections for information search and analysis / Mathematical models of ecological and economic systems: economics/ Tr. MFTI. 2017. Vol. 9. N 3. P. 132 – 142 [in Russian].
14. Lewis D. D., Yang Y., Rose T. G., Li. F. RCV1: A new benchmark collection for text categorization research / J. Mach. Learn. Res. 2004. N 5. P. 361 – 397.
15. Lehmann J., Isele R., Jakob M., Jentzsch A., Kontokostas D., Mendes P., Hellmann S., Morsey M. Kleef P., Auer S., Bizer C. DBpedia — a large-scale, multilingual knowledge base extracted from Wikipedia / Semantic Web J. 2015. Vol. 6. N 2. P. 167 – 195.
16. Chetviorkin I., Braslavskiy P., Loukachevitch N. Sentiment Analysis Track at ROMIP 2011 / Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference «Dialogue». 2012. Vol. 2. N 11(18). P. 1 – 14.
17. Mohammad S. M., Kiritchenko S., Sobhani P., Zhu X., Cherry C. SemEval2016 Task 6: Detecting Stance in Tweets / Proceedings of SemEval-2016. 2016. P. 31 – 41.
18. Rajadesingan A., Liu H. Identifying Users with Opposing Opinions in Twitter Debates / 7th Int. Conf. on Social Computing, Behavioral Cultural Modeling, and Prediction (SBP 2014), 2014. P. 153 – 160.
19. Manning K., Raghavan P., Schutze H. Introduction to Information search. — Cambridge Univ. Press, 2008. — 504 p.
20. Aggarwal C. C. Machine Learning for Text. — Springer, 2018. — 452 p.
21. Tolcheev V. O. Analysis of the problem and development of the procedure for identifying fuzzy duplicates of scientific articles on bibliographic descriptions / Inf. Tekhnol. 2011. N 2. P. 17 – 21 [in Russian].
22. Flach P. Machine learning (The Art and Science of Algorithms that Make Sense of Data). — Cambridge Univ. Press, 2012. — 400 p.
Review
For citations:
Mulatov N.I., Mokhov A.S., Tolcheev V.О. Ways to build text collections for training classifiers. Industrial laboratory. Diagnostics of materials. 2021;87(7):76-84. (In Russ.) https://doi.org/10.26896/1028-6861-2021-87-7-76-84