Preview

Industrial laboratory. Diagnostics of materials

Advanced search
Open Access Open Access  Restricted Access Subscription Access

Comparative analysis of binary classifiers on an array of scientific publications

https://doi.org/10.26896/1028-6861-2022-88-7-79-87

Abstract

Binary classifiers are studies on balanced text samples. The samplings are formed from scientific publications in the field of Computer Science (Computer Science). The first class contains articles on «Text Data Mining» (the «TDM» class), the second one contains works on other topics of Computer Science (the «non-TDM» class). All the main stages of preliminary processing of text documents are considered, models of their presentation are analyzed. The problem of binary classification is formulated and the quality indicators used in the study are given. A method of sampling from the Russian digital library (Elibrary) is proposed. The generated sampling consists of bibliographic descriptions of documents (title, abstract and keywords). An exploratory analysis was carried out and the sampling structure was studied. «Term clouds» for two classes are constructed and analyzed, documents are visualized using the method of stochastic embedding of neighbors with t-distribution (t-SNE). Based on the review and analysis of known classifiers, the following methods were selected for the study: the K-nearest neighbor method, random forest, gradient boosting, logistic regression, and the support vector method. Profile methods based on the construction of a vector (profile) of the most informative terms determined by the frequency of occurrence of terms and classes are also used in the study. The parameters of the methods were configured using a five-fold cross-validation. The best quality of classification in our sampling demonstrated the methods using the ensemble (collective) decision-making principle (random forest, gradient boosting), as well as the support vector method. The best classifier, gradient boosting, had the proportion of correct answers (accuracy) about 0.98, recall and precision about 0.99. The other (simpler) methods used in the study also generally showed rather good quality of classification (for the least accurate k-nearest neighbor method accuracy, recall and precision were 0.90, 0.81, and 0.91, respectively).

About the Authors

P. A. Kozlov
National Research University «Moscow Power Engineering Institute»
Russian Federation

Pavel A. Kozlov

14, Krasnokazarmennaya ul., Moscow, 111250, Russia



A. S. Mokhov
National Research University «Moscow Power Engineering Institute»
Russian Federation

Andrey S. Mokhov

14, Krasnokazarmennaya ul., Moscow, 111250, Russia



N. A. Nazarov
National Research University «Moscow Power Engineering Institute»
Russian Federation

Nikolay A. Nazarov

14, Krasnokazarmennaya ul., Moscow, 111250, Russia



Sh. I. Safin
National Research University «Moscow Power Engineering Institute»
Russian Federation

Shahim I. Safin

14, Krasnokazarmennaya ul., Moscow, 111250, Russia



V. O. Tolcheev
National Research University «Moscow Power Engineering Institute»
Russian Federation

Vladimir O. Tolcheev

14, Krasnokazarmennaya ul., Moscow, 111250, Russia



References

1. Evangeline M., Shyamala K. Text Categorization Techniques: A Survey / International Conference on Innovative Practices in Technology and Management (ICIPTM), 2021. P. 137 – 142.

2. Surya K., Nithin R., Prasanna S., Venkatesan R. A comprehensive study on machine learning concepts for text mining / International Conference on Circuit, Power and Computing Technologies (ICCPCT), 2016. P. 1 – 5.

3. Manning K., Raghavan P., Schutze H. Introduction to information retrieval. — Moscow: Vil’yams, 2014. — 528 p. [Russian translation].

4. Flakh P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. — Moscow: DMK-press, 2015. — 400 p. [in Russian]

5. Orlov A. I. Three main results of the mathematical theory of classification / Zavod. Lab. Diagn. Mater. 2016. Vol. 82. N 5. P. 63 – 70 [in Russian].

6. Orlov A. I. Basic requirements for mathematical methods of classification / Zavod. Lab. Diang. Mater. 2020. Vol. 86. N 11. P. 67 – 78 [in Russian].

7. Che W., Liu Y., Wang Y., Zheng B., Liu T. Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation / CoRR. arXiv: 1807.03121. 2018.

8. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding / Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol. 1 (Long and Short Papers). 2019. P. 4171 – 4186.

9. Zherebtsova Yu. A., Chizhik A. V. Comparison of models of vector representation of texts in the problem of creating a chat-bot / Vestnik NGU. Series: Linguistics and intercultural communication. 2020. Vol. 18. N 3. P. 17 – 32 [in Russian]. DOI: 10.25205/1818-7935-2020-18-3-16-34

10. Kilimci Z. H., Akyokuş S. The Analysis of Text Categorization Represented With Word Embeddings Using Homogeneous Classifiers / IEEE International Symposium on Innovations in Intelligent SysTems and Applications (INISTA). 2019. P. 1 – 6. DOI: 10.1109/INISTA.2019.8778329

11. Kreutz C. K., Schenkel R. Scientific Paper Recommendation Systems: a Literature Review of recent Publications / arXiv-CS-Digital Libraries (IF). Pub Date. 2022-01-03. DOI: arxiv-2201.00682

12. Shokin Yu. I., Fedotov A. M., Zhizhimov O. L. Technologies for creating distributed information systems for scientific research support / Vychisl. Tekhnol. 2015. Vol. 20. N 5. P. 251 – 274 [in Russian].

13. Bershadskaya E. G. Analysis of research support technologies / XXI century: results of the past and problems of the present. Series: Engineering sciences. Information Technology. 2015. Issue 3. Vol. 1. P. 11 – 17 [in Russian].

14. Shiryaev A. A. Management information systems in the scientific sphere / Scientific and technical information. Series 1. Organization and methodology of information work. 2015. N 10. P. 31 – 36 [in Russian].

15. Kozlov P. A., Mokhov A. S., Tolcheev V. O. Clustering scientific publications of the department (based on data from the library eLibrary.ru) / VIII International Scientific and Practical Conference «Fuzzy Systems, Soft Computing and Intelligent Technologies» NSMVIT-2020 / Proceedings of the conference. Vol. 2. — Smolensk: Universum, 2020. P. 189 – 199 [in Russian].

16. Pavlov N. A., Andreichenko A. E., Vladzimirsky A. V., Revazyan A. A., Kirpichev Y. S., Morozov S. P. Reference medical datasets (MosMedData) for independent external evaluation of algorithms based on artificial intelligence in diagnostics / Digital Diagnostics. 2021. Vol. 2. N 1. P. 49 – 65 [in Russian]. DOI: 10.17816/DD60635

17. Simon C., Davidsen K., Hansen C. A text mining tool for performing classification of biomedical literature / Bioinformatics 19 — BioReade, 2019. P. 57. DOI: 10.1186/s12859-19-2607-x

18. Scientific electronic library eLibrary.ru. https://www.elibrary.ru.

19. Van der Maaten L., Hinton G. Visualizing High-Dimensional Data Using t-SNE / Journal of Machine Learning Research. 2008. No. 9. P. 2579 – 2605.

20. Mokhov A. S., Tolcheev V. O. Development of specialized methods for classifying bilingual text documents / Proceedings of the 6th All-Russian Multi-Conference on Management Problems. Vol. 1. — Divnomorskoe, 2013. P. 75 – 79 [in Russian].

21. Scikit-learn tutorial: statistical-learning for scientific data processing. Marsland Machine Learning (An Algorithmic Perspective). CRC Press, 2009. https://scikit-learn.org/stable/index.html

22. Word Cloud for Python. http://amueller.github.io/word_cloud

23. Gradient Boosting and XGBoost. https://medium.com/hacker- noon/gradient-boosting-and-xgboost-90862daa6c77


Review

For citations:


Kozlov P.A., Mokhov A.S., Nazarov N.A., Safin Sh.I., Tolcheev V.O. Comparative analysis of binary classifiers on an array of scientific publications. Industrial laboratory. Diagnostics of materials. 2022;88(7):79-87. (In Russ.) https://doi.org/10.26896/1028-6861-2022-88-7-79-87

Views: 404


ISSN 1028-6861 (Print)
ISSN 2588-0187 (Online)