

Procedure for checking the uniformity of samples of text documents based on nonparametric criteria
https://doi.org/10.26896/1028-6861-2023-89-7-71-77
Abstract
One of the most important tasks in Text Mining is the formation of sufficiently large representative and consistent samples (datasets). Usually, datasets are obtained from various information sources. In some cases, due to the lack of specialized texts in Russian, the dataset is expanded by adding translated English-language documents. In such situations, it is advisable to evaluate the uniformity-heterogeneity of the combined arrays. However, such a verification is complicated by the fact that the documents are multidimensional vectors, the correct comparison of which is a very non-trivial task. Insufficient elaboration of procedures for checking the uniformity of samples for the multidimensional case leads to the fact the problem of possible differences in data is ignored that in practice as insignificant. As a result, classifiers are trained on samples that are a mixture of quite diverse texts, and the resulting quality of categorization does not improve (or even deteriorates). Thus, it seems relevant to develop a procedure for checking the uniformity of documentary samples. To do this, we provide a comprehensive study of the problem of shift in textual data, identified and analyzed the reasons that cause the heterogeneity of documentary arrays. In this study, the datasets consist of bibliographic descriptions of scientific articles (title, abstract, keywords). The authors develop a procedure for assessing the homogeneity of two samples having approximately the same volume and the same method for calculating the weights of terms. For comparison, centroids are used, which have the size of a common dictionary of two datasets (in the absence of some terms, zero values are put in the corresponding positions of the centroids). The representation of samples in the form of «terminological portraits» (centroids) allowed us to reduce the verification of the homogeneity of multidimensional document vectors to a well-studied problem of analyzing two one-dimensional connected samples, for which nonparametric criteria were used. The sign criterion and the Wilcoxon sign rank criterion were used in the study. The proposed procedure for checking the uniformity of samples was tested on three collections of documents obtained from Russian and English-language sources.
About the Authors
S. I. SafinRussian Federation
Shahim I. Safin
14, Krasnokazarmennaya ul., Moscow, 111250
V. O. Tolcheev
Russian Federation
Vladimir O. Tolcheev
14, Krasnokazarmennaya ul., Moscow, 111250
References
1. Orlov A. I. Applied statistics. — Moscow: Ékzamen, 2006. — 671 p. [in Russian].
2. Burkov A. Machine Learning Engineering. — Moscow: DMK Press, 2022. — 306 p. [in Russian].
3. Mulatov N. I., Mokhov A. S., Tolcheev V. O. Methods of constructing text collections for training classifiers / Zavod. Lab. Diagn. Mater. 2021. Vol. 87. N 7. P. 76 – 84 [in Russian]. DOI: 10.26896/1028-6861-2021-87-7-76-84
4. Kaftannikov I. L., Parasich A. V. Problems of training sample formation in machine learning tasks / Vestn. UUrGu. Ser. Komp’yut. Tekhnol. Upr. Radioélektr. 2016. Vol. 16 N 3. P. 15 – 24 [in Russian].
5. Hollender M., Wolf D. Nonparametric methods of statistics. — Moscow: Finance and Statistics, 1983. — 518 p. [Russian translation].
6. Orlov A. I. Basic requirements for mathematical classification methods / Zavod. Lab. Diagn. Mater. 2020. Vol. 86. N 11. P. 67 – 78 [in Russian]. DOI: 10.26896/1028-6861-2020-86-11-67-78
7. Lipton Z., Wang Y-X., Smola A. Detecting and Correcting for Label Shift with Black Box Predictors / ArXiv: 1802.03916.2018.
8. Dataset Shift in Machine Learning / J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, N. Lawrence, Eds. — The MIT Press, 2022. — 248 p.
9. Zhang K., Scholkopf B., Muandet K., Wang Z. Domain Adaptation under Target and Conditional Shift / Proceedings of the 30th International Conference on Machine Learning. 2013. Vol. 28. N 3. P. 819 – 827.
10. Subbaswamy A., Schulam P., Saria S. Preventing Failures Due to Dataset Shift: Learning Predictive Models that Transport / Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2019. Vol. 89. P. 3118 – 3127.
11. Parker B., Khan L. Rapidly Labeling and Tracking Dynamically Evolving Concepts in Data Streams / IEEE 13th International Conference on Data Mining Workshops. 2013. P. 1161 – 1164.
12. Efimova I. V. Formation of homogeneous training samples for medical diagnostics tasks / Proceedings of the 57th International Scientific Conference of MIPT. 2014. P. 91 – 92 [in Russian].
13. Evangeline M., Shyamala K. Text Categorization Techniques: A Survey / International Conference on Innovative Practices in Technology and Management (ICIPTM). 2021. P. 137 – 142.
14. Kreutz C. K., Schenkel R. Scientific Paper Recommendation Systems: a Literature Review of recent Publications / ArXiv: 2201.00682.2022.
15. Silambarasan M., Shathik J. Ensemble Text Classifier: A Document Classification Technique to Predict and Categorizes Regularised and Novel Classes Using Incremental Learning / International Journal of Applied Engineering Research. 2017. Vol. 12. N 22. P. 12454 – 12459.
16. Understanding Dataset Shift and Potential Remedies. Technical Report. — Vector Institute, 2021. — 27 p.
17. Orlov A. I. What hypotheses can be tested using the two-sample Wilcoxon criterion / Zavod. Lab. Diagn. Maters. 1999. Vol. 65. N 1. P. 51 – 56 [in Russian].
18. Orlov A. I. Model of coincidence analysis in the calculation of nonparametric rank statistics / Zavod. Lab. Diagn. Mater. 2017. Vol. 83. N. 11. P. 66 – 72 [in Russian]. DOI: 10.26896/1028-6861-2017-83-11-66-72
19. Orlov A. I. Distributions of real statistical data are not normal / Scientific Journal of KubGAU. 2016. N. 117. P. 71 – 90 [in Russian].
20. Orlov A. I. Methods of checking the homogeneity of related samples / Zavod. Lab. Diagn. Mater. 2004. Vol. 70. N. 7. P. 57 – 61 [in Russian].
21. Frias-Blanco I., Campo-Avila J., Ramos-Jimenez G., Morales-Bueno R., Ortiz-Diaz A., Caballero-Mota Y. Online and Non-Parametric Drift Detection Methods Based on Hoeffding’s Bounds / IEEE Transactions on Knowledge and Data Engineering. 2014. Vol. 27. N 3. P. 810 – 823.
22. Digital Library Elibrary [cited February 3, 2023]. Available: https://eLibrary.ru
23. Electronic archive of scientific articles of Cornell University with open access [cited February 3, 2023]. Available: https://arxiv.org
24. Electronic Library of the Association for Computing Machinery ACM Digital Library [cited February 3, 2023]. Available: https://dl.acm.org
Review
For citations:
Safin S.I., Tolcheev V.O. Procedure for checking the uniformity of samples of text documents based on nonparametric criteria. Industrial laboratory. Diagnostics of materials. 2023;89(7):71-77. (In Russ.) https://doi.org/10.26896/1028-6861-2023-89-7-71-77