چكيده لاتين
In information organization, document classification holds a significant position. The objective of text classification is to predict whether a given text document belongs to a specific predefined class. Text classification primarily relies on machine learning methods, which require large amounts of labeled data for effective training. However, providing substantial amounts of labeled textual data is time-consuming and costly in real-world applications. Furthermore, the role of semantics is crucial in text classification.The present study aims to propose and evaluate a model for the automatic classification of scientific articles in the field of higher education based on semantic relationships. This research is applied in nature and employs a mixed-methods approach, utilizing techniques from natural language processing, topic modeling, and machine learning. The statistical population comprises 4,233 scientific articles (titles, abstracts, and keywords) collected from journals in the field of higher education and Persian databases such as Magiran, Jihad University, and ScienceNet.The automatic classification of documents is conducted using semi-supervised learning methods through co-training with a small amount of labeled data. In this way, the labeled data is divided into several views using LDA topic modeling and extracted semantic relationships through a combined approach. Base classifiers are trained using each view. Data augmentation techniques are also utilized as an alternative method to address the limited amount of labeled data. The chosen method for extracting semantic relationships requires external knowledge sources such as Wikidata and WordNet to enrich the semantic vectors extracted from the skip-gram model.This research contributes to the development of a new framework for text classification with small training datasets through co-training based on LDA topic modeling, semantic relationships, and a convolutional neural network with combined features (CNN). The performance of the proposed method is compared with other baseline methods including Support Vector Machine, Naïve Bayes, Decision Tree, K-Nearest Neighbors, CNN, and Deep Neural Networks in both supervised and semi-supervised settings on the collected dataset. The classification quality in the proposed model is measured based on three metrics: accuracy, precision, and F1 score on 100% of labeled training documents, yielding scores of 0.912, 0.854, and 0.846 respectively.The results of implementing the proposed model demonstrate that the co-training method based on topic modeling and semantic relationships performs better than other methods for text classification. This improvement is particularly significant when the training datasets are very large. Additionally, the results indicate the effectiveness of using the proposed method when training data is limited.