طراحي و ارزيابي مدلي براي طبقه‌بندي خودكار مقالات علمي براساس رويكرد نيمه نظارتي: مطالعه موردي حوزه آموزش عالي

مقطع تحصيلي

دكتري

رشته تحصيلي

علم اطلاعات و دانش شناسي - بازيابي اطلاعات و دانش

دانشكده

علوم تربيتي و روان شناسي

تاريخ دفاع

1403/11/01

صفحه شمار

225 ص.

استاد راهنما

ميترا پشوتني زاده , علي منصوري

استاد مشاور

حميدرضا برادران كاشاني

كليدواژه فارسي

طبقه بندي متن , يادگيري نيمه نظارتي , مدلسازي موضوعي , روابط معنايي , آموزش عالي

چكيده فارسي

در سازماندهي اطلاعات، طبقه‌بندي اسناد نقش مهمي ايفا مي‌كند. هدف از طبقه‌بندي متن، پيش‌بيني تعلق يك سند متني به يك كلاس از پيش تعريف‌شده است. روش‌هاي يادگيري ماشين براي اين منظور نيازمند داده‌هاي برچسب‌گذاري‌شده زيادي هستند كه تهيه آن‌ها زمان‌بر و پرهزينه است. همچنين، استفاده از معنا در طبقه‌بندي متن حياتي است. اين پژوهش با هدف ارائه و ارزيابي مدلي براي طبقه‌بندي خودكار مقالات علمي حوزه آموزش عالي انجام شد. پژوهش از نوع كاربردي و با رويكرد آميخته است كه از فنون پردازش زبان طبيعي، مدلسازي موضوعي و يادگيري ماشيني بهره مي‌برد. جامعه آماري شامل 4233 مقاله علمي (عنوان، چكيده و كليدواژه‌ها) در حوزه آموزش عالي است كه از مجلات مرتبط و پايگاه‌هاي اطلاعاتي فارسي مانند مگ‌ايران، جهاد دانشگاهي و علم‌نت جمع‌آوري شد. طبقه‌بندي خودكار اسناد با استفاده از روش‌هاي يادگيري نيمه‌نظارت‌شده و آموزش اشتراكي، با بهره‌گيري از مقادير كمي از داده‌هاي برچسب‌گذاري‌شده، صورت گرفت. داده‌هاي برچسب‌گذاري‌شده از طريق مدلسازي موضوعي LDA و استخراج روابط معنايي با رويكرد تركيبي به نماهاي مختلف تقسيم شدند و طبقه‌بندي‌كننده‌هاي پايه توسط هر نما آموزش ديدند. همچنين، از تكنيك افزايش داده‌ها براي مقابله با كمبود داده‌هاي برچسب‌گذاري‌شده استفاده شد. روش انتخابي براي استخراج روابط معنايي، به غني‌سازي بردارهاي معنايي استخراج‌شده از مدل Skip-gram با استفاده از منابع دانش خارجي مانند ويكي‌ديتا و وردنت نياز دارد. اين پژوهش به توسعه چارچوبي جديد براي طبقه‌بندي متن با مجموعه داده‌هاي آموزشي كوچك از طريق آموزش اشتراكي مبتني بر مدلسازي موضوعي LDA، روابط معنايي و شبكه عصبي كانولوشنال با ويژگي تركيبي (CNN) پرداخته است.عملكرد روش پيشنهادي بر روي مجموعه داده‌هاي جمع‌آوري‌شده با روش‌هاي پايه ديگر از جمله ماشين بردار پشتيبان، Naïve Bayes، درخت تصميم، - K نزديك‌ترين همسايه‌ها، CNN و شبكه عصبي عميق نظارتي و نيمه‌نظارتي مقايسه شد. كيفيت طبقه‌بندي در مدل پيشنهادي براساس سه معيار دقت، صحت و امتياز F1 به ترتيب 0.912، 0.854 و 0.846 به دست آمد. نتايج نشان مي‌دهد كه روش آموزش اشتراكي مبتني بر مدلسازي موضوعي و روابط معنايي عملكرد بهتري نسبت به روش‌هاي ديگر در طبقه‌بندي متن دارد، به‌ويژه زماني كه مجموعه داده‌هاي آموزشي محدود است.

كليدواژه لاتين

Text classification , Semi-supervised learning , Topic modeling , Semantic relationships

عنوان لاتين

Design an‎d eva‎luation of a Model for the Automatic Classification of Scientific Articles using a semi-supervised approach: A case study in Higher Education domain

گروه آموزشي

علم اطلاعات و دانش شناسي

چكيده لاتين

In information organization, document classification holds a significant position. The objective of text classification is to predict whether a given text document belongs to a specific predefined class. Text classification primarily relies on machine learning methods, which require large amounts of labeled data for effective training. However, providing substantial amounts of labeled textual data is time-consuming an‎d costly in real-world applications. Furthermore, the role of semantics is crucial in text classification.The present study aims to propose an‎d eva‎luate a model for the automatic classification of scientific articles in the field of higher education based on semantic relationships. This research is applied in nature an‎d employs a mixed-methods approach, utilizing techniques from natural language processing, topic modeling, an‎d machine learning. The statistical population comprises 4,233 scientific articles (titles, abstracts, an‎d keywords) collected from journals in the field of higher education an‎d Persian databases such as Magiran, Jihad University, an‎d ScienceNet.The automatic classification of documents is conducted using semi-supervised learning methods through co-training with a small amount of labeled data. In this way, the labeled data is divided into several views using LDA topic modeling an‎d extracted semantic relationships through a combined approach. Base classifiers are trained using each view. Data augmentation techniques are also utilized as an alternative method to address the limited amount of labeled data. The chosen method for extracting semantic relationships requires external knowledge sources such as Wikidata an‎d WordNet to enrich the semantic vectors extracted from the skip-gram model.This research contributes to the development of a new framework for text classification with small training datasets through co-training based on LDA topic modeling, semantic relationships, an‎d a convolutional neural network with combined features (CNN). The performance of the proposed method is compared with other baseline methods including Support Vector Machine, Naïve Bayes, Decision Tree, K-Nearest Neighbors, CNN, an‎d Deep Neural Networks in both supervised an‎d semi-supervised settings on the collected dataset. The classification quality in the proposed model is measured based on three metrics: accuracy, precision, an‎d F1 score on 100% of labeled training documents, yielding scores of 0.912, 0.854, an‎d 0.846 respectively.The results of implementing the proposed model demonstrate that the co-training method based on topic modeling an‎d semantic relationships performs better than other methods for text classification. This improvement is particularly significant when the training datasets are very large. Additionally, the results indicate the effectiveness of using the proposed method when training data is limited.

تعداد فصل ها

فهرست مطالب pdf

123088

نويسنده

خليليان، سعيده

لينک به اين مدرک

https://lib.ui.ac.ir/dl/search/default.aspx?Term=24548&Field=0&DTC=3