توليد خودكار مجموعه داده تصحيح خطاهاي دستوري زبان فارسي

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

مهندسي كامپيوتر - نرم افزار

دانشكده

مهندسي كامپيوتر

تاريخ دفاع

1405/02/7

صفحه شمار

114 ص.

استاد راهنما

رضا رمضاني

كليدواژه فارسي

تصحيح خطاي دستوري , داده افزايي , پردازش زبان طبيعي , يادگيري عميق , مدلهاي زباني بزرگ

چكيده فارسي

در عصر ديجيتال، گسترش روزافزون توليد متون الكترونيكي و وجود خطاهاي دستوري در آن‌ها، چالش‌هاي جدي را براي پردازش زبان طبيعي و كيفيت نگارش ايجاد كرده است. با توجه به اينكه عملكرد مدل‌هاي هوشمند تصحيح خطاي دستوري وابستگي مستقيمي به حجم و كيفيت داده‌هاي آموزشي دارد، كمبود مجموعه‌داده‌هاي استاندارد و حاشيه‌نويسي‌شده در زبان فارسي يك مانع اساسي محسوب مي‌شود. هدف اين پژوهش، طراحي و پياده‌سازي چارچوبي براي توليد خودكار جملات داراي خطاي دستوري است تا از طريق داده‌افزايي، دقت مدل‌هاي تصحيح خطا در زبان فارسي بهبود يابد. رويكرد پيشنهادي در اين تحقيق، يك فرآيند دو مرحله‌اي است. در مرحله نخست، مجموعه‌داده‌اي از منابع معتبر همچون ويكي‌پديا و روزنامه همشهري گردآوري و پس از پالايش‌هاي چندگانه، توسط مدل‌هاي زباني بزرگ اعتبارسنجي شد. سپس با تركيب روش‌هاي قاعده‌محور، توليد خالص با مدل زباني بزرگ و رويكردهاي تركيبي، جملات صحيح به جملات داراي خطاي مصنوعي تبديل شدند. در مرحله دوم، براي كاهش وابستگي به مدل‌هاي پرهزينه و افزايش مقياس‌پذيري، مدل mT5-small به صورت اختصاصي روي داده‌هاي توليدشده آموزش مجدد داده شد تا وظيفه تبديل «جمله درست به جمله غلط» را بياموزد. مدل‌هاي زباني بزرگ (مانند Llama-8B و Qwen-30B) قابليت‌هاي چشمگيري در توليد متن آزاد دارند، اما در وظايفي كه نياز به تغييرات دقيق، محدود، و هدفمند در متن از پيش موجود دارند، مانند توليد خطاي دستوري خاص در يك جمله مشخص، عملكرد ضعيف‌تري نشان مي‌دهند. تنظيم دقيق يك مدل كوچك‌تر روي نمونه‌هاي با كيفيت بالا راهكار مؤثرتري است. بر اين اساس، مدل mT5-small به عنوان مدل توليد خطا براي استفاده در مقياس بزرگ انتخاب شد. نتايج ارزيابي نشان داد كه مدل mT5-small تنظيم‌شده، با كسب نرخ تأييد 60.5 درصد و بالاترين شباهت رشته‌اي، عملكردي برتر نسبت به مدل‌هاي زباني بزرگ مبتني بر پرامپت در توليد خطا دارد. مجموعه‌داده نهايي شامل 19,877 جفت جمله كه براي آموزش مدل‌هاي تصحيح خطا مورد استفاده قرار گرفت. مدل تصحيح خطاي GECToR كه با اين داده‌ها آموزش ديد، موفق به كسب امتياز F0.5 برابر 96.76 درصد شد. همچنين ارزيابي‌هاي بين‌دامنه‌اي نشان دادند كه مدل آموزش‌ديده با داده‌هاي اين پژوهش، نيز عملكردي مطلوب (96.86درصد) از خود نشان مي‌دهد. اين دستاوردها گواهي بر اثربخشي روش پيشنهادي در توليد داده‌هاي واقع‌گرايانه و رفع چالش كمبود داده در زبان فارسي است.

كليدواژه لاتين

deep learning , grammatical error correction

عنوان لاتين

Automatic generation of Persian grammatical error correction dataset

گروه آموزشي

مهندسي نرم افزار

چكيده لاتين

In the digital age, the increasing production of electronic texts an‎d the presence of grammatical errors within them have created serious challenges for natural language processing an‎d writing quality. Given that the performance of intelligent grammatical error correction models is directly dependent on the volume an‎d quality of training data, the scarcity of stan‎dard an‎d annotated datasets in Persian constitutes a fundamental obstacle. This research aims to design an‎d implement a framework for the automatic generation of grammatically erroneous sentences, thereby improving the accuracy of error correction models in Persian through data augmentation. The proposed approach in this study follows a two-stage process. In the first stage, a dataset was collected from reliable sources such as Wikipedia an‎d *Hamshahri* newspaper. After multiple cleaning steps, the data was validated using Large Language Models (LLMs). Subsequently, by combining rule-based methods, pure generation via LLMs, an‎d hybrid approaches, correct sentences were transformed into artificially erroneous ones. In the second stage, to reduce dependency on costly models an‎d enhance scalability, the mT5-small model was fine-tuned specifically on the generated data to learn the task of converting "correct sentences to incorrect sentences." While Large Language Models (such as Llama-8B an‎d Qwen-30B) possess significant capabilities in free-text generation, they demonstrate weaker performance in tasks requiring precise, limited, an‎d targeted modifications to existing text, such as generating specific grammatical errors in a given sentence. Fine-tuning a smaller model on high-quality samples proves to be a more effective strategy. Consequently, the mT5-small model was selec‎ted as the error-generation model for large-scale application. eva‎luation results indicated that the fine-tuned mT5-small model outperformed pro‎mp‎t-based Large Language Models in error generation, achieving an acceptance rate of 60.5% an‎d the highest string similarity. The final dataset comprised 19,877 sentence pairs, which were utilized to train the error correction models. The GECToR error correction model, trained on this data, achieved an F0.5 score of 96.76%. Furthermore, cross-domain eva‎luations demonstrated that the model trained on this research’s data also yielded satisfactory performance (96.86%). These achievements confirm the effectiveness of the proposed method in generating realistic data an‎d addressing the challenge of data scarcity in the Persian language.

تعداد فصل ها

فهرست مطالب pdf

160913

نويسنده

ربيعي، كوثر

لينک به اين مدرک

https://lib.ui.ac.ir/dl/search/default.aspx?Term=25916&Field=0&DTC=3