تأثير زبان اوّل بر دقّت دسته بندها در برچسب دهي خودكار خطاهاي املايي در پيكره ي زبان آموز

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

زبانشناسي رايانشي

دانشكده

زبانهاي خارجي

تاريخ دفاع

بهمن ماه 1404

صفحه شمار

175 ص.

استاد راهنما

رضوان متوليان نائيني

استاد مشاور

مرجان كائدي

كليدواژه فارسي

پيكره زبان آموز , برچسب خطاي املايي , دسته بندها , تحليل خطا به كمك رايانه

چكيده فارسي

با توجّه به خلأ پژوهشي موجود در توسعه سامانه¬هاي تصحيح خودكار املايي سازگار با نيازهاي زبان¬آموزان فارسي، اين پژوهش به بررسي تأثير اطّلاعات فراداده¬اي مليّت و زبان¬اوّل بردقّت دسته¬بندها در برچسب-دهي خودكار خطاهاي املايي مي¬پردازد. در اين مطالعه، 637 سند¬ متني شامل متون املايي و انشايي از زبان‌آموزاني با مليّت‌¬هاي مختلف با تمركز بر چهار مليّت اردو‌، عرب، ارمني و روسي‌زبانان گردآوري شده است. اين داده¬ها از مركز آموزش زبان دانشگاه¬اصفهان، مدرسه¬ي¬بنت¬الهدي، جامعة‌المصطفي قم، دبيرستان آرمن و پيكره¬ي بنيادسعدي استخراج شده‌اند. خطاهاي¬¬املايي با استفاده از نرم¬افزار INCEpTION براساس چارچوب استاندارد كدگذاري شده و برچسب¬گذاري خطاها توسط نگارنده‌ي پژوهش انجام ¬شده ¬است. اين چارچوب شامل سطوح سه¬گانة حوزه خطا، مقوله خطا و فرايند خطا بوده و چهارمقوله اصلي (نشانه¬هاي¬اصلي، نشانه¬هاي¬ثانوي، علائم¬نگارشي و مقوله¬صورت) و چهار فرايند خطايي (كاهش، افزايش، جابجايي و جايگزيني) را دربرمي¬گيرد. اين پژوهش كه در ادامه مطالعه تقوي ثاني (1402) قرار مي¬گيرد، از دو منظر به تحليل داده¬ها پرداخت: نخست، تحليل آماري بسامد خطاها براساس مليّت و زبان¬ اوّل، و دوّم، سنجش دقّت پيش¬بيني دسته¬بندها در دو وضعيّت متفاوت «با در نظرگيري» و «بدون در نظرگيري» ويژگي¬هاي زباني و مليّتي. يافتهها حاكي از آن است كه در نظرگرفتن اطّلاعات زبان-اوّل و مليّت، بهويژه در پيكرههاي ناهمگون زباني، دقّت دسته¬بندها در برچسب¬دهي خودكار خطاهاي املايي در پيكره¬ي زبان¬آموزان را بالا¬برده و عملكرد خوبي داشته¬است. نتايج ارزيابي نشان داد كه ميزان دقّت كلّي مدل¬‌ها با در نظرگرفتن متغيّر مليّت در اغلب موارد بهبود يافته است. در مدل جنگل¬تصادفي با روش Index_Encode، دقّت كلّي در حالت بدون در نظرگرفتن مليّت برابر با 58 درصد بود كه با افزودن اين متغيّر به 62 درصد افزايش يافت. در همين روش، مدل ماشين¬بردارپشتيبان دقّتي معادل 53 درصد در حالت بدون مليّت و 55 درصد در حالت با در نظرگرفتن مليّت به¬دست¬آورد.در روش OneHot_Encode، مدل جنگل¬تصادفي در حالت بدون در نظرگرفتن مليّت به دقّت 63 درصد دست¬يافت و با افزودن متغيّر مليّت، تغيير محسوسي در عملكرد آن مشاهده نشد و دقّت در حدود 61 درصد باقي ماند. در مقابل، مدل ماشين¬بردارپشتيبان در اين روش عملكرد بهتري نشان داد، به¬طوري‌¬كه دقّت آن از 66 درصد در حالت بدون مليّت به 68 درصد در حالت با در نظرگرفتن مليّت افزايش يافت. در مجموع، نتايج حاكي از آن است كه متغيّر مليّت در هر دو روش كدگذاري و در هر دو مدل يادگيري ماشين، نقش مثبتي در بهبود عملكرد ايفا كرده و در اغلب موارد موجب افزايش دقّت مدل¬ها شده است، هرچند ميزان اين تأثير به نوع مدل و روش كدگذاري وابسته بوده است. پيشنهاد مي‌شود در پژوهش‌هاي آتي پيكره‌هاي بزرگ‌تر و متنوّع‌تري گردآوري شود تا تأثير زبان¬اوّل بر خطاهاي املايي دقيق‌تر بررسي و تعميم‌پذيري نتايج تقويت گردد.

كليدواژه لاتين

Learner Corpus , Spelling Error Tagging , Classifiers , Computer-Aided Error Analysis

عنوان لاتين

The Effect of First Language on the Accuracy of Classifiers in Automatic Spelling Error Tagging in a Learner Corpus

گروه آموزشي

زبان شناسي

چكيده لاتين

Given the existing research gap in developing automatic spelling correction systems tailored to the needs of Persian language learners, this study investigates the impact of metadata information namely nationality an‎d first language (L1) on the accuracy of classifiers in the automatic tagging of spelling errors. In this research, 637 text documents, including spelling an‎d composition texts from learners of various nationalities, were collected, with a focus on four nationalities: Urdu, Arabic, Armenian, an‎d Russian speakers. This data was extracted from the Language Center of the University of Isfahan, Bent-Al-Hoda School, Al-Mustafa International University (Qom), Armenian High School, an‎d the Saadi Foundation Corpus. Spelling errors were coded using the INCEpTION software based on a stan‎dard framework. The error tagging was performed by the researcher. This framework encompasses three hierarchical levels: Error Domain, Error Category, an‎d Error Process. It includes four main categories (Primary Signs, Secondary Signs, Punctuation Marks, an‎d Form Category) an‎d four error processes (Reduction, Addition, Transposition, an‎d Substitution). Following the study by Taghavi Sani (1402/2023), this research analyzed the data from two perspectives: first, a statistical analysis of error frequencies based on nationality an‎d L1; an‎d second, an assessment of classifier prediction accuracy under two distinct conditions: “with consideration” an‎d “without consideration” of linguistic an‎d national features. The findings indicate that incorporating L1 an‎d nationality information significantly enhances the accuracy of classifiers in the automatic tagging of spelling errors within the learner corpus, particularly in linguistically heterogeneous corpora, leading to better performance .eva‎luation results demonstrated that the overall accuracy of the models improved in most cases when the nationality variable was included. In the Ran‎dom Forest model using the Index_Encode method, the overall accuracy was 58% without considering nationality, which increased to 62% upon its inclusion. For the same method, the Support Vector Machi ne model achieved an accuracy of 53% in the “without nationality” condition an‎d 55% in the “with nationality” condition. In the OneHot_Encode method, the RF model achieved an accuracy of 63% without nationality, with no significant change observed upon adding the nationality variable (remaining around 61%). Conversely, the SVM model showed better performance in this method, with accuracy increasing from 66% (without nationality) to 68% (with nationality). Overall, the results suggest that the nationality variable plays a positive role in improving the performance of both machine learning models across both encoding methods, leading to increased accuracy in most cases. However, the extent of this impact was dependent on the specific model an‎d encoding technique used. Future research is recommended to collect larger an‎d more diverse corpora to more precisely investigate the effect of L1 on spelling errors an‎d enhance the generalizability of the findings.

تعداد فصل ها

فهرست مطالب pdf

159047

نويسنده

قاسمي، زهره

لينک به اين مدرک

https://lib.ui.ac.ir/dl/search/default.aspx?Term=25848&Field=0&DTC=3