طراحي يك پيكره‌ موازي انگليسي/فارسي زبان‌شناسي و استخراج خودكار فرهنگ لغت تخصصي از آن

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

زبانشناسي رايانشي

دانشكده

زبانهاي خارجي

تاريخ دفاع

1404/11/19

صفحه شمار

186 ص.

استاد راهنما

رضوان متوليان

كليدواژه فارسي

خزش وب , پيكره‌ي زبان‌شناسي انگليسي-فارسي , پيكره‌ي موازي , استخراج خودكار اصطلاحات , هم‌تراز يا موازي‌سازي

چكيده فارسي

همگام با جهاني شدن دانش و افزايش روزافزون اهميت ارتباطات علمي، وجود سيستم‌هاي كارآمد براي توليد پيكره‌هاي موازي تخصصي و استخراج اصطلاحات از آن‌ها، امري مهم تلقي مي‌شود. ما در اين پژوهش، با بهره‌برداري از يك خزشگر وب، چكيده‌هاي فارسي و انگليسي مقالات موجود در مجلات زبان‌شناسي و چكيده‌هاي فارسي-انگليسي پاياننامههاي زبانشناسي موجود در ايرانداك را گردآوري كرده و با تقطيع و هم‌ترازسازي آن‌ها در سطح سند، جمله، گروه و واژه، يك پيكره‌‌ي موازي دوزبانه‌ي انگليسي-فارسي را در حوزه‌ي زبان‌شناسي توسعه داديم. علاوه بر ساخت اين پيكره‌ي تخصصي، با به كار گرفتن يك مدل زباني مناسب، اصطلاحات تخصصي زبان‌شناسي را به صورت خودكار از پيكره‌ي به‌ دست‌ آمده، استخراج كرديم‌ تا يك فرهنگ لغت تخصصي دوزبانه را در اين حوزه به دست دهيم. بر اساس ارزيابي‌هاي انجام شده، كيفيت تقطيع و هم‌ترازي اين پيكره در سطح جمله، 92% و در سطح گروه و واژه 96% گزارش شده است. در مقايسه‌اي بين نسخه‌ي انگليسي اسناد موجود و نسخه‌ي ترجمه‌شده با استفاده از فرهنگ لغت مستخرج نيز، شباهت متون دوزبانه در مجموعه‌ي آزمون، از 75% به بيش از 94% رسيده است.

كليدواژه لاتين

Web Scraping , English-Persian Linguistic Corpus , Parallel Corpus , Automated Term Extraction , Alignment

عنوان لاتين

Designing an English/ Persian Parallel Corpus in Linguistics an‎d Automated Extraction of a Specialized Dictionary from It

گروه آموزشي

زبان شناسي

چكيده لاتين

The globalization of knowledge an‎d the everyday increase of the importance of scientific, scholarly communication necessitate the existence of efficient systems to come up with domain-specific parallel corpora an‎d the later automatic term extraction of them. In this research, we collected the Persian an‎d English abstracts of the available papers in linguistic magazines as well as the Persian-English abstracts of the existing linguistic theses on Iran‎doc, using a Web scraper. The collected data then got segmented an‎d aligned in document, sentence, phrase an‎d word level with the aim of designing a Persian-English parallel corpus in linguistic domain. Moreover, we used an appropriate language model to automatically extract linguistic terms from the developed parallel corpus an‎d build a specialized Persian-English dictionary from it. According to the investigations, the segmentation an‎d alignment quality was reported %92 in sentence level an‎d %96 in phrase an‎d word level. In addition, in a comparison between the available English version of the documents an‎d their translated version using the extracted dictionary, the similarity between the bilingual texts in the test data has reached over %94 from %75.

تعداد فصل ها

فهرست مطالب pdf

155885

نويسنده

نمازي، ياسمن

لينک به اين مدرک

https://lib.ui.ac.ir/dl/search/default.aspx?Term=25619&Field=0&DTC=3