توليد پاسخ در سيستم گفتگوي گفتاري هم‌دلانه با استفاده از ويژگي‌ها‌ي صوتي و متني

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

مهندسي كامپيوتر - نرم افزار

دانشكده

مهندسي كامپيوتر

تاريخ دفاع

1403/11/01

صفحه شمار

102 ص.

استاد راهنما

افسانه فاطمي

كليدواژه فارسي

سيستم گفتگوي گفتاري هم‌دلانه , توليد پاسخ , گفتگوي دامنه‌باز , يادگيري ماشين چند وجهي

چكيده فارسي

سيستم‌هاي گفتگو در قالب‌هاي متني و صوتي، نقش پر رنگي در زندگي روزمره‌ي ما دارند و در بسياري از وظايف از جمله دستيار شخصي هوشمند، خدمات مشتري و مشاوره مورد استفاده قرار مي‌گيرند. در اين سيستم‌ها به دليل تعامل مستقيم با انسان‌ها، با تقويت مواردي چون درك كاربر و پاسخگويي متناسب با آن، مي‌توان رضايتمندي كاربران را افزايش داد. توانايي درك احساسات ديگران و ارائه پاسخ متناسب با آن‌ها، در انسان‌ها به‌عنوان هم‌دلي شناخته مي‌شود. به همين دليل افزودن هم‌دلي به سيستم‌هاي گفتگو به يكي از موضوعات مهم پژوهش‌هاي اخير تبديل شده است. در بسياري از پژوهش‌ها، تنها از وجه متن گفتگو جهت افزودن هم‌دلي به سيستم‌هاي گفتگو استفاده شده است. اين در حالي است كه در سيستم گفتگو‌ي گفتاري، گفتگو در قالب صوت انجام مي‌شود و صوت حاوي اطلاعات زيادي مانند لحن، بلندي صدا، شدت، مكث، لرزش صدا و زير و بمي است. از اين اطلاعات مي‌توان سطح استرس، احساسات، جنسيت، سن كاربر و موارد ديگر استنتاج نمود. بر اساس پژوهش‌هاي روان‌شناسي، صوت مي‌تواند در برانگيختن هم‌دلي نقش موثري ايفا كند. علاوه بر اين پژوهش‌هاي بسياري نشان داده‌اند كه تركيب اطلاعات صوتي و متني توانسته است عملكرد مدل‌هاي تشخيص احساسات هيجاني را بهبود بخشد. با وجود اين موارد تعداد بسيار كمي از پژوهش‌ها به استفاده از صوت براي ايجاد پاسخ‌هاي هم‌دلانه پرداخته‌اند. اغلب اين پژوهش‌ها تنها اطلاعات محدودي از صوت را به‌صورت متني استخراج كرده و سپس همراه با تاريخچه‌ي گفتگو به مدل‌هاي زباني بزرگ ارائه داده‌اند كه اين روش منجر به ناديده گرفتن بسياري از اطلاعات مهم و موثر موجود ديگر در صوت مي‌شود. در اين راستا اين پژوهش روشي براي توليد پاسخ‌هاي هم‌دلانه ارائه مي‌دهد كه با تركيب نمايش‌هاي وجوه متن و صوت، سعي مي‌كند تا از اطلاعات موجود در هر دو وجه بهره‌برداري كند. در گام اول اين پژوهش به دليل عدم وجود مجموعه‌داده‌ي گفتگوهاي هم‌دلانه دو وجهي (متن و صوت) براي آموزش يك سيستم توليد پاسخ انتهابه‌انتها، مجموعه‌داده‌اي متناسب با نيازهاي پژوهش تهيه شده است. اين مجموعه‌داده با نام BiMEmpDialogues شناخته مي‌شود كه با استفاده از خط لوله طراحي‌شده در اين پژوهش، و اعمال آن بر روي چهار مجموعه‌داده‌ي گفتگوي چند وجهي به‌دست آمده است. در ادامه يك مدل توليد پاسخ هم‌دلانه دو وجهي طراحي شده است كه از دريچه‌ي متحرك جهت ادغام وجوه صوت و متن بهره مي‌برد. اين مدل مبتني بر دانش خارجي و نمونه بوده و از سه طبقه‌بند تشخيص وجود سازوكار‌هاي ارتباطي هم‌دلي در پاسخ، در هنگام آموزش استفاده شده است تا مدل را به سمت توليد پاسخ ايده‌آل هم‌دلانه هدايت كند. طبق ارزيابي‌هاي انجام‌شده نسخه‌ي متني مدل پيشنهادي داراي عملكرد خوبي نسبت به پژوهش‌هاي اخير بوده و پاسخ‌هاي توليد‌شده توسط مدل، داراي امتياز حضور هم‌دلي بالاتري مي‌باشد. هم‌چنين مدل پيشنهادي از منظر معيار‌هايي چون ROUGE (شامل rougeL با بهبود 1.31 درصد، rouge1 با بهبود 0.41 درصد و rouge2 با بهبود 0.23 درصد) و BLEU (با بهبود 0.25 درصد) و نيمي از ابعاد كيفيت معيار FED عملكرد بهتري نسبت به نسخه متني خود نشان داده است.

كليدواژه لاتين

Empathetic Spoken Dialogue System , Response Generation , Open-domain Dialogues , Multi-Modal Machine Learning

عنوان لاتين

Response generation in empathetic spoken dialogue system using acoustic an‎d textual features

گروه آموزشي

مهندسي نرم افزار

چكيده لاتين

Dialogue Systems, in both chat an‎d spoken format, play prominent role in our daily lives an‎d are being used in various tasks such as intelligent personal assistants, customer service, an‎d counseling. In these systems, due to their direct interaction with humans, user satisfaction can be increased by enhancing understan‎ding of the user an‎d responding accordingly. The ability to understan‎d othersʹ emotions an‎d respond accordingly is known as empathy in humans. As a result, adding empathy to dialogue systems has become one of the important research topics in recent years. In many researches, only text has been considered for processing to add empathy to dialogue systems. However, in spoken dialogue systems, conversations are in speech format, which contains a wealthy information such as tone, loudness, intensity, pauses, voice tremors, an‎d pitch. The userʹs stress level, feelings, emotions, gender, age, an‎d more can be inferred from this information. According to psychological research, speech can play an effective role in evoking empathy. Additionally, many studies have shown that combining audio an‎d text information has improved the performance of emotion recognition models. Despite these cases, very few studies have used speech to generate empathetic responses. Most of these studies have extracted only limited information from audio in text form an‎d then presented it to large language models along with the dialogue history, which leads to ignoring many important an‎d effective information that exists in speech. In this regard, this research proposes a method for generating empathetic responses by combining representations of speech an‎d text modalities that tries to leverage to leverage the information available in both modalities. In the first step of this study, due to the lack of a bi-modal empathetic dialogue to train an end-to-end response generation system, a suitable dataset for the research needs has been prepared. This dataset is known as BiMEmpDialogues, which was obtained using the designed pipeline in this research an‎d applied to four multi-modal conversation datasets. Subsequently, a bi-modal empathetic response generation model has been designed that uses a shifting gate to integrate audio an‎d text representations. This model is based on external knowledge an‎d examples an‎d uses three classifiers to detect the presence of empathy mechanisms in the generated responses during training, to guide the model towards generating the ideal empathetic responses. According to the eva‎luations, the text-based proposed model performs well compared to recent studies an‎d its generated responses model have a higher empathy presence score. Additionally, the proposed model has shown better performance than its text version in ROUGE (including rougeL with an improvement of 1.31 percent, rouge1 with an improvement of 0.41 percent, an‎d rouge2 with an improvement of 0.23 percent) an‎d BLEU (with an improvement of 0.25 percent) metrics, an‎d in half of the quality dimensions of the FED metric.

تعداد فصل ها

فهرست مطالب pdf

123333

نويسنده

شفريي، زلفا

لينک به اين مدرک

https://lib.ui.ac.ir/dl/search/default.aspx?Term=24569&Field=0&DTC=3