چكيده لاتين
Dialogue Systems, in both chat and spoken format, play prominent role in our daily lives and are being used in various tasks such as intelligent personal assistants, customer service, and counseling. In these systems, due to their direct interaction with humans, user satisfaction can be increased by enhancing understanding of the user and responding accordingly. The ability to understand othersʹ emotions and respond accordingly is known as empathy in humans. As a result, adding empathy to dialogue systems has become one of the important research topics in recent years. In many researches, only text has been considered for processing to add empathy to dialogue systems. However, in spoken dialogue systems, conversations are in speech format, which contains a wealthy information such as tone, loudness, intensity, pauses, voice tremors, and pitch. The userʹs stress level, feelings, emotions, gender, age, and more can be inferred from this information. According to psychological research, speech can play an effective role in evoking empathy. Additionally, many studies have shown that combining audio and text information has improved the performance of emotion recognition models. Despite these cases, very few studies have used speech to generate empathetic responses. Most of these studies have extracted only limited information from audio in text form and then presented it to large language models along with the dialogue history, which leads to ignoring many important and effective information that exists in speech. In this regard, this research proposes a method for generating empathetic responses by combining representations of speech and text modalities that tries to leverage to leverage the information available in both modalities. In the first step of this study, due to the lack of a bi-modal empathetic dialogue to train an end-to-end response generation system, a suitable dataset for the research needs has been prepared. This dataset is known as BiMEmpDialogues, which was obtained using the designed pipeline in this research and applied to four multi-modal conversation datasets. Subsequently, a bi-modal empathetic response generation model has been designed that uses a shifting gate to integrate audio and text representations. This model is based on external knowledge and examples and uses three classifiers to detect the presence of empathy mechanisms in the generated responses during training, to guide the model towards generating the ideal empathetic responses. According to the evaluations, the text-based proposed model performs well compared to recent studies and its generated responses model have a higher empathy presence score. Additionally, the proposed model has shown better performance than its text version in ROUGE (including rougeL with an improvement of 1.31 percent, rouge1 with an improvement of 0.41 percent, and rouge2 with an improvement of 0.23 percent) and BLEU (with an improvement of 0.25 percent) metrics, and in half of the quality dimensions of the FED metric.