The challenges of Persian Spoken Dialogue Systems

Mar

2021

There are differences between English and Persian, which make the latter more difficult for Speech Understanding

In this article, the challenges toward a Persian SDS is briefly explained. In all languages, implementation of a SDS has many obstacles but here we only mentioned those which are specific to Persian and have not been of any concern for English. Wherever the phonetic of Persian words were needed the IPA/Persian format is used inside //.

1.1. Differences in spoken and written forms

The most important hurdle in Persian SDS's seems to be a large number of differences between spoken and written forms of this language. Some of the most obvious differences are as follows. At the moment there is no robust logical reasoning behind the divisions below, some sections might belong to the same linguistic category, and numbering is done just for later reference and discussions.

- 1.1.1. Simplification of vowels

In conversational Persian, long vowels are almost simplified to be easier to pronounce. For instance, the words ending with /ɒːn/ or /ɒːm/ are simplified to /uːn/ and /uːm/ respectively. Some examples are mentioned in table 1.

Persian written	IPA written	IPA spoken
خبرخوان	xabarxɒːn	xabarxuːn
مسلمان	mosalmɒːn	mosalmuːn

There is also a very frequent object marker /rɒː/ which in spoken form is normally pronounced as /ro/ or even simpler /o/ attached to the end of the object. Some examples are presented in table2.

Persian written	IPA written	IPA spoken
سیب را خوردم	sib rɒː xordam	sibo xordam
حرف خود را زد	harfe xod ra zad	harfe xodesho zad

- 1.1.2. Removal of some prepositions

Some prepositions usually are removed during daily Persian conversations.

Persian written	IPA written	Persian spoken	IPA spoken
در را زدم	dar rɒː zadam	در زدم	dar zadam
هر کدام از ما	har kodɒːm az mɒː	هر کدوم ما	har koduːme mɒː

- 1.1.3. Replacing formal words with more informal ones

This happens in English too. So it falls beyond the purpose of this article, skip it!

Formal, normally written	Less formal, more used in speech
نوشیدم	خوردم
افزود	اضافه کرد
پیشین	قبلی

- 1.1.4. Change in verb tense

To construct a future verb tense, a specific form of the word "خواستن" is used, for example, "خواهم خورد" which means "I will eat" However since its pronunciation is relatively hard, Persian speakers usually change it to "میخورم" which indeed is another tense: present continuous. In other words, "میخورم" is used in conversations both for the future and for present continuous. Native speakers disambiguate these two forms using other clues in the context. Thus NLU unit must be capable of distinguishing this type of verb tense. Similarly, NLG must be capable of including some clues in the produced sentence to clarify the desired meaning for the listener.

- 1.1.5. Various spoken forms with the same written form

Another problem is that in written Persian, most of the vowels are omitted most of the time. While in English vowels are represented with separate independent letters, in Persian they are said but normally omitted in writing. This omission leads some different words in speech to take the same form in the text. For example, for "wrestling" and "ship", the Persian speakers say 'koshti' and 'keshti' respectively. The written representation of them are "کُشتی" and "کِشتی" in the same respect, however in most KB's, both words are written in the same form: "کشتی", the signs "-ُ" (= o) and "-ِ" (= e) are omitted. Thus a case of ambiguity happens here and NLU must be robust enough to overcome such cases and understand the desired meaning. Not only NLU, but also SLG must be aware of vagueness sources to avoid them and produce clear sentences.

- 1.1.6. Letters different in writing but the same in speech

There are some distinct letters that are pronounced the same, leading to another source of ambiguity. For instance, "قالب" (form/shape) and "غالب" (majority) both are pronounced "qaleb". Another case is "حیات" (life) and "حیاط" (yard) where both are said colloquially as 'hajat'.

- 1.1.7. Making question form just by intonation

Unlike English in which interrogative sentences are quite distinguished from declarative ones, by replacing verb and subject, in Persian the same sequence of words can be used for both types; with the only difference that interrogative form is pronounced with a different intonation. For example, these two sentences are written

Declarative	من سیبو خوردم	I ate the apple.
Interrogative	من سیبو خوردم؟	Did I eat the apple?

As you can see, both normal and questioning sentences are written with the same sequence of words, but the question is pronounced with an increasing intonation. Thus this makes the task of ASR together with TTS harder to distinguish between these two.

- 1.1.8. The free structure of sentences

In contrast to English in which the position of POS tags are fixed within the sentences, in colloquial Persian words can appear in almost any order. Words that take precedence convey more emphasis in meaning.

من پنجره رو دیشب تو شرکت حتما بستم

پنجره رو من دیشب تو شرکت حتما بستم

دیشب من پنجره رو تو شرکت حتما بستم

بستم من پنجره رو دیشب تو شرکت حتما

حتما من پنجره رو دیشب تو شرکت بستم

...

1.2. Cultural discrepancies

Here some cultural differences between English and Persian speakers are mentioned. For more insight about this, the interested reader is referred to [1].

- 1.2.1. Taarof

It's quite usual in Iranian culture to say something whilst it is not the real intention of the speaker. This is known in Iranian culture as "Taarof". For example, if both sides of the conversation agree to start reading a poem, the user may say "you first" as a polite suggestion to the other side to take precedence. However, if a robot does not take the Iranian culture into account and start reading immediately, it ignores the expected behaviour. Repeated over time, such reactions make a negative impression. Instead, it's expected from the other side to say "no, you first please" even though it isn't the real intention.

- 1.2.2. Compliments

“Another example of a Persian taboo is complimenting a man on his wife’s looks. The remark "You have a lovely wife." or "Your wife is very beautiful." would be regarded as almost indecent by many Iranians. Yet the same compliment would be considered perfectly natural and even highly appreciated by Westerners.” [1]

- 1.2.3. Addressing people

Unlike English, in which it's common to call almost everybody with their first name, in Iranian culture, it's interpreted as impoliteness especially in some situations. As an instance, it's unaccepted for a child or student to call their parents or teacher by their first name. Disobeying these rules might lead the conversation robot to be seen as unnatural, rude, or at least ignorant.

1.3. Other challenges
- 1.3.1. Mixed words from English

Persian speakers are increasingly using English words in their speech. In the ASR phase, there is no problem, we transform the signal from utterance into its equivalent written form in Persian. But in the NLU phase, the Persian form may not be found in any Knowledge Base (KB) or ontology. For instance, the user says the sentence below and ASR produces:

پیج رسمی فرهادی تو اینستاگرام کدومه؟

For the first word (پیج ) we cannot find any entry in KB's such as Persian Wikipedia because it's an imported word from English, ("page") and is not annotated anywhere in Persian KB's.

- 1.3.2. English abbreviations

Like the above-mentioned cases, English abbreviations are another source of difficulty in Persian speech recognition. Take for example the sentence "داوران ای اف سی انتخاب شدند" into account, where "ای اف سی" is the Persian transliteration of AFC. Mixed with other Persian words, it would be challenging to detect the boundaries of abbreviations, and to translate them to the equivalent English form.

1.4. Scarcity of data

Last but not least, the amount of annotated data for Persian is considerably lower than that of English. As an example, in Wikipedia, which serves as the main resource for Alana chat bot, the number of pages in English is about 52,293,836, whereas it is around 4,815,055 for Persian (less than 10% of English, see here). Yet another example is WordNet ontology, it contains 155,327 words organized in 175,979 synsets (see here for more details), whereas FarsNet (the most up to date Persian wordnet) has 100,000 words and more than 40,000 synsets.

In conclusion, many of the above-mentioned barriers happen occasionally. It's a matter of further research to see how frequent and likely they are? One useful research in this area is perhaps this Persian paper.

[1] Afghari, A., & Karimnia, A. (2007). A contrastive study of four cultural differences in everyday conversation between English and Persian. intercultural communication studies, 16(1), 243.