JIRC 2023 : Journées informatique en Région Centre-Val de Loire

19-20 oct. 2023 Tours (France)

sciencesconf.org:jirc-2023:501516

The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic normalization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.

Type :	:	présentation orale
Thématiques	:	Session 3
Mots-Clés	:	Tunisian Dialect ; Orthographic normalization ; Language modeling ; CODA ; TUN ; COTA Orthography ; Training ; Computational modeling ; Writing ; Natural language processing ; Data models ; Standards ; Phase detection
PDF version	:	PDF version

Présentation

Vie privée | Accessibilité