The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic normalization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.
- Présentation