Noisy Parallel Data Alignment – OSCAR Celebration of Student Scholarship and Impact

Author(s): Ruoyu Xie

Mentor(s): Antonios Anastasopoulos, Department of Computer Science

Abstract

There are about 7,000 languages in the world and 40% of them are endangered. However, due to the lack of attention, most current Natural Language Processing (NLP) technologies are not reliable for endangered languages. Optical Character Recognition (OCR) may be used to convert existing endangered language documents into machine-readable text. However, for endangered languages, the OCR output is typically noisy, and OCR post-correction systems are used to reduce OCR recognition errors and produce more accurate machine-readable text. Previous work has shown that using alignments between the endangered language and its translation can significantly improve OCR post-correction (Rijhwani et al., 2020). Nevertheless, most alignment tools cannot be trusted to produce accurate alignments for the noisy OCR outputted text. In this work, we study existing alignment tools under noisy data settings, aiming to optimize the state-of-the-art alignment model for better handling OCR noisy data.

Audio Transcript

Hello everyone, my name is Ruoyu Xie, today I will be talking about my work Noisy Parallel Data Alignment.

You must have used Google translation before. Today, Most of the language tools, such as translation apps, speech recognition software are working well with popular languages such as English, Chinese, French. but they are not really doing well with endangered languages, which are the languages that we don’t see every day. there are 7000 languages in the world, but 40% of them are endangered. The reason why these language tools are not doing well on endangered languages is that we don’t have enough data to train the language models for these endangered languages.

In order to train the languages models, we use a tool called optical character recognition (OCR) to digitalize them from books and make them machine-readable.

However, OCR oftentimes makes mistakes in recognizing the text for endangered languages. So we have to make an OCR post-correction which requires the translation alignments between two languages.

This is where my research comes into play, most of the available alignment models are not performing well on noisy data that has OCR errors. My goal is to build a robust alignment model that can produce accurate alignment results regardless of the data noisiness.

In terms of the methodology, I first capture all different kinds of errors that could be made from OCR recognition. And then, based on the probability distribution of these errors, I applied them to some clean datasets and make them noisy. Lastly, I use these synthetic noisy data to train the alignment model and make it more robust in handling noisy data.

Currently, we have reduced the noisy alignment error rate for the English-French dataset from 53% down to 42%. and we are still conducting more experiments and hopefully get better results in the future.

Lastly, I want to say a thank you to my mentor, Dr. Anastasopoulos for guiding me through this project, and also a thank you to the OSCAR office for the sponsorship.

Thanks for watching

Leave a Reply Cancel reply