Noisy Parallel Data Alignment

Author(s): Ruoyu Xie

Mentor(s): Antonios Anastasopoulos, Computer Science

Abstract
There are about 7,000 languages in the world. However, at least 40% of them are endangered. Due to
the lack of attention, the current Optical Character Recognition (OCR) software is not reliable on
recognizing endangered languages as machine-readable text. The OCR post-correction systems are
developed to reduce OCR recognition errors and produce more accurate results. In order to train OCR
post-correction systems for endangered languages, we need alignments between the endangered
languages and their translations. While there are some alignment tools for extracting alignments, they
do not produce accurate alignment results due to the noisy data from endangered languages. Without
inputting correct alignments, the OCR post-correction systems would not be able to correct OCR
recognition errors well. This research project is aiming to build a robust alignment model that handles
noisy OCR text from endangered languages and produces accurate alignment results for the OCR post-correction systems.
Audio Transcript
Hello everyone, my name is Ruoyu Xie, today I will be talking about my work Noisy Parallel Data Alignment.

You must have used Google translation before. Today, Most of the language tools, such as translation apps, speech recognition software are working well with popular languages such as English, Chinese, French. but they are not really doing well with endangered languages, which are the languages that we don’t see every day. there are 7000 languages in the world, but 40% of them are endangered. The reason why these language tools are not doing well on endangered languages is that we don’t have enough data to train the language models for these endangered languages.

In order to train the languages models, we use a tool called optical character recognition (OCR) to digitalize them from books and make them machine-readable.

However, OCR oftentimes makes mistakes in recognizing the text for endangered languages. So we have to make an OCR post-correction which requires the translation alignments between two languages.

This is where my research comes into play, most of the available alignment models are not performing well on noisy data that has OCR errors. My goal is to build a robust alignment model that can produce accurate alignment results regardless of the data noisiness.

In terms of the methodology, I first capture all different kinds of errors that could be made from OCR recognition. And then, based on the probability distribution of these errors, I applied them to some clean datasets and make them noisy. Lastly, I use these synthetic noisy data to train the alignment model and make it more robust in handling noisy data.

Currently, we have reduced the noisy alignment error rate for the English-French dataset from 53% down to 42%. and we are still conducting more experiments and hopefully get better results in the future.

Lastly, I want to say a thank you to my mentor, Dr. Anastasopoulos for guiding me through this project, and also a thank you to the OSCAR office for the sponsorship.

Thanks for watching

2 replies on “Noisy Parallel Data Alignment”

Great question! We first try to obtain the translation of an endangered language from its local community. They usually have books and materials that contain the endangered language and another popular language. If not, we have to find someone who knows both languages and manually translate it.

Leave a Reply