Authors
Eray Yildiz1 Ahmed Cuneyd Tantug2 and Banu Diri3, 1,3Yildiz Technical University, Turkey and 2Istanbul Technical University, Turkey
Abstract
A parallel corpus plays an important role in statistical machine translation (SMT) systems. In this study, our aim is to figure out the effects of parallel corpus size and quality in the SMT. We develop a machine learning based classifier to classify parallel sentence pairs as high-quality or poor-quality. We applied this classifier to a parallel corpus containing 1 million parallel English-Turkish sentence pairs and obtained 600K high-quality parallel sentence pairs. We train multiple SMT systems with various sizes of entire raw parallel corpus and filtered high-quality corpus and evaluate their performance. As expected, our experiments show that the size of parallel corpus is a major factor in translation performance. However, instead of extending corpus with all available “so-called” parallel data, a better translation performance and reduced time-complexity can be achieved with a smaller high-quality corpus using a quality filter.
Keywords
Machine Translation, Machine Learning, Natural Language Processing, Parallel Corpus, Data Selection