A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora

Dilshad Kaur; Satwinder Singh

doi:10.3844/jcssp.2021.924.952

Research Article Open Access

A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora

Dilshad Kaur¹ and Satwinder Singh¹

¹ Central University of Punjab, India

Abstract

In today’s Globalized Scenario, the requirement for translation is high and increasing rapidly in the number of fields, but it is difficult to translate everything manually. Machine Translation, which is dependent on corpora availability, is a medium for meeting this high demand for translation. Parallel corpora are used to gain most translation knowledge. But, the number and quality of parallel corpora are critical. Because parallel corpora are not readily accessible for many different language pairs, comparable corpora that are widely accessible can be used to extract parallel corpora. A systematic literature survey is performed on 188 research articles that are published in premier journals, conferences, workshops and book chapters. The research process is carried out while considering the research questions. Different MT systems along with their features are identified. Several datasets and techniques for bilingual lexicon extraction, parallel sentence and fragment extraction are revealed. A proposed architecture and a mind map are also showcased in this review article to provide better clarity regarding parallel data extraction using comparable corpora. The study of the paper will increase readers' understanding of parallel data mining through bilingual lexicons, parallel sentences and fragments.

Journal of Computer Science

Volume 17 No. 10, 2021, 924-952

DOI: https://doi.org/10.3844/jcssp.2021.924.952

Submitted On: 22 May 2021 Published On: 27 October 2021

How to Cite: Kaur, D. & Singh, S. (2021). A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora. Journal of Computer Science, 17(10), 924-952. https://doi.org/10.3844/jcssp.2021.924.952

Copyright: © 2021 Dilshad Kaur and Satwinder Singh. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

2,750 Views
1,331 Downloads
2 Citations

Download

Keywords

Machine Translation
Statistical Machine Translation
Parallel Corpora
Comparable Corpora