Optimizing Clusters Alignment for Bilingual Malay-English Corpora

Rayner Alfred; Chan Chen Jie; Ng Zhen Wei; Asni Tahir; Joe Henry Obit

doi:10.3844/jcssp.2012.1970.1978

Research Article Open Access

Optimizing Clusters Alignment for Bilingual Malay-English Corpora

Rayner Alfred¹, Chan Chen Jie¹, Ng Zhen Wei¹, Asni Tahir¹ and Joe Henry Obit¹

¹ Universiti Malaysia Sabah, Malaysia

Abstract

Bilingual corpora, containing the same documents in two different languages, are becoming an essential resource for natural language processing. Clustering bilingual corpora provides us with an insight into the differences between languages when term frequency-based Information Retrieval (IR) tools are used. It also allows one to use the Natural Language Processing (NLP) and IR tools in one language to implement IR for another language. This study reports on our work on applying Hierarchical Agglomerative Clustering (HAC) to a large corpus of documents where each appears both in Malay and English languages. These documents are clustered for each language and both results are compared with respect to the content of clusters produced. Further, the effects of using different methods of computing the inter-clusters distance on the cluster results is also studied. These methods include Single, Complete and Average links. Finally, this study describes an experiment employing a genetic algorithm to fine-tune individual term’s weight in order to reproduce more closely a predefined set of clusters. In this way, clustering becomes a supervised learning technique that is trained to better reproduce known clusters in Malay language when applied to the corresponding documents in English language. On the data available, the results of clustering one language resemble the other, provided the number of clusters required is relatively small. The method used to compute the inter-clusters distance also influences the cluster results. The result actually showed an increase in the percentage of aligned clusters, when we applied the genetic algorithm to fine-tune weights of terms considered in clustering the bilingual Malay-English corpora. This study concludes that with a smaller number of clusters, k = 5, all of the clusters from English texts can be mapped into the clusters of Malay texts, by using the Complete link distance measure in clustering the bilingual parallel corpus. In contrast, with a large size of clusters, fewer clusters from English texts can be mapped into the clusters of Malay texts.

Journal of Computer Science

Volume 8 No. 12, 2012, 1970-1978

DOI: https://doi.org/10.3844/jcssp.2012.1970.1978

Submitted On: 13 July 2012 Published On: 26 November 2012

How to Cite: Alfred, R., Jie, C. C., Wei, N. Z., Tahir, A. & Obit, J. H. (2012). Optimizing Clusters Alignment for Bilingual Malay-English Corpora. Journal of Computer Science, 8(12), 1970-1978. https://doi.org/10.3844/jcssp.2012.1970.1978

Copyright: © 2012 Rayner Alfred, Chan Chen Jie, Ng Zhen Wei, Asni Tahir and Joe Henry Obit. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

3,501 Views
2,689 Downloads
2 Citations

Download

Keywords

Bilingual Corpora
Hierarchical Agglomerative Clustering
Parallel Clustering
Genetic Algorithm
Malay-English Corpora
Knowledge Management