Research Article Open Access

Optimization Techniques To Record Deduplication

Deepa Karunakaran1 and Rangarajan Rangaswamy2
  • 1 Anna University-Coimbatore, India
  • 2 Indus College of Engineering, India

Abstract

Duplicate record detection is important for data preprocessing and cleaning. Artificial Bee Colony (ABC) is one of the most recently introduced algorithms based on the intelligent foraging behavior of a honey bee swarm. Our approach to duplicate detection is the use of ABC algorithm for generating the optimal similarity measure to decide whether the data is duplicate or not. In the training phase, ABC algorithm is used to generate the optimal similarity measure. Once the optimal similarity measure obtained, the deduplication of remaining datasets is done with the help of optimal similarity measure generated from the ABC algorithm. We have used Restaurant and Cora datasets to analyze the proposed algorithm and the performance of the proposed algorithm is compared against the genetic programming technique with the help of evaluation metrics.

Journal of Computer Science
Volume 8 No. 9, 2012, 1487-1495

DOI: https://doi.org/10.3844/jcssp.2012.1487.1495

Submitted On: 4 July 2012 Published On: 11 August 2012

How to Cite: Karunakaran, D. & Rangaswamy, R. (2012). Optimization Techniques To Record Deduplication. Journal of Computer Science, 8(9), 1487-1495. https://doi.org/10.3844/jcssp.2012.1487.1495

  • 3,913 Views
  • 5,489 Downloads
  • 1 Citations

Download

Keywords

  • Data preprocessing
  • genetic programming
  • remaining datasets
  • similarity measure obtained
  • evaluation metrics
  • proposed algorithm
  • Artificial Bee Colony (ABC)