Research Article Open Access

New Information Content Glossary Relatedness (ICGR) Approach for Short Text Similarity (STS) Tasks

Ali Muftah BenOmran1 and Mohd Juzaiddin Ab Aziz1
  • 1 Universiti Kebangsaan Malaysia, Malaysia

Abstract

The measurement of the relatedness of word semantics based on complementary Wikipedia and WordNet-based methods takes two forms, combined and integrative, which are aimed at increasing the semantic space between related words. However, each form has its own set of issues regarding its components and the strategy that is used to combine and integrate corpus-based and knowledge-based methods. In the integrative strategy, a large corpus, such as Wikipedia, is used to extract a set of related words for a particular concept as a basis for searching the WordNet space. The drawback to this strategy is in its use of a fixed scaling parameter, which only fits an implemented dataset that is near to a human score. Other corpus-based methods use a cut-off threshold that is determined experimentally to reduce the semantic space and to increase the search for a more accurate semantic space. Such methods merely take into account the frequency of bigrams, while ignoring the frequency of individual terms. Knowledge-based methods using a gloss overlap have a similar limitation to the corpus-based methods, where they lead to the loss of many valuable relatedness features that determine a more accurate measurement. Thus, in this paper, a new Information Content Glossary Relatedness (ICGR) approach was proposed in two steps, namely, an Extended-PMI based on a cut-off density threshold was proposed to extract a Robust Relatedness Vector set (RVS) of a large Wikipedia dataset. Then, a Semantic Structural Information (SSI) method was presented to use the RVS as a fulcrum to define the most relatedness gloss in the WordNet of each gloss and to select the top 5 glosses related to each RVS. The results showed that the proposed approach outperformed the state-of-the-art set, where the Extended-PMI achieved a Spearman’s correlation of 0.89 to the human score and the ICGR approach achieved a Spearman’s correlation of 0.8 to the human score.

Journal of Computer Science
Volume 15 No. 6, 2019, 769-784

DOI: https://doi.org/10.3844/jcssp.2019.769.784

Submitted On: 10 March 2018 Published On: 5 June 2018

How to Cite: BenOmran, A. M. & Ab Aziz, M. J. (2019). New Information Content Glossary Relatedness (ICGR) Approach for Short Text Similarity (STS) Tasks. Journal of Computer Science, 15(6), 769-784. https://doi.org/10.3844/jcssp.2019.769.784

  • 4,200 Views
  • 2,130 Downloads
  • 0 Citations

Download

Keywords

  • Wikipedia
  • PMI
  • WordNet
  • Gloss
  • Structural Information
  • STS