Efficient Textual Similarity using Semantic MinHashing

Nawaz, Waqas and Baig, Maryam and Khan, Kifayat Ullah (2024) Efficient Textual Similarity using Semantic MinHashing. In: 2024 IEEE International Conference on Big Data and Smart Computing, 18th-21st February 2024, Bangkok, Thailand.

[thumbnail of Efficient_Textual_Similarity_using_Semantic_MinHashing.pdf]
Preview
Text
Efficient_Textual_Similarity_using_Semantic_MinHashing.pdf - Accepted Version

Download (855kB)

Abstract

Quantifying the likeness between words, sentences, paragraphs, and documents plays a crucial role in various applications of natural language processing (NLP). As Bert, Elmo, and Roberta exemplified, contemporary methodologies leverage neural networks to generate embeddings, necessitating substantial data and training time for cutting-edge performance. Alternatively, semantic similarity metrics are based on knowledge bases like WordNet, using approaches such as the shortest path between words. MinHashing, a nimble technique, quickly approximates Jaccard similarity scores for document pairs. In this study, we propose employing MinHashing to gauge semantic scores by enhancing original documents with information from semantic networks, incorporating relationships such as syn-onyms, antonyms, hyponyms, and hypernyms. This augmentation improves lexical similarity based on semantic insights. The MinHash algorithm calculates compact signatures for extended vectors, mitigating dimensionality concerns. The similarity of these signatures reflects the semantic score between the documents. Our method achieves approximately 64 % accuracy in the MRPC and SICK data sets.

Item Type: Conference or Workshop Item (Paper)
Identification Number: 10.1109/BigComp60711.2024.00048
Dates:
Date
Event
11 December 2023
Accepted
11 April 2024
Published Online
Uncontrolled Keywords: MinHashing, Semantic similarity, WordNet, Natural Language Processing (NLP), Jaccard similarity, Algorithm
Subjects: CAH11 - computing > CAH11-01 - computing > CAH11-01-03 - information systems
CAH11 - computing > CAH11-01 - computing > CAH11-01-05 - artificial intelligence
Divisions: Faculty of Business, Law and Social Sciences > College of Accountancy, Finance and Economics
Faculty of Business, Law and Social Sciences > College of Accountancy, Finance and Economics > Centre for Accountancy Finance and Economics
Depositing User: Kifayat Khan
Date Deposited: 02 Jan 2025 14:35
Last Modified: 02 Jan 2025 14:35
URI: https://www.open-access.bcu.ac.uk/id/eprint/16059

Actions (login required)

View Item View Item

Research

In this section...