SMS Scam Detection Application Based on Optical Character Recognition for Image Data Using Unsupervised and Deep Semi-Supervised Learning

Shinde, Anjali; Shahra, Essa; Basurra, Shadi; Saeed, Faisal; Alsewari, AbdulRahman; Jabbar, Waheb A.

SMS Scam Detection Application Based on Optical Character Recognition for Image Data Using Unsupervised and Deep Semi-Supervised Learning

Shinde, Anjali and Shahra, Essa and Basurra, Shadi and Saeed, Faisal and Alsewari, AbdulRahman and Jabbar, Waheb A. (2024) SMS Scam Detection Application Based on Optical Character Recognition for Image Data Using Unsupervised and Deep Semi-Supervised Learning. Sensors, 24 (18). p. 6084. ISSN 1424-8220

Preview

Text
sensors-24-06084-v3.pdf - Published Version
Available under License Creative Commons Attribution.
Download (659kB)

Official URL: https://www.mdpi.com/1424-8220/24/18/6084

Abstract

The growing problem of unsolicited text messages (smishing) and data irregularities necessitates stronger spam detection solutions. This paper explores the development of a sophisticated model designed to identify smishing messages by understanding the complex relationships among words, images, and context-specific factors, areas that remain underexplored in existing research. To address this, we merge a UCI spam dataset of regular text messages with real-world spam data, leveraging OCR technology for comprehensive analysis. The study employs a combination of traditional machine learning models, including K-means, Non-Negative Matrix Factorization, and Gaussian Mixture Models, along with feature extraction techniques such as TF-IDF and PCA. Additionally, deep learning models like RNN-Flatten, LSTM, and Bi-LSTM are utilized. The selection of these models is driven by their complementary strengths in capturing both the linear and non-linear relationships inherent in smishing messages. Machine learning models are chosen for their efficiency in handling structured text data, while deep learning models are selected for their superior ability to capture sequential dependencies and contextual nuances. The performance of these models is rigorously evaluated using metrics like accuracy, precision, recall, and F1 score, enabling a comparative analysis between the machine learning and deep learning approaches. Notably, the K-means feature extraction with vectorizer achieved 91.01% accuracy, and the KNN-Flatten model reached 94.13% accuracy, emerging as the top performer. The rationale behind highlighting these models is their potential to significantly improve smishing detection rates. For instance, the high accuracy of the KNN-Flatten model suggests its applicability in real-time spam detection systems, but its computational complexity might limit scalability in large-scale deployments. Similarly, while K-means with vectorizer excels in accuracy, it may struggle with the dynamic and evolving nature of smishing attacks, necessitating continual retraining.

Item Type:	Article
Identification Number:	10.3390/s24186084
Dates:	Date Event 16 September 2024 Accepted 20 September 2024 Published Online
Uncontrolled Keywords:	unsupervised machine learning, deep learning semi supervised, feature ex-traction, smishing message
Subjects:	CAH10 - engineering and technology > CAH10-01 - engineering > CAH10-01-01 - engineering (non-specific) CAH11 - computing > CAH11-01 - computing > CAH11-01-01 - computer science
Divisions:	Architecture, Built Environment, Computing and Engineering > Engineering
Depositing User:	Gemma Tonks
Date Deposited:	16 May 2025 12:55
Last Modified:	16 May 2025 12:55
URI:	https://www.open-access.bcu.ac.uk/id/eprint/16359