Degree

Doctor of Philosophy (PhD)

Department

Electrical and Computer Engineering

Document Type

Dissertation

Abstract

Text messages are essential these days; however, spam texts have contributed negatively to the success of this communication mode. The compromised authenticity of such messages has given rise to several security breaches. Using spam messages, malicious links have been sent to either harm the system or obtain information detrimental to the user. Spam SMS messages as well as emails have been used as media for attacks such as masquerading and smishing ( a phishing attack through text messaging), and this has threatened both the user and service providers. Therefore, given the waves of attacks, the need to identify and remove these spam messages is important.

This dissertation explores the process of text classification from data input to embedded representation of the words in vector form and finally the classification process. Therefore, we have applied different embedding methods to capture both the linguistic and semantic meanings of words. Static embedding methods that are used include Word to Vector (Word2Vec) and Global Vectors (GloVe), while for dynamic embedding the transfer learning of the Bidirectional Encoder Representations from Transformers (BERT) was employed. For classification, both machine learning and deep learning techniques were used to build an efficient and sensitive classification model with good accuracy and low false positive rate. Our result established that the combination of BERT for embedding and machine learning for classification produced better classification results than other combinations.

With these results, we developed models that combined the self-feature extraction advantage of deep learning and the effective classification of machine learning. These models were tested on four different datasets, namely: SMS Spam dataset, Ling dataset, Spam Assassin dataset and Enron dataset. BERT+SVC (hybrid model) produced the result with highest accuracy and lowest false positive rate.

Date

11-3-2022

Committee Chair

Trahan, Jerry

DOI

10.31390/gradschool_dissertations.6011

Share

COinS