A Comparative Study of Machine Learning Models for Classification and Detection of Cybersecurity Threat in Hacking Forum

Document Type

Conference Proceeding

Source of Publication

2024 15th Annual Undergraduate Research Conference on Applied Computing (URC)

Publication Date

4-25-2024

Abstract

This paper presents a comprehensive investigation into the efficacy of machine learning algorithms, leveraging Word2Vec, TF-IDF, and GloVe embeddings for cyber threat detection in forum discussions. The study encompasses a comprehensive methodology, including data pre-processing, model training, and evaluation using popular machine learning algorithms such as Support Vector Machines (SVM), Logistic Regression (LR), Random Forest (RF), XGBoost, LSTM, and Feedforward Neural Networks. The Word2Vec models utilize semantic relationships to create document embeddings, while TF-IDF transforms textual content into numerical features. Additionally, GloVe embeddings are employed to capture global semantic relationships in the text. The findings reveal that TF-IDF-based SVM emerges as a standout performer, attaining an accuracy of 91% and demonstrating enhanced handling of imbalanced classes. The dataset in the study comprises 1966 records, providing a substantial basis for analysis and experimentation. The findings presented in this study contribute to the ongoing discourse on effective text classification methodologies in the cybersecurity landscape.

ISBN

979-8-3315-2734-1

Publisher

IEEE

Volume

00

First Page

1

Last Page

6

Disciplines

Computer Sciences

Keywords

Cyber threat detection, Machine learning models, Hacking forum, TF-IDF, Word2Vec

Indexed in Scopus

no

Open Access

no

Share

COinS