Analysis and Comparison of Machine Learning Techniques for DDoS Attack Classification in Network Environments

This research presents a comparative analysis of machine learning techniques for classifying Distributed Denial of Service (DDoS) attacks within network traffic. We evaluated the performance of three algorithms: Logistic Regression, Decision Tree, and Random Forest, including their scaled-feature counterparts. The study utilized a robust methodology incorporating advanced data preprocessing, feature engineering, and Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance. The models were rigorously tested using a cross-validation framework, assessing their accuracy, precision, recall, and F1 score. Results indicated that the Random Forest algorithm outperformed the others, demonstrating superior predictive accuracy and consistency, albeit with higher computational costs. Logistic Regression, when feature-scaled, showed significant improvement in performance, highlighting the importance of data normalization in models sensitive to feature scaling. Decision Trees provided a quick and interpretable model, though slightly less accurate than the Random Forest. The research findings highlight the trade-offs between predictive performance and computational efficiency in selecting machine learning models for cybersecurity applications. The study contributes to the cybersecurity domain by elucidating the efficacy of ensemble techniques in DDoS attack classification and underscores the potential for model improvement through scaling and data balancing.


Introduction
As the digital landscape continues to evolve, cybersecurity threats like Distributed Denial of Service (DDoS) attacks are becoming increasingly sophisticated and damaging [1], [2], [3].These attacks disrupt essential online services by flooding networks with excessive traffic, posing significant threats to the stability and security of digital infrastructures [4], [5], [6].The complexity and dynamic nature of these attacks necessitate advanced detection and mitigation strategies [7], [8], [9].This research is situated within this context, aiming to enhance the detection and classification of DDoS attacks using innovative machine learning approaches.The literature on DDoS attack detection is vast and diverse.Studies such as those who explored various machine learning techniques for network anomaly detection, providing valuable insights but often limited by static modeling approaches [10], [11], [12].Another research emphasized the potential of deep learning methods, yet these require extensive computational resources and large datasets [13].On the other hand, another work investigated classical machine learning models, highlighting their efficiency but pointing out their limitations in handling complex and high-dimensional data typical in network traffic [14].Furthermore, the work from other source on the issue of imbalanced datasets prevalent in network security, proposing various sampling techniques to enhance model performance [15].
The current state of the art incorporates a range of techniques, from traditional machine learning models to more recent deep learning frameworks [16], [17].Despite the progress, there remains a crucial gap in comprehensively comparing the effectiveness of different machine learning models, particularly in the context of DDoS attack classification.Many studies focus on a single model or a specific aspect of the classification problem, lacking a holistic approach that considers various models under uniform experimental conditions.
Addressing this gap, our research makes several key contributions, firstly is an extensive data preprocessing and feature engineering, in this process, we apply a series of advanced data preprocessing techniques to refine the network traffic data, ensuring high-quality inputs for model training.Additionally, our unique feature engineering strategy enhances the models' capacity to distinguish between normal traffic and DDoS attacks.Secondly, by implementing diverse machine learning models.This process is a central aspect of our study to do comparative analysis of three widely used machine learning models: Logistic Regression, Decision Trees, and Random Forest.This comparison provides a comprehensive view of their performance in the context of DDoS attack classification.
We also employ PCA for effective dimensionality reduction, allowing us to manage the complex nature of network data.Simultaneously, our use of SMOTE addresses the challenge of class imbalance, a common issue in network security datasets [18], [19], [20].To do In-depth comparative evaluation, the evaluation method involves a rigorous analysis of the models based on accuracy, precision, recall, and F1 score.This thorough comparative study helps in identifying the most effective model under various scenarios, contributing significantly to the field of network security.This study's implications extend beyond the immediate realm of DDoS attack classification.By providing a detailed comparison of different machine learning models, our research contributes to the broader understanding of their applicability in cybersecurity.The methodologies and insights gained can be applied to other areas of network security, potentially aiding in the development of more robust and adaptable defense mechanisms.Furthermore, our work sets a foundation for future research, encouraging further exploration into the comparative analysis of machine learning models in cybersecurity.

Data Collection and Preprocessing
The dataset used in this study was sourced from Kaggle, specifically designed to contain network logs pertinent to DDoS attacks.The 'DDoS Attack Network Logs' dataset comprises various network attributes that are key to discerning traffic patterns indicative of attacks.The initial step involved loading the dataset using the ArffLoader function, tailored for handling the ARFF (Attribute-Relation File Format) commonly utilized in machine learning datasets.Furthermore, we create dataframe creation in order to do post-loading, the data was structured into a Pandas DataFrame.This format is conducive for data manipulation in Python, offering a wide array of functionalities for data analysis.After that, we create feature-target separation, in this process, the dataset was bifurcated into feature columns ('df_X') and a target column ('df_target').This separation is a cornerstone of supervised learning, where the model learns to predict the target variable from the features.Furthermore, data encoding is implemented, the dataset contained byte sequences which were converted to strings for uniformity and ease of processing.This encoding step is crucial for handling categorical data in subsequent analysis.Then, specific columns were cast to integer and object types to maintain consistency with Python's data processing libraries.We also conducted a thorough check for missing values and duplicate entries to ensure data integrity.Handling missing values is vital to prevent inaccuracies during model training.

Feature Engineering
Feature engineering is a pivotal step where domain knowledge is leveraged to extract and optimize features from raw data.Firstly, categorical data handling, in this process, the categorical values in the 'FLAGS' column were replaced with numerical codes, as most machine learning algorithms necessitate numerical inputs.Secondly, one-hot encoding, this technique was employed to process categorical variables, converting them into a format amenable to machine learning algorithms, thus aiding in improving model accuracy.
Thirdly is logarithmic transformation, this process is important to address data skewness, logarithmic transformations were applied to various packet-related features.This approach is often effective in normalizing data distributions, enhancing the performance of learning algorithms.

Dimensionality Reduction and Class Imbalance
Handling PCA (Principal Component Analysis) was employed to mitigate the challenge of high-dimensional data.PCA reduces the dimensionality while retaining most of the data variance, aiding in simplifying the dataset.To address the class imbalance prevalent in network security datasets, SMOTE (Synthetic Minority Oversampling Technique) was utilized.This technique synthesizes new samples from the minority class, balancing the dataset for training.

Model Implementation
In the model implementation phase of our research, we dedicated our efforts to the deployment of three distinct machine learning algorithms, each chosen for their relevance and potential in addressing the classification challenges posed by DDoS attack detection in network traffic data.Logistic Regression was the first of the three models we implemented.As a probabilistic linear classifier, Logistic Regression is traditionally prized for its simplicity and interpretability.
In our implementation, we adapted the model to handle the binary classification task by modeling the log-odds of the probability of an attack as a linear combination of the input features.This required careful consideration of the feature space and the relationships between the features and the probability of an attack to ensure that the logistic function's output could be effectively threshold to distinguish between the two classes of interest.Decision Trees were selected for their intuitive representation of decision-making processes, mirroring the if-then-else decision rules that can be readily understood.To implement the Decision Tree, we constructed a flowchart-like structure that recursively split the data into homogenous subsets.This was achieved by identifying the features that resulted in the most significant reduction in a given impurity measure (such as Gini impurity or entropy) at each node.Given the diverse nature of network traffic data, the Decision Tree was fine-tuned to prevent overfitting while maintaining sufficient complexity to capture the underlying patterns indicative of DDoS attacks.
Random Forest was the final model we implemented, chosen for its robustness and accuracy resulting from its ensemble approach.By integrating multiple Decision Trees, each trained on a different subset of the data and features, the Random Forest model mitigates the overfitting tendencies of individual trees.In our study, the Random Forest model was composed of numerous trees whose predictions were aggregated through majority voting for classification.We adjusted the number of trees and the depth of each tree to optimize the trade-off between model bias and variance, ensuring that the model captured the essential characteristics of the data without being swayed by noise.
Each model was implemented with a particular emphasis on optimizing for the idiosyncrasies of our dataset, which included an imbalance between the classes and a wide range of feature scales.We employed techniques such as feature scaling and class weighting to tailor each model to the dataset's specific characteristics and requirements.Through rigorous hyperparameter tuning and validation, we ensured that each model achieved a high level of performance while avoiding the pitfalls of overfitting or underfitting.The outcome of this careful implementation was a set of models that were well-suited to our data and capable of providing insights into the nature of DDoS attacks within network traffic.

Evaluation Metrics and Procedures
Model evaluation was conducted using cross-validation, a robust technique for assessing model performance on unseen data.We employed a range of metrics, including accuracy, precision, recall, and F1 score, to comprehensively evaluate each model.For categorical variables, one-hot encoding transforms a categorical variable with 'n' categories into 'n' binary features, each representing one category.For a category 'c', the feature corresponding to 'c' is 1, and all other features are 0.
For a variable x, the logarithmic transformation is given by, y = log(x), where 'log' is the natural logarithm.This is applied to skewed features to normalize their distribution.Furthermore, PCA involves the computation of the eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering the data for each attribute.The principal components are the eigenvectors of this covariance matrix.In order to measure effectiveness of the algorithm we used four metrics: accuracy, precision, recall and F1, these are described in the Equation (1-4).

Logistic Regression
Logistic Regression is commonly used for binary classification problems.It models the probability of a binary outcome using the logistic function as presented in the Equation (5).

Decision Tree
Decision Trees are a non-parametric supervised learning method used for classification and regression.The tree structure represents decisions based on feature values.A decision tree splits the data into subsets based on the value of input features.This splitting is repeated recursively, forming a tree structure.Commonly used metrics for determining splits are Gini Impurity and Information Gain (Entropy).These concepts is described in the Equation (6)(7).
Where pₖ is the proportion of samples that belong to class k in the set S.

Random Forest
Random Forest is an ensemble learning method for classification and regression, which operates by constructing a multitude of decision trees at training time.The output of the Random Forest is determined by the aggregate of the predictions made by individual trees.Random Forest employs a technique known as Bootstrap Aggregating or Bagging.This method involves creating multiple subsets of the original dataset with replacement, known as bootstrap samples.Each tree in the Random Forest is trained on one of these bootstrap samples.Mathematically, given a dataset D of size N, a bootstrap sample is a subset Dᵢ (also of size N) sampled with replacement from D. This process is repeated to create as many datasets as there are trees in the forest.
Each decision tree in the Random Forest is constructed using a subset of features chosen at random at each split.
If there are M features, a number m (where m << M) is specified such that at each split in the tree, m features are selected at random out of the M and the best split on these m is used to split the node.The value of m is constant during the forest growing.
For regression, the prediction of the Random Forest is given by averaging the predictions of all the individual trees.Mathematically, if h(x, Θᵢ) is the prediction of the i-th tree, then the Random Forest prediction, H(x), for a given input x is described in the Equation (8).
Where N is the number of trees, and Θᵢ represents the parameters of the i-th tree.For classification, the output is the class selected by most trees (majority voting).Each tree gives a 'vote' for a class, and the class with the most votes is chosen as the final prediction.Logistic Regression appears as the first set of boxplots.Logistic Regression is a statistical model that, in this context, predicts the probability of a binary outcome.

Result and Discussion
The boxplots indicate that accuracy has a median around 0.92, suggesting that, on average, the model correctly predicts the outcome 92% of the time.However, there is notable variability in the accuracy, as evidenced by outliers.These outliers could indicate cases where the model's performance deviated significantly from the median.Precision and recall both have medians close to 0.90.Precision measures the model's accuracy in predicting positive labels, while recall assesses how well the model captures actual positive instances.The similarity of these medians suggests a balanced trade-off between these two metrics.The F1 score, which harmonizes precision and recall into a single metric, reflects this balance with a similar median.When we consider the scaled variants of these metrics (indicated with an "_SC" suffix), the median values are slightly higher, and the interquartile ranges (IQRs) are tighter.This indicates an improvement in model performance when feature scaling is applied, which is common with Logistic Regression as it can be sensitive to the scale of input variables.
Moving to the Decision Tree model, the boxplots demonstrate a consistently high median performance across all metrics, with accuracy peaking around 0.95.This suggests that the Decision Tree model, which works by partitioning the data into subsets based on feature value thresholds, is highly effective at classification in this case.The high precision and recall indicate that the Decision Tree makes accurate predictions and is good at capturing the majority of the relevant cases.Notably, there is minimal variation in performance with scaling, as the boxplots for the scaled and unscaled metrics are closely aligned.This is indicative of the Decision Tree model's robustness to feature scaling, as it does not rely on distance calculations that can be affected by the scale of the data.
Lastly, the Random Forest model, which is an ensemble of Decision Trees, shows the highest median scores and the least variability among the three models.The Random Forest boxplots show medians around 0.96 for accuracy, 0.95 for precision, and similarly high for recall and F1 score.The tight IQRs across these metrics suggest that the Random Forest model is not only accurate but also consistent in its predictions across different iterations.This is typical of Random Forests, which tend to perform well on a variety of datasets by reducing overfitting through averaging the predictions of multiple trees.Similar to the Decision Tree, the Random Forest model does not show significant changes in performance with feature scaling.
Random Forest appears to be the best performing model among the three, with the highest median values and tightest IQRs for all metrics, which indicates not only high performance but also consistency across runs.
Decision Tree shows very high performance as well but with a slight decrease compared to Random Forest.Logistic Regression has the lowest median scores among the three models.However, the performance improves with feature scaling, which indicates that Logistic Regression is more sensitive to the scale of the data.Scaling does not significantly impact the performance of Decision Trees and Random Forests, which might be due to these models' intrinsic handling of feature scales.Outliers in the boxplots suggest that there are runs where the models either significantly overperform or underperform compared to the median, which could be due to the variability in the data or the randomness in the training process.Computation Time of Models can be seen on Figure 4. predictive metrics, it requires significantly more computational time, which might be a consideration in practical applications.Logistic Regression, after scaling, shows improved performance, and its quick computation makes it an attractive model for situations where speed is a critical factor.Decision Trees offer a good balance between speed and performance, with feature scaling not significantly affecting its results.The choice between these models would ultimately depend on the specific requirements of the application, including the acceptable trade-off between accuracy and computational resources.Test harness result can be seen on Figure 5.This substantial increase in time is expected due to the complexity of Random Forest, which builds multiple Decision Trees on various sub-samples of the dataset and averages their predictions.The nature of this algorithm makes it robust to overfitting and generally more accurate than a single Decision Tree, at the cost of increased computational complexity.Similar to Decision Trees, Random Forests do not inherently benefit from feature scaling (RFC_SC), as evidenced by the consistent execution times regardless of scaling.
In terms of the test harness function itself, the usage of cross-validation (CV) with five folds suggests a robust evaluation methodology.In CV, the dataset is split into five parts, and the model is trained and tested five times, with each part being used as the test set once.This method provides a thorough assessment of the model's performance and generalizability to new data.From these execution times, one can infer that while Logistic Regression and Decision Trees are quicker to train and evaluate, Random Forests take a considerably longer time.This trade-off between time and predictive performance is a common consideration in machine learning.Practitioners must decide whether the improvement in prediction accuracy with Random Forest is worth the additional computation time, which may be a critical factor in real-time applications or when working with very large datasets.The information from the test_harness function is valuable for understanding not only the performance of the models but also the computational demands they place on the system.Such insights are crucial when it comes to selecting the right model for deployment in production environments, where both accuracy and efficiency need to be balanced according to the application's requirements.

Conclusion
Our findings reveal a nuanced landscape of model efficacy and computational efficiency.The Logistic Regression model demonstrated admirable predictive performance, particularly when feature scaling was applied, suggesting its utility in scenarios where model interpretability and operational speed are paramount.The Decision Tree model offered a compelling balance between speed and performance, reinforcing its reputation as a versatile and interpretable classifier.However, it was the Random Forest model that emerged as the superior performer in terms of accuracy, precision, recall, and F1 score, albeit with significantly higher computational demands.The scaled versions of these models (denoted with "_SC"), particularly for Logistic Regression, hinted at the importance of feature normalization in enhancing model predictions.Notably, such scaling did not markedly affect the tree-based models, underscoring their inherent robustness to feature magnitude variations.Reflecting on our methodological approach, the use of cross-validation provided a comprehensive understanding of model generalizability, while the Synthetic Minority Oversampling Technique (SMOTE) addressed the common issue of class imbalance in cybersecurity datasets.Despite the strengths of our research, we acknowledge certain limitations.The scope of computational resources and the potential for model tuning were not exhaustively explored, which could yield further improvements in model performance.Moreover, the dynamic and evolving nature of DDoS attack patterns necessitates ongoing model adaptation and validation.
For future work, we recommend the exploration of hybrid models and deep learning architectures, which may uncover new dimensions of predictive accuracy.

Figure 1 -
Figure 1-3 appear to be a set of boxplots comparing the performance metrics of three different machine learning models: Logistic Regression, Decision Tree, and Random Forest.Each boxplot shows the distribution of a specific metric (accuracy, precision, recall, F1 score) across multiple runs of the model.

Figure 4 .
Figure 4. Computation Time of Models In terms of computational efficiency, Logistic Regression is the fastest, with times ranging from approximately 6.85 to 8.53 seconds.The Decision Tree model is comparably fast, with times around 2.70 to 3.71 seconds.The Random Forest model, however, takes considerably longer, ranging from about 44.90 to 48.95 seconds, which is expected given that it builds multiple trees and combines their results.While the Random Forest model outperforms the other two in terms of

Figure 5 .
Figure 5. Test HarnessAs presented in the Figure5, The logs indicate multiple runs of a Logistic Regression model, with and without feature scaling (denoted as LOGI_SC).The times recorded for these runs show that the model takes, on average, about 8 seconds to complete the test harness function.Logistic Regression is a foundational machine learning algorithm that models the probability of a binary outcome.It is generally favored for its simplicity and efficiency, especially in cases where the relationship between the independent variables and the binary outcome is approximately linear.The scaled version (LOGI_SC) implies that the data has been standardized or normalized to improve the model's performance, which can be particularly beneficial for Logistic Regression as it relies on the optimization of a loss function that can converge faster when features are on the same scale.Decision Trees are recorded next, also both in scaled (DCT_SC) and unscaled forms.These models are significantly faster, completing the test harness in roughly 2.7 seconds.This efficiency stems from the Decision Tree's flowchart-like structure, where binary decisions are made at each node, leading to a final classification at the leaves.Decision Trees are highly interpretable and do not require feature scaling, which is consistent with the similar execution times observed for DCT and DCT_SC.Finally, the Random Forest models, which are ensembles of Decision Trees, show a much higher execution time, averaging around 48 seconds.This substantial increase in time is expected due to the complexity of Random Forest, which builds multiple Decision Trees on various sub-samples of the dataset and averages their predictions.The nature of this algorithm makes it robust to overfitting and generally more accurate than a single Decision Tree, at the cost of increased computational complexity.Similar to Decision Trees, Random Forests do not inherently benefit from feature scaling (RFC_SC), as evidenced by the consistent execution times regardless of scaling.