MITRE ATT&CK Techniques
Machine Learning Pipeline to Predict MITRE Att&cks
Developed a machine learning pipeline to predict MITRE ATT&CK techniques based on categorical features from a cybersecurity dataset. This project combined cybersecurity domain knowledge with supervised ML to explore whether attack attributes (tools, targets, tags) could predict underlying adversarial techniques.
- Preprocessed a noisy, multi-label dataset from Kaggle, fixing inconsistent labels, handling missing targets, and applying dimensionality reduction through binning.
- Engineered a custom preprocessing pipeline with
ColumnTransformer,OneHotEncoder, and aMultiLabelBinarizertransformer to handle categorical and list-type features. - Trained and compared multiple models:
- Naive Bayes (baseline, poor fit)
- Linear SGD (one-vs-rest) for high-dimensional classification
- LightGBM for gradient boosting on large, sparse features
- Conducted exploratory data analysis, revealing skewed distributions and unexpected co-occurrences (e.g., tools like Burp Suite mapping to specific techniques).
- Documented challenges such as RAM overload, training inefficiencies, and imbalance issues, with reflections on future improvements.
- Cybersecurity analytics (MITRE ATT&CK framework)
- Machine learning (multi-label classification, high-dimensional sparse features)
- Python ML stack: scikit-learn, LightGBM, pandas, numpy
- Feature engineering, label encoding, and handling imbalanced datasets
- Critical reflection and iteration on preprocessing + modeling



