MITRE ATT&CK Techniques

Machine Learning Pipeline to Predict MITRE Att&cks

Developed a machine learning pipeline to predict MITRE ATT&CK techniques based on categorical features from a cybersecurity dataset. This project combined cybersecurity domain knowledge with supervised ML to explore whether attack attributes (tools, targets, tags) could predict underlying adversarial techniques.

Key Contributions:

  • Preprocessed a noisy, multi-label dataset from Kaggle, fixing inconsistent labels, handling missing targets, and applying dimensionality reduction through binning.
  • Engineered a custom preprocessing pipeline with ColumnTransformer, OneHotEncoder, and a MultiLabelBinarizer transformer to handle categorical and list-type features.
  • Trained and compared multiple models:
    • Naive Bayes (baseline, poor fit)
    • Linear SGD (one-vs-rest) for high-dimensional classification
    • LightGBM for gradient boosting on large, sparse features
  • Conducted exploratory data analysis, revealing skewed distributions and unexpected co-occurrences (e.g., tools like Burp Suite mapping to specific techniques).
  • Documented challenges such as RAM overload, training inefficiencies, and imbalance issues, with reflections on future improvements.
Skills Demonstrated:

  • Cybersecurity analytics (MITRE ATT&CK framework)
  • Machine learning (multi-label classification, high-dimensional sparse features)
  • Python ML stack: scikit-learn, LightGBM, pandas, numpy
  • Feature engineering, label encoding, and handling imbalanced datasets
  • Critical reflection and iteration on preprocessing + modeling
Links:

Other Projects