2024Securing Finances

Fraud Detection

Built with
01 / Overview

A machine learning-based fraud detection system trained on real-world e-commerce transaction data from Vesta Corporation, achieving 91% ROC-AUC through advanced feature engineering and ensemble methods.

Modern e-commerce platforms process millions of transactions daily, requiring robust fraud detection systems that balance security with user experience. This project addresses the critical challenge of minimizing false positives—where legitimate transactions are incorrectly flagged—while maintaining high detection accuracy for fraudulent activity. Using Vesta Corporation's real-world dataset from the IEEE-CIS Fraud Detection competition, the system processes transaction and identity data merged across 400+ features, implementing comprehensive preprocessing, dimensionality reduction via PCA, and gradient boosting models to achieve optimal performance on highly imbalanced data.
02 / Process
01

Data Integration & Preprocessing

Merged transaction and identity datasets using TransactionID as the join key. Implemented missing value imputation with domain-appropriate placeholders (-999 for numerical, '-999' for categorical). Applied z-score thresholding (σ > 3) to remove outliers in TransactionAmt. Conducted memory optimization by downcasting numerical columns and encoding categorical variables via LabelEncoder.

02

Feature Engineering & Dimensionality Reduction

Extracted 15+ engineered features including email domain suffixes, operating system types, browser classifications, and device characteristics. Applied PCA to high-dimensional V-columns (339 features) to reduce to 9 principal components while retaining 90% explained variance. Created temporal features capturing transaction patterns across days of week and hours of day.

03

Model Development & Validation

Trained and compared gradient boosting models (XGBoost, LightGBM) with stratified cross-validation. Optimized hyperparameters including learning rate (0.05), max depth (6), and early stopping rounds (10). Evaluated using AUC-ROC, precision, recall, and F1-score appropriate for imbalanced classification. Achieved 91% ROC-AUC with strong precision-recall trade-off on validation set.

03 / Impact
  • Achieved 91% ROC-AUC on the IEEE-CIS Fraud Detection dataset through systematic feature engineering, including email domain extraction, device fingerprinting, and temporal pattern analysis.

  • Implemented comprehensive preprocessing pipeline with PCA dimensionality reduction (retaining 90% variance), outlier removal via z-score thresholding, and memory optimization through data type casting.

  • Evaluated gradient boosting frameworks (XGBoost, LightGBM) with cross-validation and hyperparameter optimization, demonstrating robust performance on highly imbalanced e-commerce transaction data.

"

"This project provided an excellent example of how machine learning techniques are applied to the financial industry. It was a great learning experience in data preprocessing, feature engineering, and model evaluation. The insights gained from this project are valuable for any data scientist working in the field of fraud detection."