2024Securing Finances

Fraud Detection

A machine learning-based fraud detection system trained on real-world e-commerce transaction data from Vesta Corporation, achieving 91% ROC-AUC through advanced feature engineering and ensemble methods.

Modern e-commerce platforms process millions of transactions daily, requiring robust fraud detection systems that balance security with user experience. This project addresses the critical challenge of minimizing false positives where legitimate transactions are incorrectly flagged while maintaining high detection accuracy for fraudulent activity.

Using Vesta Corporation's real-world dataset from the IEEE-CIS Fraud Detection competition, the system processes transaction and identity data merged across 400+ features, implementing comprehensive preprocessing, dimensionality reduction via PCA, and gradient boosting models to achieve optimal performance on highly imbalanced data.

Dataset Statistics

Total Transactions

590,540

Real-world financial transactions from Vesta Corporation

Fraud Cases

20,663

Confirmed fraudulent transactions for model training

Features

394

Engineered features including transaction patterns and user behavior

AUC Score

0.91

Model performance on IEEE-CIS competition dataset

Tools & Technologies

XGBoost

Gradient boosting framework for structured data with high performance

LightGBM

Microsoft's gradient boosting framework optimized for speed and memory

CatBoost

Yandex's gradient boosting with categorical feature handling

SMOTE

Synthetic Minority Oversampling Technique for class imbalance

Pandas

Data manipulation and analysis library for structured data

Scikit-learn

Machine learning library for preprocessing and model evaluation

Data Integration & Preprocessing

Merged transaction and identity datasets using TransactionID as the join key. Implemented missing value imputation with domain-appropriate placeholders (-999 for numerical, '-999' for categorical). Applied z-score thresholding (σ > 3) to remove outliers in TransactionAmt. Conducted memory optimization by downcasting numerical columns and encoding categorical variables via LabelEncoder.

Feature Engineering & Dimensionality Reduction

Extracted 15+ engineered features including email domain suffixes, operating system types, browser classifications, and device characteristics. Applied PCA to high-dimensional V-columns (339 features) to reduce to 9 principal components while retaining 90% explained variance. Created temporal features capturing transaction patterns across days of week and hours of day.

Model Development & Validation

Trained and compared gradient boosting models (XGBoost, LightGBM) with stratified cross-validation. Optimized hyperparameters including learning rate (0.05), max depth (6), and early stopping rounds (10). Evaluated using AUC-ROC, precision, recall, and F1-score appropriate for imbalanced classification. Achieved 91% ROC-AUC with strong precision-recall trade-off on validation set.

Achieved 91% ROC-AUC on the IEEE-CIS Fraud Detection dataset through systematic feature engineering, including email domain extraction, device fingerprinting, and temporal pattern analysis.

Implemented comprehensive preprocessing pipeline with PCA dimensionality reduction (retaining 90% variance), outlier removal via z-score thresholding, and memory optimization through data type casting.

Evaluated gradient boosting frameworks (XGBoost, LightGBM) with cross-validation and hyperparameter optimization, demonstrating robust performance on highly imbalanced e-commerce transaction data.

The hard part was not picking a model. Feature engineering and careful handling of class imbalance mattered far more than swapping XGBoost for LightGBM once the data was honest.