Project Overview
This project addresses one of the most critical challenges in modern semiconductor manufacturing: predicting yield rates for integrated circuit production. As we approach the Angstrom level in chip manufacturing, the ability to predict and optimize yield becomes increasingly vital for meeting global demand.
The project combines deep understanding of semiconductor physics with advanced machine learning techniques, specifically implementing logistic regression models to predict whether individual dies will be functional or defective.
The Dataset
I used semiconductor sensor data from Kaggle - proper clean stuff with no missing values, which made life a lot easier. The training set had 1,763 rows, and the test set had 756 rows. The target variable was binary classification: 0 for unsuccessful and 1 for successful production.
Data Preprocessing
Since there were no missing values, I didn't have to mess about. All features were integers except feature 3, so I simplified the column headers for better readability. I implemented both Z-score normalisation and min-max scaling to compare how they affected model performance.
Machine Learning Models
Logistic Regression
This was my main model for binary classification. I ran both scaled and unscaled versions to see the difference. Achieved about 90% accuracy on the test data, which was pretty decent. The interesting bit was that it was better at predicting unsuccessful cases (True Negatives) but struggled with the minority class (successful cases). When I did hyperparameter tuning, I found there were clear trade-offs between True Positives and True Negatives.
Random Forest Classifier
Used this for ensemble-based classification with decision trees. The Gini criterion for impurity measurement worked well, and it handled non-linear relationships and feature interactions much better than logistic regression.
Support Vector Classification (SVC)
This was a bit trickier because I used a different wafer dataset with different characteristics. The main challenge was dealing with a heavily imbalanced dataset - about 20:1 ratio of classes. I filled missing values with zeros, but the class imbalance made it quite difficult to get good performance.
Model Evaluation
I used proper metrics like overall accuracy, classification reports with precision, recall, and F1-score, plus confusion matrices to visualise the predictions. The class imbalance was significant - 322 unsuccessful vs 31 successful cases, which definitely affected model performance.
Key Findings
The first 4 features dominated the dataset due to scale differences, which made sense given the nature of semiconductor manufacturing data. Logistic regression achieved 90%+ accuracy but really struggled with the minority class. The data quality was excellent with no missing values, requiring minimal preprocessing.
What I Learned
This project taught me loads about handling imbalanced datasets and the importance of proper feature scaling. The trade-offs between different performance metrics were really interesting - you can't just focus on overall accuracy when dealing with imbalanced classes. I also got a proper understanding of how semiconductor manufacturing data looks and behaves.
Future Improvements
I'd definitely implement techniques to handle class imbalance better - things like SMOTE or class weights. Exploring additional algorithms like neural networks or gradient boosting would be interesting too. Feature engineering to improve minority class prediction and cross-validation for more robust evaluation are definitely on the list.