Machine Learning-based Quality Index Prediction Model for Semiconductor Manufacturing Process
SK-Planet Data Analyst Training Program Project (2023.01.03 ~ 2023.03.09)
This project develops a Virtual Metrology (VM) model that predicts the quality index of semiconductor manufacturing processes using sensor data. The model leverages 665 sensor variables across 7 process steps, applying feature engineering, multicollinearity treatment, and hyperparameter optimization techniques.
- Feature Engineering: 665 sensor variables → 800+ derived features
- Multicollinearity Treatment: PCA dimensionality reduction for VIF > 10 variables
- Feature Selection: SelectKBest (Mutual Information) based top 250 features
- Bayesian Optimization: Automated hyperparameter tuning for Ridge, Random Forest
- AutoML: Multi-model comparison and auto-tuning with PyCaret
| Metric | Value |
|---|---|
| Sample Size | 611 observations (train) |
| Sensor Variables | 665 features |
| Process Steps | 7 steps (04, 06, 12, 13, 17, 18, 20) |
| Sensor Types | 95 sensor categories |
| Target Variable | Quality Index (mean ~1263) |
| Generated Features | 200+ engineered features |
Mean: 1263.41
Variance: 67.16
Distribution: Approximately Normal
Outliers: y < 1240 (clipped during preprocessing)
Total Process Time: 30.6 ~ 31.9 minutes
Critical Step Gap: Step 06 → Step 12 (longest duration)
Speed Classification: 1870 seconds threshold
- Early (E): EQ7, EQ8 modules
- Late (L): Other modules
Best Model: Extra Trees Regressor
Evaluation Metric: RMSE
Cross-validation: 5-fold
# Python 3.8 or higher
python --version
# Install dependencies
pip install -r requirements.txt# Clone the repository
git clone git@github.com:jinsoo96/Semiconductor_Yield_Virtual_Metrology.git
cd Semiconductor_Yield_Virtual_Metrology
# Install Python packages
pip install -r requirements.txtFull Pipeline:
# Step 1: EDA and Data Preprocessing
jupyter notebook 01_Code/01_eda_and_preprocessing.ipynb
# Step 2: Modeling (AutoML + Bayesian Optimization)
jupyter notebook 01_Code/02_modeling.ipynbNote: Common functions are centralized in 01_Code/utils.py to avoid code duplication.
Execution Time: ~30-45 minutes for full pipeline
Semiconductor_Yield_Virtual_Metrology/
│
├── 01_Code/
│ ├── utils.py # Common utility functions (shared module)
│ ├── 01_eda_and_preprocessing.ipynb # EDA & Data Preprocessing
│ └── 02_modeling.ipynb # All modeling (AutoML + Bayesian Opt)
│
├── 02_Data/
│ ├── DATA_INFO.txt # Data documentation
│ └── raw/
│ ├── train_sensor.csv # Training sensor data (~24.5MB)
│ ├── train_quality.csv # Training quality labels (~25KB)
│ └── predict_sensor.csv # Test sensor data (~10.5MB)
│
├── 03_Results/
│ ├── figures/ # Generated plots
│ ├── tables/ # Result tables
│ ├── preprocessed_data.pkl # Preprocessed data (generated)
│ └── best_model.pkl # Trained model (generated)
│
├── 04_Documentation/
│ ├── final_presentation.pdf # Final presentation slides
│ ├── Code_Structure.md # Detailed code guide
│ └── Analysis_Workflow.md # Step-by-step workflow
│
├── archive/ # Original development files
│
├── README.md # This file
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
└── .gitignore # Git ignore rules
- Source: SK HYNIX semiconductor manufacturing data
- Period: October 2021
- Sample: 611 LOT observations with 665 sensor features
Preprocessing Steps:
- Pivot transformation (Long → Wide format)
- Feature generation from step_id + param_alias
- Time feature extraction (process duration)
- Missing value handling
- Outlier treatment (IQR-based clipping)
- Standardization (StandardScaler)
| Feature Type | Description | Count |
|---|---|---|
| Original Sensors | Raw sensor measurements per step | 665 |
| Duration Features | Total and inter-step process time | 21 |
| Statistical Features | Sensor std/mean across steps | 190 |
| Categorical Features | Equipment category encoding | 8+ |
| Binned Features | Continuous variable discretization | 30+ |
665 Original Features
↓
+200 Generated Features (865 total)
↓
Variance Threshold (remove zero-variance)
↓
VIF Analysis (identify multicollinearity)
↓
PCA (reduce VIF>10 features)
↓
SelectKBest (top 250 by Mutual Information)
↓
Final Feature Set (~300 features)
Models Evaluated:
- Ridge Regression (Bayesian Optimized)
- Random Forest Regressor (Bayesian Optimized)
- Extra Trees Regressor
- CatBoost Regressor
- LightGBM Regressor
- Gradient Boosting Regressor
Training Strategy:
- Train/Valid/Test split: 64% / 16% / 20%
- RandomOverSampler for class imbalance
- Log transformation on target (optional)
- 5-fold Cross-validation
Finding: Target variable y follows approximately normal distribution (mean: 1263.41, variance: 67.16)
- Insight: Normal distribution assumption is valid, favorable for regression models
- Action: Clip outliers below y < 1240 for model stability
Finding: Significant quality index variation exists across equipment (EQ1~EQ8)
- Insight: EQ7, EQ8 show relatively lower quality indices compared to others
- Action: One-hot encode module_name_eq as categorical feature
Finding: Total process duration clearly separates into two groups at 1870 seconds threshold
- Early Group (E): ~30.6 min, mainly EQ7, EQ8 modules
- Late Group (L): ~31.9 min, most modules
- Insight: Correlation exists between process speed and quality
- Action: Create tmdiff_speed categorical variable as model feature
Finding: Step 06 → Step 12 shows longest duration and highest variability
- Insight: This interval is estimated to have the greatest impact on quality
- Action: Generate inter-step duration features (gen_tmdiff_0612, etc.)
Finding: Aggregated sensor statistics (std, mean) across steps show higher predictive power than individual step values
- Insight: Step-wise variability is a key indicator for quality prediction
- Action: Generate 95 sensor std (gen_{sensor}std) and mean (gen{sensor}_mean) features
Finding: Multiple variables with VIF > 10 exist (multicollinearity problem)
- Insight: High correlation among sensor variables poses model instability risk
- Action: Apply PCA to VIF > 10 variables, reduce to 2 principal components
Finding: Model performance maintained with top 250 features selected by Mutual Information
- Key Features:
- Process duration features (gen_tmdiff_*)
- Aggregated sensor statistics (gen_std, gen_mean)
- Equipment categorical features
- Insight: Derived features show higher predictive power than original sensor variables
Finding: Target variable shows imbalance when binned into 3 categories (A: y<1242, B: 1242≤y≤1283, C: y>1283)
- Distribution: Most samples concentrated in B category
- Action: Apply RandomOverSampler for minority class oversampling
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import train_test_split
# Import utility functions
from utils import (
load_data, make_dataset,
gen_cate_feats, gen_duration_feats, gen_stats_feats,
LST_STEPS, LST_STEPSGAP
)
# Load data
path = "./02_Data/raw/"
train_sensor, train_quality, predict_sensor = load_data(path)
# Create dataset
train = make_dataset(train_sensor, train_quality)
# Feature engineering
train = gen_cate_feats(train) # Equipment category
train = gen_duration_feats(train, LST_STEPSGAP) # Process duration
train = gen_stats_feats(train, sensors_nm, LST_STEPS) # Sensor statistics
# Feature selection
skb = SelectKBest(score_func=mutual_info_regression, k=250)
X_selected = skb.fit_transform(X, y)
# Train model
from pycaret.regression import setup, compare_models, tune_model
reg = setup(data=train_df, target='y', normalize=True)
best = compare_models(sort='RMSE')
best_tuned = tune_model(best)pandas>=1.3.0
numpy>=1.21.0
scipy>=1.7.0
scikit-learn>=1.0.0
catboost>=1.0.0
lightgbm>=3.3.0
xgboost>=1.5.0
pycaret>=2.3.0
bayesian-optimization>=1.2.0
imbalanced-learn>=0.9.0
statsmodels>=0.13.0
matplotlib>=3.4.0
seaborn>=0.11.0- Python: 3.8 or higher
- RAM: 8GB minimum (16GB recommended)
- Storage: 500MB for code and data
- OS: Windows, macOS, or Linux
| Name | Role |
|---|---|
| Yukyung Lim | Data Analysis & Modeling |
| Jin Soo Kim | Feature Engineering & Optimization |
| Hojin Lee | EDA & Visualization |
| Seungah Ahn | Preprocessing & Documentation |
| Document | Description |
|---|---|
| README.md | Project overview (this file) |
| Code_Structure.md | Detailed code documentation |
| Analysis_Workflow.md | Step-by-step workflow |
| final_presentation.pdf | Final presentation slides |
- PyCaret Documentation
- Bayesian Optimization
- Scikit-learn Feature Selection
- Virtual Metrology in Semiconductor Manufacturing
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2023 Jin Soo Kim
- Author: Jin Soo Kim
- GitHub: @jinsoo96