Air Quality Index Classification

Project Description
In this project, I focus on building a classification model using the k-Nearest Neighbors (k-NN) algorithm. I begin by exploring and preprocessing the dataset, which includes handling missing values using distribution-based imputation—where I fill missing numerical values based on the feature’s skewness, using either the mean or the median. For categorical features, I fill missing values with the mode. After that, I check for duplicate entries and ensure the integrity of both training and test data. I proceed to convert categorical features such as station, critical, and category into numerical labels using Label Encoding to prepare the data for model training. I define my input features based on key air quality indicators like PM10, PM2.5, SO2, CO, O3, NO2, and a 'max' value, and use the encoded 'category' column as the target variable. For modeling, I implement the k-NN algorithm and perform hyperparameter tuning using GridSearchCV. I explore different values for the number of neighbors (k), weighting strategies, and distance metrics (such as Euclidean and Manhattan) to find the best-performing combination. I train the model using 5-fold cross-validation and evaluate its performance based on classification accuracy. After identifying the optimal k-NN configuration, I use the trained model to predict categories on the test dataset, and convert the predicted labels back into their original categorical form. Finally, I compile the results into a submission file containing the predicted category for each date entry. This project helps me understand the workflow of k-NN classification, from preprocessing and encoding to hyperparameter tuning and result submission.
My Role
Data Scientist
Tech Stack
Panda, NumPy, AI Model