Anime Characteristic Data Classification

Project Description

In this project, I work on building a classification model for the ACD dataset, with the goal of predicting a categorical target variable representing content rating. I begin by loading and exploring the data, checking for duplicate rows, identifying outliers, and assessing missing values across both training and test datasets. I handle outliers in key numerical columns using a combination of log transformation and winsorization. For missing data, I apply various strategies: using mode for categorical values, median for numerical values, and extracting temporal features (such as season and year) from textual airing dates. To enrich the dataset, I engineer new features such as “Musim Premier”, “Durasi (menit)”, and “Total Durasi Tayang (menit)”. I also extract the main producer and studio from nested string fields, and encode them using frequency and target mean encoding techniques. Categorical variables are transformed using one-hot encoding or label encoding, depending on their nature. I scale selected numerical columns using StandardScaler, and convert airing dates into “age in days” relative to a fixed reference date. For modeling, I choose the Random Forest Classifier and perform hyperparameter tuning using GridSearchCV to find the best combination of tree depth, number of estimators, and splitting criteria. I evaluate the model using accuracy and macro-averaged F1-score, which are appropriate for the multi-class classification problem at hand. To handle potential class imbalance, I apply class_weight='balanced' in the Random Forest configuration. After training and evaluation, I use the best model to predict the ratings on the test dataset. Through this project, I gain hands-on experience with complex data preprocessing, feature engineering, and model optimization. I conclude by selecting the best-performing Random Forest model and validating its effectiveness using detailed performance metrics. This process enhances my understanding of how to build interpretable and robust classification models on real-world, messy datasets.

My Role

Data Scientist

Tech Stack

Pandas, NumPy, AI Model

Link Project

https://colab.research.google.com/drive/1CLZOWibOv7TrMW6oObyyCnZMc665QSHY?usp=sharing