Anime Characteristic Data Regression

Project Description

I begin this project by loading and exploring the dataset carefully, identifying missing values and extracting information from textual columns such as “Waktu Penayangan” and “Durasi”. To address missing entries in the 'Premier' column, I infer the season and year using regular expression patterns from the airing date. I engineer new features such as Durasi_menit, Panjang_Judul, and Tahun_Tayang to enrich the dataset with potentially predictive information. Next, I handle categorical variables by applying Label Encoding, ensuring consistent mapping across both training and test sets. I also filter and prepare a consistent set of features to use for model training, while replacing placeholder values like 'Unknown' with proper null representations. After completing preprocessing, I train a regression model using LightGBM, a gradient boosting framework known for its speed and efficiency, especially on tabular data. I configure the model with regularization (reg_alpha=1) and a sufficient number of estimators to prevent overfitting. For model evaluation, I compute the R² score on the training data to assess how well the model fits. This helps me understand the proportion of variance in the target variable—Jumlah Penjualan—that can be explained by the selected features. Through this workflow, I gain experience in real-world regression tasks, including data preparation, feature extraction, encoding, and applying tree-based models. I conclude by selecting LightGBM as the most suitable model for this dataset, due to its strong performance and flexibility in handling both numeric and categorical features.

My Role

Tech Stack

Pandas, NumPy, AI Model

Link Project

https://colab.research.google.com/drive/1SWKrCH_F0RxGDVq_m1s1v0Po1ydu8Rrz?usp=sharing