Anime Characteristic Data Clustering

Anime Characteristic Data Clustering

Project Description

In this project, I perform an unsupervised learning task using clustering techniques on the ACD dataset. I start by loading and cleaning the data, including handling missing values by inferring missing 'Premier' values using the 'Waktu Penayangan' column and extracting numerical duration from the 'Durasi' column. I drop rows with remaining missing values to ensure data consistency. Categorical columns are encoded using LabelEncoder, and all features are scaled using StandardScaler to prepare for clustering. I focus on a selected subset of numerical features—Skor, Jumlah Penjualan, Peringkat, and Popularitas—which I use as input for clustering. To reduce dimensionality and enhance interpretability, I apply Principal Component Analysis (PCA) and extract two principal components. I examine the explained variance and PCA component loadings to understand how each feature contributes to the new axes. To determine the optimal number of clusters, I experiment with K-Means clustering across a range of cluster values (k = 2 to 6). I compute the silhouette score for each k to evaluate clustering quality and visualize the results using Yellowbrick’s SilhouetteVisualizer. Additionally, I apply the Elbow Method by plotting the inertia values to support the selection of k = 2 as the most appropriate cluster count. Once the best value of k is chosen, I fit the K-Means model and assign cluster labels to each data point. I then visualize the clustering results in 3D space using matplotlib, plotting the key variables and their corresponding cluster assignments. I also highlight the centroids of each cluster to gain further insight into their spatial distribution. Through this process, I gain experience in end-to-end clustering analysis—from preprocessing, dimensionality reduction, and model selection to visualization and interpretation. This helps me understand how unsupervised learning can uncover hidden patterns and groupings in real-world datasets.

My Role

Data Scientist

Tech Stack

Pandas, NumPy, AI Model