Unsupervised learning in Python is a machine learning technique where data is used without pre-assigned labels. K-Means clustering is one of the popular unsupervised learning algorithms in Python. This algorithm groups similar data points into clusters, also known as centroids, by minimizing the sum of squared distances between the data points and the centroids. In this article, we will delve into the basics of K-Means clustering and its implementation in Python.
What is K-Means Clustering?
K-Means clustering is a type of centroid-based clustering. The main objective of K-Means clustering is to partition the data into K clusters, where K is an integer and is defined by the user. The data points within each cluster should be as close as possible to the centroid of the cluster and as far as possible from the centroids of the other clusters.
The algorithm iteratively updates the centroids until convergence, which occurs when the centroids stop moving. The algorithm terminates when the sum of squared distances between the data points and the centroids is minimized.
Steps in K-Means Clustering
- Initialization: Choose K initial centroids randomly or using a pre-defined method.
- Assignment: Assign each data point to the closest centroid based on the Euclidean distance.
- Recalculation: Calculate the mean of all data points assigned to each centroid and move the centroid to the mean.
- Repeat steps 2 and 3 until the centroids stop moving.
K-Means Clustering in Python
In this section, we will show you how to perform K-Means clustering in Python using the KMeans class from the sklearn.cluster library.
Step 1: Importing Required Libraries
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs
Step 2: Creating Synthetic Data
In this step, we will create a synthetic dataset using the make_blobs function from the sklearn.datasets library.
X, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=0.60)
Step 3: Visualizing the Data
In this step, we will visualize the data using a scatter plot.
plt.scatter(X[:,0], X[:,1], s=50)
Step 4: Creating the Model
In this step, we will create the K-Means clustering model using the KMeans class from the sklearn.cluster library. We will also specify the number of clusters we want to form, which is 4 in this case.
kmeans = KMeans(n_clusters=4) kmeans.fit(X)
Step 5: Predicting the Clusters
In this step, we will predict the clusters for each data point using the predict method.
y_kmeans = kmeans.predict(X)
Step 6: Visualizing the Clusters
In this step, we will visualize the clusters
by plotting the data points with different colors based on the cluster they belong to.
plt.scatter(X[:,0], X[:,1], c=y_kmeans, s=50, cmap='viridis')
Step 7: Plotting the Centroids
In this step, we will plot the centroids of the clusters using the cluster_centers_ attribute.
centers = kmeans.cluster_centers plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
Step 8: Evaluating the Model
In this step, we will evaluate the performance of the model using the inertia_ attribute, which calculates the sum of squared distances between the data points and the centroids.
Choosing the Right Number of Clusters
The elbow method is one of the popular techniques to determine the optimal number of clusters. In this method, we plot the sum of squared distances between the data points and the centroids for each value of K. The optimal number of clusters is the value of K at the “elbow” of the plot, where the sum of squared distances begins to level off.
Sum_of_squared_distances =  K = range(1,15) for k in K: km = KMeans(n_clusters=k) km = km.fit(X) Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-') plt.xlabel('k') plt.ylabel('Sum_of_squared_distances') plt.title('Elbow Method For Optimal k') plt.show()
In this article, we have introduced the K-Means clustering algorithm in Python and shown how to implement it using the KMeans class from the sklearn.cluster library. We have also discussed how to evaluate the performance of the model and determine the optimal number of clusters using the elbow method. K-Means clustering is a powerful tool for uncovering hidden patterns in your data, and its implementation in Python is straightforward and easy to understand.
Also check WHAT IS GIT ? It’s Easy If You Do It Smart
You can also visite the Git website (https://git-scm.com/)