In python, we can implement K-Means clustering by using sklearn.cluster.KMeans easily. In this tutorial, we will use some examples to show you how to do.
Syntax
sklearn.cluster.KMeans is defined as:
class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')
This class can allow us implement k-means clustering easily.
Here are some important parameters.
n_clusters: int, how many classes you plan to cluster. For example n_clusters = 10, you plan to cluster data to 10 classes.
max_iter: default = 300, maximum number of iterations of the k-means algorithm for a single run.
random_state: determines random number generation for centroid initialization. Use an int to make the randomness deterministic. For example random_state = 0
algorithm: {“lloyd”, “elkan”, “auto”, “full”}, default=”lloyd”
“auto” and “full” are deprecated and they will be removed in Scikit-Learn 1.3
How to use implement K-Means Clustering?
In order to use sklearn.cluster.KMeans class to implement k-means, we should determine two important data.
- inputs – the data you plan to cluster, it should be a numpy array and the shape of it shoud be [sample_num, feature_dim].
For example: inputs = np.random.random([100, 200]), which means we have 100 samples, each sample is [1, 200] vector.
- k – how many classes you plan to cluster. For example k = 10, it means we will cluster 10 classes.
Here we will use an example to show you how to do.
from sklearn.cluster import KMeans import numpy as np # prepare data sample_num = 200 feature_dim = 100 data = np.random.random([sample_num, feature_dim])
Here we will create a sample data set, which contains 200 samples.
k = 10 kmeans = KMeans(n_clusters=k, random_state=0, max_iter = 500).fit(data)
This code will tell us we will cluster 10 classes.
print(kmeans.labels_) print(kmeans.cluster_centers_.shape)
This code is very important, it will tell us the class label of each sample and the cluster center.
Run this code, we will see:
[1 5 1 5 7 2 7 5 1 7 0 3 6 2 9 0 1 6 0 2 4 1 5 8 4 2 0 5 5 7 6 8 2 2 4 5 5 5 1 5 6 5 0 2 1 8 5 6 5 1 8 8 1 1 3 9 2 7 6 5 5 6 2 5 5 4 6 6 0 7 1 2 7 2 5 2 9 7 8 6 7 5 9 9 5 7 9 7 2 3 3 1 4 5 1 2 5 4 9 8 0 6 8 7 5 6 3 0 5 0 7 5 7 8 2 3 8 1 2 4 0 7 7 2 2 0 5 0 8 2 7 4 9 9 0 9 5 7 9 9 5 9 7 2 8 4 4 8 4 0 5 3 9 3 8 8 8 2 4 2 4 2 7 3 6 9 0 6 9 5 8 0 0 0 5 9 8 0 2 5 8 6 4 0 7 5 7 5 5 7 5 9 5 8 1 1 5 6 5 8] (10, 100)
We can find the class lable starts with 0-9. The shape of cluster center is [k, feature_dim] = [10, 100]
Finally, we can use this k-means to predict the class label of one test data set.
test_sample = np.random.random([5, feature_dim]) print(kmeans.predict(test_sample))
Run this code, we may see:
[2 9 6 6 5]