Skip to content Skip to sidebar Skip to footer

How To Apply KMeans To Get The Centroid Using Dataframe With Multiple Features

I am following this detailed KMeans tutorial: https://github.com/python-engineer/MLfromscratch/blob/master/mlfromscratch/kmeans.py which uses dataset with 2 features. But I have a

Solution 1:

Reading the data and clustering it should not throw any errors, even when you increase the number of features in the dataset. In fact, you only get an error in that part of the code when you redefine the euclidean_distance function.

This asnwer addresses the actual error of the plotting function that you are getting.

   def plot(self):
      fig, ax = plt.subplots(figsize=(12, 8))

       for i, index in enumerate(self.clusters):
           point = self.X[index].T
           ax.scatter(*point)

takes all points in a given cluster and tries to make a scatterplot.

the asterisk in ax.scatter(*point) means that point is unpacked.

The implicit assumption here (and this is why this might be hard to spot) is that point should be 2-dimensional. Then, the individual parts get interpreted as x,y values to be plotted.

But since you have 5 features, point is 5-dimensional.

Looking at the docs of ax.scatter:

matplotlib.axes.Axes.scatter
Axes.scatter(self, x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None,
verts=<deprecated parameter>, edgecolors=None, *, plotnonfinite=False,
data=None, **kwargs)

so ,the first few arguments that ax.scatter takes (other than self) are:

x 
y
s (i.e. the markersize)
c (i.e. the color)
marker (i.e. the markerstyle)

the first four, i.e. x,y, s anc c allow floats, but your dataset is 5-dimensional, so the fifth feature gets interpreted as marker, which expects a MarkerStyle. Since it is getting a float, it throws the error.

what to do:

only look at 2 or 3 dimensions at a time, or use dimensionality reduction (e.g. principal component analysis) to project the data to a lower-dimensional space.

For the first option, you can redefine the plot method within the KMeans class:

def plot(self):
    

    import itertools
    combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
    
    fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination

    for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
        
        
        for i, index in enumerate(self.clusters):
            point = self.X[index].T
            
            # only get the coordinates for this combination:
            px, py = point[x], point[y]
            ax.scatter(px, py)

        for point in self.centroids:
            
            # only get the coordinates for this combination:
            px, py = point[x], point[y]
            
            ax.scatter(px, py, marker="x", color='black', linewidth=2)

        ax.set_title('feature {} vs feature {}'.format(x,y))
    plt.show()

Post a Comment for "How To Apply KMeans To Get The Centroid Using Dataframe With Multiple Features"