Adding self-adjusting label size to SpectralCluster in scikit-learn.
It is often that we are faced with copious amounts of data and wish to have a method of locating like groups. This is where the field of vector clustering comes into play. Instead of having to manually select each group, we can get an algorithm which does the hard work for us.
Which clustering method?
There are many different clustering methods ranging from the overused k-means clustering to density-based non-parametric algorithms such as DBSCAN and OPTICS — just see the list provided by scikit-learn.
Since most real-world data is non-linear and often non-convex, we needed to compare a series of different algorithms and determine which gave the best cluster separation.
It, therefore, makes sense to select a set of pseudo-self-adjusting algorithm for automated testing. One thing that struck me as odd, was the explicit constraint on the number of clusters, based on the input parameter within the spectral clustering function of sklearn.
Spectral Clustering - how does it work?
Spectral clustering starts by building a nearest neighbour graph for all the inputs. We then take the adjaceny† matrix from this graph and use it to calculate the Laplacian (the degree scaled diagonal — the adjacency matrix).
† It is possible to use an affinity matrix describing how similar 2 points are in space instead of the adjacency matrix.
Next we calculate the eigenvalues and eigenvectors using our chosen solver and convert the input into an n-dimensional space. Here we can now apply a range of cluster size determining methods to give us an optimum number, o, where
o < ncluster_input .
Finally, we discretize the labels by searching for a partition matrix/clustering which most resembles the eigenvector embedding. For more information on how spectral clustering works, have a look at this link.
How many clusters?
In order to determine the optimum number of clusters, we apply the methodology described by Ulrike in his “Tutorial on Spectral Clustering”: https://arxiv.org/pdf/0711.0189.pdf
Here we look for the number of clusters that maximises the gap between consecutive eigenvalues. The larger, more pronounced, the gap the produced, the more distinct the calculated groups will be. This change-analysis style can be implemented through the following checks:
- If there are only tiny numbers, all below a threshold of 1e-3 we search for a sign change. This separates small numeric values (~-0) from actual zeros (+0).
- If there exist values above the threshold, we look for the greatest change using a rolling difference:
- In cases where the numbers are uniform (no clear change), we can simply use the first non-zero value (similar to 1.)
- Finally, if none of the above exists, we continue with the original cluster number and do not reduce it any further.
Having implemented these changes, all that remains is updating class/function arguments and correcting the documentation. I found a simple
n_auto argument worked well for this.
n_auto : bool, optional, default=False
If not False (or zero), an eigenvalue evaluation is undergone to
better fit/reduce the maximum number of clusters based on the
work in `A Tutorial on Spectral Clustering, 2007 Ulrike von
The updated code containing my changes can be found under the following branch (this shall be cleaned up at some point in the future) :: https://github.com/wolfiex/scikit-learn/tree/spectral_autocluster
Analysis and testing
When using the eigengap heuristic (change point analysis to find the greatest gap in the eigenvalues), the break will often be very apparent in clearly segregated data — and most likely visible within the diagonal of the ordered affinity matrix. Here we have a look at the different set of eigenvalue distributions we may be faced with for a set of case examples. The printed values have been rounded to 3dp, with “-0” signifying a small number >0.
On the left, we have an attempt to classify with 8 different groupings, which has been reduced to two using the eigenvalue split.
However, in many cases, we may have constant change throughout or all zeros (numbers < threshold). When this occurs, we can take the split between valued and zero-valued numbers. In the case of really small numbers, this occurs between -0 and +0.
Unfortunately, even auto-correcting classifiers suffer from overfitting. In the case of large numbers of clusters for a small system, we cannot draw a clear enough distinction between the eigenvalues and result with an output of 7, which is still « than the original 20, however not as good as the example above.
In conclusion, we have shown that we may use the eigenvalues of the solved Laplacian to better tune the number of labels assigned within the spectral clustering method and that we may apply this to the implementation within the python scikit-learn package.
- Paper: https://arxiv.org/abs/0711.0189
- How to contribute to sklearn: https://scikit-learn.org/stable/developers/contributing.html
- Article from which this is based: https://towardsdatascience.com/spectral-graph-clustering-and-optimal-number-of-clusters-estimation-32704189afbe