Adding self-adjusting label size to SpectralCluster in scikit-learn.

Image for post
Image for post

Which clustering method?

There are many different clustering methods ranging from the overused k-means clustering to density-based non-parametric algorithms such as DBSCAN and OPTICS — just see the list provided by scikit-learn.

Image for post
Image for post
The sample dataset.

Spectral Clustering - how does it work?

Spectral clustering starts by building a nearest neighbour graph for all the inputs. We then take the adjaceny† matrix from this graph and use it to calculate the Laplacian (the degree scaled diagonal — the adjacency matrix).

Image for post
Image for post
Nearest 10 neighbour graph from the sample data.

How many clusters?

In order to determine the optimum number of clusters, we apply the methodology described by Ulrike in his “Tutorial on Spectral Clustering”: https://arxiv.org/pdf/0711.0189.pdf

  1. If there exist values above the threshold, we look for the greatest change using a rolling difference:int(argmax(numpy.diff(lambdas))+1.)
  2. In cases where the numbers are uniform (no clear change), we can simply use the first non-zero value (similar to 1.)
  3. Finally, if none of the above exists, we continue with the original cluster number and do not reduce it any further.
n_auto : bool, optional, default=False
If not False (or zero), an eigenvalue evaluation is undergone to
better fit/reduce the maximum number of clusters based on the
work in `A Tutorial on Spectral Clustering, 2007 Ulrike von
Luxburg
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323`.

Code

The updated code containing my changes can be found under the following branch (this shall be cleaned up at some point in the future) :: https://github.com/wolfiex/scikit-learn/tree/spectral_autocluster

Analysis and testing

Greatest Change

When using the eigengap heuristic (change point analysis to find the greatest gap in the eigenvalues), the break will often be very apparent in clearly segregated data — and most likely visible within the diagonal of the ordered affinity matrix. Here we have a look at the different set of eigenvalue distributions we may be faced with for a set of case examples. The printed values have been rounded to 3dp, with “-0” signifying a small number >0.

Image for post
Image for post
eigenvalues [-0.002 -0.002 -0.001 -0.001 -0.001 -0.001 -0.000 -0.000]

Eigenvalue Split

On the left, we have an attempt to classify with 8 different groupings, which has been reduced to two using the eigenvalue split.

Image for post
Image for post
eigenvalues [-0.000 -0.000 -0.000 0.000 0.000] @ threshold 1e-3

All zeros

However, in many cases, we may have constant change throughout or all zeros (numbers < threshold). When this occurs, we can take the split between valued and zero-valued numbers. In the case of really small numbers, this occurs between -0 and +0.

Image for post
Image for post
eigenvalues [-0.008 -0.008 -0.007 -0.006 -0.006 -0.005 -0.005 -0.003 -0.003 -0.002–0.002 -0.002 -0.002 -0.001 -0.001 -0.001 -0.000 -0.000 -0.000 0.000] @ threshold 1e-4

Overfitting

Unfortunately, even auto-correcting classifiers suffer from overfitting. In the case of large numbers of clusters for a small system, we cannot draw a clear enough distinction between the eigenvalues and result with an output of 7, which is still « than the original 20, however not as good as the example above.

Conclusions

In conclusion, we have shown that we may use the eigenvalues of the solved Laplacian to better tune the number of labels assigned within the spectral clustering method and that we may apply this to the implementation within the python scikit-learn package.

References

Research Software Engineer specialising in Data Visualisation with a touch of HPC. — PhD in Atmospheric Chemistry and Masters in Theoretical Physics.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store