Extracting Similar Data: Using K-D Trees to Efficiently Locate Nearest Neighbors
Finding points of similar location using k-dimensional trees in large datasets — in this case geospatial (latlon) centroids
Introduction
Finding data of similar magnitude has many applications in the data science community. Although similarity metrics or vector clustering may in some cases be used, we look at near-uniformly spaced values where doing so can take a long time to compute.
Why use K-Dimensional Trees?
Traditionally if we want to find which point is closest to another, we would calculate the distance between that point and every other, and then select the smallest one — this is the brute force method.
Complexity of the Brute Force Algorithm
As an example, we use two datasets of size len(n) = 34378
and len(m) = 175434
. The aim is to find the nearest neighbor from n
which corresponds to each element in m
. Ignoring any actual distance calculations and sorting, the minimum complexity of this task is O(n*m).
— This means that for each value of m
we must calculate the distance to each point in n
as the bare minimum —…