Extracting Similar Data: Using K-D Trees to Efficiently Locate Nearest Neighbors
Finding points of similar location using k-dimensional trees in large datasets — in this case geospatial (latlon) centroids
Finding data of similar magnitude has many applications in the data science community. Although similarity metrics or vector clustering may in some cases be used, we look at near-uniformly spaced values where doing so can take a long time to compute.
Why use K-Dimensional Trees?
Traditionally if we want to find which point is closest to another, we would calculate the distance between that point and every other, and then select the smallest one — this is the brute force method.
Complexity of the Brute Force Algorithm
As an example, we use two datasets of size
len(n) = 34378 and
len(m) = 175434 . The aim is to find the nearest neighbor from
n which corresponds to each element in
m. Ignoring any actual distance calculations and sorting, the minimum complexity of this task is O(n*m).
— This means that for each value of
m we must calculate the distance to each point in
n as the bare minimum — giving us
6031070052 calculations. We then the sorting which at worst can also be O(n*n) and add any other calculation delays.
K dimensional trees on the other hand are constructed by iteratively splitting the
Xdimensional hyperplane into sets of two (around the median), and cycling through each dimension in order.
The process allows us to partition our data space and map it using a binary (decision) tree structure:
This means that we are now able to navigate the data space without having to visit every point in the comparison…