Extracting Similar Data: Using K-D Trees to Efficiently Locate Nearest Neighbors

Finding points of similar location using k-dimensional trees in large datasets — in this case geospatial (latlon) centroids

Daniel Ellis Research
4 min readMar 22, 2022
Photo by Kunj Parekh on Unsplash

Introduction

Finding data of similar magnitude has many applications in the data science community. Although similarity metrics or vector clustering may in some cases be used, we look at near-uniformly spaced values where doing so can take a long time to compute.

Why use K-Dimensional Trees?

Traditionally if we want to find which point is closest to another, we would calculate the distance between that point and every other, and then select the smallest one — this is the brute force method.

Complexity of the Brute Force Algorithm

As an example, we use two datasets of size len(n) = 34378 and len(m) = 175434 . The aim is to find the nearest neighbor from n which corresponds to each element in m. Ignoring any actual distance calculations and sorting, the minimum complexity of this task is O(n*m).

This means that for each value of m we must calculate the distance to each point in n as the bare minimum —

--

--

Daniel Ellis Research

Research Software Engineer specialising in High-Performance Computing and Data Visualisation. — PhD in Atmospheric Chemistry and Masters in Theoretical Physics.