Member-only story

Extracting Similar Data: Using K-D Trees to Efficiently Locate Nearest Neighbors

Finding points of similar location using k-dimensional trees in large datasets — in this case geospatial (latlon) centroids

4 min readMar 22, 2022

Introduction

Finding data of similar magnitude has many applications in the data science community. Although similarity metrics or vector clustering may in some cases be used, we look at near-uniformly spaced values where doing so can take a long time to compute.

Why use K-Dimensional Trees?

Traditionally if we want to find which point is closest to another, we would calculate the distance between that point and every other, and then select the smallest one — this is the brute force method.

Complexity of the Brute Force Algorithm

As an example, we use two datasets of size len(n) = 34378 and len(m) = 175434 . The aim is to find the nearest neighbor from n which corresponds to each element in m. Ignoring any actual distance calculations and sorting, the minimum complexity of this task is O(n*m).

— This means that for each value of m we must calculate the distance to each point in n as the bare minimum —…

Extracting Similar Data: Using K-D Trees to Efficiently Locate Nearest Neighbors

Finding points of similar location using k-dimensional trees in large datasets — in this case geospatial (latlon) centroids

Introduction

Why use K-Dimensional Trees?

Complexity of the Brute Force Algorithm

Written by Daniel Ellis Research

No responses yet