Check out the live web app at https://ecowley-dbscan.herokuapp.com/

Credit to this page provided by scikit-learn for instruction on how to generate blobs.

Credit to the excellent wikipedia page for its pseudocode on the dbscan algorithm.

Introduction

The goal of this project was to implement the classic dbscan clustering algorithm. This was for educational purposes; the aim was to understand how it works, not to build something that can compete with the highly optimized sklearn.cluster.dbscan.

Vocabulary

  • eps is the maximum distance apart two points may be in order to be considered neighbors.
  • minpts is the minimum number of neighbors that a point must have in order for the group of points to be considered a cluster.

How it works

  1. Select a point that does not already have a label.
  2. Query for neighbors that are within the eps distance of the point
  3. If at least minpts are found, add a new label on the first point and put the neighbors in a set called neighbors. Else, label the point noise and go back to step 1.
  4. For every point in neighbors, add the same label, then query for neighbors and add those points to the set. Continue until all points in the set are handled.
  5. Repeat steps 1-4 until all points are labeled.

Time Complexity

DBSCAN is O(n^2) because it must find the distance between every two points in the dataset.

My implementation takes 4-7 seconds to run on a dataset of 700 points. sklearn’s implementation takes in the milliseconds to run. This is due to optimizations that I was unable to look into in the short amount of time I had for the project.