Implementing DBSCAN for stay-point clustering in Python
Implementing DBSCAN for stay-point clustering in Python requires converting GPS trajectories into radian coordinates, applying density-based spatial clustering with a geographic distance metric, and filtering resulting clusters by minimum temporal duration. The most reliable production approach uses scikit-learn’s DBSCAN with metric='haversine', algorithm='ball_tree', and a post-processing step that validates cluster dwell time against a configurable threshold (typically 5–15 minutes). This pipeline transforms noisy, irregularly sampled pings into semantically stable locations suitable for downstream mobility analytics.
Stay-point detection is a foundational preprocessing stage in Movement Pattern Extraction & Trajectory Analysis, where raw telemetry must be distilled into actionable spatial anchors. Traditional threshold-based methods (fixed radius + fixed time) fail under variable sampling rates, GPS drift, and urban canyon multipath errors. Density-based clustering adapts to local point concentration, making it the preferred choice for modern Stay-Point Detection Algorithms deployed in fleet telematics, ride-hailing routing, and pedestrian flow modeling.
Environment & Compatibility Notes
- Python: 3.9+ (required for modern
numpyvectorization andzoneinfosupport) - Core Libraries:
scikit-learn>=1.2,geopandas>=0.12,pandas>=1.5,numpy>=1.23 - Coordinate System: DBSCAN with
metric='haversine'expects input in radians, not decimal degrees. Earth radius is assumed as6,371,000meters. - Memory Scaling: Distance matrix computation scales O(N²). Use
algorithm='ball_tree'for N < 100k. For larger trajectories, chunk by device ID or time windows. - Temporal Handling: Timestamps must be timezone-aware (UTC recommended) to avoid DST-induced duration miscalculations. See the official pandas time-zone handling guide for best practices.
Production-Ready Implementation
The following snippet demonstrates a complete, production-ready pipeline. It loads trajectory data, converts coordinates, runs DBSCAN per device, filters by dwell time, and returns structured stay points.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
def detect_stay_points(
trajectory_df: pd.DataFrame,
eps_meters: float = 100.0,
min_samples: int = 3,
min_duration_minutes: float = 5.0
) -> pd.DataFrame:
"""
Detect stay points from GPS trajectory using DBSCAN + temporal filtering.
trajectory_df must contain: ['device_id', 'timestamp', 'lat', 'lon']
"""
df = trajectory_df.copy()
df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True)
df = df.sort_values(['device_id', 'timestamp']).reset_index(drop=True)
# Convert decimal degrees to radians for Haversine metric
df['lat_rad'] = np.radians(df['lat'])
df['lon_rad'] = np.radians(df['lon'])
# Convert spatial threshold to radians (Earth radius in meters)
EARTH_RADIUS_M = 6371000.0
eps_rad = eps_meters / EARTH_RADIUS_M
stay_points = []
# Process each device independently to prevent cross-trajectory leakage
for device_id, group in df.groupby('device_id'):
coords = group[['lat_rad', 'lon_rad']].values
# Initialize and fit DBSCAN
# Reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
db = DBSCAN(
eps=eps_rad,
min_samples=min_samples,
metric='haversine',
algorithm='ball_tree'
).fit(coords)
group = group.copy()
group['cluster'] = db.labels_
# Filter out noise points (-1)
clusters = group[group['cluster'] != -1]
for cluster_id, cluster_data in clusters.groupby('cluster'):
duration_min = (cluster_data['timestamp'].max() -
cluster_data['timestamp'].min()).total_seconds() / 60.0
if duration_min >= min_duration_minutes:
# Compute geographic centroid in decimal degrees
centroid_lat = np.degrees(np.mean(cluster_data['lat_rad']))
centroid_lon = np.degrees(np.mean(cluster_data['lon_rad']))
stay_points.append({
'device_id': device_id,
'cluster_id': int(cluster_id),
'centroid_lat': round(centroid_lat, 6),
'centroid_lon': round(centroid_lon, 6),
'start_time': cluster_data['timestamp'].min(),
'end_time': cluster_data['timestamp'].max(),
'duration_minutes': round(duration_min, 2),
'point_count': len(cluster_data)
})
return pd.DataFrame(stay_points)
Step-by-Step Pipeline Breakdown
- Temporal Normalization: Raw GPS logs often arrive out of order or with mixed timezones. Sorting by
device_idandtimestampguarantees monotonic progression. Converting to UTC eliminates daylight saving time edge cases that corrupt duration calculations. - Radian Conversion: The Haversine formula operates on angular distances. Multiplying decimal degrees by
π/180vianp.radians()aligns inputs withscikit-learn’s expectations. Failing to convert results inepsvalues that are orders of magnitude too small, producing zero clusters. - Spatial Clustering:
DBSCANgroups points that fall withinepsmeters of each other, requiring at leastmin_samplesto form a core point. Theball_treealgorithm accelerates spherical distance lookups, avoiding the O(N²) brute-force fallback. - Temporal Validation: Spatial density alone cannot distinguish between a traffic jam and a genuine stop. Filtering by
min_duration_minutesremoves transient congestion while preserving meaningful dwell events. - Centroid Aggregation: The output calculates a mean latitude/longitude in decimal degrees, providing a clean, human-readable anchor for mapping or geocoding downstream.
Performance Scaling & Parameter Tuning
- Chunking Strategy: For datasets exceeding 100k points per device, memory pressure spikes during tree construction. Partition trajectories by calendar day or fixed 4-hour windows before clustering, then merge overlapping stay points post-processing.
epsCalibration: Start with 50–150 meters for pedestrian/urban routing, and 200–500 meters for highway/fleet tracking. Validate against known landmarks (e.g., parking garages, transit hubs) to calibrate spatial tolerance.min_samplesTrade-offs: Lower values (2–3) capture brief stops but increase false positives from GPS drift. Higher values (5–8) enforce stricter density, ideal for high-frequency sampling (>1 Hz).- Noise Handling: Points labeled
-1are not discarded; they represent transit segments or isolated pings. Preserve them for route reconstruction or speed profiling.
Validation Checklist
Before deploying to production, verify:
- All timestamps are UTC-aware and monotonically increasing per device
-
lat/loncolumns contain noNaNor out-of-bounds values (-90to90,-180to180) -
epsis converted to radians before passing toDBSCAN - Output DataFrame contains only clusters meeting both spatial and temporal thresholds
- Centroids fall within the convex hull of their source points (sanity check against coordinate inversion)
This pipeline delivers deterministic, scalable stay-point extraction that integrates cleanly with mobility data stacks. By decoupling spatial density from temporal validation, it adapts to irregular sampling while maintaining geographic precision.