Why must coordinates be in radians for scikit-learn's Haversine metric?

scikit-learn's Haversine implementation expects angular inputs in radians. Passing decimal degrees causes eps to be misinterpreted as an angular value roughly 57× too large, producing either one giant cluster or zero clusters depending on your dataset density.

What eps value should I start with for urban pedestrian trajectories?

50–100 metres is a safe starting range for walking-speed data. Convert to radians: eps_rad = 75 / 6_371_000 ≈ 1.18e-5. Validate against known dwell locations such as transit stops or building entrances.

How do I avoid cross-trajectory leakage when processing multiple devices?

Group by device_id and run a separate DBSCAN fit for each group. Fitting a single model across all devices merges geographically co-located pings from different vehicles or people into the same cluster, corrupting individual dwell attribution.

Should I discard noise points (label -1) after clustering?

No — preserve them. Noise points represent transit segments, isolated GPS pings during movement, or genuine outliers. They are the raw material for route reconstruction and speed profiling downstream.

How does DBSCAN stay-point detection compare to the fixed-radius threshold method?

Fixed-radius methods fail under variable sampling rates: a 1 Hz feed and a 0.1 Hz feed covering the same physical stop yield wildly different point counts, so a fixed min_samples value systematically misses one regime. DBSCAN adapts to local density, making it robust to rate variation. The trade-off is an additional eps-to-radians conversion step that fixed-radius methods skip.

Implementing DBSCAN for stay-point clustering in Python

The most reliable way to extract dwell locations from raw GPS trajectories is to run scikit-learn’s DBSCAN with metric='haversine' and algorithm='ball_tree', then apply a second-pass temporal filter to discard transient stops. This two-stage approach — spatial density first, dwell-time gating second — separates traffic congestion from genuine pauses and handles the irregular sampling rates common in fleet and pedestrian telemetry.

Why density-based methods outperform fixed-radius thresholds

Fixed-radius stay-point detectors fail in the same way a stopped clock does: they are right when conditions match the design assumption and wrong the rest of the time. A 100-metre radius with a 5-point minimum works at 1 Hz sampling but silently misses stops captured at 0.1 Hz — the same physical location simply does not accumulate enough pings. This is a foundational issue described in the Stay-Point Detection Algorithms overview.

DBSCAN sidesteps this by defining a core point in terms of local neighbourhood density rather than an absolute count, so it tolerates rate variation across devices and time windows. Upstream, the trajectory data should already have passed through GPS precision error handling — uncorrected multipath drift inflates apparent spatial spread and causes the algorithm to split a single real stop into several nearby micro-clusters.

This page is one step of the broader Movement Pattern Extraction & Trajectory Analysis pipeline. Before running the code below, ensure timestamps are timezone-aware; the time-series synchronization strategies page covers the UTC-normalization pattern in detail.

Pipeline overview

Four ordered steps convert a raw ping stream into validated stay points:

Normalise and sort — parse timestamps to UTC-aware datetime, sort by (device_id, timestamp), drop rows with missing or out-of-bounds coordinates.
Convert coordinates to radians — multiply decimal degrees by π/180; convert the eps distance in metres to radians by dividing by Earth’s radius.
Cluster spatially per device — fit DBSCAN independently on each device’s radian coordinate array; attach cluster labels to the rows.
Gate by dwell time — for each spatial cluster, compute the span from first to last ping and discard any cluster below the minimum-duration threshold.

Pipeline diagram

Production-ready implementation

The function below is complete and runnable. It handles empty DataFrames, missing columns, and the radian-conversion pitfall in a single self-contained unit.

PYTHON

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN


EARTH_RADIUS_M = 6_371_000.0
REQUIRED_COLS = {"device_id", "timestamp", "lat", "lon"}


def detect_stay_points(
    trajectory_df: pd.DataFrame,
    eps_meters: float = 100.0,
    min_samples: int = 3,
    min_duration_minutes: float = 5.0,
) -> pd.DataFrame:
    """
    Detect stay points from a GPS trajectory using DBSCAN + temporal gating.

    Parameters
    ----------
    trajectory_df : pd.DataFrame
        Must contain columns: device_id, timestamp, lat, lon.
        lat must be in [-90, 90], lon in [-180, 180].
    eps_meters : float
        Spatial neighbourhood radius in metres. Converted to radians internally.
    min_samples : int
        Minimum pings within eps to form a core point.
    min_duration_minutes : float
        Minimum dwell duration for a cluster to qualify as a stay point.

    Returns
    -------
    pd.DataFrame
        One row per stay point with centroid_lat, centroid_lon,
        start_time, end_time, duration_minutes, point_count.
    """
    missing = REQUIRED_COLS - set(trajectory_df.columns)
    if missing:
        raise ValueError(f"trajectory_df is missing columns: {missing}")

    if trajectory_df.empty:
        return pd.DataFrame(columns=[
            "device_id", "cluster_id", "centroid_lat", "centroid_lon",
            "start_time", "end_time", "duration_minutes", "point_count",
        ])

    df = trajectory_df.copy()

    # Step 1 — temporal normalisation
    # Timestamps must be UTC-aware; mixed-offset inputs are coerced to UTC here.
    df["timestamp"] = pd.to_datetime(df["timestamp"], utc=True)

    # Drop rows with out-of-bounds or missing coordinates before clustering.
    valid_mask = (
        df["lat"].notna() & df["lon"].notna()
        & df["lat"].between(-90, 90)
        & df["lon"].between(-180, 180)
    )
    df = df[valid_mask].sort_values(["device_id", "timestamp"]).reset_index(drop=True)

    # Step 2 — radian conversion (mandatory for Haversine metric)
    # scikit-learn's Haversine implementation expects angular inputs.
    # Passing decimal degrees causes eps to be misinterpreted ~57× too large.
    df["lat_rad"] = np.radians(df["lat"])
    df["lon_rad"] = np.radians(df["lon"])
    eps_rad = eps_meters / EARTH_RADIUS_M

    stay_points: list[dict] = []

    # Step 3 — DBSCAN per device (prevents cross-trajectory label leakage)
    for device_id, group in df.groupby("device_id"):
        if len(group) < min_samples:
            # Not enough pings to form even one core point; skip silently.
            continue

        coords = group[["lat_rad", "lon_rad"]].to_numpy()

        # ball_tree accelerates spherical neighbour lookups for N < ~100k;
        # for larger per-device arrays, pre-chunk by calendar day first.
        db = DBSCAN(
            eps=eps_rad,
            min_samples=min_samples,
            metric="haversine",
            algorithm="ball_tree",
        ).fit(coords)

        group = group.copy()
        group["cluster"] = db.labels_

        # Step 4 — dwell-time gating
        # Noise points (label -1) are intentionally excluded here but should
        # be preserved in the caller for route/speed analysis downstream.
        clustered = group[group["cluster"] != -1]

        for cluster_id, cluster_data in clustered.groupby("cluster"):
            duration_min = (
                cluster_data["timestamp"].max() - cluster_data["timestamp"].min()
            ).total_seconds() / 60.0

            if duration_min < min_duration_minutes:
                continue

            # Centroid computed in radian space, then converted back to degrees.
            centroid_lat = float(np.degrees(np.mean(cluster_data["lat_rad"])))
            centroid_lon = float(np.degrees(np.mean(cluster_data["lon_rad"])))

            stay_points.append({
                "device_id": device_id,
                "cluster_id": int(cluster_id),
                "centroid_lat": round(centroid_lat, 6),
                "centroid_lon": round(centroid_lon, 6),
                "start_time": cluster_data["timestamp"].min(),
                "end_time": cluster_data["timestamp"].max(),
                "duration_minutes": round(duration_min, 2),
                "point_count": len(cluster_data),
            })

    return pd.DataFrame(stay_points)

Validation block

After calling detect_stay_points, run these checks before passing output downstream:

PYTHON

import math

def validate_stay_points(sp: pd.DataFrame, min_duration: float = 5.0) -> None:
    assert not sp.empty or True, "Empty result is valid but worth logging"

    # Centroids must be in valid geographic bounds
    assert sp["centroid_lat"].between(-90, 90).all(), "centroid_lat out of bounds"
    assert sp["centroid_lon"].between(-180, 180).all(), "centroid_lon out of bounds"

    # All durations must meet the minimum threshold
    assert (sp["duration_minutes"] >= min_duration).all(), (
        f"Found clusters below {min_duration} min threshold"
    )

    # start_time must precede end_time for every row
    assert (sp["end_time"] >= sp["start_time"]).all(), "end_time before start_time"

    # point_count must be a positive integer
    assert (sp["point_count"] >= 1).all(), "Cluster with zero points"

    print(f"Validation passed: {len(sp)} stay points across "
          f"{sp['device_id'].nunique()} device(s)")

For integration tests, synthesise a known stop: place 5 pings within a 50-metre radius over 10 minutes, then assert the output contains exactly one row with duration_minutes >= 10 and point_count == 5.

Common mistakes and gotchas

Forgetting radian conversion. Passing decimal degrees to DBSCAN(metric='haversine') means your eps=0.001 (intended as 1 m / 6,371,000) is interpreted as ~0.057°, roughly 6.4 km at the equator. The result is a single cluster containing your entire dataset.
Using iterrows to compute inter-point distances before clustering. This is the canonical performance anti-pattern. Let DBSCAN’s BallTree handle distance computation in C; a Python loop over 10k rows is ~1,000× slower.
Running one DBSCAN fit across all devices. Co-located pings from different vehicles become the same cluster. Always group by device_id and fit independently.
Relying on wall-clock timestamps without UTC normalisation. A device switching timezone mid-journey (e.g., a cross-border truck) can invert start_time and end_time for clusters that span the boundary, producing negative durations. The time-series synchronization strategies page covers this edge case with a concrete fix.
Discarding noise points (label -1). Noise rows are the movement segments between stays. They are the direct input to speed and acceleration profiling and to directionality and turn analysis — dropping them silently removes half the analytical signal.
Applying a single eps across all transport modes. A 100-metre radius that works for pedestrians produces massive merged clusters for highway vehicles. See the parameter guidance below.

Parameter tuning by transport mode

Mode	Typical GPS rate	Recommended `eps` (m)	`min_samples`	`min_duration` (min)
Pedestrian / urban	1 Hz	50–100	3–5	3–5
Cyclist	1 Hz	75–150	3–4	2–5
Urban vehicle	0.2–1 Hz	100–200	3–5	2–5
Highway / long-haul	0.1–0.2 Hz	200–500	2–3	5–15
Maritime / aircraft	0.016–0.1 Hz	300–1,000	2–3	10–30

For large fleets with mixed modes, run the detection twice — once with loose parameters, once with strict — and merge by selecting the tighter cluster when they overlap by more than 80%.

Stay-Point Detection Algorithms — parent overview covering the full taxonomy of detection methods
Speed & Acceleration Profiling — what to do with the noise-point rows between stay points
Calculating instantaneous speed from discrete GPS points — prerequisite kinematic step before segment classification
Directionality & Turn Analysis — analysing heading changes within the movement segments DBSCAN labels as noise
GPS Precision & Error Handling — upstream cleaning that keeps multipath drift from splitting real stops into micro-clusters

Back to Stay-Point Detection Algorithms