Stay-Point Detection Algorithms for Mobility Data Pipelines

Q: What distance and duration thresholds should I use for pedestrian vs vehicle stay detection?

For pedestrians, use a 30–50 m radius and a 2–5 minute minimum dwell. For road vehicles, widen to 50–100 m to absorb parking lot drift and extend the minimum to 3–10 minutes to filter traffic-light halts. Heavy logistics fleets benefit from 100–150 m and 10+ minutes to capture structured loading stops.

Q: Why does DBSCAN outperform time-distance thresholding on noisy GPS data?

Thresholding is sequential — a single outlier point can prematurely terminate a stay segment. DBSCAN evaluates density across the entire point cloud, so a few GPS drift spikes surrounded by a dense cluster do not break the detection. The cost is that DBSCAN cannot preserve temporal ordering, so you must reconstruct entry and exit times from the original timestamp column after clustering.

Q: How do I handle GPS data gaps inside a stay region?

Set a max_gap_s parameter (typically 1800 s for urban logistics). If the time delta between consecutive points within a flagged segment exceeds that threshold, split the segment into two separate candidate stays and apply the minimum-duration filter to each independently. This prevents tunnel dropouts or battery interruptions from inflating apparent dwell times.

Q: Can I run stay-point detection on WGS84 coordinates directly?

Only for DBSCAN with metric='haversine', which accepts (latitude, longitude) in radians and returns great-circle distances. All other methods — thresholding, grid binning, STRtree queries — require a projected metric CRS. Mixing degree-based distances with metre-based thresholds produces silent errors that are difficult to detect in output data.

Q: What is the minimum point count to form a valid stay?

At 1 Hz sampling, a 5-minute stay yields 300 points. At 1/30 Hz (one point per 30 seconds), the same stop yields only 10 points. Set MinPts relative to your sampling rate: a practical heuristic is floor(min_duration_s / median_sample_interval_s) * 0.6, which allows for up to 40% data loss within the window while still satisfying the density threshold.

Stay-point detection transforms raw GPS traces into semantically meaningful dwell locations — the geographic positions where a moving entity paused, visited, or waited long enough to matter.

Prerequisites

Before implementing any detection approach, your trajectory data must meet these baseline requirements. Gaps here produce silent failures that are difficult to trace downstream.

Required Python packages

geopandas >= 0.14 — GeoDataFrame, spatial joins, projected distance
pandas >= 2.0 — vectorized groupby, time deltas
shapely >= 2.0 — geometry operations, union_all()
scikit-learn >= 1.3 — DBSCAN with ball_tree / haversine
numpy >= 1.26 — array broadcasting, np.radians
pyproj >= 3.6 — CRS transformations (see coordinate reference system mapping)

Data schema requirements

Your input GeoDataFrame must contain: trajectory_id (string/int), timestamp (datetime64[ns, UTC]), and geometry (Point, projected metric CRS). Optional but recommended: speed_ms (float, metres per second) and accuracy_m (float, GPS horizontal accuracy radius). The upstream pipeline stage that must complete first is GPS precision and error handling, which removes fix-quality outliers and enforces consistent accuracy_m values.

CRS requirement: all distance-threshold methods require a projected CRS — EPSG:3857 (Web Mercator) or a local UTM zone. Never compute spatial distances in WGS84 degrees. The one exception is DBSCAN with metric='haversine', which works on geographic radians directly.

Sampling rate: check the median inter-point interval across your fleet. Methods and threshold values differ substantially between 1 Hz telemetry and once-per-30-second pings. Irregular sampling requires time-series synchronization before thresholding.

Failure Mode Taxonomy

Failure	Mechanism	Typical Impact	Mitigation
CRS mismatch	Distance threshold applied to degree-units	Thresholds 10,000× off; all points flagged or none	Assert `gdf.crs.is_projected` before any distance call
Temporal gap inflation	GPS dropout inside a stop inflates dwell duration	2-hour stop recorded as 14 hours	Enforce `max_gap_s`; split segment on gap
Multipath jitter	Reflected signal causes positional oscillation	Stay centroid migrates; multiple stay fragments from one stop	Pre-smooth with a median filter or Kalman smoother
Sampling rate mismatch	`min_pts` calibrated for 1 Hz applied to 1/30 Hz data	Genuine stops rejected (insufficient density)	Derive `min_pts` from observed sampling rate
Boundary discretization	Grid-cell stay spans two adjacent cells	One stop split into two short stays, both rejected by min duration	Buffer centroids or use hierarchical H3 resolution
Sequential thresholding fragility	Single outlier point breaks consecutive run	Long stop fragmented into multiple short segments	Use density approach or apply morphological closing to the stay flag

Detection Pipeline Overview

The standard production pipeline for single-entity trajectories follows five ordered stages. Multi-entity (fleet) pipelines add a partitioning stage before stage 1 and a semantic join stage after stage 4.

Implementation Walkthrough

Time-Distance Thresholding (vectorized)

The deterministic approach: flag every point whose distance to the next point is within threshold D, then group consecutive flagged points into candidate stays and filter by minimum dwell duration T. This is O(N) per trajectory and fully vectorizable with pandas shift operations.

PYTHON

import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point


def detect_stay_points(
    gdf: gpd.GeoDataFrame,
    distance_m: float = 50.0,
    duration_s: float = 300.0,
    max_gap_s: float = 1800.0,
) -> gpd.GeoDataFrame:
    """
    Vectorized time-distance stay-point detection.

    Parameters
    ----------
    gdf : GeoDataFrame
        Must have columns: trajectory_id (str/int), timestamp (datetime64[ns]),
        geometry (Point). Must use a projected CRS (metres).
    distance_m : float
        Maximum distance (metres) between consecutive points to remain in a stay.
    duration_s : float
        Minimum total dwell duration (seconds) for a valid stay.
    max_gap_s : float
        If consecutive timestamps exceed this gap, force a segment break.

    Returns
    -------
    GeoDataFrame of stay centroids with columns:
        trajectory_id, stay_group, centroid (geometry),
        start_time, end_time, duration_s, point_count.
    """
    if gdf.empty:
        return gpd.GeoDataFrame(columns=[
            "trajectory_id", "stay_group", "geometry",
            "start_time", "end_time", "duration_s", "point_count",
        ])

    required = {"trajectory_id", "timestamp", "geometry"}
    missing = required - set(gdf.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    if not gdf.crs or not gdf.crs.is_projected:
        raise ValueError(
            "GeoDataFrame must use a projected CRS (metres). "
            "Reproject with gdf.to_crs(epsg=3857) or a local UTM zone first."
        )

    gdf = gdf.sort_values(["trajectory_id", "timestamp"]).copy()

    # Distance to the next point within each trajectory (metres)
    gdf["dist_next"] = gdf.groupby("trajectory_id", sort=False)["geometry"].transform(
        lambda x: x.distance(x.shift(-1))
    )

    # Time delta to the next point (seconds)
    gdf["time_next_s"] = (
        gdf.groupby("trajectory_id", sort=False)["timestamp"]
        .transform(lambda x: x.diff().shift(-1).dt.total_seconds())
    )

    # A point belongs to a stay if distance to next is within threshold AND
    # there is no large temporal gap (tunnel dropout, battery off, etc.)
    gdf["in_stay"] = (gdf["dist_next"] <= distance_m) & (gdf["time_next_s"] <= max_gap_s)

    # Assign group ids: each transition from stay→non-stay or non-stay→stay
    # starts a new group. NaN distance (last point per trajectory) is treated
    # as a non-stay point.
    gdf["in_stay"] = gdf["in_stay"].fillna(False)
    gdf["stay_group"] = (~gdf["in_stay"]).cumsum()

    # Only aggregate segments where at least one point was flagged in_stay
    stay_pts = gdf[gdf["in_stay"]].copy()

    if stay_pts.empty:
        return gpd.GeoDataFrame(columns=[
            "trajectory_id", "stay_group", "geometry",
            "start_time", "end_time", "duration_s", "point_count",
        ])

    stays = (
        stay_pts.groupby(["trajectory_id", "stay_group"], sort=False)
        .agg(
            centroid=("geometry", lambda x: x.union_all().centroid),
            start_time=("timestamp", "min"),
            end_time=("timestamp", "max"),
            point_count=("geometry", "count"),
        )
        .reset_index()
    )

    stays["duration_s"] = (
        stays["end_time"] - stays["start_time"]
    ).dt.total_seconds()

    stays = stays[stays["duration_s"] >= duration_s].copy()
    stays = stays.rename(columns={"centroid": "geometry"})
    return gpd.GeoDataFrame(stays, geometry="geometry", crs=gdf.crs)

The max_gap_s guard is the most commonly omitted production detail. Without it, a GPS signal lost during a long overnight park will stitch two separate visits into one inflated dwell record.

DBSCAN (density-based, noise-tolerant)

Use this approach when your data has irregular sampling, multi-path jitter, or you are aggregating stays across multiple entities at once. DBSCAN does not require sorted input and naturally handles noise points. The trade-off: temporal ordering is lost, so entry and exit times must be reconstructed from the original timestamps after labelling.

PYTHON

from sklearn.cluster import DBSCAN


def detect_stay_points_dbscan(
    gdf: gpd.GeoDataFrame,
    eps_m: float = 50.0,
    min_samples: int = 10,
    duration_s: float = 300.0,
) -> gpd.GeoDataFrame:
    """
    DBSCAN stay-point detection on projected coordinates.

    eps_m is expressed in the same unit as gdf.crs (metres for projected CRS).
    min_samples should be calibrated to your median sampling rate:
        floor(duration_s / median_interval_s) * 0.6
    Noise points (label == -1) are excluded from output.
    """
    if gdf.empty:
        return gpd.GeoDataFrame(columns=[
            "trajectory_id", "cluster_label", "geometry",
            "start_time", "end_time", "duration_s", "point_count",
        ])

    if not gdf.crs or not gdf.crs.is_projected:
        raise ValueError(
            "Use a projected CRS for DBSCAN with Euclidean distance. "
            "For haversine-metric DBSCAN pass radians and set metric='haversine'."
        )

    coords = np.column_stack([gdf.geometry.x, gdf.geometry.y])
    labels = DBSCAN(
        eps=eps_m,
        min_samples=min_samples,
        algorithm="ball_tree",
        n_jobs=-1,
    ).fit_predict(coords)

    gdf = gdf.copy()
    gdf["cluster_label"] = labels

    # Exclude noise
    clustered = gdf[gdf["cluster_label"] >= 0]

    if clustered.empty:
        return gpd.GeoDataFrame(columns=[
            "trajectory_id", "cluster_label", "geometry",
            "start_time", "end_time", "duration_s", "point_count",
        ])

    stays = (
        clustered.groupby(["trajectory_id", "cluster_label"], sort=False)
        .agg(
            centroid=("geometry", lambda x: x.union_all().centroid),
            start_time=("timestamp", "min"),
            end_time=("timestamp", "max"),
            point_count=("geometry", "count"),
        )
        .reset_index()
    )

    stays["duration_s"] = (
        stays["end_time"] - stays["start_time"]
    ).dt.total_seconds()

    stays = stays[stays["duration_s"] >= duration_s].copy()
    stays = stays.rename(columns={"centroid": "geometry"})
    return gpd.GeoDataFrame(stays, geometry="geometry", crs=gdf.crs)

For a full parameter-tuning walkthrough of the DBSCAN variant — including elbow-method eps selection and the k-distance plot — see Implementing DBSCAN for stay-point clustering in Python.

Mathematical and Geometric Grounding

The time-distance method tests the following condition at each point index i:

TEXT

dist(p_i, p_{i+1}) ≤ D  AND  Δt(p_i, p_{i+1}) ≤ max_gap

Consecutive points satisfying this form a run. A run qualifies as a stay if:

TEXT

t(p_{last}) − t(p_{first}) ≥ T

The centroid of the stay is the geometric mean of all points in the run:

TEXT

centroid = ( mean(x_i), mean(y_i) )

Using union_all().centroid (Shapely 2.x) is equivalent and vectorizes across the entire group in a single C-layer call — substantially faster than computing a MultiPoint row by row.

For DBSCAN, the radius eps in metres maps directly to the neighbourhood definition in metric space. The ball_tree algorithm organises points into a BSP tree that reduces average neighbourhood queries from O(N) to O(log N), keeping fleet-scale runs tractable.

Calibration and Parameter Tuning

Threshold values that work well for pedestrian traces will massively over-detect for freight vehicles and miss stays entirely for cyclists on 5-second ping intervals. Use this table as a starting point, then validate against ground-truth stop records:

Transport mode	Distance `D` (m)	Min duration `T` (min)	Max gap `max_gap_s` (min)	DBSCAN `eps` (m)	Notes
Pedestrian	30–50	2–5	15	30–50	Absorb pedestrian oscillation and queuing
Cyclist	40–60	3–7	10	40–60	Wider for bike-parking imprecision
Passenger vehicle	50–100	3–10	30	60–100	Filter traffic-light halts; widen for parking drift
Light commercial	80–120	5–15	30	80–120	Allow for loading-bay proximity scatter
Heavy logistics / HGV	100–150	10–30	60	100–150	Long structured stops, wide GPS scatter in depots
Rail / tram (on-vehicle)	150–300	1–3	5	150–300	Platform dwell; tight duration, wide radius

Deriving min_samples for DBSCAN from sampling rate:

PYTHON

def min_samples_for_rate(
    min_duration_s: float,
    median_interval_s: float,
    loss_factor: float = 0.6,
) -> int:
    """
    Estimate min_samples so that a stop of min_duration_s can form a cluster
    even with up to (1 - loss_factor) * 100% of points missing.
    """
    expected_points = min_duration_s / median_interval_s
    return max(2, int(np.floor(expected_points * loss_factor)))

Pre-processing quality also affects calibration. If your pipeline applies GPS drift handling upstream, you can use tighter thresholds. Unsmoothed telemetry typically requires 20–30% wider radii to avoid fragmenting genuine stays.

Grid-Based (H3 / S2) Approach

Grid methods bin points into cells of a spatial index and flag cells whose total dwell time exceeds a threshold. They are the right choice for multi-day, multi-entity fleet summaries where you need aggregated visit counts, not individual stop records:

PYTHON

import h3
from collections import defaultdict


def grid_stay_aggregation(
    gdf: gpd.GeoDataFrame,
    resolution: int = 9,
    min_duration_s: float = 300.0,
) -> pd.DataFrame:
    """
    H3 grid-based stay aggregation.
    gdf must be in WGS84 (EPSG:4326) — H3 operates on lat/lng degrees.
    resolution=9 gives ~0.1 km² hexagons, suitable for urban vehicle stops.
    Returns a DataFrame of (h3_index, total_duration_s, visit_count).
    """
    if gdf.crs and gdf.crs.to_epsg() != 4326:
        gdf = gdf.to_crs(epsg=4326)

    gdf = gdf.sort_values(["trajectory_id", "timestamp"]).copy()

    # Assign H3 cell to each point
    gdf["h3_index"] = gdf.apply(
        lambda row: h3.latlng_to_cell(
            row.geometry.y, row.geometry.x, resolution
        ),
        axis=1,
    )

    # Compute per-trajectory dwell time per cell
    records = []
    for tid, traj in gdf.groupby("trajectory_id", sort=False):
        traj = traj.sort_values("timestamp")
        traj["time_next_s"] = traj["timestamp"].diff().shift(-1).dt.total_seconds()
        for h3_idx, cell_pts in traj.groupby("h3_index", sort=False):
            dwell = cell_pts["time_next_s"].sum()
            if dwell >= min_duration_s:
                records.append({
                    "trajectory_id": tid,
                    "h3_index": h3_idx,
                    "total_duration_s": dwell,
                    "visit_count": len(cell_pts),
                })

    return pd.DataFrame(records)

The boundary discretization problem mentioned in the failure table above is most visible at resolution 9 and below. If you see stay records with unusually short durations near cell boundaries, increase to resolution 10 (cell area ~0.015 km²) or apply a one-ring buffer: include dwell time from the six immediate H3 neighbours when computing each cell’s total.

Integration and Compatibility

Stay-point output feeds naturally into several adjacent analysis pipelines:

Kinematic profiling. Segments directly before and after a detected stay exhibit characteristic approach and departure acceleration profiles. Correlating dwell duration with these kinematics in speed and acceleration profiling distinguishes planned stops (smooth deceleration, full stop, smooth departure) from forced halts (abrupt braking, idling, erratic restart).

Directionality and routing. Stay centroids define the natural anchor points for route reconstruction. When you combine them with bearing data and angular velocity, the directionality and turn analysis pipeline can reliably identify facility ingress/egress patterns, U-turns associated with missed delivery addresses, and parking-manoeuvre signatures.

Temporal aggregation. Stay records are point-in-time events suitable for binning into hourly or daily arrival distributions. Dynamic time binning strategies can surface peak arrival windows per facility, enabling capacity planning without manual log review.

Semantic enrichment. Join stay_centroids against POI layers or zone boundaries using a spatial join. Set the predicate to within for strict containment or dwithin with a tolerance (e.g., 30 m) for noisy centroid positions. Optimizing spatial joins for trajectory-to-zone matching covers the STRtree and partitioned join patterns that keep this step sub-second even at fleet scale.

Downsampled input. High-frequency 10 Hz telemetry does not improve stay detection quality and significantly increases compute cost. Downsampling GPS tracks to 1 Hz or 0.5 Hz before detection reduces memory usage by an order of magnitude with negligible impact on centroid accuracy.

Validation and Quality Assurance

Regression tests on a fixed synthetic or annotated dataset are the only reliable way to catch parameter drift between pipeline versions.

Synthetic trace validation:

PYTHON

def make_synthetic_stay(
    center_xy: tuple[float, float],
    duration_s: float,
    interval_s: float = 5.0,
    jitter_m: float = 10.0,
    start_time: pd.Timestamp = pd.Timestamp("2025-01-01", tz="UTC"),
    trajectory_id: str = "test_001",
    crs: int = 3857,
) -> gpd.GeoDataFrame:
    """Generate a synthetic stay segment with Gaussian positional jitter."""
    n = int(duration_s / interval_s)
    rng = np.random.default_rng(seed=42)
    xs = center_xy[0] + rng.normal(0, jitter_m, n)
    ys = center_xy[1] + rng.normal(0, jitter_m, n)
    timestamps = [start_time + pd.Timedelta(seconds=i * interval_s) for i in range(n)]
    geoms = gpd.points_from_xy(xs, ys)
    return gpd.GeoDataFrame(
        {"trajectory_id": trajectory_id, "timestamp": timestamps},
        geometry=geoms,
        crs=crs,
    )


# Precision / recall assertions
synthetic = make_synthetic_stay((500000, 6000000), duration_s=600.0, jitter_m=25.0)
result = detect_stay_points(synthetic, distance_m=50.0, duration_s=300.0)

assert len(result) == 1, f"Expected 1 stay, got {len(result)}"
assert result.iloc[0]["duration_s"] >= 300.0
centroid_err = result.iloc[0]["geometry"].distance(
    gpd.points_from_xy([500000], [6000000])[0]
)
assert centroid_err < 30.0, f"Centroid displaced {centroid_err:.1f} m — exceeds 30 m tolerance"

Track mean_absolute_duration_error and centroid_displacement_m across pipeline versions. A sudden increase in either metric after a data-source change usually indicates a timestamp timezone shift or a CRS reprojection regression — both common after fleet telemetry provider migrations.

FAQ

What distance and duration thresholds should I use for pedestrians vs vehicles? For pedestrians use a 30–50 m radius and 2–5 minute minimum dwell. For road vehicles widen to 50–100 m to absorb parking-lot drift and extend the minimum to 3–10 minutes to filter traffic-light halts. Heavy logistics fleets benefit from 100–150 m and 10+ minutes to capture structured loading stops. The calibration table above provides the full breakdown by mode.

Why does DBSCAN outperform time-distance thresholding on noisy GPS data? Thresholding is sequential — a single outlier point prematurely terminates a stay segment. DBSCAN evaluates density across the entire point cloud, so a few GPS drift spikes surrounded by a dense cluster do not break detection. The cost is that DBSCAN does not preserve temporal ordering; reconstruct entry and exit times from the original timestamp column after labelling.

How do I handle GPS data gaps inside a stay region? Set a max_gap_s parameter (typically 1800 s for urban logistics). If the time delta between consecutive points within a flagged segment exceeds that threshold, split the segment into two candidate stays and apply the minimum-duration filter to each independently.

Can I run stay-point detection on WGS84 coordinates directly? Only for DBSCAN with metric='haversine', which accepts (latitude, longitude) in radians. All other methods require a projected metric CRS. Mixing degree-based distances with metre-based thresholds produces silent errors that are hard to detect in output data.

What is the minimum point count to form a valid stay? Derive it from your sampling rate: floor(min_duration_s / median_interval_s) * 0.6. This allows for up to 40% data loss within the window. At 1 Hz over 5 minutes that yields min_samples = 180; at 1/30 Hz over the same window it yields min_samples = 10.

Related

Implementing DBSCAN for stay-point clustering in Python — deep dive into parameter selection, the k-distance plot, and haversine-metric clustering
Speed and Acceleration Profiling — kinematic analysis of approach and departure segments around detected stays
Directionality and Turn Analysis — bearing change detection and U-turn identification anchored to stay geometry
GPS Precision and Error Handling — upstream fix-quality filtering that directly affects stay-detection threshold calibration
Optimizing Spatial Joins for Trajectory-to-Zone Matching — efficient POI enrichment of stay centroids at fleet scale

Back to Movement Pattern Extraction and Trajectory Analysis

Stay-Point Detection Algorithms for Mobility Data Pipelines

Explore deeper