Why use a time-based window instead of a row-count window for speed averaging?

Row-count windows assign equal weight to all observations regardless of the time gap between them. A 10-row window spanning two hours of stationary parking produces the same average as 10 rows of highway travel. Time-based windows (e.g., rolling('5min')) include only observations within the actual clock interval, preserving the physical relationship between distance and elapsed time.

What is a realistic speed cap for filtering GPS jitter?

45 m/s (~162 km/h) is a common threshold for road vehicles. Adjust for mode: rail can exceed 80 m/s, cycling rarely exceeds 15 m/s. Apply the cap to rolling_avg_speed rather than instant_speed so that a single jitter spike doesn't contaminate several seconds of smoothed output.

How do I handle timezone-naive timestamps in the rolling pipeline?

Always parse with pd.to_datetime(col, utc=True) before sorting or rolling. Mixed offsets and DST transitions silently reorder rows, which breaks window boundaries and produces phantom speed spikes at clock transitions.

What min_periods value should I use?

Set min_periods=2 for basic speed averaging — you need at least two points to have a meaningful mean. Raise it to 3–5 for variance-based metrics or when the signal must be stable before downstream anomaly detectors fire. Setting min_periods=1 returns the instant_speed unchanged for isolated points, which may inflate rolling averages at cold starts.

Can this approach scale to millions of GPS pings?

The vectorized Haversine and pandas groupby+rolling are efficient for datasets that fit in RAM. For larger workloads, partition by device_id and date using Parquet, or switch to polars/dask. Avoid row-wise apply() — it is 60–80× slower than the vectorized approach shown here.

Computing Rolling Average Speed Over Sliding Time Windows

Computing rolling average speed from GPS trajectories requires pairing time-indexed coordinates with geodesic distance calculations, then applying a time-aware rolling aggregation that respects actual clock time rather than row count. The production-standard approach converts raw GPS pings into segment-level instantaneous speeds and then uses pandas rolling(window='5min', on='timestamp') to compute a smoothed mean that automatically adapts to irregular sampling rates, GPS dropouts, and device-specific motion patterns.

This technique is a core workload within Rolling Statistics for Mobility Metrics — the parent guide that also covers dwell detection, heading stability, and cross-stream correlation. The methodology aligns with the broader Temporal Aggregation & Window Mapping discipline for mobility data engineering.

Why Time-Aware Windows Outperform Row-Based Logic

Movement telemetry rarely arrives at fixed intervals. Fleet trackers, mobile SDKs, and IoT sensors emit coordinates at variable frequencies (typically 1–60 seconds) depending on battery state, network conditions, and motion detection thresholds. A row-based window (rolling(window=10)) assigns equal statistical weight to every observation regardless of the time gap separating them. Ten consecutive rows spanning two hours of stationary parking receive the same window weight as ten rows captured during highway travel — the physical meaning of “average speed over this interval” is destroyed.

Time-based rolling solves this by evaluating all observations within a fixed clock interval, regardless of row count. The window slides forward in real time, including only pings that fall within the specified duration. This is especially important when sampling rate optimization has produced uneven observation densities across a fleet, or when gap-filling in sparse trajectories has injected synthetic rows that should carry different weights than real sensor pings.

Pipeline at a Glance

Four steps cover every production scenario:

Sort and timestamp-normalize per device — enforce chronological order and UTC timezone before any distance or window operation.
Compute geodesic segment distance — apply a vectorized Haversine formula (or a projected CRS if your pipeline already uses one) to consecutive coordinate pairs.
Derive instantaneous speed — divide segment distance in metres by the time delta in seconds; guard the zero-division case for stationary rows.
Apply the sliding time window — use groupby + rolling(window=..., on='timestamp', min_periods=...) to aggregate instant_speed per device and cap unrealistic spike values.

Production-Ready Implementation

The function below is fully vectorized, handles multi-device grouping, UTC normalization, and zero-division edge cases, and returns an annotated DataFrame ready for downstream feature engineering or alerting pipelines.

PYTHON

import pandas as pd
import numpy as np
from typing import Optional


def haversine_m(
    lat1: np.ndarray,
    lon1: np.ndarray,
    lat2: np.ndarray,
    lon2: np.ndarray,
) -> np.ndarray:
    """Vectorized Haversine distance in metres (WGS84 sphere, R=6 371 000 m).

    NOTE: uses raw WGS84 degrees because the Haversine formula already accounts
    for Earth's curvature. Do NOT reproject to a metric CRS before calling this
    function — that would apply a double conversion and distort short segments.
    For projected pipelines (EPSG:3857 / local UTM), substitute Euclidean
    distance from the projected x/y columns instead.
    """
    R = 6_371_000.0
    phi1, phi2 = np.radians(lat1), np.radians(lat2)
    dphi = np.radians(lat2 - lat1)
    dlambda = np.radians(lon2 - lon1)
    a = (
        np.sin(dphi / 2) ** 2
        + np.cos(phi1) * np.cos(phi2) * np.sin(dlambda / 2) ** 2
    )
    return 2 * R * np.arctan2(np.sqrt(a), np.sqrt(1.0 - a))


def compute_rolling_avg_speed(
    df: pd.DataFrame,
    window: str = "5min",
    min_periods: int = 2,
    speed_cap_ms: float = 45.0,
    device_col: str = "device_id",
    ts_col: str = "timestamp",
    lat_col: str = "lat",
    lon_col: str = "lon",
) -> pd.DataFrame:
    """Compute rolling average speed (m/s) over a sliding time window.

    Parameters
    ----------
    df : DataFrame with columns [device_id, timestamp, lat, lon] at minimum.
    window : pandas offset string for the rolling window (default '5min').
    min_periods : minimum observations required to emit a non-NaN mean.
    speed_cap_ms : upper bound (m/s) for spike filtering; ~162 km/h for road.
    device_col, ts_col, lat_col, lon_col : configurable column names.

    Returns
    -------
    DataFrame with added columns:
        dist_m          — Haversine segment distance to previous ping
        dt_sec          — elapsed seconds since previous ping
        instant_speed   — dist_m / dt_sec (0.0 for first row per device)
        rolling_avg_speed — time-windowed mean of instant_speed, capped
    """
    required = {device_col, ts_col, lat_col, lon_col}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"DataFrame missing required columns: {missing}")
    if df.empty:
        return df.copy()

    df = df.copy()
    # Step 1 — normalize to UTC; mixed timezones corrupt window boundaries
    df[ts_col] = pd.to_datetime(df[ts_col], utc=True)
    df = df.sort_values([device_col, ts_col]).reset_index(drop=True)

    # Step 2 — shift coordinates and timestamps within each device group
    df["lat_prev"] = df.groupby(device_col)[lat_col].shift(1)
    df["lon_prev"] = df.groupby(device_col)[lon_col].shift(1)
    df["dt_sec"] = (
        df.groupby(device_col)[ts_col]
        .diff()
        .dt.total_seconds()
    )

    # Step 3 — geodesic distance; NaN for the first row per device
    df["dist_m"] = np.where(
        df["lat_prev"].notna(),
        haversine_m(
            df[lat_col].to_numpy(),
            df[lon_col].to_numpy(),
            df["lat_prev"].to_numpy(),
            df["lon_prev"].to_numpy(),
        ),
        np.nan,
    )

    # Step 4 — instantaneous speed; 0.0 when Δt == 0 (duplicate timestamps)
    df["instant_speed"] = np.where(
        (df["dt_sec"].notna()) & (df["dt_sec"] > 0),
        df["dist_m"] / df["dt_sec"],
        0.0,
    )

    # Step 5 — time-based rolling aggregation per device
    rolling_series = (
        df.groupby(device_col, group_keys=False)
        .apply(
            lambda g: g.set_index(ts_col)["instant_speed"]
            .rolling(window=window, min_periods=min_periods)
            .mean()
            .rename("rolling_avg_speed")
            .reset_index(drop=True)
        )
    )
    df["rolling_avg_speed"] = rolling_series.values

    # Step 6 — cap unrealistic spikes from GPS drift / multipath error
    df["rolling_avg_speed"] = df["rolling_avg_speed"].clip(upper=speed_cap_ms)

    # Drop intermediate columns used only for computation
    df.drop(columns=["lat_prev", "lon_prev"], inplace=True)
    return df

Validation Block

After running the pipeline, verify the output shape and statistical plausibility before passing results downstream:

PYTHON

def validate_rolling_speed(df: pd.DataFrame) -> None:
    """Post-run sanity checks for compute_rolling_avg_speed output."""
    assert "rolling_avg_speed" in df.columns, "Output column missing"
    assert "instant_speed" in df.columns, "Intermediate column missing"

    # No negative speeds
    assert (df["rolling_avg_speed"].dropna() >= 0).all(), "Negative speeds found"

    # Spike cap respected
    assert (df["rolling_avg_speed"].dropna() <= 45.0).all(), "Speed cap violated"

    # Cold-start NaNs expected for first min_periods-1 rows per device
    nan_count = df["rolling_avg_speed"].isna().sum()
    device_count = df["device_id"].nunique()
    print(f"NaN rolling rows: {nan_count} across {device_count} devices (expected ≥ {device_count})")

    # Distribution sanity for road vehicles
    p99 = df["rolling_avg_speed"].quantile(0.99)
    print(f"p99 rolling speed: {p99:.2f} m/s ({p99 * 3.6:.1f} km/h) — expect < 45 m/s")

Typical output for a well-behaved fleet dataset looks like NaN rolling rows: 47 across 47 devices (exactly one per device), and p99 below 35 m/s for urban telematics.

Common Mistakes and Gotchas

Row-count windows on uneven telemetry. Using rolling(window=10) instead of rolling(window='5min', on='timestamp') weights every ping equally regardless of elapsed time. A parked vehicle emitting one ping every 30 seconds will dominate a 10-row window with stale near-zero speeds, masking the actual movement that preceded the parking event.
Computing Haversine on projected coordinates. The Haversine formula expects geographic degrees (EPSG:4326). If your pipeline has already reprojected coordinates to EPSG:3857 (Web Mercator) or a UTM zone, use Euclidean distance from the projected x/y columns instead. Feeding metric coordinates into Haversine produces nonsensical distances several orders of magnitude off.
Ignoring timezone normalization. Timestamps without explicit UTC conversion silently reorder rows across DST boundaries. A clock transition during a long-haul overnight route can introduce a one-hour apparent gap mid-journey, splitting the rolling window and producing a phantom zero-speed event at the DST transition moment.
Applying apply() row-wise for distance. The vectorized NumPy Haversine above is 60–80× faster than a row-wise df.apply(lambda r: haversine(r['lat'], r['lon'], r['lat_prev'], r['lon_prev']), axis=1) equivalent. On a 1M-row dataset, this difference is the boundary between a sub-second operation and a minute-long bottleneck.
Capping instant_speed instead of rolling_avg_speed. Capping before the rolling aggregation throws away a real data point; the window average may still be valid if adjacent observations are plausible. Cap the output of the rolling operation so that genuine high-speed segments are preserved in the mean while isolated jitter spikes are suppressed.
Forgetting min_periods for cold starts and signal gaps. Omitting min_periods defaults to requiring all window positions to be filled, which can return NaN for most rows on short trajectories or after prolonged signal loss. Setting min_periods=2 emits a meaningful mean as soon as two valid observations exist within the window.

Rolling Statistics for Mobility Metrics — parent guide covering dwell detection, heading stability, and multi-metric rolling aggregation patterns
Calculating Instantaneous Speed from Discrete GPS Points — the per-segment speed derivation that feeds the rolling window
Gap-Filling in Sparse Trajectories — how to handle signal dropout before rolling aggregation distorts your means
Downsampling High-Frequency GPS Tracks Without Losing Path Integrity — managing observation density before applying rolling windows
Handling GPS Drift in Raw Trajectory Logs — upstream spike removal that reduces the need for aggressive speed caps

Back to Rolling Statistics for Mobility Metrics