Speed & Acceleration Profiling for Trajectory Data

Q: Why do acceleration values spike unrealistically in my GPS data?

Spikes almost always originate from three sources: timestamp duplicates (dt = 0 followed by a sudden jump), GPS multipath near tall buildings, or an unhandled gap where the logger paused and resumed. Enforce a minimum dt of 0.1 s, filter pings with HDOP > 2.0, and split trajectories on gaps exceeding your threshold before computing differences.

Q: Should I use forward differences or central differences for speed calculation?

Central differences halve the phase lag and reduce bias at inflection points, making them preferable for batch processing where the full trajectory is available. Forward differences are appropriate for real-time streaming because they only require the current and previous sample. In practice, compute forward differences and then apply Savitzky-Golay smoothing — the filter implicitly approximates central differences over its window.

Q: How do I choose the Savitzky-Golay window length for my data?

Window length should span roughly 5–10 seconds of trajectory time, not a fixed sample count. At 1 Hz for a freight vehicle, window_length = 11 (11 s) is a sensible starting point. For 5 Hz micromobility data, use window_length = 31 (6.2 s). Always keep window_length odd and at least poly_order + 2. If the smoothed output still shows physically implausible peaks, halve the window length and re-evaluate.

Q: Can I compute speed directly in WGS-84 decimal degrees?

No. Degree-based Euclidean distance is non-metric and latitude-dependent: a 0.001-degree step in longitude represents ~111 m at the equator but only ~55 m at 60° N. All speed and distance calculations must be performed after projecting to a metric CRS such as a local UTM zone. Use pyproj's Transformer.from_crs with always_xy=True to avoid axis-order bugs.

Q: My dataset has >50 M rows. Will groupby-based smoothing run out of memory?

GroupBy transforms in pandas materialise intermediate arrays for each group. For fleets with high-cardinality entity_id columns and long trajectories, switch to polars lazy evaluation or process entities in batches written to partitioned Parquet. Sorting the input by (entity_id, timestamp) before writing allows Parquet partition pruning and eliminates full-dataset shuffles.

Speed and acceleration profiling converts discrete, timestamped GPS coordinates into reliable kinematic vectors — the velocity and acceleration signals that drive driver-behaviour scoring, anomaly detection, and transit performance monitoring in Movement Pattern Extraction & Trajectory Analysis.

Prerequisites

Before building the profiling pipeline, confirm the following environment and data requirements. This section sits downstream of GPS precision & error handling — trajectories containing uncorrected multipath or severe clock drift will propagate those errors directly into acceleration signals.

Python dependencies: pandas >= 2.0, geopandas >= 0.14, numpy >= 1.24, shapely >= 2.0, scipy >= 1.11, pyproj >= 3.5.

Required input columns:

Column	Type	Notes
`entity_id`	string / int	Unique identifier for vehicle, device, or agent
`timestamp`	datetime64[ns, UTC]	Timezone-aware; naive timestamps cause silent errors in `diff()`
`latitude`	float64	WGS-84 decimal degrees
`longitude`	float64	WGS-84 decimal degrees

Recommended columns: altitude (float64), hdop (float64), heading (float64). hdop is particularly valuable for pre-filtering low-quality pings before computing derivatives.

Upstream stages that must complete first:

Coordinate reference system mapping — ensures all coordinates are in a consistent geodetic datum before projection.
Time-series synchronization strategies — aligns and deduplicates timestamps so that dt values are positive and non-zero.
Sampling rate optimization — establishes whether the stream is regular enough for fixed-window smoothing or requires adaptive interpolation.

Error and Problem Taxonomy

Kinematic derivatives are second-order operations on noisy signals. Small coordinate errors become large velocity errors; small velocity errors become catastrophic acceleration errors. Understanding the failure modes before coding prevents hours of debugging.

Error source	Mechanism	Typical impact	Mitigation
Timestamp duplicates	Two or more records share identical `timestamp` for the same entity	Division by zero in `dt`; infinite speed	Deduplicate on `(entity_id, timestamp)` keeping last; enforce `dt > 0.1 s` minimum
GPS multipath	Signal reflected from buildings produces coordinate jumps of 5–50 m in a single sample	Phantom acceleration spikes > 5 g	Filter `hdop > 2.0` before profiling; apply median clipping after raw difference
Trajectory gaps	Logger pauses (tunnel, power loss) resume at a distant coordinate	Single enormous velocity spike bridging the gap	Split trajectories on `dt > gap_threshold_s` (typically 60–300 s depending on mode)
WGS-84 Euclidean distance	Computing distance in decimal degrees without projection	Latitude-dependent scale error up to 40% at 60° N	Project to metric CRS before every distance calculation — never compute distance in `EPSG:4326`
Axis-order confusion	`pyproj` defaults changed between versions; `(lat, lon)` vs `(lon, lat)`	Coordinates transposed; distances wildly wrong in one axis	Always pass `always_xy=True` to `Transformer.from_crs`
Window length parity	`savgol_filter` requires odd `window_length`	`ValueError` at runtime	Coerce even values: `wl = wl if wl % 2 == 1 else wl + 1`
Short trajectory segments	Entity has fewer points than `window_length` after gap-splitting	`ValueError` or meaningless output	Guard with `if len(series) < window_length + 1: return series`

Deterministic Pipeline

The following numbered sequence maps to the SVG diagram above. Each stage has a single responsibility; keep them decoupled so you can test and replace each independently.

Temporal sort and gap-split. Sort the full dataset by (entity_id, timestamp). Compute dt for each row. Where dt > gap_threshold_s, insert a segment boundary — either a new segment_id column or by splitting into separate DataFrames. Unhandled gaps are the single most common source of phantom acceleration events.
Coordinate projection. Transform (longitude, latitude) from EPSG:4326 to a local metric CRS using pyproj.Transformer. For datasets spanning a single UTM zone, use that zone’s EPSG code. For continental or global fleets, compute the UTM zone per-entity from the median longitude and apply a dynamic CRS. This stage is what makes optimizing spatial joins downstream tractable.
Distance computation. Use the projected (x, y) coordinates and compute Euclidean distance between consecutive points within each entity/segment group. geopandas geometry.distance(geometry.shift(1)) is a clean vectorized idiom for this.
Finite-difference velocity and acceleration. Divide distance by dt for raw speed. Divide consecutive speed differences by dt for raw acceleration. Guard every division with a dt > 0 mask and store NaN for invalid rows rather than filling with zero.
Savitzky-Golay smoothing. Apply scipy.signal.savgol_filter per entity/segment to both the speed and acceleration series. Tune window_length and polyorder by transport mode (see Calibration section). Use mode='nearest' to handle boundary effects without manual padding.
Validation and persistence. Assert that smoothed speed values are non-negative and below the physical maximum for the transport mode. Flag rows where |acceleration| > 3 g as suspect rather than silently dropping them. Write the output to partitioned Parquet sorted by (entity_id, timestamp).

Implementation Walkthrough

The function below is production-grade: fully vectorized, typed, schema-validated, and explicit about every edge case that appears in the taxonomy above.

PYTHON

import numpy as np
import pandas as pd
import geopandas as gpd
from scipy.signal import savgol_filter
from pyproj import Transformer
from typing import Optional


REQUIRED_COLS = {"entity_id", "timestamp", "latitude", "longitude"}


def compute_kinematics(
    df: pd.DataFrame,
    target_crs: str = "EPSG:32618",
    gap_threshold_s: float = 300.0,
    window_length: int = 11,
    poly_order: int = 3,
    max_speed_ms: float = 60.0,
) -> pd.DataFrame:
    """
    Derive smoothed speed (m/s) and acceleration (m/s²) from raw GPS trajectories.

    All distance calculations are performed in `target_crs` (a metric projection).
    Never compute speed in EPSG:4326 — degree-based Euclidean distance is latitude-dependent.

    Parameters
    ----------
    df : pd.DataFrame
        Must contain columns: entity_id, timestamp (datetime64, UTC), latitude, longitude.
    target_crs : str
        Metric projected CRS, e.g. "EPSG:32618" (UTM 18N) or "EPSG:3035" (Europe).
    gap_threshold_s : float
        Maximum acceptable time gap in seconds within a single trajectory segment.
        Gaps larger than this trigger a segment boundary to prevent phantom spikes.
    window_length : int
        Savitzky-Golay window length (samples). Must be odd and > poly_order.
        Coerced to odd automatically. Tune per transport mode — see calibration table.
    poly_order : int
        Savitzky-Golay polynomial order. 3 is a robust default for GPS kinematics.
    max_speed_ms : float
        Physical speed ceiling in m/s for the transport mode. Values above this
        after smoothing are flagged as `speed_suspect = True` rather than clipped.

    Returns
    -------
    pd.DataFrame with columns: entity_id, segment_id, timestamp, latitude, longitude,
        speed_ms, acceleration_ms2, speed_suspect.
    """
    # --- Schema validation ---
    missing = REQUIRED_COLS - set(df.columns)
    if missing:
        raise ValueError(f"Input DataFrame missing required columns: {missing}")
    if df.empty:
        return pd.DataFrame(
            columns=["entity_id", "segment_id", "timestamp", "latitude",
                     "longitude", "speed_ms", "acceleration_ms2", "speed_suspect"]
        )

    df = df.copy()
    df["timestamp"] = pd.to_datetime(df["timestamp"], utc=True)

    # --- Stage 1: Temporal sort and gap-split ---
    df.sort_values(["entity_id", "timestamp"], inplace=True)
    df.reset_index(drop=True, inplace=True)

    df["_dt"] = (
        df.groupby("entity_id")["timestamp"]
        .transform(lambda t: t.diff().dt.total_seconds())
    )
    # Mark gap boundaries (NaN at first row of each entity counts as a gap)
    df["_gap"] = (df["_dt"].isna()) | (df["_dt"] > gap_threshold_s)
    df["segment_id"] = df.groupby("entity_id")["_gap"].transform(
        lambda g: g.cumsum()
    )

    # --- Stage 2: Coordinate projection to metric CRS ---
    # always_xy=True enforces (longitude, latitude) axis order regardless of CRS authority
    transformer = Transformer.from_crs("EPSG:4326", target_crs, always_xy=True)
    df["_x"], df["_y"] = transformer.transform(
        df["longitude"].values, df["latitude"].values
    )

    # --- Stage 3: Distance calculation (metric CRS, not WGS-84 degrees) ---
    gdf = gpd.GeoDataFrame(
        df, geometry=gpd.points_from_xy(df["_x"], df["_y"]), crs=target_crs
    )
    gdf["_dist"] = gdf.groupby(["entity_id", "segment_id"])["geometry"].transform(
        lambda g: g.distance(g.shift(1))
    )

    # --- Stage 4: Raw finite differences ---
    # Guard dt <= 0 to avoid division by zero (timestamps duplicates produce dt == 0)
    valid_dt = gdf["_dt"].gt(0)
    gdf["_speed_raw"] = np.where(valid_dt, gdf["_dist"] / gdf["_dt"], np.nan)
    gdf["_accel_raw"] = (
        gdf.groupby(["entity_id", "segment_id"])["_speed_raw"]
        .transform(lambda v: v.diff())
        / gdf["_dt"]
    )

    # --- Stage 5: Savitzky-Golay smoothing per entity/segment ---
    # window_length must be odd; coerce even values
    wl = window_length if window_length % 2 == 1 else window_length + 1

    def _savgol_safe(series: pd.Series) -> pd.Series:
        """Apply SG filter; return unfiltered series if segment is too short."""
        vals = series.fillna(method="ffill").fillna(0).values
        # Minimum length: window must be smaller than the series
        effective_wl = min(wl, len(vals) if len(vals) % 2 == 1 else len(vals) - 1)
        if effective_wl < poly_order + 2 or effective_wl < 3:
            return series  # Too short — return raw values
        smoothed = savgol_filter(
            vals, window_length=effective_wl, polyorder=poly_order, mode="nearest"
        )
        return pd.Series(smoothed, index=series.index)

    gdf["speed_ms"] = gdf.groupby(["entity_id", "segment_id"])["_speed_raw"].transform(
        _savgol_safe
    )
    gdf["acceleration_ms2"] = gdf.groupby(["entity_id", "segment_id"])["_accel_raw"].transform(
        _savgol_safe
    )

    # Clamp negative speed to zero (physically impossible; residual of smoothing at boundaries)
    gdf["speed_ms"] = gdf["speed_ms"].clip(lower=0.0)

    # --- Stage 6: Validation flags (do not silently drop — flag for inspection) ---
    gdf["speed_suspect"] = gdf["speed_ms"] > max_speed_ms

    output_cols = [
        "entity_id", "segment_id", "timestamp",
        "latitude", "longitude",
        "speed_ms", "acceleration_ms2", "speed_suspect",
    ]
    return gdf[output_cols].reset_index(drop=True)

Reliability notes:

always_xy=True is non-negotiable. Omitting it causes silent axis-order transposition in many CRS definitions, producing distances that are off by up to an order of magnitude.
The _savgol_safe helper prevents the ValueError that savgol_filter raises when window_length >= len(series). Short segments — common after aggressive gap-splitting — would otherwise crash the pipeline silently inside a transform.
Negative smoothed speeds occur at trajectory boundaries where the polynomial overshoots. Clipping to zero is correct; these are not real deceleration events.
speed_suspect flags rather than drops outliers. Downstream anomaly detection pipelines for fleet management need these rows visible to distinguish sensor failure from genuine hard braking.

Mathematical Grounding

Given discrete projected points $P_i = (x_i, y_i, t_i)$ in a metric CRS, speed $v_i$ and acceleration $a_i$ are approximated by forward finite differences:

$$v_i = \frac{\Delta d_i}{\Delta t_i} = \frac{\sqrt{(x_i - x_{i-1})^2 + (y_i - y_{i-1})^2}}{t_i - t_{i-1}}$$

$$a_i = \frac{v_i - v_{i-1}}{\Delta t_i}$$

Forward vs. central differences. Forward differences introduce a half-sample phase lag — the computed velocity is attributed to the interval between $t_{i-1}$ and $t_i$ rather than to the point $t_i$ itself. Central differences use the symmetric interval $(t_{i+1} - t_{i-1})$ and halve this lag, but require lookahead and cannot be used in real-time streaming. For batch pipelines, Savitzky-Golay filtering over a window of $2k+1$ points implicitly fits a local polynomial and is equivalent to computing weighted central differences — it recovers accurate peak kinematics that forward differences underestimate.

Savitzky-Golay fitting. The filter fits a polynomial of degree $p$ to each window of $2k+1$ samples by least squares and evaluates it at the centre point. For $p = 3$ and $k = 5$ (window = 11), the filter passes kinematic events with wavelengths longer than approximately 4 samples while attenuating GNSS jitter. Increasing $k$ increases smoothing; increasing $p$ preserves higher-frequency features but amplifies noise at the boundary.

Distance in WGS-84 is incorrect. One degree of longitude spans 111 km at the equator and 55 km at 60° N. Computing Euclidean distance in decimal degrees without projection introduces a latitude-dependent scale error of up to 40% at high latitudes — enough to make speed estimates unreliable for any analytics purpose. Projecting to a metric CRS first is mandatory.

Calibration and Parameter Tuning

window_length, gap_threshold_s, and max_speed_ms must be tuned to the kinematic regime of each transport mode. Using cargo-truck parameters for pedestrian tracking (or vice versa) will either over-smooth genuine stops or leave excessive noise in the output.

Transport mode	Typical sampling rate	Recommended `window_length`	`gap_threshold_s`	`max_speed_ms`	Notes
Freight / HGV	1 Hz	11 (11 s window)	300	33 (120 km/h)	Long windows tolerable; low dynamics
Urban passenger car	1 Hz	9 (9 s window)	120	50 (180 km/h)	Balance smoothing vs. stop detection
Motorcycle / sports	5 Hz	21 (4.2 s window)	60	83 (300 km/h)	Higher dynamics; shorter windows
Micromobility / e-bike	1 Hz	7 (7 s window)	60	14 (50 km/h)	Frequent stops; aggressive gap-split
Pedestrian	0.5–1 Hz	5 (5–10 s window)	30	5 (18 km/h)	Slow dynamics; smallest windows
Public transit (bus)	1 Hz	11 (11 s window)	180	28 (100 km/h)	Scheduled stops complicate gap logic

Window length in time, not samples. The table above shows both sample count and approximate time coverage. When comparing results across datasets with different sampling rates, normalise by time. A window covering 10–15 seconds is a robust starting point for most urban transport modes at 1 Hz.

Adaptive window sizing. For datasets where sampling rate varies within a single trajectory — common in gap-filling or downsampled streams — compute local dt statistics per segment and select window_length dynamically: wl = max(5, min(31, int(target_window_s / median_dt))).

Polynomial order. poly_order = 3 is correct for most mobility applications. Use poly_order = 2 for very slow-moving assets where the quadratic term is sufficient. Use poly_order = 4 or 5 only when you need accurate jerk (third derivative) estimates, as higher orders amplify noise at segment boundaries.

Integration and Compatibility

Speed and acceleration features rarely live in isolation. They are foundational inputs for several adjacent analytical workflows.

Stay-point detection. Sustained low-speed segments — where speed_ms < v_thresh for longer than t_min seconds — directly seed stay-point detection algorithms. The quality of stay-point boundaries depends on how cleanly the speed profile transitions around stops. Over-smoothed speed profiles blur the onset of genuine dwell events; under-smoothed profiles trigger false positives from GNSS jitter.

Directionality and turn analysis. Acceleration magnitude combined with heading change rate is the core input to directionality and turn analysis. Hard lateral acceleration (centripetal component) indicates sharp turns; combined longitudinal deceleration plus heading change identifies braking-into-turn behaviour.

Downstream models. Smoothed speed series are suitable direct inputs to Hidden Markov Models for map-matching (via viterbi emission probabilities conditioned on road-segment speed limits) and to Kalman filter state vectors where speed is an observable. For DBSCAN-based trajectory segmentation — as covered in implementing DBSCAN for stay-point clustering — the smoothed speed_ms column can replace or supplement the spatial distance metric.

Rolling statistics. Batch pipelines commonly compute 60-second or 5-minute rolling aggregates (mean speed, 95th-percentile acceleration, jerk count) over the smoothed kinematic output. The rolling statistics for mobility metrics section covers efficient windowed aggregation patterns that compose directly with this pipeline’s Parquet output.

Streaming architectures. For real-time pipelines, decouple profiling from ingestion using a message broker (Kafka, Google PubSub). Compute forward-difference raw speed per incoming ping; buffer the last window_length points in a sliding deque per entity; apply Savitzky-Golay when the buffer is full. Emit the centre-point smoothed value on each new ping after the buffer fills.

Troubleshooting Reference

Symptom	Likely cause	Diagnostic signal	Fix
Acceleration spikes > 5 g	GPS multipath or unhandled timestamp duplicate	`hdop > 2.0` at spike location; `dt` near zero before spike	Filter `hdop`; enforce `dt > 0.1 s` minimum; split on gap
Negative speed values	Coordinate axis-order bug in projection	`longitude` values look like latitudes (> 90°)	Verify `always_xy=True`; check that `longitude` and `latitude` columns are not swapped
Smoothing flattens real stops	`window_length` too large for stop duration	Smoothed speed stays above zero during known dwell	Reduce `window_length` or post-process: zero-clip where `speed_ms < 0.5 m/s` for > 10 s
`ValueError` in `savgol_filter`	Even `window_length` or window larger than segment	Exception message references `window_length`	Coerce to odd; guard with `effective_wl = min(wl, len(series) - 1 if (len(series)-1) % 2 == 1 else len(series) - 2)`
Memory OOM during `groupby.transform`	High-cardinality `entity_id` with long trajectories	Memory grows during transform step	Process in batches by `entity_id` partition; switch to `polars` lazy evaluation
CAN-bus vs. GNSS mismatch > 15%	GNSS latency and coordinate drift	Consistent lag, not random scatter	Expected variance is 5–12%; above 15% indicates projection mismatch or systematic clock offset — revisit time-series synchronization

FAQ

Why do acceleration values spike unrealistically in my GPS data?

Spikes almost always come from three sources: timestamp duplicates (dt = 0 followed by a sudden coordinate jump), GPS multipath near tall buildings, or an unhandled gap where the logger paused and resumed. Enforce a minimum dt of 0.1 s, filter pings with hdop > 2.0, and split trajectories on gaps exceeding your threshold before computing differences. If spikes persist, apply a median clip (|v| < 3 × rolling_median(v, 5)) before the Savitzky-Golay pass.

Should I use forward differences or central differences for speed calculation?

Central differences halve the phase lag and reduce bias at inflection points — prefer them for batch processing where the full trajectory is available. Forward differences are appropriate for real-time streaming because they only require the current and previous sample. In practice, compute forward differences and then apply Savitzky-Golay smoothing; the filter implicitly approximates central differences over its window and recovers the accuracy advantage without requiring lookahead.

How do I choose the Savitzky-Golay window length?

Target 5–15 seconds of trajectory time rather than a fixed sample count. At 1 Hz for a freight vehicle, window_length = 11 (11 s) is a sensible starting point. For 5 Hz micromobility data, use window_length = 31 (6.2 s). Always keep window_length odd and at least poly_order + 2. If the smoothed output still shows physically implausible peaks, halve the window length and re-evaluate before reaching for more exotic filters.

Can I compute speed directly in WGS-84 decimal degrees?

No. Degree-based Euclidean distance is non-metric and latitude-dependent: a 0.001-degree step in longitude represents ~111 m at the equator but only ~55 m at 60° N. All speed and distance calculations must be performed after projecting to a metric CRS such as a local UTM zone. Use pyproj’s Transformer.from_crs with always_xy=True to avoid the axis-order bugs introduced in pyproj 2.x.

My dataset has > 50 M rows. Will groupby-based smoothing run out of memory?

GroupBy.transform in pandas materialises intermediate arrays for each group simultaneously. For fleets with high-cardinality entity_id columns and long trajectories, switch to polars lazy evaluation or process entities in batches written to partitioned Parquet. Sorting the input by (entity_id, timestamp) before writing allows Parquet partition pruning and eliminates full-dataset shuffles during the groupby step.

Calculating instantaneous speed from discrete GPS points — deep dive into interpolation, geodesic formulas, and adaptive Kalman filtering for asynchronous GPS streams
Stay-point detection algorithms — uses smoothed speed thresholds as the primary input to dwell identification
Directionality & turn analysis — combines acceleration magnitude with heading change for intersection behaviour mapping
GPS precision & error handling — upstream quality control that profiling depends on
Rolling statistics for mobility metrics — windowed aggregation patterns over kinematic output

Back to Movement Pattern Extraction & Trajectory Analysis

Speed & Acceleration Profiling for Trajectory Data

Explore deeper