What spatial bin size should I use for pedestrian mobility heatmaps?

Start with 50–100 m bins, then validate using the 75th-percentile nearest-neighbour distance from your actual dataset. Pedestrian flows are dense enough that the empirical value usually falls in this range, but event-driven datasets (e.g. concerts, protests) may require narrower bins.

Why does Moran's I above 0.3 indicate a bin-size problem?

A high global Moran's I means neighbouring cells share unusually similar counts, which signals that the spatial resolution is too coarse to separate distinct activity zones. Shrinking the bin width until Moran's I drops below 0.3 ensures each cell represents an independent spatial sample.

Can I use the same bin size for peak and off-peak hours?

Only if your dataset density is roughly uniform across the day. In practice, peak-hour density is 5–10x off-peak in dense corridors. Adaptive binning — coarser bins during low-density windows, finer bins at peak — prevents zero-inflation during quiet periods while preserving resolution during congestion.

Why should I never compute bin widths in EPSG:4326?

Degree-based coordinates produce distances in degrees, not metres. One degree of latitude is ~111 km, but one degree of longitude varies from ~111 km at the equator to near zero at the poles. Nearest-neighbour calculations in EPSG:4326 are therefore meaningless for any distance-sensitive grid construction.

How do I handle zero-count cells without distorting my heatmap?

Filter them out before visualisation or downstream modelling with heatmap_df[heatmap_df['count'] > 0]. Retaining empty cells inflates storage, distorts density normalisation, and can produce misleading colour scales on visualisation libraries that auto-scale to the full value range.

Choosing Optimal Bin Sizes for Urban Mobility Heatmaps

The mathematically optimal spatial and temporal bin configuration for an urban mobility heatmap emerges from your dataset’s point density distribution and spatial autocorrelation range — not from arbitrary defaults. As a starting baseline: spatial bins range from 50–200 m for pedestrian and micro-mobility flows, and 200–500 m for vehicular traffic. Temporal windows should align with operational cycles: 5–15 minutes for real-time dispatch and 30–60 minutes for transit planning. These are heuristics; production-grade analytics require empirical calibration against your specific data.

Why Bin Size Matters Here

Urban mobility traces exhibit heavy-tailed spatial distributions and bursty temporal patterns — properties addressed at length in the dynamic time-binning strategies framework that this page extends. Dense downtown corridors saturate coarse grids, while sparse suburban trajectories fragment into zero-count cells under fine grids. Both failure modes corrupt downstream density estimation and any machine-learning model trained on the heatmap output.

The root cause is the mismatch between fixed-resolution grids and the heterogeneous spatial density that characterises real urban movement, a structural issue shared with GPS precision and error handling problems: neither can be solved by a single global parameter. The answer is deriving bin dimensions from the data itself.

Spatial and Temporal Baselines

Mobility Mode	Spatial Bin	Temporal Window	Primary Use Case
Pedestrian / Micro-mobility	50–100 m	5–15 min	Sidewalk congestion, curb turnover, last-mile routing
Mixed Urban Traffic	100–200 m	15–30 min	Intersection throughput, signal timing, transit dwell
Vehicular / Freight	200–500 m	30–120 min	Corridor planning, freight routing, infrastructure ROI

Calibration Pipeline (4 Steps)

Project to a metric CRS. Convert trajectory centroids to a meter-based projection (EPSG:3857 or a local UTM zone) before any distance calculation. Coordinate reference system mapping explains how to choose and apply the right projection for your region.
Compute the nearest-neighbour distribution. Build a KDTree over the projected coordinates and record each point’s distance to its closest neighbour. The 75th-percentile of this distribution is the candidate spatial bin width: it guarantees that at least three-quarters of observations fall within a single cell without excessive overlap.
Align temporal bins with GPS sampling frequency. If devices report every 30 s, a 10-minute window captures meaningful dwell and transit states without temporal aliasing. Shorter windows amplify GPS drift artefacts; longer windows mask micro-congestion. For datasets with inconsistent ping rates, apply time-series synchronisation strategies first.
Validate the configuration. Accept the bin dimensions only once the coefficient of variation (CV) of bin counts is below 0.8 and global Moran’s I is below 0.3 (see the Validation block below).

Production-Ready Python Implementation

The functions below compute the optimal spatial bin width from nearest-neighbour percentiles, then build and aggregate a temporal heatmap. Both functions handle edge cases explicitly and require a projected GeoDataFrame.

PYTHON

import numpy as np
import geopandas as gpd
import pandas as pd
from shapely.geometry import box
from scipy.spatial import KDTree


def compute_optimal_bin_width(
    gdf: gpd.GeoDataFrame,
    percentile: float = 0.75,
    min_width: float = 25.0,
    max_width: float = 1000.0,
) -> float:
    """Derive spatial bin width from the nearest-neighbour distance distribution.

    Parameters
    ----------
    gdf:        GeoDataFrame with point geometry; any CRS accepted.
    percentile: Fraction of the NN distribution to use as the bin width (0–1).
    min_width:  Hard floor in metres to prevent degenerate micro-bins.
    max_width:  Hard ceiling in metres to prevent computationally expensive mega-grids.

    Returns
    -------
    Bin width in metres (float), clamped to [min_width, max_width].

    Raises
    ------
    ValueError: if gdf is empty or has fewer than 2 points.
    """
    if gdf is None or len(gdf) < 2:
        raise ValueError("GeoDataFrame must contain at least 2 points.")

    # Always project to metric CRS — never compute distances in EPSG:4326
    if gdf.crs is None or gdf.crs.is_geographic:
        gdf = gdf.to_crs("EPSG:3857")

    coords = np.column_stack((gdf.geometry.x, gdf.geometry.y))
    tree = KDTree(coords)
    # k=2: index 0 is self (distance 0), index 1 is true nearest neighbour
    distances, _ = tree.query(coords, k=2)
    nn_dists = distances[:, 1]

    bin_width = float(np.percentile(nn_dists, percentile * 100))
    return float(np.clip(bin_width, min_width, max_width))


def build_mobility_heatmap(
    gdf: gpd.GeoDataFrame,
    bin_width: float,
    time_col: str = "timestamp",
    window_min: int = 15,
) -> gpd.GeoDataFrame:
    """Construct a regular spatial grid and aggregate point counts per time bin.

    Parameters
    ----------
    gdf:        Projected GeoDataFrame with point geometry and a timestamp column.
    bin_width:  Spatial bin side length in metres (use compute_optimal_bin_width).
    time_col:   Name of the datetime column in gdf.
    window_min: Temporal bin width in minutes (floor-aligned).

    Returns
    -------
    GeoDataFrame with columns [cell_id, time_bin, count, geometry].
    Empty cells (count == 0) are excluded to avoid zero-inflation.

    Raises
    ------
    ValueError: if gdf is empty, bin_width <= 0, or time_col is missing.
    """
    if gdf is None or gdf.empty:
        raise ValueError("GeoDataFrame is empty.")
    if bin_width <= 0:
        raise ValueError(f"bin_width must be positive, got {bin_width}.")
    if time_col not in gdf.columns:
        raise ValueError(f"Column '{time_col}' not found in GeoDataFrame.")

    # Ensure metric projection
    if gdf.crs is None or gdf.crs.is_geographic:
        gdf = gdf.to_crs("EPSG:3857")

    minx, miny, maxx, maxy = gdf.total_bounds
    x_edges = np.arange(minx, maxx + bin_width, bin_width)
    y_edges = np.arange(miny, maxy + bin_width, bin_width)

    # Vectorised grid generation
    cells = [
        box(x, y, x + bin_width, y + bin_width)
        for x in x_edges
        for y in y_edges
    ]
    grid = gpd.GeoDataFrame({"geometry": cells}, crs=gdf.crs)
    grid["cell_id"] = grid.index

    # Spatial join — drop points that fall outside the bounding box extension
    joined = gpd.sjoin(gdf, grid, how="inner", predicate="within")

    # Floor-align timestamps to temporal bin boundaries
    joined["time_bin"] = pd.to_datetime(joined[time_col]).dt.floor(f"{window_min}min")

    # Aggregate and merge geometry back
    counts = (
        joined.groupby(["cell_id", "time_bin"])
        .size()
        .reset_index(name="count")
    )
    heatmap = counts.merge(grid[["cell_id", "geometry"]], on="cell_id")

    # Exclude empty cells — retaining zeros distorts normalisation
    return gpd.GeoDataFrame(
        heatmap[heatmap["count"] > 0].reset_index(drop=True),
        crs=gdf.crs,
    )


# --- Usage ---
# optimal_w = compute_optimal_bin_width(mobility_gdf, percentile=0.75)
# heatmap = build_mobility_heatmap(mobility_gdf, bin_width=optimal_w, window_min=15)

CRS note. Both functions auto-project to Web Mercator (EPSG:3857) when geographic coordinates are detected. For accuracy beyond ±0.3 % in the mid-latitudes, replace "EPSG:3857" with the appropriate local UTM zone — see best practices for CRS transformations in movement data for a zone-selection helper.

Memory note. For metropolitan-scale datasets (>10 M points), switch predicate="within" to predicate="intersects" and process timestamps in daily chunks to avoid RAM bottlenecks on the spatial join.

Validation Block

Run these checks immediately after building the heatmap:

PYTHON

from libpysal.weights import Queen
from esda.moran import Moran

# 1. Shape sanity
assert heatmap.columns.tolist() == ["cell_id", "time_bin", "count", "geometry"]
assert (heatmap["count"] > 0).all(), "Zero-count cells leaked through — filter before analysis."

# 2. Coefficient of Variation — target < 0.8 per time slice
for t_bin, group in heatmap.groupby("time_bin"):
    cv = group["count"].std() / group["count"].mean()
    if cv >= 0.8:
        print(f"WARNING: CV={cv:.2f} at {t_bin} — bins may be too fine (sparse) or too coarse (saturated).")

# 3. Spatial autocorrelation (Moran's I) — target < 0.3 for baseline heatmaps
# Run on the densest time slice to capture worst-case clustering
peak_slice = heatmap.loc[heatmap.groupby("time_bin")["count"].transform("sum").idxmax()]
w = Queen.from_dataframe(gpd.GeoDataFrame(peak_slice, geometry="geometry"))
w.transform = "r"
moran = Moran(peak_slice["count"].values, w)
print(f"Moran's I = {moran.I:.3f}  (p = {moran.p_sim:.3f})")
if moran.I >= 0.3:
    print("Spatial autocorrelation is high — consider halving bin_width and re-validating.")

A CV consistently above 0.8 signals that your bins are either too small (many near-zero cells) or too large (a few cells absorb most of the density). A Moran’s I above 0.3 means the grid is too coarse to resolve the spatial structure of the underlying flows — adjacent cells share too much signal.

Common Mistakes and Gotchas

Computing nearest-neighbour distances in EPSG:4326. Degree-based coordinates produce dimensionless distances; the percentile value is meaningless for setting a bin width in metres. Always project first.
Skipping the min_width / max_width clamp. Very dense datasets (e.g. high-frequency scooter telemetry) can yield a p75 NN distance of 3–8 m, creating millions of grid cells that exhaust RAM on any reasonable machine. The floor prevents degenerate micro-bins.
Retaining zero-count cells. Downstream normalisation (e.g. log-scale density maps) breaks when cells have count == 0. Filter them after aggregation, not before the spatial join.
Using iterrows for the spatial join. Vectorised geopandas.sjoin is 100–1000x faster than any row-by-row loop. Never use iterrows for point-in-polygon assignment at scale.
Ignoring the sampling rate of your source data. A 30-second ping interval and a 1-minute temporal window capture only two observations per window. Dwell detection requires at least 3–5 samples, so temporal bins must be wide enough relative to the device’s reporting frequency.
Applying a single bin size across the entire city. High-density downtown cores and sparse suburban fringes demand different resolutions. Consider splitting your extent into activity zones and running compute_optimal_bin_width independently on each zone.

Dynamic Time-Binning Strategies — the parent framework that governs how bin selection feeds adaptive window logic
Mapping Congestion Thresholds to Real-Time Traffic Windows — applying bin-derived density counts to live congestion detection
Computing Rolling Average Speed Over Sliding Time Windows — combining spatial bins with rolling temporal aggregation for speed heatmaps
Optimizing Spatial Joins for Trajectory-to-Zone Matching — performance patterns for the geopandas.sjoin step at scale
Downsampling High-Frequency GPS Tracks Without Losing Path Integrity — pre-processing step that normalises ping rates before bin calibration

Back to Dynamic Time-Binning Strategies