neeTo dEce

The goal of water-temp-bc is to document and serve out British Columbia water temperature (and other realtime hydrometric) data. We scrape the Environment Canada (ECCC) realtime feed for every BC station once a month via a GitHub Actions cron and publish the result as a partitioned parquet dataset on S3.


Data layout

s3://water-temp-bc/data/
├── realtime/<yyyy>/<mm>/snapshot_<yyyy-mm-dd>.parquet  # canonical, append-only
├── historic/realtime_raw_*.parquet                      # frozen pre-modernization
└── stations_realtime.parquet                            # station metadata

The bucket lives in us-west-2. To pull a raw file in a browser, the URL needs the explicit region:

https://water-temp-bc.s3.us-west-2.amazonaws.com/data/realtime/2026/05/snapshot_2026-05-14.parquet

Each monthly snapshot pulls the full ~18-month realtime window (the most ECCC serves) and tags every row with a harvested_at timestamp. Consecutive snapshots overlap by design, and the canonical value at each (STATION_NUMBER, Parameter, Date) is the row with the most recent harvested_at — that’s how QC corrections from ECCC win over earlier provisional readings. The dedup is handled at read time by query_canonical() so callers don’t have to think about it.

The historic/ prefix preserves the pre-modernization parquets as-is for explicit archival reads. Their schemas are heterogeneous (different column types, different column sets) — read them one at a time with awareness of their shape.


How to query

scripts/query.R has top-to-bottom worked examples (water temp for one station, daily means across stations, latest reading per station, reading a historic file). The short version:

source("scripts/query-helpers.R")

# Water temperature for one station, last 6 months
query_canonical(parameter = 5, stations = "07EA004", from = Sys.Date()-180) |>
  dplyr::collect()

query_canonical() returns a lazy dplyr query against the partitioned realtime/ tree on S3, automatically deduped on (STATION_NUMBER, Parameter, Date) by taking the row with the latest harvested_at. Chain more dplyr verbs and call dplyr::collect() when you want the data in memory.

Common parameter codes: 5 = water temperature, 6 = discharge (daily mean), 46 = water level.


# Station metadata (location, drainage, timezone, etc.) — unchanged file path,
# managed separately from the realtime snapshots.
stations <- arrow::read_parquet("s3://water-temp-bc/data/stations_realtime.parquet")
# Per-station date ranges in the canonical realtime/ dataset. Slow over the
# network so we cache the result locally and only refresh on demand.
range <- query_canonical(parameter = 5) |>
  dplyr::group_by(STATION_NUMBER) |>
  dplyr::summarise(
    min_date = min(Date, na.rm = TRUE),
    max_date = max(Date, na.rm = TRUE),
    .groups  = "drop"
  ) |>
  dplyr::collect()

saveRDS(range, "data/result.rds")

Realtime station information with the date range of water temperature data available in s3://water-temp-bc/data/realtime/. NOTE: To view all columns in the table - please click on one of the sort arrows within column headers before scrolling to the right.


Sample query

The chunk below pulls the last 6 months of water-temperature observations for one station via query_canonical(). It’s the same pattern as Example 1 in scripts/query.R.

sample <- query_canonical(
  parameter = 5,
  stations  = "07EA004",
  from      = Sys.Date() - 180
) |>
  dplyr::select(STATION_NUMBER, Date, Value, Unit, Grade, Approval) |>
  dplyr::arrange(Date) |>
  dplyr::collect()

Last 6 months of water temperature (Parameter == 5) for station 07EA004, pulled via query_canonical() from s3://water-temp-bc/data/realtime/. NOTE: To view all columns in the table - please click on one of the sort arrows within column headers before scrolling to the right.