water-temp-bc

neeTo dEce

Water temperature, discharge, and water level for ~290 BC hydrometric stations, queryable straight from S3 — no database, nothing to download up front. A monthly GitHub Actions cron re-pulls the full ~18-month window Environment Canada (ECCC) serves and compacts it into one deduplicated parquet dataset, so QC corrections automatically replace earlier provisional readings and the record keeps growing month over month.

Quick start

install.packages(c("arrow", "dplyr", "duckdb", "dbplyr", "lubridate"))

source("https://raw.githubusercontent.com/NewGraphEnvironment/water-temp-bc/main/scripts/query-helpers.R")

# water temperature, one station, last 6 months
query_canonical(parameter = 5, stations = "07EA004", from = Sys.Date() - 180) |>
  dplyr::collect()

query_canonical() returns a lazy dplyr query — filter by parameter, stations, from/to, chain any dplyr verbs, and collect() when you want the data in memory. Typical queries return in a few seconds. No AWS account or credentials needed — the bucket is fully public. scripts/query.R has more worked examples (daily means across stations, latest reading per station).

What’s in it

`Parameter`	Name	Unit	Rows	Stations
`5`	Water temperature	°C	4,474,827	291
`6`	Discharge (daily mean)	m³/s	155,261	252
`46`	Water level (primary sensor)	m	49,697,562	288
`47`	Discharge (primary sensor derived)	m³/s	44,398,842	255

For realtime discharge use 47, not 6 — 6 is one value per day; 5, 46 and 47 are high-frequency sensor readings. Record starts 2024-10 and grows monthly. Station locations and metadata live in stations_realtime.parquet (table below).

Data layout

s3://water-temp-bc/data/
├── canonical/Parameter=<n>/part-*.parquet     # THE dataset — query this
├── realtime/<yyyy>/<mm>/snapshot_.../         # raw monthly pulls (provenance only)
├── historic/realtime_raw_*.parquet            # frozen pre-2026 archive, odd schemas
└── stations_realtime.parquet                  # station metadata

canonical/ is deduplicated at build time — newest QC’d value per station, parameter, and timestamp. query_canonical() and arrow::open_dataset() both read it directly.
realtime/ keeps every raw overlapping pull; ~2/3 duplicate rows by design. Only useful if you need pre-correction history.
Single files fetched by URL need the region, e.g. https://water-temp-bc.s3.us-west-2.amazonaws.com/data/realtime/2026/06/snapshot_2026-06-01/chunk_001.parquet (note: files under canonical/ carry Parameter in the directory name, not as a column). The store is rewritten ~12:00–14:00 UTC on the 1st of each month.

# Station metadata (location, drainage, timezone, etc.) — unchanged file path,
# managed separately from the realtime snapshots.
stations <- arrow::read_parquet("s3://water-temp-bc/data/stations_realtime.parquet")

# Per-station date ranges in the canonical dataset. Cached locally; refresh
# on demand with params$update_query (seconds against canonical/).
range <- query_canonical(parameter = 5) |>
  dplyr::group_by(STATION_NUMBER) |>
  dplyr::summarise(
    min_date = min(Date, na.rm = TRUE),
    max_date = max(Date, na.rm = TRUE),
    .groups  = "drop"
  ) |>
  dplyr::collect()

saveRDS(range, "data/result.rds")

Realtime station information with the date range of water temperature data available in s3://water-temp-bc/data/canonical/. NOTE: To view all columns in the table - please click on one of the sort arrows within column headers before scrolling to the right.

Sample query

The chunk below pulls the last 6 months of water-temperature observations for one station via query_canonical(). It’s the same pattern as Example 1 in scripts/query.R.

sample <- query_canonical(
  parameter = 5,
  stations  = "07EA004",
  from      = Sys.Date() - 180
) |>
  dplyr::select(STATION_NUMBER, Date, Value, Unit, Grade, Approval) |>
  dplyr::arrange(Date) |>
  dplyr::collect()

Last 6 months of water temperature (Parameter == 5) for station 07EA004, pulled via query_canonical() from s3://water-temp-bc/data/canonical/. NOTE: To view all columns in the table - please click on one of the sort arrows within column headers before scrolling to the right.