Extract segments of text from a data-frame column

Pull substrings between regex start/end delimiters from a string column. Each element of segment_pairs is a length-2 character vector: c(start_regex, end_regex). The list name becomes the output column name.

Usage

ngr_str_df_extract(data, col, segment_pairs)

Arguments

data: data.frame or tibble::tbl_df A table containing the source text column.
col: character A single string: the name of the text column (e.g., "details").
segment_pairs: list A named list where each element is a length-2 character vector c(start_regex, end_regex). List names become output column names. Names must be non-empty and unique.

Value

data.frame The input data with one new column per named segment. Values are NA where no match is found. An error is raised if col is missing from data, if segment_pairs is not a properly named list, or if any element is not a length-2 character vector.

Examples

df <- tibble::tibble(details = c(
"Grant Amount: $400,000 Intake Year: 2025 Region: Fraser Basin Project Theme: Restoration",
"Grant Amount: $150,500 Intake Year: 2024 Region: Columbia Basin Project Theme: Food Systems"
))

segs <- list(
amount = c("Grant\\\\s*Amount", "Intake\\\\s*Year|Region|Project\\\\s*Theme|$"),
year = c("Intake\\\\s*Year", "Region|Project\\\\s*Theme|$"),
region = c("Region", "Project\\\\s*Theme|$"),
theme = c("Project\\\\s*Theme", "$")
)

out <- ngr_str_df_extract(df, "details", segs)
out
#> # A tibble: 2 × 5
#>   details                                              amount year  region theme
#>   <chr>                                                <chr>  <chr> <chr>  <chr>
#> 1 Grant Amount: $400,000 Intake Year: 2025 Region: Fr… NA     NA    : Fra… NA   
#> 2 Grant Amount: $150,500 Intake Year: 2024 Region: Co… NA     NA    : Col… NA

Extract segments of text from a data-frame column

Usage

Arguments

Value

See also

Examples