class: center, middle, inverse, title-slide # Separate and unite ## Work with strings ### Byron C. Jaeger ### Last updated: 2020-06-01 --- class: inverse, center, middle # Separate --- layout:true background-image: url(img/tidyr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Separate --- We'll start with ambulatory BP monitoring demographics data: ```r abpm_demo <- read_rds('data/abpm_demographics.rds') abpm_demo ``` ``` ## # A tibble: 2,500 x 2 ## id asr ## <int> <chr> ## 1 1 20_Male_White ## 2 2 30_Female_White ## 3 3 30_Male_Black ## 4 4 28_Male_White ## 5 5 28_Male_White ## 6 6 19_Male_Black ## 7 7 29_Female_Black ## 8 8 24_Male_Black ## 9 9 22_Female_White ## 10 10 23_Male_Black ## # ... with 2,490 more rows ``` --- Use `separate()` to turn one column into 2 or more columns: ```r abpm_demo %>% * separate(col = asr, into = c('age', 'sex', 'race')) ``` ``` ## # A tibble: 2,500 x 4 ## id age sex race ## <int> <chr> <chr> <chr> ## 1 1 20 Male White ## 2 2 30 Female White ## 3 3 30 Male Black ## 4 4 28 Male White ## 5 5 28 Male White ## 6 6 19 Male Black ## 7 7 29 Female Black ## 8 8 24 Male Black ## 9 9 22 Female White ## 10 10 23 Male Black ## # ... with 2,490 more rows ``` --- Why did that work? - `separate()` has a built-in default pattern that it searches for in the character designated as `col`. (more on patterns later) <img src="img/separate_help.png" width="100%" /> --- but what if we had less tidy data to begin with? ```r abpm_demo ``` ``` ## # A tibble: 2,500 x 2 ## id asr ## <int> <glue> ## 1 1 20_Male.junk.White ## 2 2 30_Female.junk.White ## 3 3 30_Male.junk.Black ## 4 4 28_Male.junk.White ## 5 5 28_Male.junk.White ## 6 6 19_Male.junk.Black ## 7 7 29_Female.junk.Black ## 8 8 24_Male.junk.Black ## 9 9 22_Female.junk.White ## 10 10 23_Male.junk.Black ## # ... with 2,490 more rows ``` --- We'll need to do some cleaning before things work ```r # won't work! separate(abpm_demo, col = asr, into = c('age', 'sex', 'race')) ``` ``` ## # A tibble: 2,500 x 4 ## id age sex race ## <int> <chr> <chr> <chr> ## 1 1 20 Male junk ## 2 2 30 Female junk ## 3 3 30 Male junk ## 4 4 28 Male junk ## 5 5 28 Male junk ## 6 6 19 Male junk ## 7 7 29 Female junk ## 8 8 24 Male junk ## 9 9 22 Female junk ## 10 10 23 Male junk ## # ... with 2,490 more rows ``` --- What can we do? Many things. How about two separates? ```r # here's the first one abpm_demo %>% * separate(col = asr, into = c('age', 'to_split'), sep = '_') ``` ``` ## # A tibble: 2,500 x 3 ## id age to_split ## <int> <chr> <chr> ## 1 1 20 Male.junk.White ## 2 2 30 Female.junk.White ## 3 3 30 Male.junk.Black ## 4 4 28 Male.junk.White ## 5 5 28 Male.junk.White ## 6 6 19 Male.junk.Black ## 7 7 29 Female.junk.Black ## 8 8 24 Male.junk.Black ## 9 9 22 Female.junk.White ## 10 10 23 Male.junk.Black ## # ... with 2,490 more rows ``` --- What can we do? Many things. How about two separates? ```r # and now the second abpm_demo %>% separate(asr, into = c('age', 'to_split'), sep = '_') %>% * separate(to_split, into = c('sex', 'race'), sep = '.junk.') ``` ``` ## # A tibble: 2,500 x 4 ## id age sex race ## <int> <chr> <chr> <chr> ## 1 1 20 Male White ## 2 2 30 Female White ## 3 3 30 Male Black ## 4 4 28 Male White ## 5 5 28 Male White ## 6 6 19 Male Black ## 7 7 29 Female Black ## 8 8 24 Male Black ## 9 9 22 Female White ## 10 10 23 Male Black ## # ... with 2,490 more rows ``` --- layout: false class: inverse, center, middle # Unite --- layout:true background-image: url(img/tidyr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Unite --- `unite()` is simply the inverse of `separate()`. ```r abpm_wide %>% * select(age, sex, race) ``` ``` ## # A tibble: 2,500 x 3 ## age sex race ## <dbl> <chr> <chr> ## 1 20 Male White ## 2 30 Female White ## 3 30 Male Black ## 4 28 Male White ## 5 28 Male White ## 6 19 Male Black ## 7 29 Female Black ## 8 24 Male Black ## 9 22 Female White ## 10 23 Male Black ## # ... with 2,490 more rows ``` --- `unite()` is simply the inverse of `separate()`. ```r abpm_wide %>% select(id, age, sex, race) %>% * unite(col = 'asr', age, sex, race, sep = '_') ``` ``` ## # A tibble: 2,500 x 2 ## id asr ## <int> <chr> ## 1 1 20_Male_White ## 2 2 30_Female_White ## 3 3 30_Male_Black ## 4 4 28_Male_White ## 5 5 28_Male_White ## 6 6 19_Male_Black ## 7 7 29_Female_Black ## 8 8 24_Male_Black ## 9 9 22_Female_White ## 10 10 23_Male_Black ## # ... with 2,490 more rows ``` --- layout:false class: inverse, center, middle # But what about <br/>"[^[:alnum:]]+"? --- layout:true background-image: url(img/stringr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Regular expressions and strings --- A string is just a character value: ```r string <- "A string for you" writeLines(string) ``` ``` ## A string for you ``` --- What if you want to put a quote inside the string? <br/><br/> Use the `\` symbol to escape it! ```r string <- "A \"string\" for you" writeLines(string) ``` ``` ## A "string" for you ``` But what if you want to write a \ in your string? Escape the escape! ```r string <- "A \\backslash\\ for you" writeLines(string) ``` ``` ## A \backslash\ for you ``` --- So how about regular expressions? - You can use these to match patterns in a string - They take a bit of time to learn, but are worth it Here is some motivation: ```r some_strings <- c( "115 is a number", "want another one? 3!", "Okay, here is one more: 11" ) ``` How could we pull the numbers out of this string? --- We could... - remove all of the non-digit character ```r str_remove_all(some_strings, pattern = '\\D') ``` ``` ## [1] "115" "3" "11" ``` --- We could... - extract all of the digit characters ```r str_extract_all(some_strings, pattern = '\\d') ``` ``` ## [[1]] ## [1] "1" "1" "5" ## ## [[2]] ## [1] "3" ## ## [[3]] ## [1] "1" "1" ``` --- But what if the numbers had a decimal? ```r tricky_strings <- c( 'this number means business: 11.234', 'and don\'t even get me started on 43.211' ) ``` --- We need to write a regular expression that will match - one or more digits, followed by - a decimal symbol, and then - one or more digits That is, "\\d+\\.?\\d+" - "\\d+" detects one or more digits - \\.? detects exactly one '.' symbol --- Voila. ```r str_extract_all(tricky_strings, "\\d*\\.?\\d+") ``` ``` ## [[1]] ## [1] "11.234" ## ## [[2]] ## [1] "43.211" ``` or, if you want a character vector: ```r str_extract_all(tricky_strings, "\\d*\\.?\\d+") %>% unlist() ``` ``` ## [1] "11.234" "43.211" ``` --- layout:false ## Learning more To learn more, see - The [stringr](https://stringr.tidyverse.org/) website - This [vignette](https://stringr.tidyverse.org/articles/regular-expressions.html) dedicated to regular expressions.