Separate and unite

class: center, middle, inverse, title-slide

# Separate and unite
## Work with strings
### Byron C. Jaeger
### Last updated: 2020-06-01

---

class: inverse, center, middle

# Separate

---
layout:true
background-image: url(img/tidyr.png)
background-size: 12.5%
background-position: 97.5% 2.5%

## Separate
---

We'll start with ambulatory BP monitoring demographics data:

```r
abpm_demo <- read_rds('data/abpm_demographics.rds')

abpm_demo
```

```
## # A tibble: 2,500 x 2
##       id asr            
##    <int> <chr>          
##  1     1 20_Male_White  
##  2     2 30_Female_White
##  3     3 30_Male_Black  
##  4     4 28_Male_White  
##  5     5 28_Male_White  
##  6     6 19_Male_Black  
##  7     7 29_Female_Black
##  8     8 24_Male_Black  
##  9     9 22_Female_White
## 10    10 23_Male_Black  
## # ... with 2,490 more rows
```

---

Use `separate()` to turn one column into 2 or more columns:

```r
abpm_demo %>% 
* separate(col = asr, into = c('age', 'sex', 'race'))
```

```
## # A tibble: 2,500 x 4
##       id age   sex    race 
##    <int> <chr> <chr>  <chr>
##  1     1 20    Male   White
##  2     2 30    Female White
##  3     3 30    Male   Black
##  4     4 28    Male   White
##  5     5 28    Male   White
##  6     6 19    Male   Black
##  7     7 29    Female Black
##  8     8 24    Male   Black
##  9     9 22    Female White
## 10    10 23    Male   Black
## # ... with 2,490 more rows
```

---

Why did that work?

- `separate()` has a built-in default pattern that it searches for in the character designated as `col`. (more on patterns later)

---

but what if we had less tidy data to begin with?

```r
abpm_demo
```

```
## # A tibble: 2,500 x 2
##       id asr                 
##    <int> <glue>              
##  1     1 20_Male.junk.White  
##  2     2 30_Female.junk.White
##  3     3 30_Male.junk.Black  
##  4     4 28_Male.junk.White  
##  5     5 28_Male.junk.White  
##  6     6 19_Male.junk.Black  
##  7     7 29_Female.junk.Black
##  8     8 24_Male.junk.Black  
##  9     9 22_Female.junk.White
## 10    10 23_Male.junk.Black  
## # ... with 2,490 more rows
```

---

We'll need to do some cleaning before things work

```r
# won't work!
separate(abpm_demo, col = asr, into = c('age', 'sex', 'race'))
```

```
## # A tibble: 2,500 x 4
##       id age   sex    race 
##    <int> <chr> <chr>  <chr>
##  1     1 20    Male   junk 
##  2     2 30    Female junk 
##  3     3 30    Male   junk 
##  4     4 28    Male   junk 
##  5     5 28    Male   junk 
##  6     6 19    Male   junk 
##  7     7 29    Female junk 
##  8     8 24    Male   junk 
##  9     9 22    Female junk 
## 10    10 23    Male   junk 
## # ... with 2,490 more rows
```

---

What can we do? Many things. How about two separates?

```r
# here's the first one
abpm_demo %>% 
* separate(col = asr, into = c('age', 'to_split'), sep = '_')
```

```
## # A tibble: 2,500 x 3
##       id age   to_split         
##    <int> <chr> <chr>            
##  1     1 20    Male.junk.White  
##  2     2 30    Female.junk.White
##  3     3 30    Male.junk.Black  
##  4     4 28    Male.junk.White  
##  5     5 28    Male.junk.White  
##  6     6 19    Male.junk.Black  
##  7     7 29    Female.junk.Black
##  8     8 24    Male.junk.Black  
##  9     9 22    Female.junk.White
## 10    10 23    Male.junk.Black  
## # ... with 2,490 more rows
```

---

What can we do? Many things. How about two separates?

```r
# and now the second
abpm_demo %>% 
  separate(asr, into = c('age', 'to_split'), sep = '_') %>% 
* separate(to_split, into = c('sex', 'race'), sep = '.junk.')
```

---
layout: false
class: inverse, center, middle

# Unite

---
layout:true

background-image: url(img/tidyr.png)
background-size: 12.5%
background-position: 97.5% 2.5%

## Unite

---

`unite()` is simply the inverse of `separate()`.

```r
abpm_wide %>% 
* select(age, sex, race)
```

```
## # A tibble: 2,500 x 3
##      age sex    race 
##    <dbl> <chr>  <chr>
##  1    20 Male   White
##  2    30 Female White
##  3    30 Male   Black
##  4    28 Male   White
##  5    28 Male   White
##  6    19 Male   Black
##  7    29 Female Black
##  8    24 Male   Black
##  9    22 Female White
## 10    23 Male   Black
## # ... with 2,490 more rows
```

---

`unite()` is simply the inverse of `separate()`.

```r
abpm_wide %>% 
  select(id, age, sex, race) %>% 
* unite(col = 'asr', age, sex, race, sep = '_')
```

---
layout:false
class: inverse, center, middle

# But what about <br/>"[^[:alnum:]]+"?

---
layout:true

background-image: url(img/stringr.png)
background-size: 12.5%
background-position: 97.5% 2.5%

## Regular expressions and strings

---

A string is just a character value:

```r
string <- "A string for you"

writeLines(string)
```

```
## A string for you
```

---

What if you want to put a quote inside the string? <br/><br/>
Use the `\` symbol to escape it!

```r
string <- "A \"string\" for you"

writeLines(string)
```

```
## A "string" for you
```

But what if you want to write a \ in your string? Escape the escape!

```r
string <- "A \\backslash\\ for you"

writeLines(string)
```

```
## A \backslash\ for you
```

---

So how about regular expressions?

- You can use these to match patterns in a string

- They take a bit of time to learn, but are worth it

Here is some motivation:

```r
some_strings <- c(
  "115 is a number", 
  "want another one? 3!", 
  "Okay, here is one more: 11"
)
```

How could we pull the numbers out of this string?

---

We could...

- remove all of the non-digit character

```r
str_remove_all(some_strings, pattern = '\\D')
```

```
## [1] "115" "3"   "11"
```

---

We could...

- extract all of the digit characters

```r
str_extract_all(some_strings, pattern = '\\d')
```

```
## [[1]]
## [1] "1" "1" "5"
## 
## [[2]]
## [1] "3"
## 
## [[3]]
## [1] "1" "1"
```

---

But what if the numbers had a decimal?

```r
tricky_strings <- c(
  'this number means business: 11.234',
  'and don\'t even get me started on 43.211'
)
```

---

We need to write a regular expression that will match

- one or more digits, followed by

- a decimal symbol, and then

- one or more digits

That is, "\\d+\\.?\\d+"

- "\\d+" detects one or more digits

- \\.? detects exactly one '.' symbol

---

Voila.

```r
str_extract_all(tricky_strings, "\\d*\\.?\\d+")
```

```
## [[1]]
## [1] "11.234"
## 
## [[2]]
## [1] "43.211"
```

or, if you want a character vector:

```r
str_extract_all(tricky_strings, "\\d*\\.?\\d+") %>% 
  unlist()
```

```
## [1] "11.234" "43.211"
```

---
layout:false

## Learning more

To learn more, see

- The [stringr](https://stringr.tidyverse.org/) website

- This [vignette](https://stringr.tidyverse.org/articles/regular-expressions.html) dedicated to regular expressions.