Transforming data with dplyr

# Transforming data with dplyr
## Introduction to tidy principles
### Byron C. Jaeger
### Last updated: 2020-04-22

---

# Tidy data

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy

1. Each variable is a column.
2. Each observation is a row.
3. Each type of observational unit forms a table.
]
--
.pull-right[
**Characteristics of untidy data:**

!@#$%^&*()
]

---

# Pipes

---

## Where does the name come from?

The pipe operator is implemented in the package **magrittr**.

---

## Review: How does a pipe work?

- You can think about the following sequence of actions - find key, 
unlock car, start car, drive to school, park.

- Expressed as a set of nested functions in R:

```r
park(drive(start_car(find("keys")), to = "campus"))
```

- Writing it out using pipes gives a more natural structure:

```r
find("keys") %>%
  start_car() %>%
  drive(to = "campus") %>%
  park()
```

---

## What about other arguments?

To send results to a function argument other than first one or to use the previous result for multiple arguments, use `.`:

```r
nhanes %>%
  filter(sex == "Female") %>%
* lm(bp_sys_mmhg ~ age, data = .)
```

```
## 
## Call:
## lm(formula = bp_sys_mmhg ~ age, data = .)
## 
## Coefficients:
## (Intercept)          age  
##     92.7765       0.6337
```

---

# Data wrangling

---

## NHANES data

Pulled from [NHANES website](https://wwwn.cdc.gov/nchs/nhanes/Default.aspx), filtered to 51761 observations by you.

```r
glimpse(nhanes, width = 60)
```

```
## Rows: 51,761
## Columns: 14
## $ seqn            <dbl> 2, 5, 6, 7, 10, 12, 13, 14, 15,...
## $ exam            <fct> 1999, 1999, 1999, 1999, 1999, 1...
## $ age             <dbl> 77, 49, 19, 59, 43, 37, 70, 81,...
## $ sex             <fct> Male, Male, Female, Female, Mal...
## $ race_ethnicity  <fct> Non-Hispanic White, Non-Hispani...
## $ education       <fct> College graduate, College gradu...
## $ bp_sys_mmhg     <dbl> 100.6667, 122.0000, 114.6667, 1...
## $ bp_dia_mmhg     <dbl> 56.66667, 82.66667, 68.00000, 8...
## $ bp_controlled   <chr> "Yes", "Yes", "Yes", "Yes", "No...
## $ bp_high_aware   <dbl> 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0...
## $ bp_meds         <chr> "No", "Yes", "No", "No", "No", ...
## $ acr_mgg         <dbl> 6.275862, 3.546512, 4.032258, 5...
## $ chol_hdl_mgdl   <dbl> 54, 42, 61, 105, 51, 38, 49, 40...
## $ chol_total_mgdl <dbl> 215, 279, 153, 245, 140, 156, 3...
```

---
layout: true
background-image: url(img/hex-dplyr.png)
background-size: 12.5%
background-position: 95% 5%

## Data wrangling

---

`dplyr` is known as the grammar of data manipulation.

Single data frame functions / verbs:

- `select`, `rename`: select / rename specific columns by name
- `pull`: grab a column as a vector
- `filter`: pick rows matching criteria
- `slice`: pick rows using index(es)
- `arrange`: reorder rows
- `mutate`: add new variables
- `transmute`: create new data frame with variables
- `summarise`: reduce variables to values
- `count`: special case of `summarise` that computes frequencies.
- ... (many more)

---

`dplyr` has rules:

1. First argument is _always_ a data frame

2. Subsequent arguments say what to do with that data frame

3. _Always_ return a data frame

4. Don't modify in place

5. Performance via lazy evaluation

---

Let's make some conditional variables!

- `albuminuria`:

+ 'Yes' if ACR > 30 mg / g
    
    + 'No' otherwise.

- `bp_cat`:

+ 'Normotensive' if SBP < 130 and DBP < 80 mm Hg

+ 'Hypertension' if SBP is 130 to < 140 or DBP is 80 to < 90 mm Hg

+ 'Uncontrolled' if SBP is > 140 or DBP is > 90 mm Hg

---

`dplyr` provides two main functions for conditional execution:

- `if_else()` for variables with 2 categories

- `case_when()` for variables with >2 categories

---

- `albuminuria`:

+ 'Yes' if ACR > 30 mg / g
    
    + 'No' otherwise.

```r
nhanes <- nhanes %>% 
  mutate(
    albuminuria = if_else(
      condition = acr_mgg > 30,
      true = 'Yes', 
      false = 'No'
    )
  )
```

---

__Check your work!__

Make it a habit to check each data processing step you complete.

- Yes, this will slow you down in the short term

- Yes, it is very much worth it.

```r
nhanes %$% table(albuminuria, acr_mgg > 30)
```

```
##            
## albuminuria FALSE  TRUE
##         No  44460     0
##         Yes     0  6263
```

```r
# same thing as table(nhanes$albuminuria, nhanes$acr_mgg > 30)
```

---

- `bp_cat`:

+ 'Normotensive' if SBP < 130 and DBP < 80 mm Hg
    
    + 'Hypertension' if SBP is 130 to < 140 or DBP is 80 to < 90 mm Hg
    
    + 'Uncontrolled' if SBP is > 140 or DBP is > 90 mm Hg

```r
nhanes <- nhanes %>% 
  mutate(
    bp_cat = case_when(
      bp_sys_mmhg  < 130 & bp_dia_mmhg  < 80 ~ "Normotensive",
      bp_sys_mmhg  < 140 & bp_dia_mmhg  < 90 ~ "Hypertension",
      bp_sys_mmhg >= 140 | bp_dia_mmhg >= 90 ~ "Uncontrolled",
      TRUE ~ NA_character_ # added for clarity
    )
  )
```

---

__Check your work!__

```r
ggplot(nhanes) + 
  aes(x = bp_sys_mmhg, y = bp_dia_mmhg, col = bp_cat) + 
  geom_point()
```

![](index_files/figure-html/unnamed-chunk-9-1.png)

---

Use `summarize()` to, well, summarize your data

The values are summarised in a data frame

```r
nhanes %>%
  summarise(
    mean_sbp = mean(bp_sys_mmhg),
    mean_dbp = mean(bp_dia_mmhg),
    prevalence_alb = mean(albuminuria == 'Yes', na.rm = TRUE)
  )
```

```
## # A tibble: 1 x 3
##   mean_sbp mean_dbp prevalence_alb
##      <dbl>    <dbl>          <dbl>
## 1     124.     70.5          0.123
```

---

`group_by() %>% summarize()` summarizes each group:

```r
nhanes %>%
  group_by(exam) %>% 
  summarise(mean_sbp = mean(bp_sys_mmhg),
    prevalence_alb = mean(albuminuria == 'Yes', na.rm = TRUE))
```

```
## # A tibble: 10 x 3
##    exam  mean_sbp prevalence_alb
##    <fct>    <dbl>          <dbl>
##  1 1999      127.          0.134
##  2 2001      126.          0.119
##  3 2003      125.          0.114
##  4 2005      124.          0.121
##  5 2007      124.          0.139
##  6 2009      122.          0.107
##  7 2011      123.          0.126
##  8 2013      123.          0.118
##  9 2015      125.          0.125
## 10 2017      126.          0.134
```

---
class: center, middle
layout: false

# Wrangling categorical data

---
layout: true
background-image: url(img/hex-forcats.png)
background-size: 12.5%
background-position: 95% 5%

## Factors

---

- factors are used to work with categorical variables

- categorical variables have a fixed and known set of finite values.

```r
fctr <- factor(
  x = c(1, 2, 2, 3),
  levels = c(1,2,3),
  labels = c("A", "B", "C")
)

fctr
```

```
## [1] A B B C
## Levels: A B C
```

---

Sometimes factors make you say, "I don't know about that"

```r
x1 <- factor(c(1,2), 
  labels = c('a','b'))

x2 <- factor(3, labels = 'c')

c(x1, x2)
```

```
## [1] 1 2 1
```
]

<img src="img/chappelle_skeptic.png" width="100%" style="display: block; margin: auto;" />
]

- `forcats` makes factors easier to wrangle

```r
forcats::fct_c(x1, x2)
```

```
## [1] a b c
## Levels: a b c
```

---

Convert character/numeric vectors to factors if

- you want to impose an ordering that is not alphabetical.

```r
count(nhanes, bp_cat)
```

```
## # A tibble: 3 x 2
##   bp_cat           n
##   <chr>        <int>
## 1 Hypertension 10325
## 2 Normotensive 31075
## 3 Uncontrolled 10361
```

---

Convert character/numeric vectors to factors if

- you want to impose an ordering that is not alphabetical.

```r
nhanes <- nhanes %>% 
  mutate(
    bp_cat = factor(
      x = bp_cat, 
      levels = c('Normotensive', 'Hypertension', 'Uncontrolled')
    )
  )

count(nhanes, bp_cat)
```

```
## # A tibble: 3 x 2
##   bp_cat           n
##   <fct>        <int>
## 1 Normotensive 31075
## 2 Hypertension 10325
## 3 Uncontrolled 10361
```

---

Convert character/numeric vectors to factors if

- you have a numeric variable that should be a categorical one

```r
count(nhanes, bp_high_aware)
```

```
## # A tibble: 2 x 2
##   bp_high_aware     n
##           <dbl> <int>
## 1             0 34514
## 2             1 17247
```

---

Convert character/numeric vectors to factors if

- you have a numeric variable that should be a categorical one

```r
nhanes <- nhanes %>% 
  mutate(
    bp_high_aware = factor(
      x = bp_high_aware, 
      levels = c(0, 1),
      labels = c("No", "Yes")
    )
  )

count(nhanes, bp_high_aware)
```

```
## # A tibble: 2 x 2
##   bp_high_aware     n
##   <fct>         <int>
## 1 No            34514
## 2 Yes           17247
```

---

Relevel factors (change their order) with `forcats`:

```r
library(forcats)

nhanes %>% 
  mutate(
    bp_cat = fct_relevel(
      bp_cat, 'Uncontrolled', 'Hypertension'
    )
  ) %>% 
  count(bp_cat)
```

```
## # A tibble: 3 x 2
##   bp_cat           n
##   <fct>        <int>
## 1 Uncontrolled 10361
## 2 Hypertension 10325
## 3 Normotensive 31075
```

---

Collapse factors (lump categories) with `forcats`:

```r
nhanes %>% 
  mutate(
    bp_cat = fct_collapse(
      bp_cat, 
      "Hypertensive" = c("Hypertension", "Uncontrolled")
    )
  ) %>% 
  count(bp_cat)
```

```
## # A tibble: 2 x 2
##   bp_cat           n
##   <fct>        <int>
## 1 Normotensive 31075
## 2 Hypertensive 20686
```

---

Explicitly set factor missing levels to their own category

```r
count(nhanes, education)
```

```
## # A tibble: 4 x 2
##   education                    n
##   <fct>                    <int>
## 1 Less than high school    14389
## 2 High school/some college 26076
## 3 College graduate         10344
## 4 <NA>                       952
```

---

Explicitly set factor missing levels to their own category

```r
nhanes <- nhanes %>% 
  mutate(
    education = fct_explicit_na(
      f = education,
      na_level = 'Missing' 
    )
  )

count(nhanes, education)
```

```
## # A tibble: 4 x 2
##   education                    n
##   <fct>                    <int>
## 1 Less than high school    14389
## 2 High school/some college 26076
## 3 College graduate         10344
## 4 Missing                    952
```

---

Recode factor levels manually:

```r
nhanes %>% 
  mutate(
    education = fct_recode(
      education,
      # new level = old level
      'less_than_hs' = 'Less than high school',
      'hs_some_college' = 'High school/some college',
      'college_grad' = 'College graduate'        
    )
  ) %>% 
  count(education)
```

```
## # A tibble: 4 x 2
##   education           n
##   <fct>           <int>
## 1 less_than_hs    14389
## 2 hs_some_college 26076
## 3 college_grad    10344
## 4 Missing           952
```

---
layout: false

## Learning more

- Data transformation and `forcats` cheatsheets available on [Rstudio cloud](https://rstudio.cloud/learn/cheat-sheets)

- Package websites:

+ dplyr: https://dplyr.tidyverse.org/index.html
    
    + forcats: https://forcats.tidyverse.org/