class: center, middle, inverse, title-slide # Transforming data with dplyr ## Introduction to tidy principles ### Byron C. Jaeger ### Last updated: 2020-04-22 --- class: inverse, center, middle # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** 1. Each variable is a column. 2. Each observation is a row. 3. Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- class: center, middle # Pipes --- ## Where does the name come from? The pipe operator is implemented in the package **magrittr**. .pull-left[  ] .pull-right[  ] --- ## Review: How does a pipe work? - You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park. - Expressed as a set of nested functions in R: ```r park(drive(start_car(find("keys")), to = "campus")) ``` - Writing it out using pipes gives a more natural structure: ```r find("keys") %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## What about other arguments? To send results to a function argument other than first one or to use the previous result for multiple arguments, use `.`: ```r nhanes %>% filter(sex == "Female") %>% * lm(bp_sys_mmhg ~ age, data = .) ``` ``` ## ## Call: ## lm(formula = bp_sys_mmhg ~ age, data = .) ## ## Coefficients: ## (Intercept) age ## 92.7765 0.6337 ``` --- class: center, middle # Data wrangling --- ## NHANES data Pulled from [NHANES website](https://wwwn.cdc.gov/nchs/nhanes/Default.aspx), filtered to 51761 observations by you. ```r glimpse(nhanes, width = 60) ``` ``` ## Rows: 51,761 ## Columns: 14 ## $ seqn <dbl> 2, 5, 6, 7, 10, 12, 13, 14, 15,... ## $ exam <fct> 1999, 1999, 1999, 1999, 1999, 1... ## $ age <dbl> 77, 49, 19, 59, 43, 37, 70, 81,... ## $ sex <fct> Male, Male, Female, Female, Mal... ## $ race_ethnicity <fct> Non-Hispanic White, Non-Hispani... ## $ education <fct> College graduate, College gradu... ## $ bp_sys_mmhg <dbl> 100.6667, 122.0000, 114.6667, 1... ## $ bp_dia_mmhg <dbl> 56.66667, 82.66667, 68.00000, 8... ## $ bp_controlled <chr> "Yes", "Yes", "Yes", "Yes", "No... ## $ bp_high_aware <dbl> 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0... ## $ bp_meds <chr> "No", "Yes", "No", "No", "No", ... ## $ acr_mgg <dbl> 6.275862, 3.546512, 4.032258, 5... ## $ chol_hdl_mgdl <dbl> 54, 42, 61, 105, 51, 38, 49, 40... ## $ chol_total_mgdl <dbl> 215, 279, 153, 245, 140, 156, 3... ``` --- layout: true background-image: url(img/hex-dplyr.png) background-size: 12.5% background-position: 95% 5% ## Data wrangling --- `dplyr` is known as the grammar of data manipulation. Single data frame functions / verbs: - `select`, `rename`: select / rename specific columns by name - `pull`: grab a column as a vector - `filter`: pick rows matching criteria - `slice`: pick rows using index(es) - `arrange`: reorder rows - `mutate`: add new variables - `transmute`: create new data frame with variables - `summarise`: reduce variables to values - `count`: special case of `summarise` that computes frequencies. - ... (many more) --- `dplyr` has rules: 1. First argument is _always_ a data frame 2. Subsequent arguments say what to do with that data frame 3. _Always_ return a data frame 4. Don't modify in place 5. Performance via lazy evaluation --- Let's make some conditional variables! - `albuminuria`: + 'Yes' if ACR > 30 mg / g + 'No' otherwise. - `bp_cat`: + 'Normotensive' if SBP < 130 and DBP < 80 mm Hg + 'Hypertension' if SBP is 130 to < 140 or DBP is 80 to < 90 mm Hg + 'Uncontrolled' if SBP is > 140 or DBP is > 90 mm Hg --- `dplyr` provides two main functions for conditional execution: - `if_else()` for variables with 2 categories - `case_when()` for variables with >2 categories --- - `albuminuria`: + 'Yes' if ACR > 30 mg / g + 'No' otherwise. ```r nhanes <- nhanes %>% mutate( albuminuria = if_else( condition = acr_mgg > 30, true = 'Yes', false = 'No' ) ) ``` --- __Check your work!__ Make it a habit to check each data processing step you complete. - Yes, this will slow you down in the short term - Yes, it is very much worth it. ```r nhanes %$% table(albuminuria, acr_mgg > 30) ``` ``` ## ## albuminuria FALSE TRUE ## No 44460 0 ## Yes 0 6263 ``` ```r # same thing as table(nhanes$albuminuria, nhanes$acr_mgg > 30) ``` --- - `bp_cat`: + 'Normotensive' if SBP < 130 and DBP < 80 mm Hg + 'Hypertension' if SBP is 130 to < 140 or DBP is 80 to < 90 mm Hg + 'Uncontrolled' if SBP is > 140 or DBP is > 90 mm Hg ```r nhanes <- nhanes %>% mutate( bp_cat = case_when( bp_sys_mmhg < 130 & bp_dia_mmhg < 80 ~ "Normotensive", bp_sys_mmhg < 140 & bp_dia_mmhg < 90 ~ "Hypertension", bp_sys_mmhg >= 140 | bp_dia_mmhg >= 90 ~ "Uncontrolled", TRUE ~ NA_character_ # added for clarity ) ) ``` --- __Check your work!__ ```r ggplot(nhanes) + aes(x = bp_sys_mmhg, y = bp_dia_mmhg, col = bp_cat) + geom_point() ``` <!-- --> --- Use `summarize()` to, well, summarize your data The values are summarised in a data frame ```r nhanes %>% summarise( mean_sbp = mean(bp_sys_mmhg), mean_dbp = mean(bp_dia_mmhg), prevalence_alb = mean(albuminuria == 'Yes', na.rm = TRUE) ) ``` ``` ## # A tibble: 1 x 3 ## mean_sbp mean_dbp prevalence_alb ## <dbl> <dbl> <dbl> ## 1 124. 70.5 0.123 ``` --- `group_by() %>% summarize()` summarizes each group: ```r nhanes %>% group_by(exam) %>% summarise(mean_sbp = mean(bp_sys_mmhg), prevalence_alb = mean(albuminuria == 'Yes', na.rm = TRUE)) ``` ``` ## # A tibble: 10 x 3 ## exam mean_sbp prevalence_alb ## <fct> <dbl> <dbl> ## 1 1999 127. 0.134 ## 2 2001 126. 0.119 ## 3 2003 125. 0.114 ## 4 2005 124. 0.121 ## 5 2007 124. 0.139 ## 6 2009 122. 0.107 ## 7 2011 123. 0.126 ## 8 2013 123. 0.118 ## 9 2015 125. 0.125 ## 10 2017 126. 0.134 ``` --- class: center, middle layout: false # Wrangling categorical data --- layout: true background-image: url(img/hex-forcats.png) background-size: 12.5% background-position: 95% 5% ## Factors --- - factors are used to work with categorical variables - categorical variables have a fixed and known set of finite values. ```r fctr <- factor( x = c(1, 2, 2, 3), levels = c(1,2,3), labels = c("A", "B", "C") ) fctr ``` ``` ## [1] A B B C ## Levels: A B C ``` --- Sometimes factors make you say, "I don't know about that" .pull-left[ ```r x1 <- factor(c(1,2), labels = c('a','b')) x2 <- factor(3, labels = 'c') c(x1, x2) ``` ``` ## [1] 1 2 1 ``` ] -- .pull-right[ <img src="img/chappelle_skeptic.png" width="100%" style="display: block; margin: auto;" /> ] - `forcats` makes factors easier to wrangle ```r forcats::fct_c(x1, x2) ``` ``` ## [1] a b c ## Levels: a b c ``` --- Convert character/numeric vectors to factors if - you want to impose an ordering that is not alphabetical. ```r count(nhanes, bp_cat) ``` ``` ## # A tibble: 3 x 2 ## bp_cat n ## <chr> <int> ## 1 Hypertension 10325 ## 2 Normotensive 31075 ## 3 Uncontrolled 10361 ``` --- Convert character/numeric vectors to factors if - you want to impose an ordering that is not alphabetical. ```r nhanes <- nhanes %>% mutate( bp_cat = factor( x = bp_cat, levels = c('Normotensive', 'Hypertension', 'Uncontrolled') ) ) count(nhanes, bp_cat) ``` ``` ## # A tibble: 3 x 2 ## bp_cat n ## <fct> <int> ## 1 Normotensive 31075 ## 2 Hypertension 10325 ## 3 Uncontrolled 10361 ``` --- Convert character/numeric vectors to factors if - you have a numeric variable that should be a categorical one ```r count(nhanes, bp_high_aware) ``` ``` ## # A tibble: 2 x 2 ## bp_high_aware n ## <dbl> <int> ## 1 0 34514 ## 2 1 17247 ``` --- Convert character/numeric vectors to factors if - you have a numeric variable that should be a categorical one ```r nhanes <- nhanes %>% mutate( bp_high_aware = factor( x = bp_high_aware, levels = c(0, 1), labels = c("No", "Yes") ) ) count(nhanes, bp_high_aware) ``` ``` ## # A tibble: 2 x 2 ## bp_high_aware n ## <fct> <int> ## 1 No 34514 ## 2 Yes 17247 ``` --- Relevel factors (change their order) with `forcats`: ```r library(forcats) nhanes %>% mutate( bp_cat = fct_relevel( bp_cat, 'Uncontrolled', 'Hypertension' ) ) %>% count(bp_cat) ``` ``` ## # A tibble: 3 x 2 ## bp_cat n ## <fct> <int> ## 1 Uncontrolled 10361 ## 2 Hypertension 10325 ## 3 Normotensive 31075 ``` --- Collapse factors (lump categories) with `forcats`: ```r nhanes %>% mutate( bp_cat = fct_collapse( bp_cat, "Hypertensive" = c("Hypertension", "Uncontrolled") ) ) %>% count(bp_cat) ``` ``` ## # A tibble: 2 x 2 ## bp_cat n ## <fct> <int> ## 1 Normotensive 31075 ## 2 Hypertensive 20686 ``` --- Explicitly set factor missing levels to their own category ```r count(nhanes, education) ``` ``` ## # A tibble: 4 x 2 ## education n ## <fct> <int> ## 1 Less than high school 14389 ## 2 High school/some college 26076 ## 3 College graduate 10344 ## 4 <NA> 952 ``` --- Explicitly set factor missing levels to their own category ```r nhanes <- nhanes %>% mutate( education = fct_explicit_na( f = education, na_level = 'Missing' ) ) count(nhanes, education) ``` ``` ## # A tibble: 4 x 2 ## education n ## <fct> <int> ## 1 Less than high school 14389 ## 2 High school/some college 26076 ## 3 College graduate 10344 ## 4 Missing 952 ``` --- Recode factor levels manually: ```r nhanes %>% mutate( education = fct_recode( education, # new level = old level 'less_than_hs' = 'Less than high school', 'hs_some_college' = 'High school/some college', 'college_grad' = 'College graduate' ) ) %>% count(education) ``` ``` ## # A tibble: 4 x 2 ## education n ## <fct> <int> ## 1 less_than_hs 14389 ## 2 hs_some_college 26076 ## 3 college_grad 10344 ## 4 Missing 952 ``` --- layout: false ## Learning more - Data transformation and `forcats` cheatsheets available on [Rstudio cloud](https://rstudio.cloud/learn/cheat-sheets) - Package websites: + dplyr: https://dplyr.tidyverse.org/index.html + forcats: https://forcats.tidyverse.org/