class: center, middle, inverse, title-slide # Reshaping data ## spread, gather, and pivot ### Byron C. Jaeger ### Last updated: 2020-04-11 --- class: inverse, center, middle # What is wide/long data? --- ## in a nutshell <img src="img/tidyr-spread-gather.gif" style="display: block; margin: auto;" /> --- class: inverse, center, middle # How do we go from wide to long? --- background-image: url(img/tidyr_spread_gather.png) background-size: 65% background-position: 95% 50% ## With tidyr! An R package for - reshaping - tidying - nesting your data. <br/> Part of the `tidyverse` --- layout:true background-image: url(img/tidyr.png) background-size: 12.5% background-position: 97% 3% --- ## History At first, `tidyr` had functions called `spread` and `gather` - The names were confusing, - the argument names were hard to remember, - the functions supported a limited set of transformations from wide to long, and vice versa. In a later update, `pivot_wider` and `pivot_longer` were introduced as more intuitive and powerful versions of `spread` and `gather`. I have found `pivot_wider` and `pivot_longer` to be easier to use than `spread` and `gather`, so I am only teaching the `pivot` functions. However, you are welcome to use `spread` and `gather` in your assignments. --- ## pivot to longer data We'll start with the long covid-19 data: ```r cv19 ``` ``` ## # A tibble: 2,161 x 4 ## date state cases deaths ## <date> <chr> <dbl> <dbl> ## 1 2020-01-21 Washington 1 0 ## 2 2020-01-22 Washington 1 0 ## 3 2020-01-23 Washington 1 0 ## 4 2020-01-24 Illinois 1 0 ## 5 2020-01-24 Washington 1 0 ## 6 2020-01-25 California 1 0 ## 7 2020-01-25 Illinois 1 0 ## 8 2020-01-25 Washington 1 0 ## 9 2020-01-26 Arizona 1 0 ## 10 2020-01-26 California 2 0 ## # ... with 2,151 more rows ``` --- ## pivot to longer data Then pivot to a longer form with - cases and death values in the `value` column - cases and death indicators in the `name` column. ```r cv19_long <- cv19 %>% * pivot_longer(cols = c(cases, deaths)) cv19_long[1:5, ] ``` ``` ## # A tibble: 5 x 4 ## date state name value ## <date> <chr> <chr> <dbl> ## 1 2020-01-21 Washington cases 1 ## 2 2020-01-21 Washington deaths 0 ## 3 2020-01-22 Washington cases 1 ## 4 2020-01-22 Washington deaths 0 ## 5 2020-01-23 Washington cases 1 ``` --- ## How is this helpful? Longer data are ideal for `ggplot` ```r ggplot(cv19_long, aes(x=date, y = value)) + geom_line(aes(group = state)) + facet_wrap(~ name, scales = 'free') ``` <!-- --> --- ## Pivot to wider data We'll start with the `nhanes_long` data ```r nhanes_long ``` ``` ## # A tibble: 30 x 4 ## # Groups: exam, sex [6] ## exam sex race_ethnicity bp_sys_mmhg ## <fct> <fct> <fct> <dbl> ## 1 2013 Female Non-Hispanic White 121. ## 2 2013 Female Non-Hispanic Black 125. ## 3 2013 Female Non-Hispanic Asian 118. ## 4 2013 Female Hispanic 119. ## 5 2013 Female Other race 119. ## 6 2013 Male Non-Hispanic White 123. ## 7 2013 Male Non-Hispanic Black 129. ## 8 2013 Male Non-Hispanic Asian 123. ## 9 2013 Male Hispanic 124. ## 10 2013 Male Other race 123. ## # ... with 20 more rows ``` --- ## Pivot to wider data Then give the systolic BP values a column for each exam ```r nhanes_wide <- nhanes_long %>% * pivot_wider(values_from = bp_sys_mmhg, * names_from = exam, names_prefix = 'exam_') nhanes_wide ``` ``` ## # A tibble: 10 x 5 ## # Groups: sex [2] ## sex race_ethnicity exam_2013 exam_2015 exam_2017 ## <fct> <fct> <dbl> <dbl> <dbl> ## 1 Female Non-Hispanic White 121. 124. 125. ## 2 Female Non-Hispanic Black 125. 126. 131. ## 3 Female Non-Hispanic Asian 118. 119. 123. ## 4 Female Hispanic 119. 123. 123. ## 5 Female Other race 119. 122. 123. ## 6 Male Non-Hispanic White 123. 126. 127. ## 7 Male Non-Hispanic Black 129. 131. 132. ## 8 Male Non-Hispanic Asian 123. 123. 124. ## 9 Male Hispanic 124. 127. 126. ## 10 Male Other race 123. 125. 126. ``` --- ## How is this helpful? Wider data are ideal for `gt`. ```r year_cols <- paste("exam", c('2013', '2015','2017'), sep = '_') # looks like these data are not wide enough! gt(nhanes_wide) %>% fmt_number(columns = year_cols, decimals = 1) %>% cols_align(columns = 'race_ethnicity', align = 'left') ```
race_ethnicity
exam_2013
exam_2015
exam_2017
Female
Non-Hispanic White
121.2
123.6
124.6
Non-Hispanic Black
125.2
126.4
131.3
Non-Hispanic Asian
118.0
119.2
123.5
Hispanic
119.4
123.2
122.9
Other race
118.8
121.7
122.6
Male
Non-Hispanic White
123.4
125.9
127.0
Non-Hispanic Black
128.5
130.5
131.6
Non-Hispanic Asian
122.7
122.7
124.4
Hispanic
124.2
126.7
126.0
Other race
123.0
125.0
125.7
--- ## More wider What if we need to draw names from two columns? ```r nhanes_wider <- nhanes_long %>% pivot_wider(names_from = c(sex, exam), values_from = bp_sys_mmhg) nhanes_wider ``` ``` ## # A tibble: 5 x 7 ## race_ethnicity Female_2013 Male_2013 Female_2015 Male_2015 Female_2017 Male_2017 ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Non-Hispanic Wh~ 121. 123. 124. 126. 125. 127. ## 2 Non-Hispanic Bl~ 125. 129. 126. 131. 131. 132. ## 3 Non-Hispanic As~ 118. 123. 119. 123. 123. 124. ## 4 Hispanic 119. 124. 123. 127. 123. 126. ## 5 Other race 119. 123. 122. 125. 123. 126. ``` --- ## More wider I need to re-order the columns for `gt` ```r nhanes_wider <- nhanes_wider %>% select( race_ethnicity, starts_with("Female"), starts_with("Male") ) nhanes_wider ``` ``` ## # A tibble: 5 x 7 ## race_ethnicity Female_2013 Female_2015 Female_2017 Male_2013 Male_2015 Male_2017 ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Non-Hispanic Wh~ 121. 124. 125. 123. 126. 127. ## 2 Non-Hispanic Bl~ 125. 126. 131. 129. 131. 132. ## 3 Non-Hispanic As~ 118. 119. 123. 123. 123. 124. ## 4 Hispanic 119. 123. 123. 124. 127. 126. ## 5 Other race 119. 122. 123. 123. 125. 126. ``` --- ## More wider Often you need to access groups of columns in `gt` tables. - I recommend doing this with a character vector: ```r cols <- expand.grid(c('Female', 'Male'), c('2013', '2015', '2017')) %>% glue::glue_data("{Var1}_{Var2}") %>% as.character() cols[1:3] ``` ``` ## [1] "Female_2013" "Male_2013" "Female_2015" ``` ```r cols[4:6] ``` ``` ## [1] "Male_2015" "Female_2017" "Male_2017" ``` --- ## Wider data, cleaner table? Oh no, is this too wide?? ```r *gt(nhanes_wider, rowname_col = 'race_ethnicity') ```
Female_2013
Female_2015
Female_2017
Male_2013
Male_2015
Male_2017
Non-Hispanic White
121.1983
123.5570
124.5606
123.4038
125.9283
126.9904
Non-Hispanic Black
125.2148
126.3657
131.2813
128.5331
130.5202
131.5769
Non-Hispanic Asian
118.0030
119.1709
123.4821
122.7009
122.7175
124.4053
Hispanic
119.4327
123.1708
122.9338
124.2335
126.6675
126.0108
Other race
118.7540
121.6912
122.5813
123.0392
125.0150
125.7456
--- ## Wider data, cleaner table? - reducing the length numbers helps, but - also need to reduce the redundant column names ```r gt(nhanes_wider, rowname_col = 'race_ethnicity') %>% * fmt_number(columns = cols, decimals = 1) %>% ```
Female_2013
Female_2015
Female_2017
Male_2013
Male_2015
Male_2017
Non-Hispanic White
121.2
123.6
124.6
123.4
125.9
127.0
Non-Hispanic Black
125.2
126.4
131.3
128.5
130.5
131.6
Non-Hispanic Asian
118.0
119.2
123.5
122.7
122.7
124.4
Hispanic
119.4
123.2
122.9
124.2
126.7
126.0
Other race
118.8
121.7
122.6
123.0
125.0
125.7
--- ## Wider data, cleaner table! Convert column names into spanner labels! (this is why I needed to re-order my column names) ```r gt(nhanes_wider, rowname_col = 'race_ethnicity') %>% fmt_number(columns = cols, decimals = 1) %>% * tab_spanner_delim(columns = cols, delim = '_') ```
Female
Male
2013
2015
2017
2013
2015
2017
Non-Hispanic White
121.2
123.6
124.6
123.4
125.9
127.0
Non-Hispanic Black
125.2
126.4
131.3
128.5
130.5
131.6
Non-Hispanic Asian
118.0
119.2
123.5
122.7
122.7
124.4
Hispanic
119.4
123.2
122.9
124.2
126.7
126.0
Other race
118.8
121.7
122.6
123.0
125.0
125.7
--- ## Aside Always remember to label the stub: ```r gt(nhanes_wider, rowname_col = 'race_ethnicity') %>% fmt_number(columns = cols, decimals = 1) %>% tab_spanner_delim(columns = cols, delim = '_') %>% * tab_stubhead("Race") ```
Race
Female
Male
2013
2015
2017
2013
2015
2017
Non-Hispanic White
121.2
123.6
124.6
123.4
125.9
127.0
Non-Hispanic Black
125.2
126.4
131.3
128.5
130.5
131.6
Non-Hispanic Asian
118.0
119.2
123.5
122.7
122.7
124.4
Hispanic
119.4
123.2
122.9
124.2
126.7
126.0
Other race
118.8
121.7
122.6
123.0
125.0
125.7
--- ## Pivot longer with 2+ names Columns are named `[sex value]_[exam value]` - so tell `pivot_longer` that `names_sep = "_"` ```r pivot_longer(nhanes_wider, cols = -race_ethnicity, names_to = c('sex', 'exam'), names_sep = '_') ``` ``` ## # A tibble: 30 x 4 ## race_ethnicity sex exam value ## <fct> <chr> <chr> <dbl> ## 1 Non-Hispanic White Female 2013 121. ## 2 Non-Hispanic White Female 2015 124. ## 3 Non-Hispanic White Female 2017 125. ## 4 Non-Hispanic White Male 2013 123. ## 5 Non-Hispanic White Male 2015 126. ## 6 Non-Hispanic White Male 2017 127. ## 7 Non-Hispanic Black Female 2013 125. ## 8 Non-Hispanic Black Female 2015 126. ## 9 Non-Hispanic Black Female 2017 131. ## 10 Non-Hispanic Black Male 2013 129. ## # ... with 20 more rows ``` --- ## Learning more To learn more, see - This [vignette](https://tidyr.tidyverse.org/articles/pivot.html#introduction) dedicated to the `pivot_wider` and `pivot_longer` functions.