class: center, middle, inverse, title-slide # Data visualization tools ## Also: data tidying tools ### Byron C. Jaeger ### Last updated: 2020-04-09 --- class: inverse, center, middle # Tidying your data --- class: center, middle ## Going from here <img src="img/excel_spreadsheet.png" width="1024" /> --- ## To here ```r glimpse(supermarket, width = 60) ``` ``` ## Rows: 14,059 ## Columns: 16 ## $ transaction <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1... ## $ purchase_date <dttm> 2011-12-18, 2011-12-20, 201... ## $ customer_id <dbl> 7223, 7841, 8374, 9619, 1900... ## $ gender <chr> "F", "M", "F", "M", "F", "F"... ## $ marital_status <chr> "S", "M", "M", "M", "S", "M"... ## $ homeowner <chr> "Y", "Y", "N", "Y", "Y", "Y"... ## $ children <dbl> 2, 5, 2, 3, 3, 3, 2, 2, 3, 1... ## $ annual_income <chr> "$30K - $50K", "$70K - $90K"... ## $ city <chr> "Los Angeles", "Los Angeles"... ## $ state_or_province <chr> "CA", "CA", "WA", "OR", "CA"... ## $ country <chr> "USA", "USA", "USA", "USA", ... ## $ product_family <chr> "Food", "Food", "Food", "Foo... ## $ product_department <chr> "Snack Foods", "Produce", "S... ## $ product_category <chr> "Snack Foods", "Vegetables",... ## $ units_sold <dbl> 5, 5, 3, 4, 4, 3, 4, 6, 1, 2... ## $ revenue <dbl> 27.38, 14.90, 5.52, 4.44, 14... ``` --- ## Import - `read_excel()` can read specific sheets into R as `tibbles` ```r supermarket <- read_excel( path = "data/Supermarket Transactions.xlsx", sheet = "Data" ) supermarket[1:2, 1:3] ``` ``` ## # A tibble: 2 x 3 ## Transaction `Purchase Date` `Customer ID` ## <dbl> <dttm> <dbl> ## 1 1 2011-12-18 00:00:00 7223 ## 2 2 2011-12-20 00:00:00 7841 ``` Look at the first 3 names of the spreadsheet's data. What do you see? --- background-image: url(img/janitor_clean_names.png) background-size: 75% background-position: 50% 75% ## Variable names The variable names have spaces and are written in Title Case. There's nothing inherently wrong with that, but it's less than ideal for workflow and tedious to manually rename everything. `janitor`, an R package for cleaning data, is here to help. --- ## Variable names Pick your favorite naming convention: ```r clean_names(supermarket, case = 'snake')[1:2, 1:3] ``` ``` ## # A tibble: 2 x 3 ## transaction purchase_date customer_id ## <dbl> <dttm> <dbl> ## 1 1 2011-12-18 00:00:00 7223 ## 2 2 2011-12-20 00:00:00 7841 ``` --- ## Variable names Pick your favorite naming convention: ```r clean_names(supermarket, case = 'lower_camel')[1:2, 1:3] ``` ``` ## # A tibble: 2 x 3 ## transaction purchaseDate customerId ## <dbl> <dttm> <dbl> ## 1 1 2011-12-18 00:00:00 7223 ## 2 2 2011-12-20 00:00:00 7841 ``` --- ## Variable names Pick your favorite naming convention: ```r clean_names(supermarket, case = 'screaming_snake')[1:2, 1:3] ``` ``` ## # A tibble: 2 x 3 ## TRANSACTION PURCHASE_DATE CUSTOMER_ID ## <dbl> <dttm> <dbl> ## 1 1 2011-12-18 00:00:00 7223 ## 2 2 2011-12-20 00:00:00 7841 ``` --- ## Review: data summarization Find the total `revenue` from supermarkets for each city: ```r city_rev <- supermarket %>% group_by(city) %>% summarise(revenue = sum(revenue, na.rm = TRUE)) city_rev ``` ``` ## # A tibble: 23 x 2 ## city revenue ## * <chr> <dbl> ## 1 Acapulco 5161. ## 2 Bellingham 993. ## 3 Beverly Hills 10320. ## 4 Bremerton 10975. ## 5 Camacho 5797. ## 6 Guadalajara 523. ## 7 Hidalgo 11313. ## 8 Los Angeles 12296. ## 9 Merida 8740. ## 10 Mexico City 2488. ## # ... with 13 more rows ``` --- ## Bar charts Without proper hygiene, bar charts devolve into the bad charts ```r ggplot(city_rev, aes(x = city, y = revenue)) + geom_bar(stat = "identity") ``` <!-- --> --- ## Order matters ```r ggplot(city_rev) + * aes(x = reorder(city, revenue), y = revenue) + geom_bar(stat = "identity") ``` <!-- --> --- ## Orientation matters ```r ggplot(city_rev) + aes(x = reorder(city, revenue), y = revenue) + geom_bar(stat = "identity") + * coord_flip() ``` <!-- --> --- class: center, middle ## Question Who generates more revenue for super markets? -- Shoppers who are men or shoppers who are women? --- ## Data summarization ```r city_rev_gender <- supermarket %>% group_by(city, gender) %>% summarise(revenue = sum(revenue, na.rm = TRUE)) %>% * ungroup() %>% mutate( gender = recode(gender, 'F' = 'Female', 'M' = 'Male'), # re-order city in the data rather than the plot # why would this fail if data were grouped? * city = fct_reorder(city, .x = revenue) ) city_rev_gender[1:3,] ``` ``` ## # A tibble: 3 x 3 ## city gender revenue ## <fct> <chr> <dbl> ## 1 Acapulco Female 2566. ## 2 Acapulco Male 2596. ## 3 Bellingham Female 453. ``` --- ## Comparisons Aesthetic option `fill` applies to the inside of bars, color applies to their border ```r *ggplot(city_rev_gender, aes(city, revenue, fill = gender)) + geom_bar(stat = "identity", color = 'purple') + coord_flip() ``` <!-- --> --- ## Comparisons `position` governs how the bars are placed ```r ggplot(city_rev_gender, aes(city, revenue, fill = gender)) + * geom_bar(stat = "identity", position = "dodge") + coord_flip() ``` <!-- --> --- ## Comparisons `facet_wrap()` and `facet_grid()` give one plot per group ```r ggplot(city_rev_gender, aes(city, revenue, fill = gender)) + geom_bar(stat = "identity", position = "dodge") + coord_flip() + * facet_wrap( ~ gender) ``` <!-- --> --- ## Maximal info, minimal ink > Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space. > > <footer>--- Edward R. Tufte</footer> - How can we make it easy to pick out the patterns across cities? - Can we use less ink? --- ## Points ```r ggplot(city_rev_gender, aes(revenue, city)) + geom_point(aes(color = gender)) ``` <!-- --> --- ## Size ```r ggplot(city_rev_gender, aes(revenue, city)) + geom_point(aes(color = gender), size = 3) + * theme(text = element_text(size = 16)) ``` <!-- --> --- ## Aesthetic inheritance - ggplot adds layers, one by one, to a graph. - general aesthetics for the whole graph can be set using `aes()` + in the `ggplot()` function + in a stand-alone `aes()` function. - the aesthetics of the current `geom` can be set using `aes()` _inside_ the geom function. --- The aesthetics in this line are inherited by `geom_line` ```r ggplot(city_rev_gender) + * aes(x = revenue, y = city) + geom_point(aes(color = gender), size = 3) + theme(text = element_text(size = 16)) + geom_line() # inherits x = revenue, y = city ``` The aesthetics in this line are __not__ inherited by `geom_line` ```r ggplot(city_rev_gender) + aes(x = revenue, y = city) + * geom_point(aes(color = gender), size = 3) + theme(text = element_text(size = 16)) + geom_line() # inherits x = revenue, y = city ``` --- ## Lines the main aesthetic for lines is group. Bad groupings ruin good plots. ```r ggplot(city_rev_gender) + aes(x = revenue, y = city) + geom_point(aes(color = gender), size = 3) + theme(text = element_text(size = 16)) + geom_line(aes(group = gender)) # disaster! ``` <!-- --> --- ## Lines Good groupings help draw the eye to the relevant comparisons ```r ggplot(city_rev_gender) + aes(x = revenue, y = city) + geom_point(aes(color = gender), size = 3) + theme(text = element_text(size = 16)) + geom_line(aes(group = city)) ``` <!-- --> --- ## Order matters If you want points to appear on top of the lines, put the line layer down _before_ the point layer. ```r ggplot(city_rev_gender) + aes(x = revenue, y = city) + * geom_line(aes(group = city)) + geom_point(aes(color = gender), size = 3) + theme(text = element_text(size = 16)) ``` <!-- --> --- ## Use text intelligently Annotation can help readers understand the most relevant parts of your data. - `ggplot` uses `geom_text()` to add text layers - the main aesthetic for `geom_text()` is `label` - `ggforce`, an extension of `ggplot`, has a lot of handly annotation helpers. Remember, minimal ink... --- Oh dear... ```r ggplot(city_rev_gender) + * aes(x = revenue, y = city, label = revenue) + geom_line(aes(group = city)) + geom_point(aes(color = gender), size = 3) + * geom_text(aes(color = gender)) + theme(text = element_text(size = 16)) ``` <!-- --> --- ## Your turn - It will take some data wrangling to get this figure just how we want it. - Let's finish the figure in `exercises.Rmd`