Data visualization tools

class: center, middle, inverse, title-slide

# Data visualization tools
## Also: data tidying tools
### Byron C. Jaeger
### Last updated: 2020-04-09

---

class: inverse, center, middle

# Tidying your data

---
class: center, middle

## Going from here

---

## To here

```r
glimpse(supermarket, width = 60)
```

```
## Rows: 14,059
## Columns: 16
## $ transaction        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1...
## $ purchase_date      <dttm> 2011-12-18, 2011-12-20, 201...
## $ customer_id        <dbl> 7223, 7841, 8374, 9619, 1900...
## $ gender             <chr> "F", "M", "F", "M", "F", "F"...
## $ marital_status     <chr> "S", "M", "M", "M", "S", "M"...
## $ homeowner          <chr> "Y", "Y", "N", "Y", "Y", "Y"...
## $ children           <dbl> 2, 5, 2, 3, 3, 3, 2, 2, 3, 1...
## $ annual_income      <chr> "$30K - $50K", "$70K - $90K"...
## $ city               <chr> "Los Angeles", "Los Angeles"...
## $ state_or_province  <chr> "CA", "CA", "WA", "OR", "CA"...
## $ country            <chr> "USA", "USA", "USA", "USA", ...
## $ product_family     <chr> "Food", "Food", "Food", "Foo...
## $ product_department <chr> "Snack Foods", "Produce", "S...
## $ product_category   <chr> "Snack Foods", "Vegetables",...
## $ units_sold         <dbl> 5, 5, 3, 4, 4, 3, 4, 6, 1, 2...
## $ revenue            <dbl> 27.38, 14.90, 5.52, 4.44, 14...
```

---

## Import

- `read_excel()` can read specific sheets into R as `tibbles`

```r
supermarket <- read_excel(
  path  = "data/Supermarket Transactions.xlsx",
  sheet = "Data"
)

supermarket[1:2, 1:3]
```

```
## # A tibble: 2 x 3
##   Transaction `Purchase Date`     `Customer ID`
##         <dbl> <dttm>                      <dbl>
## 1           1 2011-12-18 00:00:00          7223
## 2           2 2011-12-20 00:00:00          7841
```

Look at the first 3 names of the spreadsheet's data. What do you see?

---
background-image: url(img/janitor_clean_names.png)
background-size: 75% 
background-position: 50% 75%

## Variable names

The variable names have spaces and are written in Title Case. There's nothing inherently wrong with that, but it's less than ideal for workflow and tedious to manually rename everything. `janitor`, an R package for cleaning data, is here to help.

---

## Variable names

Pick your favorite naming convention:

```r
clean_names(supermarket, case = 'snake')[1:2, 1:3]
```

```
## # A tibble: 2 x 3
##   transaction purchase_date       customer_id
##         <dbl> <dttm>                    <dbl>
## 1           1 2011-12-18 00:00:00        7223
## 2           2 2011-12-20 00:00:00        7841
```

---

## Variable names

Pick your favorite naming convention:

```r
clean_names(supermarket, case = 'lower_camel')[1:2, 1:3]
```

```
## # A tibble: 2 x 3
##   transaction purchaseDate        customerId
##         <dbl> <dttm>                   <dbl>
## 1           1 2011-12-18 00:00:00       7223
## 2           2 2011-12-20 00:00:00       7841
```

---

## Variable names

Pick your favorite naming convention:

```r
clean_names(supermarket, case = 'screaming_snake')[1:2, 1:3]
```

```
## # A tibble: 2 x 3
##   TRANSACTION PURCHASE_DATE       CUSTOMER_ID
##         <dbl> <dttm>                    <dbl>
## 1           1 2011-12-18 00:00:00        7223
## 2           2 2011-12-20 00:00:00        7841
```

---

## Review: data summarization

Find the total `revenue` from supermarkets for each city:

```r
city_rev <- supermarket %>%
  group_by(city) %>%
  summarise(revenue = sum(revenue, na.rm = TRUE))

city_rev
```

```
## # A tibble: 23 x 2
##    city          revenue
##  * <chr>           <dbl>
##  1 Acapulco        5161.
##  2 Bellingham       993.
##  3 Beverly Hills  10320.
##  4 Bremerton      10975.
##  5 Camacho         5797.
##  6 Guadalajara      523.
##  7 Hidalgo        11313.
##  8 Los Angeles    12296.
##  9 Merida          8740.
## 10 Mexico City     2488.
## # ... with 13 more rows
```

---

## Bar charts

Without proper hygiene, bar charts devolve into the bad charts

```r
ggplot(city_rev, aes(x = city, y = revenue)) +
  geom_bar(stat = "identity")
```

![](index_files/figure-html/unnamed-chunk-10-1.png)

---

## Order matters

```r
ggplot(city_rev) +
* aes(x = reorder(city, revenue), y = revenue) +
  geom_bar(stat = "identity")
```

![](index_files/figure-html/unnamed-chunk-11-1.png)

---

## Orientation matters

```r
ggplot(city_rev) +
  aes(x = reorder(city, revenue), y = revenue) +
  geom_bar(stat = "identity") + 
* coord_flip()
```

![](index_files/figure-html/unnamed-chunk-12-1.png)

---
class: center, middle

## Question

Who generates more revenue for super markets?

Shoppers who are men or shoppers who are women?

---

## Data summarization

```r
city_rev_gender <- supermarket %>%
  group_by(city, gender) %>%
  summarise(revenue = sum(revenue, na.rm = TRUE)) %>% 
* ungroup() %>%
  mutate(
    gender = recode(gender, 'F' = 'Female', 'M' = 'Male'),
    # re-order city in the data rather than the plot
    # why would this fail if data were grouped?
*   city = fct_reorder(city, .x = revenue)
  )

city_rev_gender[1:3,]
```

```
## # A tibble: 3 x 3
##   city       gender revenue
##   <fct>      <chr>    <dbl>
## 1 Acapulco   Female   2566.
## 2 Acapulco   Male     2596.
## 3 Bellingham Female    453.
```

---

## Comparisons

Aesthetic option `fill` applies to the inside of bars, color applies to their border

```r
*ggplot(city_rev_gender, aes(city, revenue, fill = gender)) +
  geom_bar(stat = "identity", color = 'purple') +
  coord_flip()
```

![](index_files/figure-html/unnamed-chunk-14-1.png)

---

## Comparisons

`position` governs how the bars are placed

```r
ggplot(city_rev_gender, aes(city, revenue, fill = gender)) +
* geom_bar(stat = "identity", position = "dodge") +
  coord_flip()
```

![](index_files/figure-html/unnamed-chunk-15-1.png)

---

## Comparisons

`facet_wrap()` and `facet_grid()` give one plot per group

```r
ggplot(city_rev_gender, aes(city, revenue, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
* facet_wrap( ~ gender)
```

![](index_files/figure-html/unnamed-chunk-16-1.png)

---

## Maximal info, minimal ink

> Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.
>
> <footer>--- Edward R. Tufte</footer>

- How can we make it easy to pick out the patterns across cities?

- Can we use less ink?

---

## Points

```r
ggplot(city_rev_gender, aes(revenue, city)) +
  geom_point(aes(color = gender))
```

![](index_files/figure-html/unnamed-chunk-17-1.png)

---

## Size

```r
ggplot(city_rev_gender, aes(revenue, city)) +
  geom_point(aes(color = gender), size = 3) + 
* theme(text = element_text(size = 16))
```

![](index_files/figure-html/unnamed-chunk-18-1.png)

---

## Aesthetic inheritance

- ggplot adds layers, one by one, to a graph.

- general aesthetics for the whole graph can be set using `aes()`

+ in the `ggplot()` function
    
    + in a stand-alone `aes()` function.

- the aesthetics of the current `geom` can be set using `aes()` _inside_ the geom function.

---

The aesthetics in this line are inherited by `geom_line`

```r
ggplot(city_rev_gender) +
* aes(x = revenue, y = city) +
  geom_point(aes(color = gender), size = 3) +
  theme(text = element_text(size = 16)) +
  geom_line() # inherits x = revenue, y = city
```

The aesthetics in this line are __not__ inherited by `geom_line`

```r
ggplot(city_rev_gender) +
  aes(x = revenue, y = city) + 
* geom_point(aes(color = gender), size = 3) +
  theme(text = element_text(size = 16)) +
  geom_line() # inherits x = revenue, y = city
```

---

## Lines

the main aesthetic for lines is group. Bad groupings ruin good plots.

```r
ggplot(city_rev_gender) +
  aes(x = revenue, y = city) + 
  geom_point(aes(color = gender), size = 3) +
  theme(text = element_text(size = 16)) +
  geom_line(aes(group = gender)) # disaster!
```

![](index_files/figure-html/unnamed-chunk-21-1.png)

---

## Lines

Good groupings help draw the eye to the relevant comparisons

```r
ggplot(city_rev_gender) +
  aes(x = revenue, y = city) + 
  geom_point(aes(color = gender), size = 3) + 
  theme(text = element_text(size = 16)) +
  geom_line(aes(group = city)) 
```

![](index_files/figure-html/unnamed-chunk-22-1.png)

---

## Order matters

If you want points to appear on top of the lines, put the line layer down _before_ the point layer.

```r
ggplot(city_rev_gender) +
  aes(x = revenue, y = city) + 
* geom_line(aes(group = city)) +
  geom_point(aes(color = gender), size = 3) + 
  theme(text = element_text(size = 16)) 
```

![](index_files/figure-html/unnamed-chunk-23-1.png)

---

## Use text intelligently

Annotation can help readers understand the most relevant parts of your data. 
- `ggplot` uses `geom_text()` to add text layers

- the main aesthetic for `geom_text()` is `label`

- `ggforce`, an extension of `ggplot`, has a lot of handly annotation helpers.

Remember, minimal ink...

---

Oh dear...

```r
ggplot(city_rev_gender) +
* aes(x = revenue, y = city, label = revenue) +
  geom_line(aes(group = city)) +
  geom_point(aes(color = gender), size = 3) + 
* geom_text(aes(color = gender)) +
  theme(text = element_text(size = 16)) 
```

![](index_files/figure-html/unnamed-chunk-24-1.png)

---

## Your turn

- It will take some data wrangling to get this figure just how we want it.

- Let's finish the figure in `exercises.Rmd`