class: center, middle, inverse, title-slide # Data visualization basics ## Introduction to ggplot2 ### Byron C. Jaeger ### last updated: 2020-04-07 --- ## Catch up - Any questions on material from last time? - Any questions on the reading / primer? - Any questions on workflow / course structure? - Catch up on informal "requirements": + You should have a GitHub profile. + You should have an RStudio cloud profile. + Ideally add your photo to both profiles. --- ## Agenda - Tips on getting help (`reprex`) - Exploratory data analysis - Data visualization - Visualizing Star Wars - Aesthetics - Faceting --- class: inverse, center, middle # `reprex` --- ## What is `reprex`? `reprex` stands for reproducible example. The `reprex` [R package](https://reprex.tidyverse.org/) helps you prepare reproducible examples for posts on GitHub, StackOverflow, or Slack. -- .pull-left[ - If you want someone to help you solve a problem... - widdle the problem down to its essential components. - keep code minimal, don't overwhelm your helpers. - describe issues concisely. ] -- .pull-right[ <img src="img/help-me-help-you.gif" width="500" height="300" /> ] .footnote[</br>Image source and great place to get help: [https://reprex.tidyverse.org/](https://reprex.tidyverse.org/) </br> see R/reprex_example in [Rstudio cloud](https://rstudio.cloud/spaces/15174/project/1048666)] --- class: center, middle # Exploratory data analysis </br> (EDA) --- ## What is EDA? - Interactively learning about data by summarizing its main characteristics. -- - Often, this is visual. That's what we'll focus on today. -- - We might also calculate summary statistics and perform + data tidying (coming up much later) + data isolation (coming up next time) + data transformation (coming up later) at (or before) this stage of the analysis. --- class: center, middle # Data visualization --- ## Data visualization > *"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey* - Data visualization is the creation and study of the visual representation of data. - There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (**ggplot2** is one of them, and that's the one we're going to use). --- ## `ggplot2` - `ggplot2` is a data visualization package. Like any package, `ggplot2` should be loaded before its functions are used. ```r library(ggplot2) ``` -- - Code for `ggplot` can often be written as ```r ggplot + geom_yyy() ``` -- - More generally: ```r ggplot(data = [dataset]) + aes(x = [x-variable], y = [y-variable]) + geom_xxx() + other_functions() ``` - `geoms` (geometric objects) determine the type of plot produced. --- ## About `ggplot2` - `ggplot2` is the name of the package - The `gg` in "`ggplot2`" stands for Grammar of Graphics - Inspired by the book **Grammar of Graphics** by Leland Wilkinson - `ggplot()` is the main function in ggplot2 - For help with `ggplot2`, see http://ggplot2.tidyverse.org/ --- class: center, middle # Visualizing Star Wars --- ## Dataset terminology __Question__: What does each row represent? What does each column represent? ```r starwars[, 1:5] ``` ``` ## # A tibble: 87 x 5 ## name height mass hair_color skin_color ## <chr> <int> <dbl> <chr> <chr> ## 1 Luke Skywalker 172 77 blond fair ## 2 C-3PO 167 75 <NA> gold ## 3 R2-D2 96 32 <NA> white, blue ## 4 Darth Vader 202 136 none white ## 5 Leia Organa 150 49 brown light ## 6 Owen Lars 178 120 brown, grey light ## 7 Beru Whitesun lars 165 75 brown light ## 8 R5-D4 97 32 <NA> white, red ## 9 Biggs Darklighter 183 84 black light ## 10 Obi-Wan Kenobi 182 77 auburn, white fair ## # ... with 77 more rows ``` --- ## Luke Skywalker  --- ## What's in the Star Wars data? Take a `glimpse` at the data: ```r glimpse(starwars, width = 60) ``` ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", ... ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97... ## $ mass <dbl> 77, 75, 32, 136, 49, 120, 75, 32, 84... ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "b... ## $ skin_color <chr> "fair", "gold", "white, blue", "whit... ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "... ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0,... ## $ sex <chr> "male", "none", "none", "male", "fem... ## $ gender <chr> "masculine", "masculine", "masculine... ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Ta... ## $ species <chr> "Human", "Droid", "Droid", "Human", ... ## $ films <list> [<"The Empire Strikes Back", "Reven... ## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder ... ## $ starships <list> [<"X-wing", "Imperial shuttle">, <>... ``` --- ## What's in the Star Wars data? Run the following **in the Console** to view the help ```r ?starwars ```  __Question__: How many rows and columns does this dataset have? Make a prediction: What relationship do you expect to see between height and mass? --- ## Mass vs. height ```r ggplot(data = starwars) + aes(x = height, y = mass) + geom_point() ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <!-- --> --- ## What's that warning? - Not all characters have height and mass information (hence 28 of them not plotted) ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` - Going forward I'll supress the warning to save room on slides, but it's important to note it --- ## Mass vs. height __Questions__: - How would you describe this relationship? - Who is the not so tall but really chubby character? <!-- --> --- class: center, middle # Jabba! <img src="img/jabbaplot.png" width="768" /> --- class: center, middle # Aesthetics --- ## Aesthetics options Visual characteristics that can be **mapped to data** are - `color` - `size` - `shape` - `alpha` (transparency) --- ## Mass vs. height + gender ```r ggplot(data = starwars) + aes(x = height, y = mass, color = gender) + geom_point() ``` <!-- --> --- ## Aesthetics summary - Discrete variables are measured (often counted) on a discrete scale
Aesthetics
Discrete
color
different color for each category
size
discrete steps in sizes
shape
different shapes for each category
- Continuous variable are measured on a continuous scale
Aesthetics
Continuous
color
color gradient
size
linear mapping between radius and value
shape
shouldn't (and doesn't) work
--- ## Your turn - Switch to Rstudio cloud and find the data visualization basics project. - Work on completing the problems in `exercises.Rmd` with your teammates.