class: center, middle, inverse, title-slide # Joining data ## Intro to ‘The Office’ ### Byron C. Jaeger ### Last updated: 2020-07-10 --- class: inverse, center, middle # The Office --- background-image: url(img/schrute.png) background-size: 12.5% background-position: 97.5% 2.5% ## The Office - An American mockumentary sitcom television series that depicts the everyday lives of office employees. - Met with mixed reviews during its short first season, but the following seasons, particularly those featuring Steve Carell, received significant acclaim from television critics. - Later seasons were criticized for a decline in quality, with many seeing Carell's departure in season seven as a contributing factor. - The final season ended the series' run with a generally positive response. - The `schrute` package contains a full transcript of every Office episode! --- background-image: url(img/office_meme.jpg) background-size: 70% --- class: inverse, center, middle # Joining data --- layout: true background-image: url(img/dplyr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Relational data --- It’s rare that a data analysis involves only a single table of data. - Usually combine many tables of data to answer questions - Multiple tables of data are called relational data. --- For example, `raw_ratings` has one row per episode of The Office <br/> and contains the average IMDB rating. ```r raw_ratings ``` ``` ## # A tibble: 188 x 5 ## season episode title imdb_rating total_votes ## <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 1 Pilot 7.6 3706 ## 2 1 2 Diversity Day 8.3 3566 ## 3 1 3 Health Care 7.9 2983 ## 4 1 4 The Alliance 8.1 2886 ## 5 1 5 Basketball 8.4 3179 ## 6 1 6 Hot Girl 7.8 2852 ## 7 2 1 The Dundies 8.7 3213 ## 8 2 2 Sexual Harassment 8.2 2736 ## 9 2 3 Office Olympics 8.4 2742 ## 10 2 4 The Fire 8.4 2713 ## # ... with 178 more rows ``` --- On the other hand, `raw_directors` has one row per writer for <br/> each episode and also specifies the episode's director. ```r raw_directors ``` ``` ## # A tibble: 223 x 4 ## season episode_name director writer ## <int> <chr> <chr> <chr> ## 1 1 Pilot Ken Kwapis Ricky Gervais ## 2 1 Pilot Ken Kwapis Stephen Merchant ## 3 1 Pilot Ken Kwapis Greg Daniels ## 4 1 Diversity Day Ken Kwapis B.J. Novak ## 5 1 Health Care Ken Whittingham Paul Lieberstein ## 6 1 The Alliance Bryan Gordon Michael Schur ## 7 1 Basketball Greg Daniels Greg Daniels ## 8 1 Hot Girl Amy Heckerling Mindy Kaling ## 9 2 The Dundies Greg Daniels Mindy Kaling ## 10 2 Sexual Harassment Ken Kwapis B.J. Novak ## # ... with 213 more rows ``` --- Some problems: - column names for episode titles are different in the two sets. - not all of the episode names from `raw_ratings` are in `raw_directors`, and vice-versa. ```r all(raw_ratings$episode %in% raw_directors$episode_name) ``` ``` ## [1] FALSE ``` - different units of observation in these two datasets. You will get a chance to engage with these problems in class exercises. For now, we will look at some simpler examples based on summaries of these data. --- `ratings_byseason`: The mean rating of episodes, by season ```r ratings_byseason <- raw_ratings %>% group_by(season) %>% summarise(rating = weighted.mean(imdb_rating, total_votes)) ratings_byseason ``` ``` ## # A tibble: 9 x 2 ## season rating ## <dbl> <dbl> ## 1 1 8.01 ## 2 2 8.47 ## 3 3 8.60 ## 4 4 8.67 ## 5 5 8.59 ## 6 6 8.26 ## 7 7 8.51 ## 8 8 7.67 ## 9 9 8.28 ``` --- `directors_byseason`: The most frequent director of each season ```r directors_byseason <- raw_directors %>% select(-writer) %>% distinct() %>% group_by(season) %>% count(director) %>% arrange(desc(n)) %>% slice(1) %>% select(-n) directors_byseason ``` ``` ## # A tibble: 9 x 2 ## # Groups: season [9] ## season director ## <int> <chr> ## 1 1 Ken Kwapis ## 2 2 Greg Daniels ## 3 3 Greg Daniels ## 4 4 Paul Feig ## 5 5 Paul Feig ## 6 6 Randall Einhorn ## 7 7 Jeffrey Blitz ## 8 8 B.J. Novak ## 9 9 David Rogers ``` --- layout: false background-image: url(img/dplyr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Joining verbs There are 3 families of verbs designed for relational data: 1. __Mutating joins__, which add new variables to one data frame from matching observations in another. 1. __Filtering joins__, which filter observations from one data frame based on whether or not they match an observation in the other table. 1. __Set operations__, which treat observations as if they were set elements. --- class: inverse, center, middle # Mutating joins --- ## Left join <img src="img/left-join.gif" width="80%" style="display: block; margin: auto;" /> --- ## Right join <img src="img/right-join.gif" width="80%" style="display: block; margin: auto;" /> --- background-image: url(img/dplyr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Joining verbs An example of __Mutating join:__ ```r *ratings_byseason %>% left_join(directors_byseason) ``` ``` ## # A tibble: 9 x 2 ## season rating ## <dbl> <dbl> ## 1 1 8.01 ## 2 2 8.47 ## 3 3 8.60 ## 4 4 8.67 ## 5 5 8.59 ## 6 6 8.26 ## 7 7 8.51 ## 8 8 7.67 ## 9 9 8.28 ``` --- background-image: url(img/dplyr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Joining verbs An example of __Mutating join:__ ```r ratings_byseason %>% * left_join(directors_byseason) ``` ``` ## # A tibble: 9 x 3 ## season rating director ## <dbl> <dbl> <chr> ## 1 1 8.01 Ken Kwapis ## 2 2 8.47 Greg Daniels ## 3 3 8.60 Greg Daniels ## 4 4 8.67 Paul Feig ## 5 5 8.59 Paul Feig ## 6 6 8.26 Randall Einhorn ## 7 7 8.51 Jeffrey Blitz ## 8 8 7.67 B.J. Novak ## 9 9 8.28 David Rogers ``` --- ## Semi join <img src="img/semi-join.gif" width="80%" style="display: block; margin: auto;" /> --- ## Anti join <img src="img/anti-join.gif" width="80%" style="display: block; margin: auto;" /> --- background-image: url(img/dplyr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Joining verbs An example of __Filtering join:__ ```r # Find episode names that don't match raw_ratings %>% select(episode_name = title, imdb_rating) %>% anti_join(raw_directors) ``` ``` ## # A tibble: 28 x 2 ## episode_name imdb_rating ## <chr> <dbl> ## 1 E-Mail Surveillance 8.4 ## 2 Boys and Girls 8.2 ## 3 A Benihana Christmas 8.7 ## 4 The Job 9.3 ## 5 Fun Run 8.8 ## 6 Dunder Mifflin Infinity 8.4 ## 7 Launch Party 8.5 ## 8 Money 8.7 ## 9 Goodbye, Toby 9.3 ## 10 Weight Loss 8.8 ## # ... with 18 more rows ``` --- background-image: url(img/dplyr.png) background-size: 12.5% background-position: 97.5% 2.5% ## Joining verbs An example of __Filtering join:__ ```r # Flip data sets to see why raw_directors %>% select(title = episode_name, director) %>% anti_join(raw_ratings) ``` ``` ## # A tibble: 32 x 2 ## title director ## <chr> <chr> ## 1 E-Mail Surveilance Paul Feig ## 2 Boys & Girls Dennie Gordon ## 3 A Benihana Christmas (Parts 1&2) Harold Ramis ## 4 The Job (Parts 1&2) Ken Kwapis ## 5 The Job (Parts 1&2) Ken Kwapis ## 6 Fun Run (Parts 1&2) Greg Daniels ## 7 Dunder Mifflin Infinity (Parts 1&2) Craig Zisk ## 8 Launch Party (Parts 1&2) Ken Whittingham ## 9 Money (Parts 1&2) Paul Lieerstein ## 10 Goodbye, Toby (Parts 1&2) Paul Feig ## # ... with 22 more rows ``` --- ## Full join <img src="img/full-join.gif" width="80%" style="display: block; margin: auto;" /> --- ## Learning more To learn more, see - Many more joining animations in the img folder - This [vignette](https://dplyr.tidyverse.org/articles/two-table.html) dedicated to joining verbs.