Wide data to long using the tidyverse (tidyr's gather function)

A wide data storage format is an efficient and compact way to store information. And this organization perhaps it makes data easier to inspect. We have wide monitors our laptops and destops. However, for visualization and analysis you generally need to transform this data from the wide format to a “tidy”, long format.

We look at the case where just one variable is stored in a spreadsheet.

library(tidyverse)

Suppose you have a data frame of rankings of schools by year, and the initial data set is organized as follows (I just build one with tribble()):

df_wide <- tribble(~rankings_of_schools_by_year, ~`2000`, ~`2001`, ~`2002`,
        "U of Illinois", 1, 2,  3, 
        "TU Dresden", 2, 3, 1, 
        "U of Denver", 3, 1, 1, 
        "Hogwarts", 4,4,4)

df_wide
## # A tibble: 4 x 4
##   rankings_of_schools_by_year `2000` `2001` `2002`
##   <chr>                        <dbl>  <dbl>  <dbl>
## 1 U of Illinois                    1      2      3
## 2 TU Dresden                       2      3      1
## 3 U of Denver                      3      1      1
## 4 Hogwarts                         4      4      4

Restructuring the data with gather, I define the names of columns that will contain the information in the column names (year) and the variable of interest (rank).

df_long <- df_wide %>% gather(key = year, value = rank, `2000`:`2002`)
df_long
## # A tibble: 12 x 3
##    rankings_of_schools_by_year year   rank
##    <chr>                       <chr> <dbl>
##  1 U of Illinois               2000      1
##  2 TU Dresden                  2000      2
##  3 U of Denver                 2000      3
##  4 Hogwarts                    2000      4
##  5 U of Illinois               2001      2
##  6 TU Dresden                  2001      3
##  7 U of Denver                 2001      1
##  8 Hogwarts                    2001      4
##  9 U of Illinois               2002      3
## 10 TU Dresden                  2002      1
## 11 U of Denver                 2002      1
## 12 Hogwarts                    2002      4

Pretty good! But we are not all the way there. Let’s use the code above a base. We need to change the first column name to be more appropriate. Also, the years are encoded as a character variable whereas they should be numeric (in this case integers, as the years are round numbers).

df_long <- df_wide %>% gather(key = year, value = rank, `2000`:`2002`) %>% 
  rename(school = rankings_of_schools_by_year) %>% 
  mutate(year = as.integer(year))
df_long
## # A tibble: 12 x 3
##    school         year  rank
##    <chr>         <int> <dbl>
##  1 U of Illinois  2000     1
##  2 TU Dresden     2000     2
##  3 U of Denver    2000     3
##  4 Hogwarts       2000     4
##  5 U of Illinois  2001     2
##  6 TU Dresden     2001     3
##  7 U of Denver    2001     1
##  8 Hogwarts       2001     4
##  9 U of Illinois  2002     3
## 10 TU Dresden     2002     1
## 11 U of Denver    2002     1
## 12 Hogwarts       2002     4

Note to students: Then you might filter by year: filter(year > 2000).

Avatar
Evangeline Reynolds
Visiting Teaching Assistant Professor

My research interests include international institutions, causal inference, data visualization, and computational social science and pedagogy.