10  pivot_longer and pivot_wider

Author

Luke Miratrix and Josh Gilbert

Generally, you want your data to be in a form where each row is a case and each column is a variable (either explanatory or response). Sometimes your data don’t start that way. This section describes how to move your data around to get it in that form. The tidyverse provides a simple method for doing this (pivot_longer() and pivot_wider()) which you should read about in R for Data Science. There are also “old school” ways of doing this, via a method called reshape(); this way is more powerful and useful in some circumstances. See the final section for more on this old-style approach.

But for now, the pivot methods will pretty much do everything you want. Both pivot_longer and pivot_wider from tidyverse are great functions to understand. First, we load tidyverse and make some fake data.

library(tidyverse)

dat <- data.frame( ID = c( 1:3 ), 
                  X = c( 10, 20, 30 ),
                  Y1 = 1:3,
                  Y2 = 10 + 1:3,
                  Y3 = 20 + 1:3 )

dat
  ID  X Y1 Y2 Y3
1  1 10  1 11 21
2  2 20  2 12 22
3  3 30  3 13 23

This data is in wide format, where we have multiple measurements (Y1, Y2, and Y3) for each individual (each row of data).

10.1 Converting wide data to long data

We use pivot_longer to take our Y values and nest them within each ID for longitudinal MLM analysis. (NB you can use SEM to fit longitudinal models with wide data; we do not explore that application here.)

datL <- pivot_longer(dat, Y1:Y3, 
                     names_to = "time", 
                     values_to = "front" )

datL
# A tibble: 9 × 4
     ID     X time  front
  <int> <dbl> <chr> <dbl>
1     1    10 Y1        1
2     1    10 Y2       11
3     1    10 Y3       21
4     2    20 Y1        2
5     2    20 Y2       12
6     2    20 Y3       22
7     3    30 Y1        3
8     3    30 Y2       13
9     3    30 Y3       23

10.2 Converting long data to wide data

pivot_wider takes us back in the other direction.

newdat <- pivot_wider( datL, c(ID, X), 
                       names_from=time, 
                       values_from=front  )

newdat
# A tibble: 3 × 5
     ID     X    Y1    Y2    Y3
  <int> <dbl> <dbl> <dbl> <dbl>
1     1    10     1    11    21
2     2    20     2    12    22
3     3    30     3    13    23

We then verify our work with a few checks.

stopifnot( length( unique( newdat$ID ) ) == nrow( newdat ) )

students = datL %>% dplyr::select( ID, X ) %>%
    unique()
students
# A tibble: 3 × 2
     ID     X
  <int> <dbl>
1     1    10
2     2    20
3     3    30
students = merge( students, newdat, by="ID" )

10.3 Optional: wrangling data with reshape

The reshape() command is the old-school way of doing things, and it is harder to use but also can be more powerful in some ways (alternatively, there is a long literature on doing fancy stuff with the pivot methods as well). This section is entirely optional and possibly no longer useful.

Anyway, say you have data in a form where a row has a value for a variable for several different points in time. The following code turns it into a data.frame where each row (case) is a value for the variable at that point in time. You also have an ID variable for which Country the GDP came from.

dtw = read.csv( "data/fake_country_block.csv", as.is=TRUE )
dtw
  Country X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004
1   China   0.5     1     2   3.4     4   5.3   6.0     7
2 Morocco  31.9    32    33  34.0    NA  36.0  37.0    NA
3 England  51.3    52    53  54.3    55  56.0  57.3    58

Here we have three rows, but actually a lot of cases if we consider each time point a case. For trying it on your own, get the sample csv file ()[here]
See the website to get the sample csv file \verb|fake_country_block.csv|.

The following our original data by making a case for each time point:

dt = reshape( dtw, idvar="Country", timevar="Year", varying=2:9, sep="", direction="long" )
head(dt)
             Country Year    X
China.1997     China 1997  0.5
Morocco.1997 Morocco 1997 31.9
England.1997 England 1997 51.3
China.1998     China 1998  1.0
Morocco.1998 Morocco 1998 32.0
England.1998 England 1998 52.0

Things to notice: each case has a “row name” made out of the country and the Year. The “2:9” indicates a range of columns for the variable that is actually the same variable.
R picked up that, for each of these columns, “X” is the name of the variable and the number is the time, and seperated them. You can set the name of your time variable, \verb|timevar|, to whatever you want.

The above output is called “long format” and the prior is called “wide format.”
You can go in either direction. Here:

dtn = reshape( dt, idvar="Country", timevar="Year" )
dtn
             Country X.1997 X.1998 X.1999 X.2000 X.2001 X.2002 X.2003 X.2004
China.1997     China    0.5      1      2    3.4      4    5.3    6.0      7
Morocco.1997 Morocco   31.9     32     33   34.0     NA   36.0   37.0     NA
England.1997 England   51.3     52     53   54.3     55   56.0   57.3     58

You can reshape on multiple variables. For example:

exp.dat = data.frame( ID=c("a","b","c","d"), 
      cond = c("AI","DI","DI","AI"),
            trial1 = c("E","U","U","E"),
            dec1 = c(1,1,0,1),
            trial2 = c("U","E","U","E"),
            dec2 = c(0,0,0,1),
                trial3 = c("U","E","E","U"),
            dec3 = c(0,1,0,1),
                trial4 = c("E","U","E","U"),
            dec4 = c(0,1,0,0) )
exp.dat
  ID cond trial1 dec1 trial2 dec2 trial3 dec3 trial4 dec4
1  a   AI      E    1      U    0      U    0      E    0
2  b   DI      U    1      E    0      E    1      U    1
3  c   DI      U    0      U    0      E    0      E    0
4  d   AI      E    1      E    1      U    1      U    0
rs = reshape( exp.dat,  idvar="ID", 
        varying=c( 3:10 ), sep="", direction="long")            
head(rs)
    ID cond time trial dec
a.1  a   AI    1     E   1
b.1  b   DI    1     U   1
c.1  c   DI    1     U   0
d.1  d   AI    1     E   1
a.2  a   AI    2     U   0
b.2  b   DI    2     E   0

It sorts out which variables are which. Note the names have to be exactly the same for any group of variables.

Once you have reshaped, you can look at things more easily (I use mosaic’s tally instead of the base table):

mosaic::tally( trial ~ dec, data=rs )
     dec
trial 0 1
    E 4 4
    U 5 3

or

mosaic::tally( trial~dec+cond, data=rs )
, , cond = AI

     dec
trial 0 1
    E 1 3
    U 3 1

, , cond = DI

     dec
trial 0 1
    E 3 1
    U 2 2