12  Tips, Tricks, and Debugging in R

This chapter is a complete hodge-podge of small things that tend to bite students, especially when they are dealing with messy data that they might have for final projects.

In particular, as you learn R, there’s lots of good tricks you’ll never know about until somebody shows you. Clean code is one such good trick; consider the following: “Your most important collaborator is you from 6 months ago. Unfortunately, you can’t ask that-you any questions, because they don’t answer their email.”

So onwards with a few things that you may find useful, either now or later.

12.1 Some principles to live by

12.1.1 Watch Tricky letter and number confusion in code

The letter “l” looks like the number “1”—watch out for that. Things like “mylm” are usually all letters, with “lm” standing for linear model.

12.1.2 Write in a good R style

Try to do the following

  • Comment your code!
  • Structure your R file like so:
    • Descriptive comments (including date)
    • Load libraries
    • Constants and script parameters (# iterations, etc.)
    • Functions (with descriptive comment after first line)
    • Everything else
  • Naming: variableName / variable.name, FunctionVerb, kConstantName. not_like_this
  • Indentation: 2-space indents is nice
  • Spaces are ok, but don’t go overboard. E.g., y = (a * x) + b + rnorm(1, sd=sigma)
  • Never use attach()

12.1.3 Save and load R objects to save time

If you have the result of something that took awhile to run (e.g., a big multilevel model fit to a lot of data) you can try saving it like so:

myBigThing = lm(mpg ~ disp, data=mtcars) #something slow
saveRDS(myBigThing, savedPath)

## Later on:
myBigThing <- readRDS(savedPath)

12.1.4 Reproduce randomness with set.seed

If your code uses random numbers, then you should set your seed, which makes your script always generate the same sequence of random numbers.

For example, say your code had this:

tryCatch({(1:(1:10)[rpois(1, 3)])}, error=function(e){(e)}) #works?
[1] 1 2
set.seed(97)
tryCatch({(1:(1:10)[rpois(1, 3)])}, error=function(e){(e)}) #fails!
<simpleError in 1:(1:10)[rpois(1, 3)]: argument of length 0>

(Note the tryCatch() method is a way of generating errors and not crashing.)

Key thing to know: Reproducible results help with debugging.

If you want to get fancy, try this (after installing the `TeachingDemos’ package):

TeachingDemos::char2seed("quinn") 
# Using your name as a seed says "nothing up my sleeve"

12.1.5 Keep your files organized

Ever seen this?

  • /My Documents
    • my paper.tex
    • my paper draft 2.tex
    • my paper final.tex
    • my paper final revised.tex
    • my paper final revised 2.tex
    • script.r
    • script 2.r
    • data.csv

Try instead something like:

  • /stat 166-Small Data Analysis
    • stat 166.rproj
    • /Empty Project
      • /code
      • /data
      • /text
      • /figures
      • readme.txt
    • /HW1

Your readme.txt might have informational notes such as “Got data from bit.ly/XYZ.” to remind you of what you were up to.

Your figures folder should be full of figures you can easily regenerate with code in your code folder.

12.1.6 Make sure your data are numeric

Sometimes when you load data in, R does weird things like decide all your numbers are actually words. This happens if some of your entries are not numbers. Then R makes them all not numbers. You can check this with the str() function:

str( exp.dat )
'data.frame':   4 obs. of  10 variables:
 $ ID    : chr  "a" "b" "c" "d"
 $ cond  : chr  "AI" "DI" "DI" "AI"
 $ trial1: chr  "E" "U" "U" "E"
 $ dec1  : num  1 1 0 1
 $ trial2: chr  "U" "E" "U" "E"
 $ dec2  : num  0 0 0 1
 $ trial3: chr  "U" "E" "E" "U"
 $ dec3  : num  0 1 0 1
 $ trial4: chr  "E" "U" "E" "U"
 $ dec4  : num  0 1 0 0

Here we see that we have factors (categorical variables) and numbers (num). All is well.

If something should be a number, then change it like so:

lst <-  c( 1, 2, 3, "dog", 5, 6 )
str( lst )
 chr [1:6] "1" "2" "3" "dog" "5" "6"
lst <- as.numeric( lst )
Warning: NAs introduced by coercion
lst
[1]  1  2  3 NA  5  6
str( lst )
 num [1:6] 1 2 3 NA 5 6

Note it warned you that you had non-numbers when you converted. The non-numbers are now missing (NA).

For a dataframe, you fix like this:

exp.dat$trial1 = as.numeric( exp.dat$trial1 )
Warning: NAs introduced by coercion

12.1.7 Categories should be words

For categorical variables, don’t use numbers, if at all possible. E.g.,

levels = c( "Low", "Middle", "High", NA )

is better than

levels = c(1, 2, 3, 99 )

12.2 Data Wrangling

We next give some high level data wrangling advice. But really, check out R for DS for much more and much better on the merging and summarizing topics.

12.2.1 Handling Lagged Data

Sometimes you have multiple times for your units (think country or state), and you want to regress, say, future X on current X. Then you want to have both future and current X for each case.

Here think of a case as a Country at a point in time. E.g., we might have data like this:

dtw = read.csv( "data/fake_country_block.csv", as.is=TRUE )
dt = pivot_longer( dtw, cols=X1997:X2004,
                   names_to = "Year", names_prefix = "X",
                   values_to = "X" )
dt$Year = as.numeric( dt$Year )
slice_sample( dt, n=5 )
# A tibble: 5 × 3
  Country  Year     X
  <chr>   <dbl> <dbl>
1 China    2000   3.4
2 England  1999  53  
3 China    2003   6  
4 Morocco  1997  31.9
5 England  2003  57.3

We then want to know what the X will be 2 years in the future. We can do this with the following trick:

dt.fut = dt
dt.fut$Year = dt.fut$Year - 2
head(dt.fut)
# A tibble: 6 × 3
  Country  Year     X
  <chr>   <dbl> <dbl>
1 China    1995   0.5
2 China    1996   1  
3 China    1997   2  
4 China    1998   3.4
5 China    1999   4  
6 China    2000   5.3
newdt = left_join( dt, dt.fut, 
                   by=c("Country","Year"), suffix=c("",".fut") )
head( newdt, 10 )
# A tibble: 10 × 4
   Country  Year     X X.fut
   <chr>   <dbl> <dbl> <dbl>
 1 China    1997   0.5   2  
 2 China    1998   1     3.4
 3 China    1999   2     4  
 4 China    2000   3.4   5.3
 5 China    2001   4     6  
 6 China    2002   5.3   7  
 7 China    2003   6    NA  
 8 China    2004   7    NA  
 9 Morocco  1997  31.9  33  
10 Morocco  1998  32    34  

Here we are merging records that match Country and Year.

Note that for the final two China entries, we don’t have a future X value. The merge will make it NA indicating it is missing.

How this works: we are tricking the program. We are making a new \verb|dt.lag| data.frame and then putting all the entries into the past by two years. When we merge, and match on Country and Year, the current dataframe and the lagged dataframe get lined up by this shift. Clever, no?

Now we could do regression:

my.lm = lm( X.fut ~ X + Country, data=newdt )
summary( my.lm )

Call:
lm(formula = X.fut ~ X + Country, data = newdt)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5869 -0.2610  0.0107  0.2753  0.5137 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.8684     0.2128    8.78  2.7e-06 ***
X                1.0179     0.0582   17.48  2.3e-09 ***
CountryEngland  -0.8259     2.9704   -0.28     0.79    
CountryMorocco  -0.7514     1.7603   -0.43     0.68    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.351 on 11 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:     1,  Adjusted R-squared:     1 
F-statistic: 2.13e+04 on 3 and 11 DF,  p-value: <2e-16

12.2.2 Quick overview of merging data

Often you have two datasets that you want to merge. For example, say you want to merge some data you have on a few states with some SAT information from the mosaic package.

library( mosaicData )
data( SAT )
head( SAT )
       state expend ratio salary frac verbal math  sat
1    Alabama   4.41  17.2   31.1    8    491  538 1029
2     Alaska   8.96  17.6   48.0   47    445  489  934
3    Arizona   4.78  19.3   32.2   27    448  496  944
4   Arkansas   4.46  17.1   28.9    6    482  523 1005
5 California   4.99  24.0   41.1   45    417  485  902
6   Colorado   5.44  18.4   34.6   29    462  518  980
df = data.frame( state=c("Alabama","California","Fakus"), 
                A=c(10,20,50), 
                frac=c(0.5, 0.3, 0.4) )
df
       state  A frac
1    Alabama 10  0.5
2 California 20  0.3
3      Fakus 50  0.4
merge( df, SAT, by="state", all.x=TRUE )
       state  A frac.x expend ratio salary frac.y verbal math  sat
1    Alabama 10    0.5   4.41  17.2   31.1      8    491  538 1029
2 California 20    0.3   4.99  24.0   41.1     45    417  485  902
3      Fakus 50    0.4     NA    NA     NA     NA     NA   NA   NA

The records are combined by the “by” variable. I.e., each record in df is matched with each record in SAT with the same value of “state.”

Things to note: If you have the same variable in each dataframe, it will keep both, and add a suffix of “.x” and “.y” to indicate where they came from.

The “all.x” means keep all records from your first dataframe (here df) even if there is no match. If you added “all.y=TRUE” then you would get all 50 states from the SAT dataframe even though df doesn’t have most of them. Try it!

You can merge on more than one variable. I.e., if you said \verb|by=c(“A”,“B”)| then it would match records if they had the same value for both A and B. See below for an example on this.

12.2.3 Summarizing/aggregating Data

Sometimes you want to collapse several cases into one. This is called aggregating. If you install a package called “dplyr” (Run install.packages( "dplyr" ) once to install, or better yet simply install tidyverse) then you will have great power.

Using newdt from above, we can summarize countries across all their time points:

newdt %>% group_by( Country ) %>% 
    summarise( mean.X = mean(X, na.rm=TRUE ),
        sd.X = sd( X, na.rm=TRUE ) )
# A tibble: 3 × 3
  Country mean.X  sd.X
  <chr>    <dbl> <dbl>
1 China     3.65  2.37
2 England  54.6   2.43
3 Morocco  34.0   2.12

You can also augment data by adding new variables. You can even do this within groups. Here we subtract the mean from each group:

dshift = newdt %>% group_by( Country ) %>%
    mutate( Xm = mean(X, na.rm=TRUE),
            Xc = X - mean(X, na.rm=TRUE ) )
head(dshift)
# A tibble: 6 × 6
# Groups:   Country [1]
  Country  Year     X X.fut    Xm    Xc
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 China    1997   0.5   2    3.65 -3.15
2 China    1998   1     3.4  3.65 -2.65
3 China    1999   2     4    3.65 -1.65
4 China    2000   3.4   5.3  3.65 -0.25
5 China    2001   4     6    3.65  0.35
6 China    2002   5.3   7    3.65  1.65

12.2.4 Making Data Frames on the fly

For small datasets, you can type in data the hard way like so:

exp.dat = data.frame( ID=c("a","b","c","d"), 
      cond = c("AI","DI","DI","AI"),
            trial1 = c("E","U","U","E"),
            dec1 = c(1,1,0,1),
            trial2 = c("U","E","U","E"),
            dec2 = c(0,0,0,1),
                trial3 = c("U","E","E","U"),
            dec3 = c(0,1,0,1),
                trial4 = c("E","U","E","U"),
            dec4 = c(0,1,0,0) )
exp.dat  
  ID cond trial1 dec1 trial2 dec2 trial3 dec3 trial4 dec4
1  a   AI      E    1      U    0      U    0      E    0
2  b   DI      U    1      E    0      E    1      U    1
3  c   DI      U    0      U    0      E    0      E    0
4  d   AI      E    1      E    1      U    1      U    0

This is for an experiment on 4 subjects. The first and forth subject got the AI treatment, the second two got the DI treatment. The subjects then had 4 trials each, and they received a “E” choice or a “U” choice, and the decision variable is whether they accepted the choice.

As you can see, data can get a bit complicated!