= lm(mpg ~ disp, data=mtcars) #something slow
myBigThing saveRDS(myBigThing, savedPath)
## Later on:
<- readRDS(savedPath) myBigThing
12 Tips, Tricks, and Debugging in R
This chapter is a complete hodge-podge of small things that tend to bite students, especially when they are dealing with messy data that they might have for final projects.
In particular, as you learn R, there’s lots of good tricks you’ll never know about until somebody shows you. Clean code is one such good trick; consider the following: “Your most important collaborator is you from 6 months ago. Unfortunately, you can’t ask that-you any questions, because they don’t answer their email.”
So onwards with a few things that you may find useful, either now or later.
12.1 Some principles to live by
12.1.1 Watch Tricky letter and number confusion in code
The letter “l” looks like the number “1”—watch out for that. Things like “mylm” are usually all letters, with “lm” standing for linear model.
12.1.2 Write in a good R style
Try to do the following
- Comment your code!
- Structure your R file like so:
- Descriptive comments (including date)
- Load libraries
- Constants and script parameters (# iterations, etc.)
- Functions (with descriptive comment after first line)
- Everything else
- Naming: variableName / variable.name, FunctionVerb, kConstantName. not_like_this
- Indentation: 2-space indents is nice
- Spaces are ok, but don’t go overboard. E.g.,
y = (a * x) + b + rnorm(1, sd=sigma)
- Never use
attach()
12.1.3 Save and load R objects to save time
If you have the result of something that took awhile to run (e.g., a big multilevel model fit to a lot of data) you can try saving it like so:
12.1.4 Reproduce randomness with set.seed
If your code uses random numbers, then you should set your seed, which makes your script always generate the same sequence of random numbers.
For example, say your code had this:
tryCatch({(1:(1:10)[rpois(1, 3)])}, error=function(e){(e)}) #works?
[1] 1 2 3
set.seed(97)
tryCatch({(1:(1:10)[rpois(1, 3)])}, error=function(e){(e)}) #fails!
<simpleError in 1:(1:10)[rpois(1, 3)]: argument of length 0>
(Note the tryCatch()
method is a way of generating errors and not crashing.)
Key thing to know: Reproducible results help with debugging.
If you want to get fancy, try this (after installing the `TeachingDemos’ package):
::char2seed("quinn")
TeachingDemos# Using your name as a seed says "nothing up my sleeve"
12.1.5 Keep your files organized
Ever seen this?
- /My Documents
- my paper.tex
- my paper draft 2.tex
- my paper final.tex
- my paper final revised.tex
- my paper final revised 2.tex
- script.r
- script 2.r
- data.csv
Try instead something like:
- /stat 166-Small Data Analysis
- stat 166.rproj
- /Empty Project
- /code
- /data
- /text
- /figures
- readme.txt
- /HW1
- …
Your readme.txt
might have informational notes such as “Got data from bit.ly/XYZ.” to remind you of what you were up to.
Your figures
folder should be full of figures you can easily regenerate with code in your code
folder.
12.1.6 Make sure your data are numeric
Sometimes when you load data in, R does weird things like decide all your numbers are actually words. This happens if some of your entries are not numbers. Then R makes them all not numbers. You can check this with the str()
function:
str( exp.dat )
'data.frame': 4 obs. of 10 variables:
$ ID : chr "a" "b" "c" "d"
$ cond : chr "AI" "DI" "DI" "AI"
$ trial1: chr "E" "U" "U" "E"
$ dec1 : num 1 1 0 1
$ trial2: chr "U" "E" "U" "E"
$ dec2 : num 0 0 0 1
$ trial3: chr "U" "E" "E" "U"
$ dec3 : num 0 1 0 1
$ trial4: chr "E" "U" "E" "U"
$ dec4 : num 0 1 0 0
Here we see that we have factors (categorical variables) and numbers (num). All is well.
If something should be a number, then change it like so:
<- c( 1, 2, 3, "dog", 5, 6 )
lst str( lst )
chr [1:6] "1" "2" "3" "dog" "5" "6"
<- as.numeric( lst ) lst
Warning: NAs introduced by coercion
lst
[1] 1 2 3 NA 5 6
str( lst )
num [1:6] 1 2 3 NA 5 6
Note it warned you that you had non-numbers when you converted. The non-numbers are now missing (NA).
For a dataframe, you fix like this:
$trial1 = as.numeric( exp.dat$trial1 ) exp.dat
Warning: NAs introduced by coercion
12.1.7 Categories should be words
For categorical variables, don’t use numbers, if at all possible. E.g.,
= c( "Low", "Middle", "High", NA ) levels
is better than
= c(1, 2, 3, 99 ) levels
12.2 Data Wrangling
We next give some high level data wrangling advice. But really, check out R for DS for much more and much better on the merging and summarizing topics.
12.2.1 Handling Lagged Data
Sometimes you have multiple times for your units (think country or state), and you want to regress, say, future X on current X. Then you want to have both future and current X for each case.
Here think of a case as a Country at a point in time. E.g., we might have data like this:
= read.csv( "data/fake_country_block.csv", as.is=TRUE )
dtw = pivot_longer( dtw, cols=X1997:X2004,
dt names_to = "Year", names_prefix = "X",
values_to = "X" )
$Year = as.numeric( dt$Year )
dtslice_sample( dt, n=5 )
# A tibble: 5 × 3
Country Year X
<chr> <dbl> <dbl>
1 China 2000 3.4
2 England 1999 53
3 China 2003 6
4 Morocco 1997 31.9
5 England 2003 57.3
We then want to know what the X will be 2 years in the future. We can do this with the following trick:
= dt
dt.fut $Year = dt.fut$Year - 2
dt.futhead(dt.fut)
# A tibble: 6 × 3
Country Year X
<chr> <dbl> <dbl>
1 China 1995 0.5
2 China 1996 1
3 China 1997 2
4 China 1998 3.4
5 China 1999 4
6 China 2000 5.3
= left_join( dt, dt.fut,
newdt by=c("Country","Year"), suffix=c("",".fut") )
head( newdt, 10 )
# A tibble: 10 × 4
Country Year X X.fut
<chr> <dbl> <dbl> <dbl>
1 China 1997 0.5 2
2 China 1998 1 3.4
3 China 1999 2 4
4 China 2000 3.4 5.3
5 China 2001 4 6
6 China 2002 5.3 7
7 China 2003 6 NA
8 China 2004 7 NA
9 Morocco 1997 31.9 33
10 Morocco 1998 32 34
Here we are merging records that match Country and Year.
Note that for the final two China entries, we don’t have a future X value. The merge will make it NA indicating it is missing.
How this works: we are tricking the program. We are making a new \verb|dt.lag| data.frame and then putting all the entries into the past by two years. When we merge, and match on Country and Year, the current dataframe and the lagged dataframe get lined up by this shift. Clever, no?
Now we could do regression:
= lm( X.fut ~ X + Country, data=newdt )
my.lm summary( my.lm )
Call:
lm(formula = X.fut ~ X + Country, data = newdt)
Residuals:
Min 1Q Median 3Q Max
-0.5869 -0.2610 0.0107 0.2753 0.5137
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8684 0.2128 8.78 2.7e-06 ***
X 1.0179 0.0582 17.48 2.3e-09 ***
CountryEngland -0.8259 2.9704 -0.28 0.79
CountryMorocco -0.7514 1.7603 -0.43 0.68
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.351 on 11 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.13e+04 on 3 and 11 DF, p-value: <2e-16
12.2.2 Quick overview of merging data
Often you have two datasets that you want to merge. For example, say you want to merge some data you have on a few states with some SAT information from the mosaic package.
library( mosaicData )
data( SAT )
head( SAT )
state expend ratio salary frac verbal math sat
1 Alabama 4.41 17.2 31.1 8 491 538 1029
2 Alaska 8.96 17.6 48.0 47 445 489 934
3 Arizona 4.78 19.3 32.2 27 448 496 944
4 Arkansas 4.46 17.1 28.9 6 482 523 1005
5 California 4.99 24.0 41.1 45 417 485 902
6 Colorado 5.44 18.4 34.6 29 462 518 980
= data.frame( state=c("Alabama","California","Fakus"),
df A=c(10,20,50),
frac=c(0.5, 0.3, 0.4) )
df
state A frac
1 Alabama 10 0.5
2 California 20 0.3
3 Fakus 50 0.4
merge( df, SAT, by="state", all.x=TRUE )
state A frac.x expend ratio salary frac.y verbal math sat
1 Alabama 10 0.5 4.41 17.2 31.1 8 491 538 1029
2 California 20 0.3 4.99 24.0 41.1 45 417 485 902
3 Fakus 50 0.4 NA NA NA NA NA NA NA
The records are combined by the “by” variable. I.e., each record in df is matched with each record in SAT with the same value of “state.”
Things to note: If you have the same variable in each dataframe, it will keep both, and add a suffix of “.x” and “.y” to indicate where they came from.
The “all.x” means keep all records from your first dataframe (here df) even if there is no match. If you added “all.y=TRUE” then you would get all 50 states from the SAT dataframe even though df doesn’t have most of them. Try it!
You can merge on more than one variable. I.e., if you said \verb|by=c(“A”,“B”)| then it would match records if they had the same value for both A and B. See below for an example on this.
12.2.3 Summarizing/aggregating Data
Sometimes you want to collapse several cases into one. This is called aggregating. If you install a package called “dplyr” (Run install.packages( "dplyr" )
once to install, or better yet simply install tidyverse
) then you will have great power.
Using newdt
from above, we can summarize countries across all their time points:
%>% group_by( Country ) %>%
newdt summarise( mean.X = mean(X, na.rm=TRUE ),
sd.X = sd( X, na.rm=TRUE ) )
# A tibble: 3 × 3
Country mean.X sd.X
<chr> <dbl> <dbl>
1 China 3.65 2.37
2 England 54.6 2.43
3 Morocco 34.0 2.12
You can also augment data by adding new variables. You can even do this within groups. Here we subtract the mean from each group:
= newdt %>% group_by( Country ) %>%
dshift mutate( Xm = mean(X, na.rm=TRUE),
Xc = X - mean(X, na.rm=TRUE ) )
head(dshift)
# A tibble: 6 × 6
# Groups: Country [1]
Country Year X X.fut Xm Xc
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 China 1997 0.5 2 3.65 -3.15
2 China 1998 1 3.4 3.65 -2.65
3 China 1999 2 4 3.65 -1.65
4 China 2000 3.4 5.3 3.65 -0.25
5 China 2001 4 6 3.65 0.35
6 China 2002 5.3 7 3.65 1.65
12.2.4 Making Data Frames on the fly
For small datasets, you can type in data the hard way like so:
= data.frame( ID=c("a","b","c","d"),
exp.dat cond = c("AI","DI","DI","AI"),
trial1 = c("E","U","U","E"),
dec1 = c(1,1,0,1),
trial2 = c("U","E","U","E"),
dec2 = c(0,0,0,1),
trial3 = c("U","E","E","U"),
dec3 = c(0,1,0,1),
trial4 = c("E","U","E","U"),
dec4 = c(0,1,0,0) )
exp.dat
ID cond trial1 dec1 trial2 dec2 trial3 dec3 trial4 dec4
1 a AI E 1 U 0 U 0 E 0
2 b DI U 1 E 0 E 1 U 1
3 c DI U 0 U 0 E 0 E 0
4 d AI E 1 E 1 U 1 U 0
This is for an experiment on 4 subjects. The first and forth subject got the AI treatment, the second two got the DI treatment. The subjects then had 4 trials each, and they received a “E” choice or a “U” choice, and the decision variable is whether they accepted the choice.
As you can see, data can get a bit complicated!