5 Summarizing and exploring data

This chapter gives a brief introduction to a variety of packages for summarizing variables. We being by using ggplot() to make a few simple plots and then turn to making summary tables. These tools are useful in general for exploring and describing data, and they may be useful for final projects and other things as well.

5.1 National Youth Survey Example

Our running example is the National Youth Survey (NYS) data as described in Raudenbush and Bryk, page 190. This data comes from a survey in which the same students were asked yearly about their acceptance of 9 “deviant” behaviors (such as smoking marijuana, stealing, etc.). The study began in 1976, and followed two cohorts of children, starting at ages 11 and 14 respectively. We will analyze the first 5 years of data.

At each time point, we have measures of:

ATTIT, the attitude towards deviance, with higher numbers implying higher tolerance for deviant behaviors.
EXPO, the “exposure”, based on asking the children how many friends they had who had engaged in each of the “deviant” behaviors.

Both of these variables have been transformed to a logarithmic scale to reduce skew.

For each student, we have:

Gender (binary)
Minority status (binary)
Family income, in units of $10K (this can be either categorical or continuous).

We’ll focus on the first cohort, from ages 11-15. First, let’s read the data. Note that this data frame is in “wide format”. That is, there is only one row for each student, with all the different observations for that student in different columns of that one row.

nyswide <- read_csv("data/nyswide.csv")
head(nyswide)

# A tibble: 6 × 14
     ID ATTIT.11 EXPO.11 ATTIT.12 EXPO.12 ATTIT.13 EXPO.13 ATTIT.14 EXPO.14
  <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>
1     3     0.11   -0.37     0.2    -0.27     0      -0.37     0      -0.27
2     8     0.29    0.42     0.29    0.2      0.11    0.42     0.51    0.2 
3     9     0.8     0.47     0.58    0.52     0.64    0.2      0.75    0.47
4    15     0.44    0.07     0.44    0.32     0.89    0.47     0.75    0.26
5    33     0.2    -0.27     0.64   -0.27     0.69   -0.27    NA      NA   
6    45     0.11    0.26     0.37   -0.17     0.37    0.14     0.37    0.14
# ℹ 5 more variables: ATTIT.15 <dbl>, EXPO.15 <dbl>, FEMALE <dbl>,
#   MINORITY <dbl>, INCOME <dbl>

Generally, we would want such data in “long format”, i.e. each student has multiple rows for the different observations. The pivot_longer() command does this for us.

nys1 <- nyswide |> 
  pivot_longer(ATTIT.11:EXPO.15, names_to = "score") |> 
  mutate(outcome = word(score, 1, 1, sep = "\\."),
         age = as.numeric(word(score, 2, 2, sep = "\\.")),
         age_fac = factor(age)) |> 
  select(-score) |> 
  pivot_wider(names_from = outcome) |> 
  # drop missing ATTIT values
  drop_na(ATTIT)

head( nys1 )

# A tibble: 6 × 8
     ID FEMALE MINORITY INCOME   age age_fac ATTIT  EXPO
  <dbl>  <dbl>    <dbl>  <dbl> <dbl> <fct>   <dbl> <dbl>
1     3      1        0      3    11 11       0.11 -0.37
2     3      1        0      3    12 12       0.2  -0.27
3     3      1        0      3    13 13       0    -0.37
4     3      1        0      3    14 14       0    -0.27
5     3      1        0      3    15 15       0.11 -0.17
6     8      0        0      4    11 11       0.29  0.42

Just to get a sense of the data, let’s plot each age as a boxplot

  ggplot(nys1, aes(age_fac, ATTIT)) +
    geom_boxplot() + 
    labs(title = "Boxplot of attitude towards deviance by age", 
         x = "Age", y = "Attitude towards deviance")

Note: The boxplot’s “x” variable is the group. You get one box per group. The “y” variable is the data we are making boxplots of.

Note some features of the data:

First, we see that ATTIT goes up over time.
Second, we see the variation of points also goes up over time. This is evidence of heteroskedasticity.

If we plot individual lines, grouped by gender and minority status, we have:

nys1 |> 
  drop_na() |> 
  ggplot(aes(age, ATTIT, group=ID)) +
  facet_grid( FEMALE ~ MINORITY ) +
    geom_line(alpha=0.2, position = "jitter") + 
    labs(title = "Individual trajectories of attitude towards deviance over time",
         x = "Age",
         y ="Attitude towards deviance")

If we squint, we can kind of see correlation of residuals: some students have systematically lower trajectories and some students have systematically higher trajectories (although there is a lot of bouncing around).

5.2 Tabulating data (Categorical variables)

We can tabulate data as so:

table(nys1$age)


 11  12  13  14  15 
202 209 230 220 218

table(nys1$MINORITY, nys1$age)

   
     11  12  13  14  15
  0 159 165 182 175 175
  1  43  44  48  45  43

Interestingly, we have more observations for later ages.

We can make “proportion tables” as well:

prop.table( table( nys1$MINORITY, nys1$INCOME  ), margin=1 )

   
          1       2       3       4       5       6       7       8       9
  0 0.06075 0.13551 0.18341 0.18107 0.14369 0.10981 0.06893 0.05257 0.00935
  1 0.28251 0.41704 0.12556 0.05830 0.05830 0.02242 0.01345 0.00000 0.00000
   
         10
  0 0.05491
  1 0.02242

The margin determines what adds up to 100%.

5.3 Summary statistics for continuous variables

The tableone package is useful:

  library(tableone)
  
# sample mean  
  CreateTableOne(data = nys1,
                 vars = c("ATTIT"))

                   
                    Overall    
  n                 1079       
  ATTIT (mean (SD)) 0.33 (0.27)

# you can also stratify by a variables of interest
  CreateTableOne(data = nys1,
                 vars = c("ATTIT"), 
                 strata = c("FEMALE"))

                   Stratified by FEMALE
                    0           1           p      test
  n                  559         520                   
  ATTIT (mean (SD)) 0.37 (0.27) 0.29 (0.27) <0.001

# you can also include binary variables
  CreateTableOne(data = nys1, 
                 vars = c("ATTIT", "age_fac"),  # include both binary and continuous variables here
                 factorVars = c("age_fac"), # include only binary variables here
                 strata = c("FEMALE"))

                   Stratified by FEMALE
                    0            1            p      test
  n                  559          520                    
  ATTIT (mean (SD)) 0.37 (0.27)  0.29 (0.27)  <0.001     
  age_fac (%)                                  0.991     
     11              106 (19.0)    96 (18.5)             
     12              105 (18.8)   104 (20.0)             
     13              119 (21.3)   111 (21.3)             
     14              115 (20.6)   105 (20.2)             
     15              114 (20.4)   104 (20.0)

5.4 Descriptive Statistics with the `psych` Package

Another package for obtaining detailed descriptive statistics for your data is the psych package in R, which has describe(), a function that generates a comprehensive summary of each variable in your dataset.

If you haven’t already installed the psych package, you can do so using install.packages(). You then load the library as so:

# install.packages("psych")
library(psych)

The describe() function provides descriptive statistics such as mean, standard deviation, skewness, and kurtosis for each variable in your dataset.

Here’s an example using a built-in dataset, iris:

summary_stats <- describe(nys1)
print(summary_stats)

         vars    n   mean     sd median trimmed    mad   min     max   range
ID          1 1079 841.47 483.55 851.00  839.79 597.49  3.00 1720.00 1717.00
FEMALE      2 1079   0.48   0.50   0.00    0.48   0.00  0.00    1.00    1.00
MINORITY    3 1079   0.21   0.41   0.00    0.13   0.00  0.00    1.00    1.00
INCOME      4 1079   4.10   2.35   4.00    3.87   2.97  1.00   10.00    9.00
age         5 1079  13.04   1.40  13.00   13.05   1.48 11.00   15.00    4.00
age_fac*    6 1079   3.04   1.40   3.00    3.05   1.48  1.00    5.00    4.00
ATTIT       7 1079   0.33   0.27   0.29    0.31   0.27  0.00    1.24    1.24
EXPO        8 1079   0.00   0.30  -0.09   -0.03   0.27 -0.37    1.04    1.41
          skew kurtosis    se
ID        0.01    -1.18 14.72
FEMALE    0.07    -2.00  0.02
MINORITY  1.45     0.09  0.01
INCOME    0.79     0.03  0.07
age      -0.04    -1.27  0.04
age_fac* -0.04    -1.27  0.04
ATTIT     0.63    -0.36  0.01
EXPO      0.88     0.31  0.01

The describe() function generates a table with the following columns:

vars: The variable number.
n: Number of valid cases.
mean: The mean of the variable.
sd: The standard deviation.
median: The median of the variable.
trimmed: The mean after trimming 10% of the data from both ends.
mad: The median absolute deviation (a robust estimate of the variability).
min: The minimum value.
max: The maximum value.
range: The range (max - min).
skew: The skewness (measure of asymmetry).
kurtosis: The kurtosis (measure of peakedness).
se: The standard error.

5.5 The `skimr` Package

Yet another package that provides a comprehensive summary of your data is the skimr package. This package is more about exploring data in the moment, and less about report generation, however.

One warning is skimr can generate special characters that can crash a R markdown report in some cases–so if you are using it, and getting weird errors when trying to render your reports, try commenting out the skim() call. Using it is simple:

skimr::skim( nys1 )

Data summary
Name	nys1
Number of rows	1079
Number of columns	8
_______________________
Column type frequency:
factor	1
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
age_fac	0	1	FALSE	5	13: 230, 14: 220, 15: 218, 12: 209

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	1	841.47	483.55	3.00	422.00	851.00	1242.00	1720.00	▇▇▆▇▆
FEMALE	1	0.48	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
MINORITY	1	0.21	0.41	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
INCOME	1	4.10	2.35	1.00	2.00	4.00	5.00	10.00	▇▇▅▂▂
age	1	13.04	1.40	11.00	12.00	13.00	14.00	15.00	▇▇▇▇▇
ATTIT	1	0.33	0.27	0.00	0.11	0.29	0.51	1.24	▇▅▃▂▁
EXPO	1	0.00	0.30	-0.37	-0.27	-0.09	0.20	1.04	▇▃▃▁▁

5.6 Summarizing by group

To plot summaries by group, first aggregate your data, and plot the results. Do like so:

aggdat = nys1 %>% 
  group_by( ID, FEMALE, MINORITY) %>%
  summarize( avg.ATTIT = mean( ATTIT, na.rm=TRUE ),
             n_obs = n(), .groups="drop" )

head( aggdat )

# A tibble: 6 × 5
     ID FEMALE MINORITY avg.ATTIT n_obs
  <dbl>  <dbl>    <dbl>     <dbl> <int>
1     3      1        0     0.084     5
2     8      0        0     0.378     5
3     9      0        0     0.75      5
4    15      0        0     0.664     5
5    33      1        0     0.41      4
6    45      1        0     0.382     5

As shown above, you can include level 2 variables in your group_by() command to ensure they get carried through to the aggregated results. Neat trick.

Anyway, we then plot:

ggplot( aggdat, aes(avg.ATTIT) ) +
  geom_histogram( binwidth = 0.05 ) +
  labs(main = "Average ATTIT across students", 
       xlab = "" )

We can facet to see multiple groups:

ggplot( aggdat, aes(avg.ATTIT) ) +
  facet_grid( FEMALE ~ MINORITY, labeller = label_both ) +
  geom_histogram( binwidth = 0.05 ) +
  labs(main = "Average ATTIT across students", 
       xlab = "" )