[2]{.chapter-number}  [Descriptive Statistics]{.chapter-title}

2 Descriptive Statistics

2.1 Descriptive Statistics


use simulated_multilevel_data.dta // use data

We use summarize for continuous variables, and tabulate for categorical variables.

summarize outcome warmth physical_punishment HDI

tabulate identity

tabulate intervention

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     outcome |      3,000    52.43327    6.530996   29.60798   74.83553
      warmth |      3,000    3.521667    1.888399          0          7
physical_p~t |      3,000    2.478667    1.360942          0          5
         HDI |      3,000    64.76667    17.24562         33         87


hypothetica |
 l identity |
      group |
   variable |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,507       50.23       50.23
          2 |      1,493       49.77      100.00
------------+-----------------------------------
      Total |      3,000      100.00


   recieved |
interventio |
          n |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,547       51.57       51.57
          1 |      1,453       48.43      100.00
------------+-----------------------------------
      Total |      3,000      100.00

library(haven) # read data in Stata format

df <- read_dta("simulated_multilevel_data.dta")

R’s descriptive statistics functions rely heavily on whether a variable is a numeric variable, or a factor variable. Below, I convert two variables to factors (factor) before using summary¹ to generate descriptive statistics.

df$country <- factor(df$country)

df$identity <- factor(df$identity)

df$intervention <- factor(df$intervention)

summary(df)

    country          HDI            family            id            identity
 1      : 100   Min.   :33.00   Min.   :  1.00   Length:3000        1:1507  
 2      : 100   1st Qu.:53.00   1st Qu.: 25.75   Class :character   2:1493  
 3      : 100   Median :70.00   Median : 50.50   Mode  :character           
 4      : 100   Mean   :64.77   Mean   : 50.50                              
 5      : 100   3rd Qu.:81.00   3rd Qu.: 75.25                              
 6      : 100   Max.   :87.00   Max.   :100.00                              
 (Other):2400                                                               
 intervention physical_punishment     warmth         outcome     
 0:1547       Min.   :0.000       Min.   :0.000   Min.   :29.61  
 1:1453       1st Qu.:2.000       1st Qu.:2.000   1st Qu.:48.02  
              Median :2.000       Median :4.000   Median :52.45  
              Mean   :2.479       Mean   :3.522   Mean   :52.43  
              3rd Qu.:3.000       3rd Qu.:5.000   3rd Qu.:56.86  
              Max.   :5.000       Max.   :7.000   Max.   :74.84

using Tables, MixedModels, MixedModelsExtras, StatFiles, DataFrames, CategoricalArrays, DataFramesMeta

df = DataFrame(load("simulated_multilevel_data.dta"))

Similarly to R, Julia relies on the idea of variable type. I use transform to convert the appropriate variables to categorical variables.

@transform!(df, :country = categorical(:country))

@transform!(df, :identity = categorical(:identity))

@transform!(df, :intervention = categorical(:intervention))


describe(df) # descriptive statistics

9×7 DataFrame
 Row │ variable             mean     min     median  max      nmissing  eltype ⋯
     │ Symbol               Union…   Any     Union…  Any      Int64     Union  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ country                       1.0             30.0            0  Union{ ⋯
   2 │ HDI                  64.7667  33.0    70.0    87.0            0  Union{
   3 │ family               50.5     1.0     50.5    100.0           0  Union{
   4 │ id                            1.1             9.99            0  Union{
   5 │ identity                      1.0             2.0             0  Union{ ⋯
   6 │ intervention                  0.0             1.0             0  Union{
   7 │ physical_punishment  2.47867  0.0     2.0     5.0             0  Union{
   8 │ warmth               3.52167  0.0     4.0     7.0             0  Union{
   9 │ outcome              52.4333  29.608  52.449  74.8355         0  Union{ ⋯
                                                                1 column omitted

2.2 Interpretation

Examining descriptive statistics is an important first step in any analysis. It is important to examine your descriptive statistics first, before skipping ahead to more sophisticated analyses, such as multilevel models.

In examining the descriptive statistics for this data, we get a sense of the data.

outcome has a mean of approximately 52 and ranges from approximately 30 to 75.
warmth and physical punishment are both variables that represent the number of times that parents use each of these forms of discipline in a week. The average of the former is about 3.5, while the average of the latter is about 2.5.
HDI, the Human Development Index has an average of about 65, and a wide range.
identity is a categorical variable for a hypothetical identity group, and has values of 1 and 2.
intervention is also a categorical variable, and has values of 0 and 1.

skimr is an excellent new alternative library for generating descriptive statistics in R.↩︎