Day 7: Missing Data

panel data
code
analysis
Author

Robert W. Walker

Published

August 16, 2022

Slides

Missing Data

use "https://github.com/robertwwalker/Essex-Data/raw/main/ISQ99-Essex.dta"
xtset
drop if AINEW==.
xtset
* Give consideration to what we should do about the Lagged DV
drop if AILAG==.
* A note about long versus flong
mi set flong
* Note that it adds some variables; what are they?
mi describe
mi misstable summarize
* There are some logical restrictions that we are going to want.
* For example, we have some changes that need to be consistent
* with levels.  We can deal with that; just calculate the changes
* after imputation.
mi register imputed DEMOC3
mi impute regress DEMOC3 PERCHPCG PERCHPOP LPOP CWARCOW IWARCOW2, add(10) dots force
* mi impute regress PCGTHOU DEMOC3 PERCHPOP LPOP CWARCOW IWARCOW2, add(20)
mi estimate: xtreg AINEW AILAG DEMOC3, fe

The Data: Poe, Tate, Keith 1999

Two objectives. Load the data and transform it into a pdata.frame for now.

knitr::opts_chunk$set(warning=FALSE, message=FALSE, comment=NA, prompt=FALSE)
library(broom)
library(plm)
library(foreign)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between() masks plm::between()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks plm::lag(), stats::lag()
✖ dplyr::lead()    masks plm::lead()
ISQ99_Essex <- read.dta("./img/ISQ99-Essex.dta")
ISQ99p <- pdata.frame(ISQ99_Essex, c("IDORIGIN", "YEAR"))

pdata.frame

Will allow R to answer many questions that stata’s xt commands make available. First, some basic summaries to get to balance.

summary(ISQ99_Essex)
    IDORIGIN          YEAR            AI              SD            POLRT      
 Min.   :  2.0   Min.   :1976   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:290.0   1st Qu.:1980   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000  
 Median :435.0   Median :1984   Median :3.000   Median :2.000   Median :3.000  
 Mean   :446.7   Mean   :1984   Mean   :2.753   Mean   :2.241   Mean   :3.809  
 3rd Qu.:640.0   3rd Qu.:1989   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:6.000  
 Max.   :990.0   Max.   :1993   Max.   :5.000   Max.   :5.000   Max.   :7.000  
                                NA's   :1061    NA's   :587     NA's   :382    
      MIL2             LEFT             BRIT            PCGNP      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :   52  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  390  
 Median :0.0000   Median :0.0000   Median :0.0000   Median : 1112  
 Mean   :0.2725   Mean   :0.1764   Mean   :0.3554   Mean   : 3592  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.: 3510  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :36670  
 NA's   :382      NA's   :393      NA's   :290      NA's   :443    
     AINEW           SDNEW           IDGURR          AILAG          SDLAG      
 Min.   :1.000   Min.   :1.000   Min.   :  2.0   Min.   :1.00   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:290.0   1st Qu.:1.00   1st Qu.:1.000  
 Median :2.000   Median :2.000   Median :450.0   Median :2.00   Median :2.000  
 Mean   :2.443   Mean   :2.262   Mean   :455.8   Mean   :2.45   Mean   :2.247  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:663.0   3rd Qu.:3.00   3rd Qu.:3.000  
 Max.   :5.000   Max.   :5.000   Max.   :990.0   Max.   :5.00   Max.   :5.000  
 NA's   :468     NA's   :468                     NA's   :644    NA's   :644    
    PERCHPCG          PERCHPOP            LPOP          PCGTHOU      
 Min.   :-95.500   Min.   :-48.450   Min.   :11.00   Min.   : 0.050  
 1st Qu.: -2.545   1st Qu.:  0.910   1st Qu.:14.51   1st Qu.: 0.390  
 Median :  4.615   Median :  2.220   Median :15.59   Median : 1.110  
 Mean   :  4.614   Mean   :  2.193   Mean   :15.48   Mean   : 3.592  
 3rd Qu.: 11.760   3rd Qu.:  2.940   3rd Qu.:16.64   3rd Qu.: 3.510  
 Max.   :128.570   Max.   :126.010   Max.   :20.89   Max.   :36.670  
 NA's   :618       NA's   :293       NA's   :115     NA's   :443     
     DEMOC3          CWARCOW         IWARCOW2     
 Min.   : 0.000   Min.   :0.000   Min.   :0.0000  
 1st Qu.: 0.000   1st Qu.:0.000   1st Qu.:0.0000  
 Median : 0.000   Median :0.000   Median :0.0000  
 Mean   : 3.682   Mean   :0.092   Mean   :0.0862  
 3rd Qu.: 9.000   3rd Qu.:0.000   3rd Qu.:0.0000  
 Max.   :10.000   Max.   :1.000   Max.   :1.0000  
 NA's   :793      NA's   :407     NA's   :380     
table(ISQ99_Essex$IDORIGIN)

    2    20    31    40    41    42    51    52    53    55    56    57    59 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
   70    90    91    92    93    94    95    96   100   101   110   115   130 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  135   140   145   150   155   160   165   200   205   210   211   212   220 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  225   230   235   260   265   290   305   310   315   316   325   333   339 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  345   346   347   348   349   350   352   355   360   365   366   367   368 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  369   370   371   372   373   374 374.1 374.2 374.3 374.4 374.5   375   380 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  385   390   395   402   403   404   411   420   432   433   434   435   436 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  437   438   439   450   451   452   461   471   475   481   482   483   484 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  490   500   501   510   516   517   520   525   530   531   540   541   551 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  552   553   560   562   570   571   572   580   581   590   591   600   615 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  616   620   625   630   640   645   651   652   660   663   666   670   678 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  680   690   692   694   696   698   700   710   712   713   732   740   750 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  760   770   771   775   780   781   790   800   811   812   815   820   821 
   18    18    18    18    18    18    18    18    18    18    18    18    18 
  830   840   850   900   910   911   914   920   950   990 
   18    18    18    18    18    18    18    18    18    18 
table(ISQ99_Essex$YEAR)

1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 
 179  179  179  179  179  179  179  179  179  179  179  179  179  179  179  179 
1992 1993 
 179  179 
is.pbalanced(ISQ99p)
[1] TRUE

Is any variable time invariant?

pvar(ISQ99p)
no time variation:       IDORIGIN BRIT IDGURR AILAG SDLAG PERCHPCG PERCHPOP 
no individual variation: YEAR AI SD POLRT MIL2 LEFT BRIT PCGNP AINEW SDNEW AILAG SDLAG PERCHPCG PERCHPOP LPOP PCGTHOU DEMOC3 CWARCOW IWARCOW2 
all NA in time dimension for at least one individual:  AI SD POLRT MIL2 LEFT BRIT PCGNP AINEW SDNEW AILAG SDLAG PERCHPCG PERCHPOP LPOP PCGTHOU DEMOC3 CWARCOW IWARCOW2 
all NA in ind. dimension for at least one time period: AI SD POLRT MIL2 LEFT BRIT PCGNP AINEW SDNEW AILAG SDLAG PERCHPCG PERCHPOP LPOP PCGTHOU DEMOC3 CWARCOW IWARCOW2 

Visualizations of Missingness

library(visdat)
vis_dat(ISQ99_Essex)

vis_miss(ISQ99_Essex)

Univariate

library(ggplot2)
library(naniar)
mplot <- gg_miss_var(ISQ99_Essex)
mplot

Bivariate

mplot <- ggplot(ISQ99_Essex, 
       aes(x = DEMOC3, 
           y = POLRT)) + 
  geom_point()
mplot

library(naniar)
mplot <- ggplot(ISQ99_Essex, 
       aes(x = DEMOC3, 
           y = POLRT)) + 
  geom_miss_point()
mplot

Multiple Imputation: Amelia II

A simple multivariate normal is easy as long as the data are well behaved. NB: This uses none of the time series or cross-sectional dimensionality for identification.

library(Amelia)
library(Zelig)
# Shorten the dataset
ISQ99.4MI <- ISQ99_Essex[,c(1,2,5:11,13:21)]
a.out <- amelia(ISQ99.4MI, m=5, ts="YEAR", cs="IDORIGIN", noms=c("MIL2","LEFT","BRIT","CWARCOW","IWARCOW2"), ords = c("AINEW","SDNEW","AILAG","SDLAG","POLRT"))
Warning: There are observations in the data that are completely missing. 
         These observations will remain unimputed in the final datasets. 
-- Imputation 1 --

  1  2  3  4  5  6  7

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9

-- Imputation 3 --

  1  2  3  4  5  6  7  8

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10

-- Imputation 5 --

  1  2  3  4  5  6

Without simplifying, it crashes because there are id variables of different sorts and other things hiding in there, perfect multicollinearities exist. Even with them, we need a bit more work.

Transforms are Key

Using the cs and ts information with polytime/splinetime and intercs

a.out2 <- amelia(ISQ99.4MI, m=5, ts="YEAR", cs="IDORIGIN", polytime=2, intercs=FALSE, noms=c("MIL2","LEFT","BRIT","CWARCOW","IWARCOW2"), ords = c("AINEW","SDNEW","AILAG","SDLAG","POLRT"), p2s=2, emburn = c(20,50))

amelia starting
beginning prep functions
Variables used:  POLRT PCGNP AINEW SDNEW AILAG SDLAG PERCHPCG PERCHPOP LPOP PCGTHOU DEMOC3 noms.MIL2.2 noms.LEFT.2 noms.BRIT.2 noms.CWARCOW.2 noms.IWARCOW2.2 time.1 time.2 
running bootstrap
-- Imputation 1 --
setting up EM chain indicies

  1(189)  2(182)  3(155)  4(89)  5(13)  6(9)  7(6)  8(1)  9(0) 10(0) 11(0) 12(0) 13(0) 14(0) 15(0) 16(0) 17(0) 18(0) 19(0) 20
(0)

 saving and cleaning

running bootstrap
-- Imputation 2 --
setting up EM chain indicies

  1(189)  2(182)  3(162)  4(87)  5(14)  6(5)  7(1)  8(0)  9(0) 10(0) 11(0) 12(0) 13(0) 14(0) 15(0) 16(0) 17(0) 18(0) 19(0) 20
(0)

 saving and cleaning

running bootstrap
-- Imputation 3 --
setting up EM chain indicies

  1(188)  2(183)  3(157)  4(76)  5(29)  6(17)  7(8)  8(0)  9(0) 10(0) 11(0) 12(0) 13(0) 14(0) 15(0) 16(0) 17(0) 18(0) 19(0) 20
(0)

 saving and cleaning

running bootstrap
-- Imputation 4 --
setting up EM chain indicies

  1(189)  2(181)  3(153)  4(60)  5(20)  6(13)  7(7)  8(2)  9(0) 10(0) 11(0) 12(0) 13(0) 14(0) 15(0) 16(0) 17(0) 18(0) 19(0) 20
(0)

 saving and cleaning

running bootstrap
-- Imputation 5 --
setting up EM chain indicies

  1(189)  2(181)  3(160)  4(90)  5(25)  6(12)  7(3)  8(0)  9(0) 10(0) 11(0) 12(0) 13(0) 14(0) 15(0) 16(0) 17(0) 18(0) 19(0) 20
(0)

 saving and cleaning
summary(a.out2)

Amelia output with 5 imputed datasets.
Return code:  1 
Message:  Normal EM convergence. 

Chain Lengths:
--------------
Imputation 1:  20
Imputation 2:  20
Imputation 3:  20
Imputation 4:  20
Imputation 5:  20

Rows after Listwise Deletion:  2144 
Rows after Imputation:  3222 
Patterns of missingness in the data:  65 

Fraction Missing for original variables: 
-----------------------------------------

         Fraction Missing
IDORIGIN       0.00000000
YEAR           0.00000000
POLRT          0.11855990
MIL2           0.11855990
LEFT           0.12197393
BRIT           0.09000621
PCGNP          0.13749224
AINEW          0.14525140
SDNEW          0.14525140
AILAG          0.19987585
SDLAG          0.19987585
PERCHPCG       0.19180633
PERCHPOP       0.09093731
LPOP           0.03569212
PCGTHOU        0.13749224
DEMOC3         0.24612042
CWARCOW        0.12631906
IWARCOW2       0.11793917

Now let’s analyze it.

Some Analysis

Not a correct model but a first start.

devtools::install_github("IQSS/ZeligChoice")
library(ZeligChoice)
Model.a2 <- zelig(AINEW ~ AILAG + MIL2 + LEFT + BRIT + CWARCOW + IWARCOW2 + PCGTHOU + PERCHPOP + DEMOC3, model = "ls", data = a.out2)
How to cite this model in Zelig:
  R Core Team. 2007.
  ls: Least Squares Regression for Continuous Dependent Variables
  in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
  "Zelig: Everyone's Statistical Software," https://zeligproject.org/
summary(Model.a2)
Model: Combined Imputations 

             Estimate Std.Error z value Pr(>|z|)
(Intercept)  1.040136  0.055780   18.65  < 2e-16
AILAG        0.621404  0.017965   34.59  < 2e-16
MIL2         0.110257  0.035754    3.08  0.00204
LEFT        -0.074708  0.046963   -1.59  0.11166
BRIT        -0.112423  0.030393   -3.70  0.00022
CWARCOW      0.532683  0.055780    9.55  < 2e-16
IWARCOW2     0.180371  0.071370    2.53  0.01150
PCGTHOU     -0.016131  0.003344   -4.82  1.4e-06
PERCHPOP    -0.000936  0.003508   -0.27  0.78956
DEMOC3      -0.030649  0.004348   -7.05  1.8e-12

For results from individual imputed datasets, use summary(x, subset = i:j)
Statistical Warning: The GIM test suggests this model is misspecified
 (based on comparisons between classical and robust SE's; see http://j.mp/GIMtest).
 We suggest you run diagnostics to ascertain the cause, respecify the model
 and run it again.

Next step: Use 'setx' method

More General

all_imputations <- dplyr::bind_rows(unclass(a.out2$imputations), .id = "m") %>%
  group_by(m) %>%
  nest()
all_imputations
# A tibble: 5 × 2
# Groups:   m [5]
  m     data                 
  <chr> <list>               
1 imp1  <tibble [3,222 × 18]>
2 imp2  <tibble [3,222 × 18]>
3 imp3  <tibble [3,222 × 18]>
4 imp4  <tibble [3,222 × 18]>
5 imp5  <tibble [3,222 × 18]>
models_imputations <- all_imputations %>%
  mutate(model = data %>% map(~ lm(AINEW ~ AILAG + MIL2 + LEFT + BRIT + CWARCOW + IWARCOW2 + PCGTHOU + PERCHPOP + DEMOC3, data = .)),
         tidied = model %>% map(~ tidy(., conf.int = TRUE)),
         glance = model %>% map(~ glance(.)))

models_imputations
# A tibble: 5 × 5
# Groups:   m [5]
  m     data                  model  tidied            glance           
  <chr> <list>                <list> <list>            <list>           
1 imp1  <tibble [3,222 × 18]> <lm>   <tibble [10 × 7]> <tibble [1 × 12]>
2 imp2  <tibble [3,222 × 18]> <lm>   <tibble [10 × 7]> <tibble [1 × 12]>
3 imp3  <tibble [3,222 × 18]> <lm>   <tibble [10 × 7]> <tibble [1 × 12]>
4 imp4  <tibble [3,222 × 18]> <lm>   <tibble [10 × 7]> <tibble [1 × 12]>
5 imp5  <tibble [3,222 × 18]> <lm>   <tibble [10 × 7]> <tibble [1 × 12]>
params <- models_imputations %>%
  unnest(tidied) %>%
  select(m, term, estimate, std.error) %>%
  gather(key, value, estimate, std.error) %>%
  spread(term, value)
params
# A tibble: 10 × 12
# Groups:   m [5]
   m     key      (Inte…¹  AILAG    BRIT CWARCOW   DEMOC3 IWARC…²    LEFT   MIL2
   <chr> <chr>      <dbl>  <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
 1 imp1  estimate  1.08   0.607  -0.106   0.583  -0.0314   0.187  -0.0867 0.112 
 2 imp1  std.err…  0.0485 0.0136  0.0279  0.0454  0.00380  0.0456  0.0362 0.0316
 3 imp2  estimate  1.04   0.623  -0.125   0.520  -0.0334   0.201  -0.114  0.0874
 4 imp2  std.err…  0.0462 0.0129  0.0267  0.0436  0.00364  0.0443  0.0353 0.0305
 5 imp3  estimate  0.999  0.637  -0.121   0.537  -0.0281   0.106  -0.0685 0.133 
 6 imp3  std.err…  0.0455 0.0128  0.0264  0.0432  0.00364  0.0431  0.0343 0.0301
 7 imp4  estimate  1.03   0.625  -0.116   0.503  -0.0313   0.162  -0.0360 0.113 
 8 imp4  std.err…  0.0469 0.0132  0.0271  0.0458  0.00373  0.0441  0.0348 0.0312
 9 imp5  estimate  1.05   0.615  -0.0938  0.521  -0.0291   0.245  -0.0686 0.106 
10 imp5  std.err…  0.0465 0.0130  0.0269  0.0445  0.00367  0.0428  0.0351 0.0308
# … with 2 more variables: PCGTHOU <dbl>, PERCHPOP <dbl>, and abbreviated
#   variable names ¹​`(Intercept)`, ²​IWARCOW2
# ℹ Use `colnames()` to see all variable names
just_coefs <- params %>%
  filter(key == "estimate") %>%
  ungroup() %>%
  select(-m, -key)
just_coefs
# A tibble: 5 × 10
  (Inter…¹ AILAG    BRIT CWARCOW  DEMOC3 IWARC…²    LEFT   MIL2 PCGTHOU PERCHPOP
     <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>    <dbl>
1    1.08  0.607 -0.106    0.583 -0.0314   0.187 -0.0867 0.112  -0.0163 -3.91e-4
2    1.04  0.623 -0.125    0.520 -0.0334   0.201 -0.114  0.0874 -0.0142  5.02e-4
3    0.999 0.637 -0.121    0.537 -0.0281   0.106 -0.0685 0.133  -0.0162 -3.14e-3
4    1.03  0.625 -0.116    0.503 -0.0313   0.162 -0.0360 0.113  -0.0146 -1.57e-3
5    1.05  0.615 -0.0938   0.521 -0.0291   0.245 -0.0686 0.106  -0.0193 -8.16e-5
# … with abbreviated variable names ¹​`(Intercept)`, ²​IWARCOW2
just_ses <- params %>%
  filter(key == "std.error") %>%
  ungroup() %>%
  select(-m, -key)
just_ses
# A tibble: 5 × 10
  (Interce…¹  AILAG   BRIT CWARCOW  DEMOC3 IWARC…²   LEFT   MIL2 PCGTHOU PERCH…³
       <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1     0.0485 0.0136 0.0279  0.0454 0.00380  0.0456 0.0362 0.0316 0.00255 0.00323
2     0.0462 0.0129 0.0267  0.0436 0.00364  0.0443 0.0353 0.0305 0.00250 0.00305
3     0.0455 0.0128 0.0264  0.0432 0.00364  0.0431 0.0343 0.0301 0.00246 0.00309
4     0.0469 0.0132 0.0271  0.0458 0.00373  0.0441 0.0348 0.0312 0.00251 0.00317
5     0.0465 0.0130 0.0269  0.0445 0.00367  0.0428 0.0351 0.0308 0.00248 0.00310
# … with abbreviated variable names ¹​`(Intercept)`, ²​IWARCOW2, ³​PERCHPOP
coefs_melded <- mi.meld(just_coefs, just_ses)
coefs_melded
$q.mi
     (Intercept)     AILAG       BRIT  CWARCOW      DEMOC3  IWARCOW2
[1,]    1.040136 0.6214036 -0.1124235 0.532683 -0.03064904 0.1803708
            LEFT      MIL2     PCGTHOU      PERCHPOP
[1,] -0.07470803 0.1102575 -0.01613081 -0.0009362572

$se.mi
     (Intercept)      AILAG       BRIT    CWARCOW      DEMOC3   IWARCOW2
[1,]  0.05578001 0.01796531 0.03039278 0.05577991 0.004347858 0.07137014
           LEFT       MIL2     PCGTHOU    PERCHPOP
[1,] 0.04696314 0.03575381 0.003344076 0.003508067
coefs_melded$q.mi / coefs_melded$se.mi
     (Intercept)    AILAG      BRIT  CWARCOW    DEMOC3 IWARCOW2     LEFT
[1,]    18.64711 34.58908 -3.699019 9.549729 -7.049228 2.527259 -1.59078
         MIL2   PCGTHOU   PERCHPOP
[1,] 3.083796 -4.823697 -0.2668869