Slides

The Data

The data are in Stata format. The github for this course. There is also a link here for a local copy.

Basic R commands

Loading the data in R and a summary using skimr::skim.

library(haven); library(kableExtra)
HR.Data <- read_dta(url("https://github.com/robertwwalker/DADMStuff/raw/master/ISQ99-Essex.dta"))
library(skimr)
skim(HR.Data) %>% kable() %>% scroll_box(width="80%", height="50%")

skim_type	skim_variable	n_missing	complete_rate	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
numeric	IDORIGIN	0	1.0000000	446.7178771	243.1931782	2.00	290.000	435.000	640.00	990.00	▆▇▇▆▂
numeric	YEAR	0	1.0000000	1984.5000000	5.1889328	1976.00	1980.000	1984.500	1989.00	1993.00	▇▆▇▆▇
numeric	AI	1061	0.6707014	2.7533549	1.0752989	1.00	2.000	3.000	3.00	5.00	▃▇▇▃▂
numeric	SD	587	0.8178150	2.2406072	1.1303528	1.00	1.000	2.000	3.00	5.00	▇▇▆▂▁
numeric	POLRT	382	0.8814401	3.8095070	2.2230297	1.00	2.000	3.000	6.00	7.00	▇▂▂▁▇
numeric	MIL2	382	0.8814401	0.2725352	0.4453421	0.00	0.000	0.000	1.00	1.00	▇▁▁▁▃
numeric	LEFT	393	0.8780261	0.1763874	0.3812168	0.00	0.000	0.000	0.00	1.00	▇▁▁▁▂
numeric	BRIT	290	0.9099938	0.3553888	0.4787126	0.00	0.000	0.000	1.00	1.00	▇▁▁▁▅
numeric	PCGNP	443	0.8625078	3591.6509536	5698.3554010	52.00	390.000	1112.000	3510.00	36670.00	▇▁▁▁▁
numeric	AINEW	468	0.8547486	2.4433551	1.1558005	1.00	1.000	2.000	3.00	5.00	▇▇▇▃▂
numeric	SDNEW	468	0.8547486	2.2618010	1.1365604	1.00	1.000	2.000	3.00	5.00	▇▇▆▂▁
numeric	IDGURR	0	1.0000000	455.7709497	246.5201369	2.00	290.000	450.000	663.00	990.00	▆▇▇▇▃
numeric	AILAG	644	0.8001241	2.4499612	1.1479673	1.00	1.000	2.000	3.00	5.00	▇▇▇▃▂
numeric	SDLAG	644	0.8001241	2.2470908	1.1156632	1.00	1.000	2.000	3.00	5.00	▇▇▆▂▁
numeric	PERCHPCG	618	0.8081937	4.6138441	13.2208934	-95.50	-2.545	4.615	11.76	128.57	▁▂▇▁▁
numeric	PERCHPOP	293	0.9090627	2.1928815	4.0424128	-48.45	0.910	2.220	2.94	126.01	▁▇▁▁▁
numeric	LPOP	115	0.9643079	15.4819279	1.8633316	11.00	14.510	15.590	16.64	20.89	▂▃▇▃▁
numeric	PCGTHOU	443	0.8625078	3.5916985	5.6983334	0.05	0.390	1.110	3.51	36.67	▇▁▁▁▁
numeric	DEMOC3	793	0.7538796	3.6817620	4.3577178	0.00	0.000	0.000	9.00	10.00	▇▁▁▂▃
numeric	CWARCOW	407	0.8736809	0.0920071	0.2890873	0.00	0.000	0.000	0.00	1.00	▇▁▁▁▁
numeric	IWARCOW2	380	0.8820608	0.0862069	0.2807187	0.00	0.000	0.000	0.00	1.00	▇▁▁▁▁

An `xtsum` function for R

A little function that I wrote up on github.

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::group_rows() masks kableExtra::group_rows()
✖ dplyr::lag()        masks stats::lag()

library(plm)


Attaching package: 'plm'

The following objects are masked from 'package:dplyr':

    between, lag, lead

source(url("https://raw.githubusercontent.com/robertwwalker/DADMStuff/master/xtsum.R"))
# Be careful with the ID variable, the safest is to make it factor; this can go wildly wrong
xtsum(IDORIGIN~., data=HR.Data) %>% kable() %>% scroll_box(width="80%", height="50%")

Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(unit)` instead of `unit` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.

	O.mean	O.sd	O.min	O.max	O.SumSQ	O.N	B.mean	B.sd	B.min	B.max	B.Units	B.t.bar	W.sd	W.min	W.max	W.SumSQ	Within.Ovr.Ratio
YEAR	1984.5	5.189	1976	1993	86725.5	3222	1984.5	0	1984.5	1984.5	179	18	5.189	-8.5	8.5	86725.5	1
AI	2.753	1.075	1	5	2497.538	2161	2.498	0.989	1	5	173	12.491	0.631	-2.375	2.5625	860.822	0.345
SD	2.241	1.13	1	5	3365.455	2635	2.241	1.004	1	5	178	14.803	0.624	-2.666667	3.0625	1025.695	0.305
POLRT	3.81	2.223	1	7	14029.94	2840	3.78	1.99	1	7	179	15.866	0.925	-4	4.777778	2428.552	0.173
MIL2	0.273	0.445	0	1	563.058	2840	0.24	0.377	0	1	179	15.866	0.216	-0.9444444	0.8888889	132.778	0.236
LEFT	0.176	0.381	0	1	410.983	2829	0.157	0.334	0	1	179	15.804	0.157	-0.8888889	0.8888889	69.611	0.169
BRIT	0.355	0.479	0	1	671.685	2932	0.335	0.473	0	1	179	16.38	0	0	0	0	0
PCGNP	3591.651	5698.355	52	36670	90205144379	2779	3449.178	5049.297	112.2222	22653.89	173	16.064	2278.412	-12303.33	16961.67	14421042273	0.16
AINEW	2.443	1.156	1	5	3677.663	2754	2.379	1.012	1	5	178	15.472	0.622	-2.388889	2.944444	1064.102	0.289
SDNEW	2.262	1.137	1	5	3556.241	2754	2.253	1.006	1	5	178	15.472	0.631	-2.588235	3	1096.442	0.308
IDGURR	455.771	246.52	2	990	195747185	3222	455.771	247.173	2	990	179	18	0	0	0	0	0
AILAG	2.45	1.148	1	5	3396.045	2578	2.402	1.039	1	5	177	14.565	0.609	-2.411765	3	955.37	0.281
SDLAG	2.247	1.116	1	5	3207.603	2578	2.236	0.991	1	5	177	14.565	0.608	-2.5	3.058824	952.174	0.297
PERCHPCG	4.614	13.221	-95.5	128.57	454983.6	2604	3.325	6.893	-36.21333	15.03765	168	15.5	12.393	-92.50235	114.8882	399763	0.879
PERCHPOP	2.193	4.042	-48.45	126.01	47846.75	2929	2.842	9.443	-2.126471	126.01	176	16.642	3.018	-48.12235	80.69765	26663.59	0.557
LPOP	15.482	1.863	11	20.89	10784.05	3107	15.488	1.844	11.09056	20.76889	177	17.554	0.129	-0.7288889	0.7311111	51.883	0.005
PCGTHOU	3.592	5.698	0.05	36.67	90204.45	2779	3.449	5.049	0.1122222	22.65389	173	16.064	2.278	-12.30333	16.96167	14420.95	0.16
DEMOC3	3.682	4.358	0	10	46107	2429	3.774	3.96	0	10	155	15.671	1.726	-7.277778	7.941176	7229.815	0.157
CWARCOW	0.092	0.289	0	1	235.17	2815	0.095	0.245	0	1	179	15.726	0.175	-0.8888889	0.9444444	85.693	0.364
IWARCOW2	0.086	0.281	0	1	223.879	2842	0.092	0.227	0	1	179	15.877	0.19	-0.8888889	0.9444444	102.992	0.46

The Core Idea

In R, this is an essential group_by calculation in the tidyverse. The between data are a summary table with units constituting the rows. The within data is the overall data with group means subtracted.

HR.Data %>% 
  group_by(IDORIGIN) %>% 
  mutate(DEMOC.Centered = 
           DEMOC3 - mean(DEMOC3, na.rm=TRUE)) %>%
  filter(IDORIGIN==42) %>% 
  select(IDORIGIN, YEAR, DEMOC3, DEMOC.Centered)

# A tibble: 18 × 4
# Groups:   IDORIGIN [1]
   IDORIGIN  YEAR DEMOC3 DEMOC.Centered
      <dbl> <dbl>  <dbl>          <dbl>
 1       42  1976      1         -5.11 
 2       42  1977      1         -5.11 
 3       42  1978      6         -0.111
 4       42  1979      6         -0.111
 5       42  1980      6         -0.111
 6       42  1981      6         -0.111
 7       42  1982      7          0.889
 8       42  1983      7          0.889
 9       42  1984      7          0.889
10       42  1985      7          0.889
11       42  1986      7          0.889
12       42  1987      7          0.889
13       42  1988      7          0.889
14       42  1989      7          0.889
15       42  1990      7          0.889
16       42  1991      7          0.889
17       42  1992      7          0.889
18       42  1993      7          0.889

In Stata, the key is to first load and declare the data.

use "https://github.com/robertwwalker/Essex-Data/raw/main/ISQ99-Essex.dta"
des

xtset denotes two key features of the data, the \(i\) and \(t\).

xtset IDORIGIN YEAR

Stata has internal capabilities for summarising and describing xt data. A between and within summary is given by xtsum

xtsum

The description can be deceptive because the indices are a complete grid.

xtdes

Leads and Lags

The core operators in Stata lead with a letter before . With panel data, we will need to be cautious of the grouping.

F is forward,
L is Lag,
D is difference.

Qualitative Variables

`xttab`

Between: How many units received each category?
Within: Of all observations of units that received that category at least once, what percent of observations take this value?

xttab AINEW

The Between

between.tally <- function(x) {
HR.Data %>% select(IDORIGIN, AINEW) %>% filter(AINEW==x) %$% table(IDORIGIN) %>% length()
}
sapply(c(1:5), function(x) {between.tally(x)})

[1]  96 121 113  86  43

`xttrans`

A first-order transition matrix.

xttrans AINEW

xttrans I want to rely on the dplyr version of lag so I am explicit here. Take the data, group it by id, calculate the lag, ungroup them, and create a table. I prefer to keep this explicit with order_by. The janitor library provides tabyl and it is explicit among missing values.

library(janitor)
HR.Data %>% 
  group_by(IDORIGIN) %>% 
  mutate(Lag.AI = dplyr::lag(AINEW, n=1L, order_by = YEAR)) %>%
  ungroup() %>%
  tabyl(Lag.AI, AINEW)

 Lag.AI   1   2   3   4  5 NA_
      1 545  83  13   1  0   5
      2  86 480 130  20  3   2
      3   5 150 496  76 14   7
      4   2  14  97 167 47   0
      5   0   1   5  53 76   0
     NA  67  41  41  33  8 454