Lab 1: Exploratory analysis of sequential data in education¶

Quan Nguyen, Department of Statistics, University of British Columbia

About the dataset¶

In this lab, we are going to use the built-in biofam data set from the TraMineR package. See more details here

This data consists information about the Family life states from the Swiss Household Panel biographical survey. 16 year-long family life sequences built from the retrospective biographical survey carried out by the Swiss Household Panel (SHP) in 2002.

A data frame with 2000 rows, 16 state variables, 1 id variable and 7 covariates and 2 weights variables.

The data set contains (in columns 10 to 25) sequences of family life states from age 15 to 30 (sequence length is 16) and a series of covariates. The sequences are a sample of 2000 sequences of those created from the SHP biographical survey. It includes only individuals who were at least 30 years old at the time of the survey. The biofam data set describes family life courses of 2000 individuals born between 1909 and 1972.

The states numbered from 0 to 7 are defined from the combination of five basic states, namely Living with parents (Parent), Left home (Left), Married (Marr), Having Children (Child), Divorced:

0 = “Parent”
1 = “Left”
2 = “Married”
3 = “Left+Marr”
4 = “Child”
5 = “Left+Child”
6 = “Left+Marr+Child”
7 = “Divorced”

Variable	Label
idhous	ID
sex	sex
birthy	birth year
nat102	nationality
plingu02	interview language
p02r01	confession or religion
p02r04	participation in religious services: frequency
cspfaj	Swiss socio-professional category: Fathers job
cspmoj	Swiss socio-professional category: Mothers job
a15	family status at age 15
…
a30	family status at age 30

library(tidyverse)
library(TraMineR)
data(biofam)
str(biofam)

'data.frame':	2000 obs. of  27 variables:
 $ idhous  : num  66891 28621 57711 17501 147701 ...
 $ sex     : Factor w/ 2 levels "man","woman": 1 1 2 1 1 1 1 1 1 2 ...
 $ birthyr : num  1943 1935 1946 1918 1946 ...
 $ nat_1_02: Factor w/ 200 levels "other error",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ plingu02: Factor w/ 3 levels "french","german",..: 2 2 1 2 2 3 2 1 1 2 ...
 $ p02r01  : Factor w/ 13 levels "other error",..: 6 7 13 7 7 7 6 9 6 7 ...
 $ p02r04  : Factor w/ 14 levels "other error",..: 9 13 7 13 7 6 7 14 9 13 ...
 $ cspfaj  : Factor w/ 12 levels "active occupied but not classified",..: 7 7 7 5 NA 12 NA 11 7 7 ...
 $ cspmoj  : Factor w/ 12 levels "active occupied but not classified",..: 7 NA 9 NA NA NA NA NA 7 NA ...
 $ a15     : num  0 0 0 0 0 0 0 0 0 1 ...
 $ a16     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a17     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a18     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a19     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a20     : num  0 1 0 1 1 0 0 0 0 1 ...
 $ a21     : num  0 1 0 1 1 0 0 1 0 1 ...
 $ a22     : num  0 1 1 1 1 0 0 1 0 1 ...
 $ a23     : num  0 1 1 1 1 0 0 1 0 1 ...
 $ a24     : num  3 1 1 1 1 0 2 1 0 6 ...
 $ a25     : num  6 1 1 1 1 0 2 1 0 6 ...
 $ a26     : num  6 3 1 1 1 0 2 3 6 6 ...
 $ a27     : num  6 6 3 1 1 0 2 3 6 6 ...
 $ a28     : num  6 6 6 1 6 0 2 3 6 6 ...
 $ a29     : num  6 6 6 1 6 0 2 6 6 6 ...
 $ a30     : num  6 6 6 1 6 0 2 6 6 6 ...
 $ wp00tbgp: num  1053 855 575 1527 796 ...
 $ wp00tbgs: num  0.935 0.759 0.51 1.356 0.707 ...

Part 1: Data manipulation¶

Q1: Import the following dataset and create a sequence object using the `seqdef()` function in the `TraMineR` package.¶

Hint: You can use the option states in seqdef() to assign the short state labels for each state

# state labels
bfstates <- c("Parent", "Left", "Married", "Left+Marr", "Child", "Left+Child", "Left+Marr+Child", "Divorced")

# BEGIN SOLUTION
biofam.seq <- seqdef(biofam, 10:25, states = bfstates, labels = bfstates)
# END SOLUTION

 [>] state coding:

       [alphabet]  [label]         [long label] 

     1  0           Parent          Parent

     2  1           Left            Left

     3  2           Married         Married

     4  3           Left+Marr       Left+Marr

     5  4           Child           Child

     6  5           Left+Child      Left+Child

     7  6           Left+Marr+Child Left+Marr+Child

     8  7           Divorced        Divorced

 [>] 2000 sequences in the data set

 [>] min/max sequence length: 16/16

Q2: Convert the sequence object from ‘STS’ to ‘SPS’ format using the `seqformat()` function¶

# BEGIN SOLUTION
seqformat(biofam.seq, from ='STS', to='SPS')
# END SOLUTION

 [!!] 'missing' set as "c('*','%')", the 'nr' and 'void' code from the 'data' state sequence object

 [>] converting STS sequences to 2000 SPS sequences

A matrix: 2000 × 16 of type chr
	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]	[9]	[10]	[11]	[12]	[13]	[14]	[15]	[16]
1167	(Parent,9)	(Left+Marr,1)	(Left+Marr+Child,6)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
514	(Parent,1)	(Left,10)	(Left+Marr,1)	(Left+Marr+Child,4)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1013	(Parent,7)	(Left,5)	(Left+Marr,1)	(Left+Marr+Child,3)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
275	(Parent,5)	(Left,11)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2580	(Parent,5)	(Left,8)	(Left+Marr+Child,3)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
773	(Parent,16)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1187	(Parent,9)	(Married,7)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
47	(Parent,6)	(Left,5)	(Left+Marr,3)	(Left+Marr+Child,2)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2091	(Parent,11)	(Left+Marr+Child,5)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1846	(Left,9)	(Left+Marr+Child,7)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1990	(Parent,8)	(Left+Marr,8)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2088	(Parent,16)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
867	(Parent,5)	(Left,4)	(Left+Marr,1)	(Left+Marr+Child,6)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1616	(Parent,6)	(Left+Marr+Child,10)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2136	(Parent,8)	(Left+Marr,3)	(Divorced,5)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2031	(Parent,6)	(Married,2)	(Left+Marr+Child,4)	(Divorced,4)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2459	(Parent,16)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
222	(Parent,16)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2193	(Parent,13)	(Left+Marr,3)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1571	(Parent,4)	(Left,3)	(Left+Marr,1)	(Left+Marr+Child,8)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2592	(Parent,8)	(Left+Marr+Child,8)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1989	(Parent,4)	(Left+Marr,2)	(Left+Marr+Child,10)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1917	(Parent,8)	(Left+Marr,3)	(Left+Marr+Child,5)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
630	(Parent,1)	(Left,2)	(Left+Marr+Child,13)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
532	(Parent,6)	(Left+Marr+Child,10)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
863	(Parent,6)	(Left+Marr,2)	(Left+Marr+Child,8)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1102	(Parent,1)	(Left,4)	(Left+Marr+Child,11)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1454	(Parent,16)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1174	(Parent,5)	(Left,7)	(Left+Marr,1)	(Left+Marr+Child,3)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
227	(Parent,5)	(Left+Marr+Child,11)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
81	(Parent,7)	(Left,9)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1805	(Parent,11)	(Left+Child,1)	(Left+Marr+Child,4)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
789	(Parent,1)	(Left,15)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2361	(Parent,7)	(Married,9)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
56	(Parent,8)	(Left+Marr,2)	(Left+Marr+Child,6)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
645	(Parent,7)	(Left,2)	(Left+Marr+Child,7)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1721	(Parent,9)	(Left,2)	(Left+Marr,2)	(Left+Marr+Child,3)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1419	(Parent,14)	(Married,2)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1207	(Parent,5)	(Left,11)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
259	(Parent,9)	(Married,7)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2413	(Parent,2)	(Left,5)	(Left+Marr,3)	(Left+Marr+Child,6)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2090	(Parent,5)	(Left,11)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1337	(Parent,9)	(Left+Marr,2)	(Left+Marr+Child,5)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1826	(Parent,5)	(Left,2)	(Left+Marr,2)	(Left+Marr+Child,7)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2503	(Parent,10)	(Left+Marr,6)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
106	(Parent,9)	(Left+Marr+Child,7)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1181	(Parent,9)	(Left,6)	(Left+Marr,1)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1848	(Parent,8)	(Left+Marr,8)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2203	(Parent,6)	(Left,10)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1745	(Parent,1)	(Left,11)	(Left+Marr+Child,4)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
278	(Parent,16)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1980	(Parent,5)	(Left,11)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
787	(Parent,14)	(Left,2)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1120	(Parent,15)	(Left+Marr+Child,1)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
59	(Parent,13)	(Married,3)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
629	(Parent,6)	(Left,3)	(Left+Marr+Child,7)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2297	(Parent,2)	(Left,6)	(Left+Marr,4)	(Left+Marr+Child,4)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
775	(Parent,16)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2522	(Parent,3)	(Married,13)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
719	(Parent,6)	(Left+Marr+Child,10)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Part 2: Exploratory data analysis¶

Q3 Plot the first 15 sequences¶

# Hint: seqiplot()

# BEGIN SOLUTION
seqiplot(biofam.seq, , idxs = 1:15)
# END SOLUTION

Error in seqiplot(biofam.seq, , idxs = 1:15): could not find function "seqiplot"
Traceback:

Q4 Plot the state distribution¶

# Hint: seqdplot()

# BEGIN SOLUTION
seqdplot(biofam.seq, main = "State distribution plot")
# END SOLUTION

Q5 Plot the top 10 most frequent sequences¶

# Hint: seqfplot()

# BEGIN SOLUTION
seqfplot(biofam.seq, main = "Sequence frequency plot", idxs = 1:10)
# END SOLUTION

Q6 What are the distinct states sequence (DSS) in the sequence objects?¶

# Hint: seqdss()

# BEGIN SOLUTION
print(head(seqdss(biofam.seq)))
# END SOLUTION

     Sequence                             
Parent-Left+Marr-Left+Marr+Child     
Parent-Left-Left+Marr-Left+Marr+Child
Parent-Left-Left+Marr-Left+Marr+Child
Parent-Left                          
Parent-Left-Left+Marr+Child          
Parent                               

Q7 How many sequences are there in the data?¶

seq_num <- nrow(biofam.seq) # SOLUTION

2000

Q8 What is the min/max/median length of the sequences¶

summary(seqlength(biofam.seq)) # SOLUTION

     Length  
 Min.   :17  
 1st Qu.:17  
 Median :17  
 Mean   :17  
 3rd Qu.:17  
 Max.   :17  

Q9 Which state is the most likely to follow ‘Left’ (hint: transition rates)¶

# Hint: seqtrate()

# BEGIN SOLUTION
seqtrate(biofam.seq)
# END SOLUTION

 [>] computing transition probabilities for states Parent/Left/Married/Left+Marr/Child/Left+Child/Left+Marr+Child/Divorced ...

A matrix: 8 × 8 of type dbl
	[-> Parent]	[-> Left]	[-> Married]	[-> Left+Marr]	[-> Child]	[-> Left+Child]	[-> Left+Marr+Child]	[-> Divorced]
[Parent ->]	0.8856748	0.05458433	0.01534398	0.03181990	0.000377311	0.001069048	0.01113068	0.0000000000
[Left ->]	0.0000000	0.88983957	0.00000000	0.08342246	0.000000000	0.003921569	0.02263815	0.0001782531
[Married ->]	0.0000000	0.00000000	0.96902303	0.01032566	0.000000000	0.000000000	0.01111994	0.0095313741
[Left+Marr ->]	0.0000000	0.00000000	0.00000000	0.78695955	0.000000000	0.000000000	0.19944212	0.0135983264
[Child ->]	0.0000000	0.00000000	0.12500000	0.00000000	0.812500000	0.062500000	0.00000000	0.0000000000
[Left+Child ->]	0.0000000	0.00000000	0.00000000	0.00000000	0.000000000	0.881944444	0.11805556	0.0000000000
[Left+Marr+Child ->]	0.0000000	0.00000000	0.00000000	0.00000000	0.000000000	0.000000000	0.99393173	0.0060682680
[Divorced ->]	0.0000000	0.00000000	0.00000000	0.00000000	0.000000000	0.000000000	0.00000000	1.0000000000

Q10 What are the top 10 most diverse sequences (hint: entropy, turbulence)¶

# BEGIN SOLUTION
df <- as.tibble(seqindic(biofam.seq, indic=c("entr","turbn","cplx")))
df$index <- rownames(seqindic(biofam.seq, indic=c("entr","turbn","cplx")))
top10 <- df %>% arrange(desc(Entr)) %>% head(10)
top10

biofam.seq$index <- rownames(biofam.seq)
seqiplot(biofam.seq %>% filter(index %in% top10$index))
# END SOLUTION

A tibble: 10 × 4
Entr	Cplx	Turbn	index
<dbl>	<dbl>	<dbl>	<chr>
0.7028195	0.4191717	0.3982722	326
0.6841404	0.4135639	0.3743671	1241
0.6666667	0.3535534	0.4887412	1098
0.6590723	0.3515339	0.4521225	141
0.6590723	0.3515339	0.4521225	2534
0.6590723	0.3515339	0.4521225	1831
0.6590723	0.3515339	0.4521225	2083
0.6590723	0.3515339	0.4521225	2028
0.6582006	0.4056478	0.3613499	1594
0.6514780	0.3495027	0.4261411	761

Q.11 What is the average time spent in each state?¶

# BEGIN SOLUTION
seqmtplot(biofam.seq)
# END SOLUTION

Temporal and sequential analysis - LASI 2022