Lab 1: Exploratory analysis of sequential data in education

Quan Nguyen, Department of Statistics, University of British Columbia

About the dataset

In this lab, we are going to use the built-in biofam data set from the TraMineR package. See more details here

This data consists information about the Family life states from the Swiss Household Panel biographical survey. 16 year-long family life sequences built from the retrospective biographical survey carried out by the Swiss Household Panel (SHP) in 2002.

A data frame with 2000 rows, 16 state variables, 1 id variable and 7 covariates and 2 weights variables.

The data set contains (in columns 10 to 25) sequences of family life states from age 15 to 30 (sequence length is 16) and a series of covariates. The sequences are a sample of 2000 sequences of those created from the SHP biographical survey. It includes only individuals who were at least 30 years old at the time of the survey. The biofam data set describes family life courses of 2000 individuals born between 1909 and 1972.

The states numbered from 0 to 7 are defined from the combination of five basic states, namely Living with parents (Parent), Left home (Left), Married (Marr), Having Children (Child), Divorced:

0 = “Parent”
1 = “Left”
2 = “Married”
3 = “Left+Marr”
4 = “Child”
5 = “Left+Child”
6 = “Left+Marr+Child”
7 = “Divorced”

Variable

Label

idhous

ID

sex

sex

birthy

birth year

nat102

nationality

plingu02

interview language

p02r01

confession or religion

p02r04

participation in religious services: frequency

cspfaj

Swiss socio-professional category: Fathers job

cspmoj

Swiss socio-professional category: Mothers job

a15

family status at age 15

a30

family status at age 30

library(tidyverse)
library(TraMineR)
data(biofam)
str(biofam)
'data.frame':	2000 obs. of  27 variables:
 $ idhous  : num  66891 28621 57711 17501 147701 ...
 $ sex     : Factor w/ 2 levels "man","woman": 1 1 2 1 1 1 1 1 1 2 ...
 $ birthyr : num  1943 1935 1946 1918 1946 ...
 $ nat_1_02: Factor w/ 200 levels "other error",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ plingu02: Factor w/ 3 levels "french","german",..: 2 2 1 2 2 3 2 1 1 2 ...
 $ p02r01  : Factor w/ 13 levels "other error",..: 6 7 13 7 7 7 6 9 6 7 ...
 $ p02r04  : Factor w/ 14 levels "other error",..: 9 13 7 13 7 6 7 14 9 13 ...
 $ cspfaj  : Factor w/ 12 levels "active occupied but not classified",..: 7 7 7 5 NA 12 NA 11 7 7 ...
 $ cspmoj  : Factor w/ 12 levels "active occupied but not classified",..: 7 NA 9 NA NA NA NA NA 7 NA ...
 $ a15     : num  0 0 0 0 0 0 0 0 0 1 ...
 $ a16     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a17     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a18     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a19     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a20     : num  0 1 0 1 1 0 0 0 0 1 ...
 $ a21     : num  0 1 0 1 1 0 0 1 0 1 ...
 $ a22     : num  0 1 1 1 1 0 0 1 0 1 ...
 $ a23     : num  0 1 1 1 1 0 0 1 0 1 ...
 $ a24     : num  3 1 1 1 1 0 2 1 0 6 ...
 $ a25     : num  6 1 1 1 1 0 2 1 0 6 ...
 $ a26     : num  6 3 1 1 1 0 2 3 6 6 ...
 $ a27     : num  6 6 3 1 1 0 2 3 6 6 ...
 $ a28     : num  6 6 6 1 6 0 2 3 6 6 ...
 $ a29     : num  6 6 6 1 6 0 2 6 6 6 ...
 $ a30     : num  6 6 6 1 6 0 2 6 6 6 ...
 $ wp00tbgp: num  1053 855 575 1527 796 ...
 $ wp00tbgs: num  0.935 0.759 0.51 1.356 0.707 ...

Part 1: Data manipulation

Q1: Import the following dataset and create a sequence object using the seqdef() function in the TraMineR package.

Hint: You can use the option states in seqdef() to assign the short state labels for each state

# state labels
bfstates <- c("Parent", "Left", "Married", "Left+Marr", "Child", "Left+Child", "Left+Marr+Child", "Divorced")

# BEGIN SOLUTION
biofam.seq <- seqdef(biofam, 10:25, states = bfstates, labels = bfstates)
# END SOLUTION
 [>] state coding:

       [alphabet]  [label]         [long label] 

     1  0           Parent          Parent

     2  1           Left            Left

     3  2           Married         Married

     4  3           Left+Marr       Left+Marr

     5  4           Child           Child

     6  5           Left+Child      Left+Child

     7  6           Left+Marr+Child Left+Marr+Child

     8  7           Divorced        Divorced

 [>] 2000 sequences in the data set

 [>] min/max sequence length: 16/16

Q2: Convert the sequence object from ‘STS’ to ‘SPS’ format using the seqformat() function

# BEGIN SOLUTION
seqformat(biofam.seq, from ='STS', to='SPS')
# END SOLUTION
 [!!] 'missing' set as "c('*','%')", the 'nr' and 'void' code from the 'data' state sequence object

 [>] converting STS sequences to 2000 SPS sequences
A matrix: 2000 × 16 of type chr
[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]
1167(Parent,9) (Left+Marr,1) (Left+Marr+Child,6) NA NANANANANANANANANANANANA
514(Parent,1) (Left,10) (Left+Marr,1) (Left+Marr+Child,4)NANANANANANANANANANANANA
1013(Parent,7) (Left,5) (Left+Marr,1) (Left+Marr+Child,3)NANANANANANANANANANANANA
275(Parent,5) (Left,11) NA NA NANANANANANANANANANANANA
2580(Parent,5) (Left,8) (Left+Marr+Child,3) NA NANANANANANANANANANANANA
773(Parent,16)NA NA NA NANANANANANANANANANANANA
1187(Parent,9) (Married,7) NA NA NANANANANANANANANANANANA
47(Parent,6) (Left,5) (Left+Marr,3) (Left+Marr+Child,2)NANANANANANANANANANANANA
2091(Parent,11)(Left+Marr+Child,5) NA NA NANANANANANANANANANANANA
1846(Left,9) (Left+Marr+Child,7) NA NA NANANANANANANANANANANANA
1990(Parent,8) (Left+Marr,8) NA NA NANANANANANANANANANANANA
2088(Parent,16)NA NA NA NANANANANANANANANANANANA
867(Parent,5) (Left,4) (Left+Marr,1) (Left+Marr+Child,6)NANANANANANANANANANANANA
1616(Parent,6) (Left+Marr+Child,10)NA NA NANANANANANANANANANANANA
2136(Parent,8) (Left+Marr,3) (Divorced,5) NA NANANANANANANANANANANANA
2031(Parent,6) (Married,2) (Left+Marr+Child,4) (Divorced,4) NANANANANANANANANANANANA
2459(Parent,16)NA NA NA NANANANANANANANANANANANA
222(Parent,16)NA NA NA NANANANANANANANANANANANA
2193(Parent,13)(Left+Marr,3) NA NA NANANANANANANANANANANANA
1571(Parent,4) (Left,3) (Left+Marr,1) (Left+Marr+Child,8)NANANANANANANANANANANANA
2592(Parent,8) (Left+Marr+Child,8) NA NA NANANANANANANANANANANANA
1989(Parent,4) (Left+Marr,2) (Left+Marr+Child,10)NA NANANANANANANANANANANANA
1917(Parent,8) (Left+Marr,3) (Left+Marr+Child,5) NA NANANANANANANANANANANANA
630(Parent,1) (Left,2) (Left+Marr+Child,13)NA NANANANANANANANANANANANA
532(Parent,6) (Left+Marr+Child,10)NA NA NANANANANANANANANANANANA
863(Parent,6) (Left+Marr,2) (Left+Marr+Child,8) NA NANANANANANANANANANANANA
1102(Parent,1) (Left,4) (Left+Marr+Child,11)NA NANANANANANANANANANANANA
1454(Parent,16)NA NA NA NANANANANANANANANANANANA
1174(Parent,5) (Left,7) (Left+Marr,1) (Left+Marr+Child,3)NANANANANANANANANANANANA
227(Parent,5) (Left+Marr+Child,11)NA NA NANANANANANANANANANANANA
81(Parent,7) (Left,9) NA NA NANANANANANANANANANANANA
1805(Parent,11)(Left+Child,1) (Left+Marr+Child,4)NA NANANANANANANANANANANANA
789(Parent,1) (Left,15) NA NA NANANANANANANANANANANANA
2361(Parent,7) (Married,9) NA NA NANANANANANANANANANANANA
56(Parent,8) (Left+Marr,2) (Left+Marr+Child,6)NA NANANANANANANANANANANANA
645(Parent,7) (Left,2) (Left+Marr+Child,7)NA NANANANANANANANANANANANA
1721(Parent,9) (Left,2) (Left+Marr,2) (Left+Marr+Child,3)NANANANANANANANANANANANA
1419(Parent,14)(Married,2) NA NA NANANANANANANANANANANANA
1207(Parent,5) (Left,11) NA NA NANANANANANANANANANANANA
259(Parent,9) (Married,7) NA NA NANANANANANANANANANANANA
2413(Parent,2) (Left,5) (Left+Marr,3) (Left+Marr+Child,6)NANANANANANANANANANANANA
2090(Parent,5) (Left,11) NA NA NANANANANANANANANANANANA
1337(Parent,9) (Left+Marr,2) (Left+Marr+Child,5)NA NANANANANANANANANANANANA
1826(Parent,5) (Left,2) (Left+Marr,2) (Left+Marr+Child,7)NANANANANANANANANANANANA
2503(Parent,10)(Left+Marr,6) NA NA NANANANANANANANANANANANA
106(Parent,9) (Left+Marr+Child,7) NA NA NANANANANANANANANANANANA
1181(Parent,9) (Left,6) (Left+Marr,1) NA NANANANANANANANANANANANA
1848(Parent,8) (Left+Marr,8) NA NA NANANANANANANANANANANANA
2203(Parent,6) (Left,10) NA NA NANANANANANANANANANANANA
1745(Parent,1) (Left,11) (Left+Marr+Child,4)NA NANANANANANANANANANANANA
278(Parent,16)NA NA NA NANANANANANANANANANANANA
1980(Parent,5) (Left,11) NA NA NANANANANANANANANANANANA
787(Parent,14)(Left,2) NA NA NANANANANANANANANANANANA
1120(Parent,15)(Left+Marr+Child,1) NA NA NANANANANANANANANANANANA
59(Parent,13)(Married,3) NA NA NANANANANANANANANANANANA
629(Parent,6) (Left,3) (Left+Marr+Child,7)NA NANANANANANANANANANANANA
2297(Parent,2) (Left,6) (Left+Marr,4) (Left+Marr+Child,4)NANANANANANANANANANANANA
775(Parent,16)NA NA NA NANANANANANANANANANANANA
2522(Parent,3) (Married,13) NA NA NANANANANANANANANANANANA
719(Parent,6) (Left+Marr+Child,10)NA NA NANANANANANANANANANANANA

Part 2: Exploratory data analysis

Q3 Plot the first 15 sequences

# Hint: seqiplot()

# BEGIN SOLUTION
seqiplot(biofam.seq, , idxs = 1:15)
# END SOLUTION
Error in seqiplot(biofam.seq, , idxs = 1:15): could not find function "seqiplot"
Traceback:

Q4 Plot the state distribution

# Hint: seqdplot()

# BEGIN SOLUTION
seqdplot(biofam.seq, main = "State distribution plot")
# END SOLUTION
../../_images/Lab1_19_0.png

Q5 Plot the top 10 most frequent sequences

# Hint: seqfplot()

# BEGIN SOLUTION
seqfplot(biofam.seq, main = "Sequence frequency plot", idxs = 1:10)
# END SOLUTION
../../_images/Lab1_23_0.png

Q6 What are the distinct states sequence (DSS) in the sequence objects?

# Hint: seqdss()

# BEGIN SOLUTION
print(head(seqdss(biofam.seq)))
# END SOLUTION
     Sequence                             
1167 Parent-Left+Marr-Left+Marr+Child     
514  Parent-Left-Left+Marr-Left+Marr+Child
1013 Parent-Left-Left+Marr-Left+Marr+Child
275  Parent-Left                          
2580 Parent-Left-Left+Marr+Child          
773  Parent                               

Q7 How many sequences are there in the data?

seq_num <- nrow(biofam.seq) # SOLUTION
2000

Q8 What is the min/max/median length of the sequences

summary(seqlength(biofam.seq)) # SOLUTION
     Length  
 Min.   :17  
 1st Qu.:17  
 Median :17  
 Mean   :17  
 3rd Qu.:17  
 Max.   :17  

Q9 Which state is the most likely to follow ‘Left’ (hint: transition rates)

# Hint: seqtrate()

# BEGIN SOLUTION
seqtrate(biofam.seq)
# END SOLUTION
 [>] computing transition probabilities for states Parent/Left/Married/Left+Marr/Child/Left+Child/Left+Marr+Child/Divorced ...
A matrix: 8 × 8 of type dbl
[-> Parent][-> Left][-> Married][-> Left+Marr][-> Child][-> Left+Child][-> Left+Marr+Child][-> Divorced]
[Parent ->]0.88567480.054584330.015343980.031819900.0003773110.0010690480.011130680.0000000000
[Left ->]0.00000000.889839570.000000000.083422460.0000000000.0039215690.022638150.0001782531
[Married ->]0.00000000.000000000.969023030.010325660.0000000000.0000000000.011119940.0095313741
[Left+Marr ->]0.00000000.000000000.000000000.786959550.0000000000.0000000000.199442120.0135983264
[Child ->]0.00000000.000000000.125000000.000000000.8125000000.0625000000.000000000.0000000000
[Left+Child ->]0.00000000.000000000.000000000.000000000.0000000000.8819444440.118055560.0000000000
[Left+Marr+Child ->]0.00000000.000000000.000000000.000000000.0000000000.0000000000.993931730.0060682680
[Divorced ->]0.00000000.000000000.000000000.000000000.0000000000.0000000000.000000001.0000000000

Q10 What are the top 10 most diverse sequences (hint: entropy, turbulence)

# BEGIN SOLUTION
df <- as.tibble(seqindic(biofam.seq, indic=c("entr","turbn","cplx")))
df$index <- rownames(seqindic(biofam.seq, indic=c("entr","turbn","cplx")))
top10 <- df %>% arrange(desc(Entr)) %>% head(10)
top10

biofam.seq$index <- rownames(biofam.seq)
seqiplot(biofam.seq %>% filter(index %in% top10$index))
# END SOLUTION
A tibble: 10 × 4
EntrCplxTurbnindex
<dbl><dbl><dbl><chr>
0.70281950.41917170.3982722326
0.68414040.41356390.37436711241
0.66666670.35355340.48874121098
0.65907230.35153390.4521225141
0.65907230.35153390.45212252534
0.65907230.35153390.45212251831
0.65907230.35153390.45212252083
0.65907230.35153390.45212252028
0.65820060.40564780.36134991594
0.65147800.34950270.4261411761
../../_images/Lab1_43_1.png

Q.11 What is the average time spent in each state?

# BEGIN SOLUTION
seqmtplot(biofam.seq)
# END SOLUTION
../../_images/Lab1_47_0.png