Lab 2: Cluster analysis of sequential data

About the dataset

In this lab, we are going to use the built-in biofam data set from the TraMineR package. See more details here

This data consists information about the Family life states from the Swiss Household Panel biographical survey. 16 year-long family life sequences built from the retrospective biographical survey carried out by the Swiss Household Panel (SHP) in 2002.

A data frame with 2000 rows, 16 state variables, 1 id variable and 7 covariates and 2 weights variables.

The data set contains (in columns 10 to 25) sequences of family life states from age 15 to 30 (sequence length is 16) and a series of covariates. The sequences are a sample of 2000 sequences of those created from the SHP biographical survey. It includes only individuals who were at least 30 years old at the time of the survey. The biofam data set describes family life courses of 2000 individuals born between 1909 and 1972.

The states numbered from 0 to 7 are defined from the combination of five basic states, namely Living with parents (Parent), Left home (Left), Married (Marr), Having Children (Child), Divorced:

0 = “Parent”
1 = “Left”
2 = “Married”
3 = “Left+Marr”
4 = “Child”
5 = “Left+Child”
6 = “Left+Marr+Child”
7 = “Divorced”

Variable

Label

idhous

ID

sex

sex

birthy

birth year

nat102

nationality

plingu02

interview language

p02r01

confession or religion

p02r04

participation in religious services: frequency

cspfaj

Swiss socio-professional category: Fathers job

cspmoj

Swiss socio-professional category: Mothers job

a15

family status at age 15

a30

family status at age 30

library(tidyverse)
library(TraMineR)
library(cluster)
data(biofam)
str(biofam)
── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
 ggplot2 3.3.5      purrr   0.3.4
 tibble  3.1.5      dplyr   1.0.7
 tidyr   1.1.3      stringr 1.4.0
 readr   2.0.1      forcats 0.5.1
Warning message:
“package ‘tibble’ was built under R version 4.1.1”
Warning message:
“package ‘readr’ was built under R version 4.1.1”
── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
Warning message:
“package ‘TraMineR’ was built under R version 4.1.1”
TraMineR stable version 2.2-3 (Built: 2022-01-26)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.
Warning message:
“package ‘cluster’ was built under R version 4.1.1”
'data.frame':	2000 obs. of  27 variables:
 $ idhous  : num  66891 28621 57711 17501 147701 ...
 $ sex     : Factor w/ 2 levels "man","woman": 1 1 2 1 1 1 1 1 1 2 ...
 $ birthyr : num  1943 1935 1946 1918 1946 ...
 $ nat_1_02: Factor w/ 200 levels "other error",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ plingu02: Factor w/ 3 levels "french","german",..: 2 2 1 2 2 3 2 1 1 2 ...
 $ p02r01  : Factor w/ 13 levels "other error",..: 6 7 13 7 7 7 6 9 6 7 ...
 $ p02r04  : Factor w/ 14 levels "other error",..: 9 13 7 13 7 6 7 14 9 13 ...
 $ cspfaj  : Factor w/ 12 levels "active occupied but not classified",..: 7 7 7 5 NA 12 NA 11 7 7 ...
 $ cspmoj  : Factor w/ 12 levels "active occupied but not classified",..: 7 NA 9 NA NA NA NA NA 7 NA ...
 $ a15     : num  0 0 0 0 0 0 0 0 0 1 ...
 $ a16     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a17     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a18     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a19     : num  0 1 0 0 0 0 0 0 0 1 ...
 $ a20     : num  0 1 0 1 1 0 0 0 0 1 ...
 $ a21     : num  0 1 0 1 1 0 0 1 0 1 ...
 $ a22     : num  0 1 1 1 1 0 0 1 0 1 ...
 $ a23     : num  0 1 1 1 1 0 0 1 0 1 ...
 $ a24     : num  3 1 1 1 1 0 2 1 0 6 ...
 $ a25     : num  6 1 1 1 1 0 2 1 0 6 ...
 $ a26     : num  6 3 1 1 1 0 2 3 6 6 ...
 $ a27     : num  6 6 3 1 1 0 2 3 6 6 ...
 $ a28     : num  6 6 6 1 6 0 2 3 6 6 ...
 $ a29     : num  6 6 6 1 6 0 2 6 6 6 ...
 $ a30     : num  6 6 6 1 6 0 2 6 6 6 ...
 $ wp00tbgp: num  1053 855 575 1527 796 ...
 $ wp00tbgs: num  0.935 0.759 0.51 1.356 0.707 ...
# state labels
bfstates <- c("Parent", "Left", "Married", "Left+Marr", "Child", "Left+Child", "Left+Marr+Child", "Divorced")

# define sequence object
biofam.seq <- seqdef(biofam, 10:25, states = bfstates, labels = bfstates)
 [>] state coding:
       [alphabet]  [label]         [long label] 
     1  0           Parent          Parent
     2  1           Left            Left
     3  2           Married         Married
     4  3           Left+Marr       Left+Marr
     5  4           Child           Child
     6  5           Left+Child      Left+Child
     7  6           Left+Marr+Child Left+Marr+Child
     8  7           Divorced        Divorced
 [>] 2000 sequences in the data set
 [>] min/max sequence length: 16/16

Q1. Create a normalized dissimilarity matrix using Longest Common Subsequences method

Store the dissimilarity matrix in biofam.seq.LCS

biofam.seq.LCS <- NULL

# BEGIN SOLUTION
biofam.seq.LCS <- seqdist(biofam.seq, method='LCS', norm = 'auto')
biofam.seq.LCS

# END SOLUTION
 [>] 2000 sequences with 8 distinct states
 [>] creating a 'sm' with a substitution cost of 2
 [>] creating 8x8 substitution-cost matrix using 2 as constant value
 [>] 537 distinct  sequences 
 [>] min/max sequence lengths: 16/16
 [>] computing distances using the LCS gmean normalized metric
 [>] elapsed time: 0.214 secs
A matrix: 2000 × 2000 of type dbl
11675141013275258077311874720911846278198078711205962922977752522719
11670.00000.62500.31250.68750.50000.43750.43750.43750.12500.62500.43750.68750.43750.37500.43750.25000.56250.43750.81250.2500
5140.62500.00000.37500.31250.25000.93750.93750.43750.68750.18750.93750.31250.81250.87500.93750.50000.25000.93750.93750.6875
10130.31250.37500.00000.37500.18750.56250.56250.12500.37500.50000.56250.37500.43750.50000.56250.25000.31250.56250.81250.4375
2750.68750.31250.37500.00000.18750.68750.68750.37500.68750.43750.68750.00000.56250.68750.68750.50000.50000.68750.81250.6875
25800.50000.25000.18750.18750.00000.68750.68750.25000.50000.31250.68750.18750.56250.62500.68750.31250.31250.68750.81250.5000
7730.43750.93750.56250.68750.68750.00000.43750.62500.31251.00000.00000.68750.12500.06250.18750.62500.87500.00000.81250.6250
11870.43750.93750.56250.68750.68750.43750.00000.62500.43751.00000.43750.68750.43750.43750.25000.62500.87500.43750.37500.6250
470.43750.43750.12500.37500.25000.62500.62500.00000.50000.56250.62500.37500.50000.56250.62500.31250.25000.62500.81250.5000
20910.12500.68750.37500.68750.50000.31250.43750.50000.00000.68750.31250.68750.31250.25000.31250.31250.62500.31250.81250.3125
18460.62500.18750.50000.43750.31251.00001.00000.56250.68750.00001.00000.43750.87500.93751.00000.37500.37501.00001.00000.5625
19900.43750.87500.50000.68750.68750.50000.50000.43750.50001.00000.50000.68750.50000.50000.50000.62500.62500.50000.81250.6250
20880.43750.93750.56250.68750.68750.00000.43750.62500.31251.00000.00000.68750.12500.06250.18750.62500.87500.00000.81250.6250
8670.25000.37500.18750.43750.25000.68750.68750.25000.37500.37500.68750.43750.56250.62500.68750.12500.31250.68750.81250.3125
16160.25000.68750.43750.68750.50000.62500.62500.50000.31250.56250.62500.68750.62500.56250.62500.18750.62500.62500.81250.0000
21360.43750.87500.50000.68750.68750.50000.50000.43750.50001.00000.50000.68750.50000.50000.50000.62500.68750.50000.81250.6250
20310.37500.68750.43750.68750.50000.62500.50000.50000.37500.75000.62500.68750.62500.56250.50000.37500.62500.62500.68750.3750
24590.43750.93750.56250.68750.68750.00000.43750.62500.31251.00000.00000.68750.12500.06250.18750.62500.87500.00000.81250.6250
2220.43750.93750.56250.68750.68750.00000.43750.62500.31251.00000.00000.68750.12500.06250.18750.62500.87500.00000.81250.6250
21930.37500.87500.50000.68750.68750.18750.43750.43750.31251.00000.18750.68750.18750.18750.18750.62500.68750.18750.81250.6250
15710.31250.43750.31250.56250.37500.75000.75000.37500.43750.37500.75000.56250.62500.68750.75000.12500.37500.75000.81250.2500
25920.12500.68750.37500.68750.50000.50000.50000.50000.18750.56250.50000.68750.50000.43750.50000.18750.62500.50000.81250.1250
19890.31250.62500.50000.75000.56250.75000.75000.50000.43750.56250.75000.75000.75000.68750.75000.31250.50000.75000.81250.1250
19170.12500.62500.31250.68750.50000.50000.50000.31250.18750.68750.50000.68750.50000.43750.50000.31250.43750.50000.81250.3125
6300.56250.56250.62500.81250.62500.93750.93750.68750.62500.43750.93750.81250.81250.87500.93750.37500.56250.93750.93750.3125
5320.25000.68750.43750.68750.50000.62500.62500.50000.31250.56250.62500.68750.62500.56250.62500.18750.62500.62500.81250.0000
8630.18750.62500.37500.68750.50000.62500.62500.37500.31250.56250.62500.68750.62500.56250.62500.18750.50000.62500.81250.1250
11020.56250.43750.50000.68750.50000.93750.93750.56250.62500.31250.93750.68750.81250.87500.93750.31250.43750.93750.93750.3125
14540.43750.93750.56250.68750.68750.00000.43750.62500.31251.00000.00000.68750.12500.06250.18750.62500.87500.00000.81250.6250
11740.43750.25000.12500.25000.06250.68750.68750.18750.50000.37500.68750.25000.56250.62500.68750.31250.25000.68750.81250.5000
2270.31250.68750.50000.68750.50000.68750.68750.56250.37500.56250.68750.68750.68750.62500.68750.25000.62500.68750.81250.0625
810.56250.37500.25000.12500.18750.56250.56250.31250.56250.43750.56250.12500.43750.56250.56250.43750.50000.56250.81250.6250
18050.18750.68750.37500.68750.50000.31250.43750.50000.06250.75000.31250.68750.31250.25000.31250.37500.62500.31250.81250.3750
7890.93750.31250.62500.25000.43750.93750.93750.62500.93750.43750.93750.25000.81250.93750.93750.75000.56250.93750.93750.9375
23610.56250.93750.56250.68750.68750.56250.12500.62500.56251.00000.56250.68750.56250.56250.37500.62500.87500.56250.25000.6250
560.06250.62500.31250.68750.50000.50000.50000.37500.18750.62500.50000.68750.50000.43750.50000.25000.50000.50000.81250.2500
6450.18750.56250.25000.56250.37500.56250.56250.37500.25000.43750.56250.56250.43750.50000.56250.06250.50000.56250.81250.1875
17210.18750.56250.18750.56250.37500.43750.43750.25000.25000.68750.43750.56250.31250.37500.43750.31250.43750.43750.81250.4375
14190.43750.93750.56250.68750.68750.12500.31250.62500.31251.00000.12500.68750.12500.12500.06250.62500.87500.12500.68750.6250
12070.68750.31250.37500.00000.18750.68750.68750.37500.68750.43750.68750.00000.56250.68750.68750.50000.50000.68750.81250.6875
2590.43750.93750.56250.68750.68750.43750.00000.62500.43751.00000.43750.68750.43750.43750.25000.62500.87500.43750.37500.6250
24130.43750.31250.31250.56250.37500.87500.87500.25000.56250.31250.87500.56250.75000.81250.87500.31250.12500.87500.87500.5000
20900.68750.31250.37500.00000.18750.68750.68750.37500.68750.43750.68750.00000.56250.68750.68750.50000.50000.68750.81250.6875
13370.06250.62500.31250.68750.50000.43750.43750.37500.12500.68750.43750.68750.43750.37500.43750.31250.50000.43750.81250.3125
18260.25000.50000.31250.56250.37500.68750.68750.31250.37500.43750.68750.56250.56250.62500.68750.12500.37500.68750.81250.2500
25030.37500.87500.50000.68750.68750.37500.43750.43750.37501.00000.37500.68750.37500.37500.37500.62500.62500.37500.81250.6250
1060.06250.68750.37500.68750.50000.43750.43750.50000.12500.56250.43750.68750.43750.37500.43750.18750.62500.43750.81250.1875
11810.37500.50000.18750.31250.31250.43750.43750.25000.43750.62500.43750.31250.31250.43750.43750.43750.43750.43750.81250.6250
18480.43750.87500.50000.68750.68750.50000.50000.43750.50001.00000.50000.68750.50000.50000.50000.62500.62500.50000.81250.6250
22030.62500.31250.31250.06250.18750.62500.62500.31250.62500.43750.62500.06250.50000.62500.62500.43750.50000.62500.81250.6250
17450.68750.06250.43750.25000.25000.93750.93750.50000.68750.18750.93750.25000.81250.87500.93750.50000.31250.93750.93750.6875
2780.43750.93750.56250.68750.68750.00000.43750.62500.31251.00000.00000.68750.12500.06250.18750.62500.87500.00000.81250.6250
19800.68750.31250.37500.00000.18750.68750.68750.37500.68750.43750.68750.00000.56250.68750.68750.50000.50000.68750.81250.6875
7870.43750.81250.43750.56250.56250.12500.43750.50000.31250.87500.12500.56250.00000.12500.18750.50000.75000.12500.81250.6250
11200.37500.87500.50000.68750.62500.06250.43750.56250.25000.93750.06250.68750.12500.00000.18750.56250.81250.06250.81250.5625
590.43750.93750.56250.68750.68750.18750.25000.62500.31251.00000.18750.68750.18750.18750.00000.62500.87500.18750.62500.6250
6290.25000.50000.25000.50000.31250.62500.62500.31250.31250.37500.62500.50000.50000.56250.62500.00000.43750.62500.81250.1875
22970.56250.25000.31250.50000.31250.87500.87500.25000.62500.37500.87500.50000.75000.81250.87500.43750.00000.87500.87500.6250
7750.43750.93750.56250.68750.68750.00000.43750.62500.31251.00000.00000.68750.12500.06250.18750.62500.87500.00000.81250.6250
25220.81250.93750.81250.81250.81250.81250.37500.81250.81251.00000.81250.81250.81250.81250.62500.81250.87500.81250.00000.8125
7190.25000.68750.43750.68750.50000.62500.62500.50000.31250.56250.62500.68750.62500.56250.62500.18750.62500.62500.81250.0000

Q2. Plot the pairs of sequences

  • Plot the top 5 sequences that are the most similar to sequence 1

  • Plot the top 5 sequences that are the least similar to sequence 1

most_sim <- head(which(biofam.seq.LCS==min(biofam.seq.LCS), arr.ind=T))
most_sim
least_sim <- head(which(biofam.seq.LCS==max(biofam.seq.LCS), arr.ind=T))
least_sim
A matrix: 6 × 2 of type int
rowcol
1167 11
1746 391
25772841
4787671
4808051
23809191
A matrix: 6 × 2 of type int
rowcol
2304 4431
2395 6021
909 8721
82114611
76915461
101217231
# BEGIN SOLUTION
seqiplot(biofam.seq[c(1,most_sim[,1]),])
seqiplot(biofam.seq[c(1,least_sim[,1]),])

# END SOLUTION
../../_images/Lab2_13_0.png ../../_images/Lab2_13_1.png

Q3. Create a dissimilarity matrix using optimal matching using transition rates as substitution cost matrix

biofam.seq.subcost <- NULL
biofam.seq.OM <- NULL

# BEGIN SOLUTION
biofam.seq.subcost <- seqcost(biofam.seq, method = "TRATE")
biofam.seq.OM <- seqdist(biofam.seq, method='OM', sm=biofam.seq.subcost$sm)
biofam.seq.OM

# END SOLUTION
 [>] creating substitution-cost matrix using transition rates ...
 [>] computing transition probabilities for states Parent/Left/Married/Left+Marr/Child/Left+Child/Left+Marr+Child/Divorced ...
 [>] 2000 sequences with 8 distinct states
 [>] checking 'sm' (size and triangle inequality)
 [>] 537 distinct  sequences 
 [>] min/max sequence lengths: 16/16
 [>] computing distances using the OM metric
 [>] elapsed time: 0.118 secs
A matrix: 2000 × 2000 of type dbl
11675141013275258077311874720911846278198078711205962922977752522719
1167 0.00000019.563325 9.89083121.56241115.63032613.90139613.9229513.437363 3.95704919.30929913.90139621.56241113.87838111.91252713.901428 7.63680517.24179013.90139625.83089 7.767166
51419.563325 0.00000011.727078 9.916578 7.89083129.37781429.5085213.32819421.254715 5.77792029.377814 9.91657825.56332527.38894529.37784615.482360 7.69514829.37781429.8360321.414446
1013 9.89083111.727078 0.00000011.739494 5.80740917.66186617.84715 3.80055811.55958215.58222117.66186611.73949413.83624715.67299717.661899 7.755282 9.94541617.66186625.8949413.676236
27521.562411 9.91657811.739494 0.000000 5.93208621.39957221.7816611.64987221.55930313.95472421.399572 0.00000017.50874121.43151921.56332515.78694915.64367221.39957225.9693121.719034
258015.630326 7.890831 5.807409 5.932086 0.00000021.52993321.74830 7.57912915.627218 9.94541621.529933 5.93208617.67249419.54106421.529966 9.854863 9.83315521.52993325.9359515.786949
77313.90139629.37781417.66186621.39957221.529933 0.00000013.8925919.609357 9.94434731.430826 0.00000021.399572 3.890831 1.988869 5.95396819.75833227.500692 0.00000025.8005319.888693
118713.92295529.50852017.84714621.78166321.74830313.892592 0.0000019.78303013.91371231.43090113.89259221.78166313.92328013.896816 7.93862419.75840727.56521713.89259211.9079419.888768
4713.43736313.328194 3.80055811.649872 7.57912919.60935719.78303 0.00000015.12875217.18333619.60935711.64987215.83624717.62048819.630873 9.356397 7.94541619.60935725.9007515.288483
2091 3.95704921.25471511.55958221.55930315.627218 9.94434713.9137115.128752 0.00000021.486480 9.94434721.559303 9.921332 7.955477 9.944379 9.81398619.377592 9.94434725.82165 9.944347
184619.309299 5.77792015.58222113.954724 9.94541631.43082631.4309017.18333621.486480 0.00000031.43082613.95472427.61791029.44195731.43085911.67249411.20908331.43082631.7584117.604580
199012.77152726.56987415.01339921.16886720.82080815.74544115.8959013.24168014.90733030.08380815.74544121.16886715.64223615.57781915.80992418.41131418.87472615.74544125.8406718.382202
208813.90139629.37781417.66186621.39957221.529933 0.00000013.8925919.609357 9.94434731.430826 0.00000021.399572 3.890831 1.988869 5.95396819.75833227.500692 0.00000025.8005319.888693
867 7.78166311.781663 6.00000013.780749 7.84866321.68305921.70462 7.60111611.73871211.52763621.68305913.78074917.89083119.69418921.683091 3.745974 9.49194721.68305925.89227 9.678059
1616 7.76716621.41444613.67623621.71903415.78694919.88869319.8887715.288483 9.94434717.60458019.88869321.71903419.86567817.89982419.888725 5.93208618.93861819.88869325.84277 0.000000
213613.57684327.20688915.54046321.58508821.23792115.90454015.8998713.87869515.38953730.94072615.90454021.58508815.80133515.73691815.87594619.26823221.63462315.90454025.8446519.349071
203111.95717621.71675313.78521421.85415015.92277819.92478915.9459915.59079011.96324423.64418619.92478921.85415019.90177417.93592015.98888011.97169219.76101119.92478921.8713611.953487
245913.90139629.37781417.66186621.39957221.529933 0.00000013.8925919.609357 9.94434731.430826 0.00000021.399572 3.890831 1.988869 5.95396819.75833227.500692 0.00000025.8005319.888693
22213.90139629.37781417.66186621.39957221.529933 0.00000013.8925919.609357 9.94434731.430826 0.00000021.399572 3.890831 1.988869 5.95396819.75833227.500692 0.00000025.8005319.888693
219311.60111627.05527215.32819421.31305820.964999 5.90454013.9076513.727078 9.37941230.865892 5.90454021.313058 5.801335 5.736918 5.96902319.19339821.640674 5.90454025.8155819.323759
1571 9.83624713.83624710.00000017.75811111.82602523.71538223.7154611.60111613.77103511.67560223.71538217.75811119.94541621.72651223.715414 4.00000011.29250523.71538225.88537 7.668751
2592 3.78942721.35055311.65542121.65514215.72305615.91095515.9110315.224591 5.96660817.54068715.91095521.65514215.88794013.92208515.910987 5.86819318.87472615.91095525.83432 3.977739
1989 9.96818019.75282515.91657823.62941217.69732623.82505323.8251315.80055813.88070717.54690323.82505323.62941223.80203921.83618423.825086 9.81051715.49194723.82505325.85281 3.936360
1917 3.76873819.451065 9.77857121.47278915.54070315.84888715.89193 9.890831 5.90454021.08101915.84888721.47278915.80133513.86001815.848919 9.40852513.47305215.84888725.83670 9.379412
63017.62460517.61945319.73264325.79625719.86417129.74613329.7462121.33375919.80178613.80958729.74613325.79625726.00000027.75726329.74616511.97736217.11167929.74613329.74627 9.857439
532 7.76716621.41444613.67623621.71903415.78694919.88869319.8887715.288483 9.94434717.60458019.88869321.71903419.86567817.89982419.888725 5.93208618.93861819.88869325.84277 0.000000
863 5.96818019.64365611.91657821.59746615.66538019.84731519.8473911.800558 9.90296817.48301119.84731521.59746619.82430017.85844519.847347 5.81051715.38277819.84731525.84436 3.601116
110217.53769813.66472915.77792021.84153315.90944729.65922529.6593017.37903519.714879 9.85486329.65922521.84153325.89083127.67035629.659257 9.94541613.15695529.65922529.76851 9.770532
145413.90139629.37781417.66186621.39957221.529933 0.00000013.8925919.609357 9.94434731.430826 0.00000021.399572 3.890831 1.988869 5.95396819.75833227.500692 0.00000025.8005319.888693
117413.781663 7.836247 3.890831 7.848663 1.91657821.55269821.73798 5.66255115.45041411.69138921.552698 7.84866317.72707819.56382821.552730 9.678059 7.91657821.55269825.9256315.610145
227 9.75603521.44639215.66510621.75098015.81889521.87756321.8776417.27735211.93321617.63652621.87756321.75098021.85454819.88869321.877595 7.92095518.97056421.87756325.84699 1.988869
8117.67158011.945416 7.848663 3.890831 5.97736217.50874117.890831 9.72709417.66847214.00000017.508741 3.89083113.61791017.54068717.67249413.86417115.74973317.50874125.93862419.785126
1805 5.83899421.45415711.57829921.57802015.645934 9.95440813.92483215.328194 1.88194423.368424 9.95440821.578020 9.931393 7.965539 9.95444011.69593019.577034 9.95440825.83276811.826291
78929.344074 9.82602519.521157 7.78166313.71374829.18123529.56332519.43153529.34096613.86417129.181235 7.78166325.29040429.21318129.34498823.56861117.52117329.18123529.89083129.500697
236117.89226729.61768917.95631521.89083121.85747217.861904 3.96931219.89219917.88302431.54007017.86190421.89083117.89259217.86612811.90793619.86757627.64129617.861904 7.93862419.888790
56 1.96818019.534487 9.86199321.53357315.60148815.86957615.89113511.691389 5.92523019.28046115.86957621.53357315.83315513.88070715.869608 7.60796715.27361015.86957625.835909 7.578854
645 5.69138917.427776 7.73264317.73236411.80027917.81291717.81299211.301813 7.86857013.61791017.81291717.73236414.00000015.82404717.812949 1.94541615.18333617.81291725.860785 5.943593
1721 5.80055817.363883 5.80740917.54690311.61481813.79380013.945989 7.636805 7.49194721.21902513.79380017.54690310.00000011.80493013.793832 9.54653113.78166313.79380025.85392513.522447
141913.90141829.37783617.66188821.50874121.529955 3.969312 9.92328019.609379 9.94436831.430848 3.96931221.508741 4.000000 3.973536 1.98465619.75835427.500713 3.96931221.83121619.888715
120721.562411 9.91657811.739494 0.000000 5.93208621.39957221.78166311.64987221.55930313.95472421.399572 0.00000017.50874121.43151921.56332515.78694915.64367221.39957225.96931221.719034
25913.92295529.50852017.84714621.78166321.74830313.892592 0.00000019.78303013.91371231.43090113.89259221.78166313.92328013.896816 7.93862419.75840727.56521713.89259211.90793619.888768
241313.663439 9.77857110.00000017.68181811.74973327.56483527.586393 8.00000017.620488 9.52454427.56483517.68181823.83624725.57596527.564867 9.691389 3.80055827.56483527.84771915.160698
209021.562411 9.91657811.739494 0.000000 5.93208621.39957221.78166311.64987221.55930313.95472421.399572 0.00000017.50874121.43151921.56332515.78694915.64367221.39957225.96931221.719034
1337 1.80055819.479903 9.80740921.50162715.56954113.88070713.92374911.636805 3.93636021.10985713.88070721.50162713.83315511.89183813.880739 9.43736315.44123213.88070725.831685 9.567724
1826 7.85901115.698240 9.91657817.67468811.74260221.74927721.749352 9.80055811.80493013.56023321.74927717.67468818.00000019.76040721.749309 3.91657811.43736321.74927725.870821 7.523893
250311.00278926.62755015.07107621.22654420.87848511.80908113.92270213.29935610.97097030.30095811.80908121.22654411.70587511.64145811.87356318.62846419.27361011.80908125.83063818.758825
106 1.80055821.31860711.62347521.62319615.69111013.92208513.92216015.192644 3.97773917.50874113.92208521.62319613.89907011.93321613.922117 5.83624719.04234813.92208525.830097 5.966608
118112.00000015.781663 6.000000 9.698240 9.69138913.64067413.989674 7.91657813.60083719.63680513.640674 9.698240 9.78166313.47305213.77133713.73264314.00000013.64067425.89761019.631337
184812.77152726.56987415.01339921.16886720.82080815.74544115.89590113.24168014.90733030.08380815.74544121.16886715.64223615.57781915.80992418.41131418.87472615.74544125.84067518.382202
220319.61699610.000000 9.794079 1.945416 5.95472419.45415719.836247 9.70445619.61388813.97736219.454157 1.94541615.56332519.48610319.61791013.84153315.66631019.45415725.95396819.773619
174521.434627 1.91657813.473052 8.000000 7.83624729.35505029.51884615.24477221.431519 5.95472429.355050 8.00000025.50874127.36618029.35508215.659164 9.61172629.35505029.84635221.591249
27813.90139629.37781417.66186621.39957221.529933 0.00000013.89259219.609357 9.94434731.430826 0.00000021.399572 3.890831 1.988869 5.95396819.75833227.500692 0.00000025.80052819.888693
198021.562411 9.91657811.739494 0.000000 5.93208621.39957221.78166311.64987221.55930313.95472421.399572 0.00000017.50874121.43151921.56332515.78694915.64367221.39957225.96931221.719034
78713.87838125.56332513.83624717.50874117.672494 3.89083113.92328015.836247 9.92133227.617910 3.89083117.508741 0.000000 3.922778 5.98465615.94541623.781663 3.89083125.83121619.865678
112011.91252727.38894515.67299721.43151919.541064 1.98886913.89681617.620488 7.95547729.441957 1.98886921.431519 3.922778 0.000000 5.95819217.76946325.511822 1.98886925.80475217.899824
5913.90142829.37784617.66189921.56332521.529966 5.953968 7.93862419.630873 9.94437931.430859 5.95396821.563325 5.984656 5.958192 0.00000019.75836527.500724 5.95396819.84656019.888725
629 7.63680515.482360 7.75528215.786949 9.85486319.75833219.758407 9.356397 9.81398611.67249419.75833215.78694915.94541617.76946319.758365 0.00000013.23792119.75833225.876128 5.932086
229717.241790 7.695148 9.94541615.643672 9.83315527.50069227.565217 7.94541619.37759211.20908327.50069215.64367223.78166325.51182227.50072413.237921 0.00000027.50069227.85963318.938618
77513.90139629.37781417.66186621.39957221.529933 0.00000013.89259219.609357 9.94434731.430826 0.00000021.399572 3.890831 1.988869 5.95396819.75833227.500692 0.00000025.80052819.888693
252225.83089129.83602625.89493925.96931225.93595225.80052811.90793625.90075125.82164831.75840725.80052825.96931225.83121625.80475219.84656025.87612827.85963325.800528 0.00000025.842769
719 7.76716621.41444613.67623621.71903415.78694919.88869319.88876815.288483 9.94434717.60458019.88869321.71903419.86567817.89982419.888725 5.93208618.93861819.88869325.842769 0.000000
?seqdist

Q4. Perform an agglomerative clustering using ward linkage method

  • You should use the dissimilarity matrix with optimal matching generated in Q3

  • Plot the dendogram

clusterward <- NULL

# BEGIN SOLUTION

clusterward <- agnes(biofam.seq.OM, diss = TRUE, method = "ward")

# Run this to generate the dendogram
plot(clusterward, which.plot=2)

# END SOLUTION
../../_images/Lab2_22_0.png

Q5: Select clusters

  • Cut the dendogram tree as appropriate using the cutree() function

  • List the number of observations in each cluster

  • Plot the sequence frequency by cluster membership (hint: seqfplot())

  • Plot the state distribution by cluster membership (hint: seqdplot())

# BEGIN SOLUTION

# cut the dendogram tree to generate two clusters
cluster5 <- cutree(clusterward, k = 5)

# check the number of observations in each cluster
table(cluster5)

# plot sequence frequency by cluster membership
seqfplot(biofam.seq, group = cluster5, pbarw = T)

# plot state distribution by cluster membership
seqdplot(biofam.seq, group = cluster5)

# END SOLUTION


s1 = Read (2) - Write (2)
s2 = Read(5) - Write (5)
cluster5
  1   2   3   4   5 
392 770 442 168 228 
../../_images/Lab2_26_1.png
Error in Read(2): could not find function "Read"
Traceback:
../../_images/Lab2_26_3.png
?seqcost()
seqcost {TraMineR}R Documentation

Generate substitution and indel costs

Description

The function seqcost proposes different ways to generate substitution costs (supposed to represent state dissimilarities) and possibly indel costs. Proposed methods are: "CONSTANT" (same cost for all substitutions), "TRATE" (derived from the observed transition rates), "FUTURE" (Chi-squared distance between conditional state distributions lag positions ahead), "FEATURES" (Gower distance between state features), "INDELS", "INDELSLOG" (based on estimated indel costs). The substitution-cost matrix is intended to serve as sm argument in the seqdist function that computes distances between sequences. seqsubm is an alias that returns only the substitution cost matrix, i.e., no indel.

Usage

seqcost(seqdata, method, cval = NULL, with.missing = FALSE, miss.cost = NULL,
  time.varying = FALSE, weighted = TRUE, transition = "both", lag = 1,
  miss.cost.fixed = NULL, state.features = NULL, feature.weights = NULL,
  feature.type = list(), proximities = FALSE)

seqsubm(...)

Arguments

seqdata

A sequence object as returned by the seqdef function.

method

String. How to generate the costs. One of "CONSTANT" (same cost for all substitutions), "TRATE" (derived from the observed transition rates), "FUTURE" (Chi-squared distance between conditional state distributions lag positions ahead), "FEATURES" (Gower distance between state features), "INDELS", "INDELSLOG" (based on estimated indel costs).

cval

Scalar. For method "CONSTANT", the single substitution cost.
For method "TRATE", a base value from which transition probabilities are subtracted.
If NULL, cval=2 is used, unless transition is "both" and time.varying is TRUE, in which case cval=4.

with.missing

Logical. Should an additional entry be added in the matrix for the missing states? If TRUE, the ‘missing’ state is also added to the alphabet. Set as TRUE if you want to use the costs for distances between sequences containing non deleted (non void) missing values. Forced as FALSE when there are no non-void missing values in seqdata. See Gabadinho et al. (2010) for more details on the options for handling missing values when creating the state sequence object with seqdef.

miss.cost

Scalar or vector. Cost for substituting the missing state. Default is cval.

miss.cost.fixed

Logical. Should the substitution cost for missing be set as the miss.cost value. When NULL (default) it will be set as FALSE when method = "INDELS" or "INDELSLOG", and TRUE otherwise.

time.varying

Logical. If TRUE return an array with a distinct matrix for each time unit. Time is the third dimension (subscript) of the returned array. Time varying works only with method='CONSTANT', 'TRATE', 'INDELS', and 'INDELSLOG'.

weighted

Logical. Should weights in seqdata be used when applicable?

transition

String. Only used if method="TRATE" and time.varying=TRUE. On which transition are rates based? Should be one of "previous" (from previous state), "next" (to next state) or "both".

lag

Integer. For methods TRATE and FUTURE only. Time ahead to which transition rates are computed (default is lag=1).

state.features

Data frame with features values for each state.

feature.weights

Vector of feature weights with length equal to the number of columns of state.features.

feature.type

List of feature types. See daisy for details.

proximities

Logical: should state proximities be returned instead of substitution costs?

...

Arguments passed to seqcost

Details

The substitution-cost matrix has dimension ns*ns, where ns is the number of states in the alphabet of the sequence object. The element (i,j) of the matrix is the cost of substituting state i with state j. It represents the dissimilarity between the states i and j. The indel cost of the cost of inserting or deleting a state.

With method CONSTANT, the substitution costs are all set equal to the cval value, the default value being 2.

With method TRATE (transition rates), the transition probabilities between all pairs of states is first computed (using the seqtrate function). Then, the substitution cost between states i and j is obtained with the formula

SC(i,j) = cval - P(i|j) -P(j|i)

where P(i|j) is the probability of transition from state j to i lag positions ahead. Default cval value is 2. When time.varying=TRUE and transition="both", the substitution cost at position t is set as

SC(i,j,t) = cval - P(i|j,t-1) -P(j|i,t-1) - P(i|j,t) - P(j|i,t)

where P(i|j,t-1) is the probability to transit from state j at t-1 to i at t. Here, the default cval value is 4.

With method FUTURE, the cost between i and j is the Chi-squared distance between the vector (d(alphabet | i)) of probabilities of transition from states i and j to all the states in the alphabet lag positions ahead:

SC(i,j) = ChiDist(d(alphabet | i), d(alphabet | j))

With method FEATURES, each state is characterized by the variables state.features, and the cost between i and j is computed as the Gower distance between their vectors of state.features values.

With methods INDELS and INDELSLOG, values of indels are first derived from the state relative frequencies f_i. For INDELS, indel_i = 1/f_i is used, and for INDELSLOG, indel_i = log[2/(1 + f_i)]. Substitution costs are then set as SC(i,j) = indel_i + indel_j.

For all methods but INDELS and INDELSLOG, the indel is set as max(sm)/2 when time.varying=FALSE and as 1 otherwise.

Value

For seqcost, a list of two elements, indel and sm or prox:

indel

The indel cost. Either a scalar or a vector of size ns. When time.varying=TRUE and method is one of "INDELS" or "INDELSLOG", a matrix with indels per time point in columns.

sm

The substitution-cost matrix (or array) when proximities = FALSE (default).

prox

The state proximity matrix when proximities = TRUE.

sm and prox are, when time.varying = FALSE, a matrix of size ns * ns, where ns is the number of states in the alphabet of the sequence object. When time.varying = TRUE, they are a three dimensional array of size ns * ns * L, where L is the maximum sequence length.

For seqsubm, only one element, the matrix (or array) sm.

Author(s)

Gilbert Ritschard and Matthias Studer (and Alexis Gabadinho for first version of seqsubm)

References

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1-37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2010). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

Studer, M. & Ritschard, G. (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. doi: 10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland, 2014. doi: 10.12682/lives.2296-1658.2014.33

See Also

seqtrate, seqdef, seqdist.

Examples

## Defining a sequence object with columns 10 to 25
## of a subset of the 'biofam' example data set.
data(biofam)
biofam.seq <- seqdef(biofam[501:600,10:25])

## Indel and substitution costs based on log of inverse state frequencies
lifcost <- seqcost(biofam.seq, method="INDELSLOG")
## Here lifcost$indel is a vector
biofam.om <- seqdist(biofam.seq, method="OM", indel=lifcost$indel, sm=lifcost$sm)

## Optimal matching using transition rates based substitution-cost matrix
## and the associated indel cost
## Here trcost$indel is a scalar
trcost <- seqcost(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", indel=trcost$indel, sm=trcost$sm)

## Using costs based on FUTURE with a forward lag of 4
fucost <- seqcost(biofam.seq, method="FUTURE", lag=4)
biofam.om <- seqdist(biofam.seq, method="OM", indel=fucost$indel, sm=fucost$sm)

## Optimal matching using a unique substitution cost of 2
## and an insertion/deletion cost of 3
ccost <- seqsubm(biofam.seq, method="CONSTANT", cval=2)
biofam.om.c2 <- seqdist(biofam.seq, method="OM",indel=3, sm=ccost)

## Displaying the distance matrix for the first 10 sequences
biofam.om.c2[1:10,1:10]

## =================================
## Example with weights and missings
## =================================
data(ex1)
ex1.seq <- seqdef(ex1[,1:13], weights=ex1$weights)

## Unweighted
subm <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=FALSE)
ex1.om <- seqdist(ex1.seq, method="OM", indel=subm$indel, sm=subm$sm, with.missing=TRUE)

## Weighted
subm.w <- seqcost(ex1.seq, method="INDELSLOG", with.missing=TRUE, weighted=TRUE)
ex1.omw <- seqdist(ex1.seq, method="OM", indel=subm.w$indel, sm=subm.w$sm, with.missing=TRUE)

ex1.om == ex1.omw

[Package TraMineR version 2.2-3 ]