Lecture 2: Clustering of sequential data in education¶

Quan Nguyen, Department of Statistics, University of British Columbia

library(tidyverse)
library(TraMineR)
library(cluster)

── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.5     ✔ dplyr   1.0.7
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   2.0.1     ✔ forcats 0.5.1

Warning message:
“package ‘tibble’ was built under R version 4.1.1”

Warning message:
“package ‘readr’ was built under R version 4.1.1”

── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Warning message:
“package ‘TraMineR’ was built under R version 4.1.1”

TraMineR stable version 2.2-3 (Built: 2022-01-26)

Website: http://traminer.unige.ch

Please type 'citation("TraMineR")' for citation information.

Warning message:
“package ‘cluster’ was built under R version 4.1.1”

1. Similarities and distances between sequences¶

Similarity between sequences can be measured via

Common attributes: The more common attributes, the more similar
Edit distance: The lower the edit cost, the more similar

data(famform)
famform.seq <- seqdef(famform)
print(famform.seq)

 [>] found missing values ('NA') in sequence data

 [>] preparing 5 sequences

 [>] coding void elements with '%' and missing values with '*'

 [>] 5 distinct states appear in the data:

     1 = M

     2 = MC

     3 = S

     4 = SC

     5 = U

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  M           M        M

     2  MC          MC       MC

     3  S           S        S

     4  SC          SC       SC

     5  U           U        U

 [>] 5 sequences in the data set

 [>] min/max sequence length: 2/5

    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     

Common attributes¶

Longest Common Prefix (LCP) distances¶

[1] S-U
[2] S-U-M

Longest Common Prefix = [S-U]

Similarity = 2
Dismilarity = 1

seqdist(famform.seq, method='LCP')

 [>] 5 sequences with 5 distinct states

 [>] 5 distinct  sequences

 [>] min/max sequence lengths: 2/5

 [>] computing distances using the LCP metric

 [>] elapsed time: 0.007 secs

A matrix: 5 × 5 of type dbl
	[1]	[2]	[3]	[4]	[5]
[1]	0	1	2	3	5
[2]	1	0	1	2	6
[3]	2	1	0	1	7
[4]	3	2	1	0	8
[5]	5	6	7	8	0

We can also produce a normalized version of LCP that is insensitive to the length of the sequences

print(famform.seq)
seqdist(famform.seq, method='LCP', norm="auto")

    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     

 [>] 5 sequences with 5 distinct states

 [>] 5 distinct  sequences

 [>] min/max sequence lengths: 2/5

 [>] computing distances using the LCP gmean normalized metric

 [>] elapsed time: 0.007 secs

A matrix: 5 × 5 of type dbl
	[1]	[2]	[3]	[4]	[5]
[1]	0.0000000	0.1835034	0.2928932	0.3675445	1
[2]	0.1835034	0.0000000	0.1339746	0.2254033	1
[3]	0.2928932	0.1339746	0.0000000	0.1055728	1
[4]	0.3675445	0.2254033	0.1055728	0.0000000	1
[5]	1.0000000	1.0000000	1.0000000	1.0000000	0

Longest common suffix (“RLCP”)¶

[3] S-U-M-MC
[5] U-M-MC

Longest Common Suffix = [U-M-MC]

Similarity = 3
Dismilarity = 1

print(famform.seq)
seqdist(famform.seq, method='RLCP')

    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     

 [>] 5 sequences with 5 distinct states

 [>] 5 distinct  sequences

 [>] min/max sequence lengths: 2/5

 [>] computing distances using the RLCP metric

 [>] elapsed time: 0.004 secs

A matrix: 5 × 5 of type dbl
	[1]	[2]	[3]	[4]	[5]
[1]	0	5	6	7	5
[2]	5	0	7	8	6
[3]	6	7	0	9	1
[4]	7	8	9	0	8
[5]	5	6	1	8	0

Longest Common Subsequence (LCS) distances¶

Let’s take a look at this example:

[4] S-U-M-MC-SC
[5] U-M-MC

Longest Common Subsequence = [U-M-MC]

Similarity = 3
Dismilarity = 2

print(famform.seq)
seqdist(famform.seq, method='LCS')

    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     

 [>] 5 sequences with 5 distinct states

 [>] creating a 'sm' with a substitution cost of 2

 [>] creating 5x5 substitution-cost matrix using 2 as constant value

 [>] 5 distinct  sequences

 [>] min/max sequence lengths: 2/5

 [>] computing distances using the LCS metric

 [>] elapsed time: 0.006 secs

A matrix: 5 × 5 of type dbl
	[1]	[2]	[3]	[4]	[5]
[1]	0	1	2	3	3
[2]	1	0	1	2	2
[3]	2	1	0	1	1
[4]	3	2	1	0	2
[5]	3	2	1	2	0

# Normalized version

seqdist(famform.seq, method='LCS', norm = 'auto')

 [>] 5 sequences with 5 distinct states

 [>] creating a 'sm' with a substitution cost of 2

 [>] creating 5x5 substitution-cost matrix using 2 as constant value

 [>] 5 distinct  sequences

 [>] min/max sequence lengths: 2/5

 [>] computing distances using the LCS gmean normalized metric

 [>] elapsed time: 0.005 secs

A matrix: 5 × 5 of type dbl
	[1]	[2]	[3]	[4]	[5]
[1]	0.0000000	0.1835034	0.2928932	0.3675445	0.5917517
[2]	0.1835034	0.0000000	0.1339746	0.2254033	0.3333333
[3]	0.2928932	0.1339746	0.0000000	0.1055728	0.1339746
[4]	0.3675445	0.2254033	0.1055728	0.0000000	0.2254033
[5]	0.5917517	0.3333333	0.1339746	0.2254033	0.0000000

Dealing with missing values¶

s1 <- c("a","b","c","d",NA,NA)
s2 <- c("a","b",NA,"c","d",NA)
s3 <- c("a","b",NA,"c","d","a")

df <- data.frame(rbind(s1,s2,s3))
colnames(df) <- c(1990:1995)
df.seq <- seqdef(df)
print(df.seq)

 [>] found missing values ('NA') in sequence data

 [>] preparing 3 sequences

 [>] coding void elements with '%' and missing values with '*'

 [>] 4 distinct states appear in the data:

     1 = a

     2 = b

     3 = c

     4 = d

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  a           a        a

     2  b           b        b

     3  c           c        c

     4  d           d        d

 [>] 3 sequences in the data set

 [>] min/max sequence length: 4/6

   Sequence   
s1 a-b-c-d    
s2 a-b-*-c-d  
s3 a-b-*-c-d-a

To compute LCS distances between sequences containing gaps, one can use the with.miss=TRUE option. In that case, missing states are considered as an additional valid state.

Let’s look at s1 and s2

LCS = a-b or c-d

Similarity = 2
Dismilarity = 1

seqdist(df.seq, method='LCS', with.miss=TRUE)

 [>] including missing values as an additional state

 [>] 3 sequences with 5 distinct states

 [>] creating a 'sm' with a substitution cost of 2

 [>] creating 5x5 substitution-cost matrix using 2 as constant value

 [>] 3 distinct  sequences

 [>] min/max sequence lengths: 4/6

 [>] computing distances using the LCS metric

 [>] elapsed time: 0.015 secs

A matrix: 3 × 3 of type dbl
	s1	s2	s3
s1	0	1	2
s2	1	0	1
s3	2	1	0

# ?seqdist

Edit distance¶

Optimal matching (OM) distances¶

Optimal matching generates edit distances that are the minimal cost, in terms of insertions, deletions and substitutions, for transforming one sequence into another.

How do we transform the following sequence 1 to sequence 2

s1 = {N,A,T,I,O,N}
s2 = {F,A,S,H,I,O,N}

Strategy 1:

Substitute N with F (NATION -> FATION)
Insert S (FATION -> FASTION)
Substitute T with H (FASTION -> FASHION)

Strategy 2:

Delete N (NATION -> ATION)
Insert F (ATION -> FATION)
Delete T (FATION -> FAION)
Insert S (FAION -> FASION)
Insert H (FASION -> FASHION)

Assuming there’s a cost for each action (i.e., substitution, insertion/deletion). What would be the optimal strategy that minizes the total cost? –> Optimal Matching algorithm

The insertion/deletion cost

A constant value specified by user (default = 1)

The substitution-cost matrix

A constant value (default)
Based on transition rates between different states

Let’s have a look at our famform data. We generate a substitution cost matrix with constant value of 2

print(famform.seq)
sub_cost <- seqcost(famform.seq, method = "CONSTANT", cval = 2)
sub_cost

    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     

 [>] creating 5x5 substitution-cost matrix using 2 as constant value

$indel

1

$sm

A matrix: 5 × 5 of type dbl
	M	MC	S	SC	U
M	0	2	2	2	2
MC	2	0	2	2	2
S	2	2	0	2	2
SC	2	2	2	0	2
U	2	2	2	2	0

We compute the distances using the matrix and the default indel cost of 1

famform.seq_OM <- seqdist(famform.seq, method = "OM", sm = sub_cost$sm)
famform.seq_OM

 [>] 5 sequences with 5 distinct states

 [>] checking 'sm' (size and triangle inequality)

 [>] 5 distinct  sequences

 [>] min/max sequence lengths: 2/5

 [>] computing distances using the OM metric

 [>] elapsed time: 0.004 secs

A matrix: 5 × 5 of type dbl
	[1]	[2]	[3]	[4]	[5]
[1]	0	1	2	3	3
[2]	1	0	1	2	2
[3]	2	1	0	1	1
[4]	3	2	1	0	2
[5]	3	2	1	2	0

Minimal cost to transform [1] S-U to [5] U-M-MC

Delete S (cost 1)
Insert M (cost 1)
Insert MC (cost 1)

Total cost = 3

famform.seq_OM[1,5]

3

Substitution cost matrix based on transition rates between different states using method = "TRATE"

print(famform.seq)
sub_cost_TRATE <- seqcost(famform.seq, method = "TRATE", cval = 2)
sub_cost_TRATE
seqdist(famform.seq, method = "OM", sm = sub_cost_TRATE$sm)

    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     

 [>] creating substitution-cost matrix using transition rates ...

 [>] computing transition probabilities for states M/MC/S/SC/U ...

$indel

1

$sm

A matrix: 5 × 5 of type dbl
	M	MC	S	SC	U
M	0	1	2	2	1
MC	1	0	2	1	2
S	2	2	0	2	1
SC	2	1	2	0	2
U	1	2	1	2	0

 [>] 5 sequences with 5 distinct states

 [>] checking 'sm' (size and triangle inequality)

 [>] 5 distinct  sequences

 [>] min/max sequence lengths: 2/5

 [>] computing distances using the OM metric

 [>] elapsed time: 0.007 secs

A matrix: 5 × 5 of type dbl
	[1]	[2]	[3]	[4]	[5]
[1]	0	1	2	3	3
[2]	1	0	1	2	2
[3]	2	1	0	1	1
[4]	3	2	1	0	2
[5]	3	2	1	2	0

Check the documentation for other methods of generating substitution costs ?seqcost

?seqcost

# ?seqdist

2. Agglomerative clustering of sequences¶

Some background on hierarchical clustering:

Broadly speaking, there are two types:

Agglomerative clustering: Commonly referred to as AGNES (AGglomerative NESting)
- Bottom-up approach
- The two data points that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are a member of just one single big cluster (root)
Divisive hierarchical clustering: Commonly referred to as DIANA (DIvise ANAlysis)
- Top-down approach
- Start with one single big cluster (root), at each iteration, we split the cluster into two least similar clusters, and repeat until each cluster contains a single data point

Source of figure: https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019

In this lecture, we will focus on the agglomerative clustering. We are going to use the agnes() function from the cluster package in R. Essentially, we are going to feed a dissimilarity matrix and specify the method to measure the distance (or similarity) between two clusters of observations.

clusterward <- agnes(famform.seq_OM, diss = TRUE, method = "ward")

famform.seq
plot(clusterward)

A stslist: 5 × 5
	[1]	[2]	[3]	[4]	[5]
	<fct>	<fct>	<fct>	<fct>	<fct>
[1]	S	U	%	%	%
[2]	S	U	M	%	%
[3]	S	U	M	MC	%
[4]	S	U	M	MC	SC
[5]	U	M	MC	%	%

How do we measure the dissimilarity between two clusters of observations?

Linkage methods	Description
Single linkage	The distance between one cluster and another cluster is taken to be equal to the shortest distance from any data point of one cluster to any data point in another (nearest neighbour)
Average linkage	The distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other cluster
Complete linkage	The distance between one cluster and another cluster is taken to be equal to the longest distance from any data point of one cluster to any data point in another (futherest neighbour)
Ward	Minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged

Match each of the following figure to an appropriate linkage method

Figure A –> ? Figure B –> ? Figure C –> ?

Source: https://www.researchgate.net/figure/The-distance-between-two-clusters-defined-in-single-linkage-SL-A-complete-linkage_fig2_350350844

How do you pick the number of clusters?

We usually look at the largest difference of heights

Source: https://statsandr.com/blog/clustering-analysis-k-means-and-hierarchical-clustering-by-hand-and-in-r/

Check out this amazing cheatsheet too: https://statsandr.com/blog/files/Hierarchical-clustering-cheatsheet.pdf

Where would you draw the line to determine optimal k clusters for each of the dendogram below?

plot(clusterward, which.plot=2)

# cut the dendogram tree to generate two clusters
cluster2 <- cutree(clusterward, k = 2)

# create label for clusters
factor(cluster2, labels = c("Type 1", "Type 2"))

# check the number of observations in each cluster
table(cluster2)

# plot sequence frequency by cluster membership
seqfplot(famform.seq, group = cluster2, pbarw = T)

Type 1
Type 1
Type 2
Type 2
Type 2

Levels:

'Type 1'
'Type 2'

cluster2
1 2 
2 3 

Check the documentation for more details of hierarchical clustering ?agnes

# ?agnes

Temporal and sequential analysis - LASI 2022