Lecture 2: Clustering of sequential data in education

Quan Nguyen, Department of Statistics, University of British Columbia

library(tidyverse)
library(TraMineR)
library(cluster)
── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
 ggplot2 3.3.5      purrr   0.3.4
 tibble  3.1.5      dplyr   1.0.7
 tidyr   1.1.3      stringr 1.4.0
 readr   2.0.1      forcats 0.5.1
Warning message:
“package ‘tibble’ was built under R version 4.1.1”
Warning message:
“package ‘readr’ was built under R version 4.1.1”
── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
Warning message:
“package ‘TraMineR’ was built under R version 4.1.1”
TraMineR stable version 2.2-3 (Built: 2022-01-26)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.
Warning message:
“package ‘cluster’ was built under R version 4.1.1”

1. Similarities and distances between sequences

Similarity between sequences can be measured via

  • Common attributes: The more common attributes, the more similar

  • Edit distance: The lower the edit cost, the more similar

data(famform)
famform.seq <- seqdef(famform)
print(famform.seq)
 [>] found missing values ('NA') in sequence data
 [>] preparing 5 sequences
 [>] coding void elements with '%' and missing values with '*'
 [>] 5 distinct states appear in the data: 
     1 = M
     2 = MC
     3 = S
     4 = SC
     5 = U
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  M           M        M
     2  MC          MC       MC
     3  S           S        S
     4  SC          SC       SC
     5  U           U        U
 [>] 5 sequences in the data set
 [>] min/max sequence length: 2/5
    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     

Common attributes

Longest Common Prefix (LCP) distances

[1] S-U
[2] S-U-M

Longest Common Prefix = [S-U]

Similarity = 2
Dismilarity = 1

seqdist(famform.seq, method='LCP')
 [>] 5 sequences with 5 distinct states
 [>] 5 distinct  sequences 
 [>] min/max sequence lengths: 2/5
 [>] computing distances using the LCP metric
 [>] elapsed time: 0.007 secs
A matrix: 5 × 5 of type dbl
[1][2][3][4][5]
[1]01235
[2]10126
[3]21017
[4]32108
[5]56780

We can also produce a normalized version of LCP that is insensitive to the length of the sequences

print(famform.seq)
seqdist(famform.seq, method='LCP', norm="auto")
    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     
 [>] 5 sequences with 5 distinct states
 [>] 5 distinct  sequences 
 [>] min/max sequence lengths: 2/5
 [>] computing distances using the LCP gmean normalized metric
 [>] elapsed time: 0.007 secs
A matrix: 5 × 5 of type dbl
[1][2][3][4][5]
[1]0.00000000.18350340.29289320.36754451
[2]0.18350340.00000000.13397460.22540331
[3]0.29289320.13397460.00000000.10557281
[4]0.36754450.22540330.10557280.00000001
[5]1.00000001.00000001.00000001.00000000

Longest common suffix (“RLCP”)

[3] S-U-M-MC
[5] U-M-MC

Longest Common Suffix = [U-M-MC]

Similarity = 3
Dismilarity = 1

print(famform.seq)
seqdist(famform.seq, method='RLCP')
    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     
 [>] 5 sequences with 5 distinct states
 [>] 5 distinct  sequences 
 [>] min/max sequence lengths: 2/5
 [>] computing distances using the RLCP metric
 [>] elapsed time: 0.004 secs
A matrix: 5 × 5 of type dbl
[1][2][3][4][5]
[1]05675
[2]50786
[3]67091
[4]78908
[5]56180

Longest Common Subsequence (LCS) distances

Let’s take a look at this example:

[4] S-U-M-MC-SC
[5] U-M-MC

Longest Common Subsequence = [U-M-MC]

Similarity = 3
Dismilarity = 2

print(famform.seq)
seqdist(famform.seq, method='LCS')
    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     
 [>] 5 sequences with 5 distinct states
 [>] creating a 'sm' with a substitution cost of 2
 [>] creating 5x5 substitution-cost matrix using 2 as constant value
 [>] 5 distinct  sequences 
 [>] min/max sequence lengths: 2/5
 [>] computing distances using the LCS metric
 [>] elapsed time: 0.006 secs
A matrix: 5 × 5 of type dbl
[1][2][3][4][5]
[1]01233
[2]10122
[3]21011
[4]32102
[5]32120
# Normalized version

seqdist(famform.seq, method='LCS', norm = 'auto')
 [>] 5 sequences with 5 distinct states
 [>] creating a 'sm' with a substitution cost of 2
 [>] creating 5x5 substitution-cost matrix using 2 as constant value
 [>] 5 distinct  sequences 
 [>] min/max sequence lengths: 2/5
 [>] computing distances using the LCS gmean normalized metric
 [>] elapsed time: 0.005 secs
A matrix: 5 × 5 of type dbl
[1][2][3][4][5]
[1]0.00000000.18350340.29289320.36754450.5917517
[2]0.18350340.00000000.13397460.22540330.3333333
[3]0.29289320.13397460.00000000.10557280.1339746
[4]0.36754450.22540330.10557280.00000000.2254033
[5]0.59175170.33333330.13397460.22540330.0000000

Dealing with missing values

s1 <- c("a","b","c","d",NA,NA)
s2 <- c("a","b",NA,"c","d",NA)
s3 <- c("a","b",NA,"c","d","a")

df <- data.frame(rbind(s1,s2,s3))
colnames(df) <- c(1990:1995)
df.seq <- seqdef(df)
print(df.seq)
 [>] found missing values ('NA') in sequence data
 [>] preparing 3 sequences
 [>] coding void elements with '%' and missing values with '*'
 [>] 4 distinct states appear in the data: 
     1 = a
     2 = b
     3 = c
     4 = d
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  a           a        a
     2  b           b        b
     3  c           c        c
     4  d           d        d
 [>] 3 sequences in the data set
 [>] min/max sequence length: 4/6
   Sequence   
s1 a-b-c-d    
s2 a-b-*-c-d  
s3 a-b-*-c-d-a

To compute LCS distances between sequences containing gaps, one can use the with.miss=TRUE option. In that case, missing states are considered as an additional valid state.

Let’s look at s1 and s2

LCS = a-b or c-d

Similarity = 2
Dismilarity = 1

seqdist(df.seq, method='LCS', with.miss=TRUE)
 [>] including missing values as an additional state
 [>] 3 sequences with 5 distinct states
 [>] creating a 'sm' with a substitution cost of 2
 [>] creating 5x5 substitution-cost matrix using 2 as constant value
 [>] 3 distinct  sequences 
 [>] min/max sequence lengths: 4/6
 [>] computing distances using the LCS metric
 [>] elapsed time: 0.015 secs
A matrix: 3 × 3 of type dbl
s1s2s3
s1012
s2101
s3210
# ?seqdist

Edit distance

Optimal matching (OM) distances

Optimal matching generates edit distances that are the minimal cost, in terms of insertions, deletions and substitutions, for transforming one sequence into another.

How do we transform the following sequence 1 to sequence 2

s1 = {N,A,T,I,O,N}
s2 = {F,A,S,H,I,O,N}

Strategy 1:

  • Substitute N with F (NATION -> FATION)

  • Insert S (FATION -> FASTION)

  • Substitute T with H (FASTION -> FASHION)

Strategy 2:

  • Delete N (NATION -> ATION)

  • Insert F (ATION -> FATION)

  • Delete T (FATION -> FAION)

  • Insert S (FAION -> FASION)

  • Insert H (FASION -> FASHION)

Assuming there’s a cost for each action (i.e., substitution, insertion/deletion). What would be the optimal strategy that minizes the total cost? –> Optimal Matching algorithm

The insertion/deletion cost

  • A constant value specified by user (default = 1)

The substitution-cost matrix

  • A constant value (default)

  • Based on transition rates between different states

Let’s have a look at our famform data. We generate a substitution cost matrix with constant value of 2

print(famform.seq)
sub_cost <- seqcost(famform.seq, method = "CONSTANT", cval = 2)
sub_cost
    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     
 [>] creating 5x5 substitution-cost matrix using 2 as constant value
$indel
1
$sm
A matrix: 5 × 5 of type dbl
MMCSSCU
M02222
MC20222
S22022
SC22202
U22220

We compute the distances using the matrix and the default indel cost of 1

famform.seq_OM <- seqdist(famform.seq, method = "OM", sm = sub_cost$sm)
famform.seq_OM
 [>] 5 sequences with 5 distinct states
 [>] checking 'sm' (size and triangle inequality)
 [>] 5 distinct  sequences 
 [>] min/max sequence lengths: 2/5
 [>] computing distances using the OM metric
 [>] elapsed time: 0.004 secs
A matrix: 5 × 5 of type dbl
[1][2][3][4][5]
[1]01233
[2]10122
[3]21011
[4]32102
[5]32120

Minimal cost to transform [1] S-U to [5] U-M-MC

  • Delete S (cost 1)

  • Insert M (cost 1)

  • Insert MC (cost 1)

Total cost = 3

famform.seq_OM[1,5]
3

Substitution cost matrix based on transition rates between different states using method = "TRATE"

print(famform.seq)
sub_cost_TRATE <- seqcost(famform.seq, method = "TRATE", cval = 2)
sub_cost_TRATE
seqdist(famform.seq, method = "OM", sm = sub_cost_TRATE$sm)
    Sequence   
[1] S-U        
[2] S-U-M      
[3] S-U-M-MC   
[4] S-U-M-MC-SC
[5] U-M-MC     
 [>] creating substitution-cost matrix using transition rates ...
 [>] computing transition probabilities for states M/MC/S/SC/U ...
$indel
1
$sm
A matrix: 5 × 5 of type dbl
MMCSSCU
M01221
MC10212
S22021
SC21202
U12120
 [>] 5 sequences with 5 distinct states
 [>] checking 'sm' (size and triangle inequality)
 [>] 5 distinct  sequences 
 [>] min/max sequence lengths: 2/5
 [>] computing distances using the OM metric
 [>] elapsed time: 0.007 secs
A matrix: 5 × 5 of type dbl
[1][2][3][4][5]
[1]01233
[2]10122
[3]21011
[4]32102
[5]32120

Check the documentation for other methods of generating substitution costs ?seqcost

?seqcost
# ?seqdist

2. Agglomerative clustering of sequences

Some background on hierarchical clustering:

Broadly speaking, there are two types:

  • Agglomerative clustering: Commonly referred to as AGNES (AGglomerative NESting)

    • Bottom-up approach

    • The two data points that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are a member of just one single big cluster (root)

  • Divisive hierarchical clustering: Commonly referred to as DIANA (DIvise ANAlysis)

    • Top-down approach

    • Start with one single big cluster (root), at each iteration, we split the cluster into two least similar clusters, and repeat until each cluster contains a single data point

Source of figure: https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019

In this lecture, we will focus on the agglomerative clustering. We are going to use the agnes() function from the cluster package in R. Essentially, we are going to feed a dissimilarity matrix and specify the method to measure the distance (or similarity) between two clusters of observations.

clusterward <- agnes(famform.seq_OM, diss = TRUE, method = "ward")
famform.seq
plot(clusterward)
A stslist: 5 × 5
[1][2][3][4][5]
<fct><fct><fct><fct><fct>
[1]SU% % %
[2]SUM % %
[3]SUM MC%
[4]SUM MCSC
[5]UMMC% %
../_images/Lecture2_35_1.png ../_images/Lecture2_35_2.png

How do we measure the dissimilarity between two clusters of observations?

Linkage methods

Description

Single linkage

The distance between one cluster and another cluster is taken to be equal to the shortest distance from any data point of one cluster to any data point in another (nearest neighbour)

Average linkage

The distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other cluster

Complete linkage

The distance between one cluster and another cluster is taken to be equal to the longest distance from any data point of one cluster to any data point in another (futherest neighbour)

Ward

Minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged

Match each of the following figure to an appropriate linkage method

Figure A –> ? Figure B –> ? Figure C –> ?

Source: https://www.researchgate.net/figure/The-distance-between-two-clusters-defined-in-single-linkage-SL-A-complete-linkage_fig2_350350844

How do you pick the number of clusters?

We usually look at the largest difference of heights

Source: https://statsandr.com/blog/clustering-analysis-k-means-and-hierarchical-clustering-by-hand-and-in-r/

Check out this amazing cheatsheet too: https://statsandr.com/blog/files/Hierarchical-clustering-cheatsheet.pdf

Where would you draw the line to determine optimal k clusters for each of the dendogram below?

plot(clusterward, which.plot=2)
../_images/Lecture2_40_0.png
# cut the dendogram tree to generate two clusters
cluster2 <- cutree(clusterward, k = 2)

# create label for clusters
factor(cluster2, labels = c("Type 1", "Type 2"))

# check the number of observations in each cluster
table(cluster2)

# plot sequence frequency by cluster membership
seqfplot(famform.seq, group = cluster2, pbarw = T)
  1. Type 1
  2. Type 1
  3. Type 2
  4. Type 2
  5. Type 2
Levels:
  1. 'Type 1'
  2. 'Type 2'
cluster2
1 2 
2 3 
../_images/Lecture2_41_2.png

Check the documentation for more details of hierarchical clustering ?agnes

# ?agnes