Lecture 1: Exploratory data analysis of sequence data in education

Quan Nguyen, Department of Statistics, University of British Columbia

Learning objectives:

By the end of this lecture, workshop participants should be able to:

  • Understand the type of questions that are relevant to clustering of sequential data

  • Understand different types of sequence data format and convert between formats

  • Create a sequence object from an existing dataset using the TraMineR package

  • Manipulate sequence object to deal with missing values and misaligned sequences

  • Visualize and compute descriptive statitsics to explore sequential data

1. Why do we want to perform cluster analysis on sequential data in education?

What types of temporal/sequential data exist in education?

Data source

Time granularity

Questions

SIS records

Semester/Year

What are the common study paths taken by students based on their course enrolment?

LMS log files

Seconds/Daily/Semester

What are the common learning patterns of students during a course?

Eye-tracking

Miliseconds/Seconds/Minutes

What are the common thought process when students engage in an exercise?

Videos

Frames/Seconds/Minutes

What are the common interaction patterns of students during collaborative activities?

What type of research questions that are temporal/sequential specific?

  • Exploratory: What are the common patterns exist in the data?

  • Predictive: Can we classifiy an existing sequence? Can we predict the next activity/outcome given a sequence?

  • Causal: Does X cause Y?

Introduction to the TramineR package

TraMineR is a R-package for mining, describing and visualizing sequences of states or events, and more generally discrete sequence data

Some common usage of TramineR includes:

  • Visualize sequences

  • Explore sequences with descriptive statistics (e.g., frequencies, transitional probabilities, entropy)

  • Cluster analysis of sequences (e.g., Optimal matching/Edit distances)

  • Run discrepancy analyses to study how sequences are related to covariates

Installation in R:

install.packages("TraMineR", dependencies=TRUE)
# install.packages("TraMineR")
options(repr.plot.width=15, repr.plot.height=8)
library("TraMineR")
packageVersion("TraMineR")
library(readr)
library(tidyverse)
Warning message:
“package ‘TraMineR’ was built under R version 4.1.1”
TraMineR stable version 2.2-3 (Built: 2022-01-26)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.
[1] ‘2.2.3’
Warning message:
“package ‘readr’ was built under R version 4.1.1”
── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
 ggplot2 3.3.5      dplyr   1.0.7
 tibble  3.1.5      stringr 1.4.0
 tidyr   1.1.3      forcats 0.5.1
 purrr   0.3.4     
Warning message:
“package ‘tibble’ was built under R version 4.1.1”
── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()

Some terminologies

States vs events

  • States: Read, write, discuss, review

  • Events: Read -> Write, Write -> Discuss (each change of state is an event)

As see in the example below, the sequences 7,8,9,10 no event occurred during the observation period since the respondent stays in the same state during the whole sequence

In sequence 2, two events occurred: Practice -> Read, and Read -> Practice

data(actcal)
actcal <- data.frame(lapply(actcal, function(x) { gsub("^A\\b", "Read", x)}))
actcal <- data.frame(lapply(actcal, function(x) { gsub("^B\\b", "Watch", x)}))
actcal <- data.frame(lapply(actcal, function(x) { gsub("^C\\b", "Discuss", x)}))
actcal <- data.frame(lapply(actcal, function(x) { gsub("^D\\b", "Practice", x)}))
actcal.seq <- seqdef(actcal, var = 13:24)
# actcal.seq
seqiplot(actcal.seq, main = "Index plot (first 10 sequences)",with.legend = TRUE)
 [>] 4 distinct states appear in the data: 
     1 = Discuss
     2 = Practice
     3 = Read
     4 = Watch
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  Discuss     Discuss  Discuss
     2  Practice    Practice Practice
     3  Read        Read     Read
     4  Watch       Watch    Watch
 [>] 2000 sequences in the data set
 [>] min/max sequence length: 12/12
../_images/Lecture1_8_13.png

The alphabet is the number of unique states (or events) in the data. In the example above, we have four unique states (i.e., Read, Discuss, Practice, Watch) so the alphabet = 4

alphabet(actcal.seq)
  1. 'Discuss'
  2. 'Practice'
  3. 'Read'
  4. 'Watch'

Time reference: Internal and external clocks

  • Internal: Year 1, Year 2, Year 3 or Semester 1, Semester 2, Semester 3

  • External: June 17th, 2022, or 14:30:00 PST

2. Data manipulation with sequences

Data format

The ‘states-sequence’ (STS) format

  • Each row is an individual

  • In this format, the successive states (statuses) of an individual are given in consecutive columns. Each column is supposed to correspond to a predetermined time unit

head(actcal.seq)
A stslist: 6 × 12
jan00feb00mar00apr00may00jun00jul00aug00sep00oct00nov00dec00
<fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct>
1Watch Watch Watch Watch Watch Watch Watch Watch Watch WatchWatchWatch
2PracticePracticePracticePracticeRead Read Read Read Read Read Read Practice
3Watch Watch Watch Watch Watch Watch Watch Watch Watch WatchWatchWatch
4Discuss Discuss Discuss Discuss DiscussDiscussDiscussDiscussDiscussWatchWatchWatch
5Read Read Read Read Read Read Read Read Read Read Read Read
6PracticeWatch Watch Watch Watch Watch Watch Watch Watch WatchWatchWatch

The ‘state-permanence-sequence’ (SPS) format

  • Each row is an individual

  • Each successive distinct state in the sequence is given together with its duration

# print(head(actcal.seq), format='SPS')
actcal.sps <- seqformat(actcal, 13:24, from = "STS", to = "SPS", compress = TRUE)
head(actcal.sps)
 [>] converting STS sequences to 2000 SPS sequences
 [>] compressing SPS sequences
A matrix: 6 × 1 of type chr
Sequence
1(Watch,12)
2(Practice,4)-(Read,7)-(Practice,1)
3(Watch,12)
4(Discuss,9)-(Watch,3)
5(Read,12)
6(Practice,1)-(Watch,11)

The vertical ‘time-stamped-event’ (TSE) format

  • Each row is an event

  • Each record of the TSE representation usually contains a case identifier, a time stamp and codes identifying the event occurring

tstate <- seqetm(actcal.seq, method='state')
actcal.tse <- seqformat(actcal, 13:24, from = "STS", to = "TSE", tevent=tstate)
head(actcal.tse)
 [!!] 'id' set to NULL as it is not specified (backward compatibility with TraMineR 1.8)
 [!!] replacing original IDs in the output by the sequence indexes
 [>] converting STS sequences to 2579 TSE sequences
A data.frame: 6 × 3
idtimeevent
<dbl><dbl><chr>
11 0Watch
22 0Practice
32 4Read
4211Practice
53 0Watch
64 0Discuss

The spell (SPELL) format

  • Each row is a state

  • Each record of SPELL contains an ID, start time, end time, and the state

actcal.spell <- seqformat(actcal, 13:24, from = "STS", to = "SPELL")
head(actcal.seq)
head(actcal.spell)
 [>] converting STS sequences to 2579 spells
A stslist: 6 × 12
jan00feb00mar00apr00may00jun00jul00aug00sep00oct00nov00dec00
<fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct>
1Watch Watch Watch Watch Watch Watch Watch Watch Watch WatchWatchWatch
2PracticePracticePracticePracticeRead Read Read Read Read Read Read Practice
3Watch Watch Watch Watch Watch Watch Watch Watch Watch WatchWatchWatch
4Discuss Discuss Discuss Discuss DiscussDiscussDiscussDiscussDiscussWatchWatchWatch
5Read Read Read Read Read Read Read Read Read Read Read Read
6PracticeWatch Watch Watch Watch Watch Watch Watch Watch WatchWatchWatch
A data.frame: 6 × 4
idbeginendstates
<chr><dbl><dbl><fct>
11 112Watch
22 1 4Practice
32 511Read
421212Practice
53 112Watch
64 1 9Discuss

Converting between format

seqformat(data, var = ..., from = "...", to = ",,,", compress = FALSE/TRUE)

Examples:

  • seqformat(actcal, 13:24, from = "STS", to = "SPELL")

  • seqformat(actcal, 13:24, from = "STS", to = "SPS", compress = TRUE)

Create a sequence object from data

We can use the seqdef function to create a sequence object from an existing dataframe.

In the dataset actcal below, it consists of:

  • The sequence data were collected on a monthly basis on each participant in columns: “jan00”, “feb00”, “mar00”, “apr00”, “may00”, “jun00”, “jul00”, “aug00”, “sep00”, “oct00”, “nov00”, “dec00”

  • The covariates such as age, education, region, etc…

head(actcal)
A data.frame: 6 × 24
idhous00age00educat00civsta00nbadul00nbkid00aoldki00ayouki00region00com2.00mar00apr00may00jun00jul00aug00sep00oct00nov00dec00
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
160671 47maturity married 321714Middleland (BE, FR, SO, NE, JU) Industrial and tertiary sector communesWatch Watch Watch Watch Watch Watch Watch WatchWatchWatch
225321 21maturity single, never married20-3-3Zurich Suburban communes PracticePracticeRead Read Read Read Read Read Read Practice
353221 50full-time vocational school married 20-3-3Lake Geneva (VD, VS, GE) Peripheral urban communes Watch Watch Watch Watch Watch Watch Watch WatchWatchWatch
413911 37maturity single, never married10-3-3Middleland (BE, FR, SO, NE, JU) Centres Discuss Discuss DiscussDiscussDiscussDiscussDiscussWatchWatchWatch
514530120apprenticeship single, never married30-3-3North-west Switzerland (BS, BL, AG) Rural commuter communes Read Read Read Read Read Read Read Read Read Read
640022 27university, higher specialized schoolsingle, never married20-3-3Central Switzerland (LU, UR, SZ, OW, NW, ZG)Suburban communes Watch Watch Watch Watch Watch Watch Watch WatchWatchWatch

To create the sequence, we can use the seqdef() function.

We can see that there were 2000 sequences, with the length of 12, and consists of 4 states

actcal.seq <- seqdef(actcal, var = c("jan00", "feb00", "mar00",
         "apr00", "may00", "jun00", "jul00", "aug00", "sep00", "oct00",
         "nov00", "dec00"))
 [>] 4 distinct states appear in the data: 
     1 = Discuss
     2 = Practice
     3 = Read
     4 = Watch
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  Discuss     Discuss  Discuss
     2  Practice    Practice Practice
     3  Read        Read     Read
     4  Watch       Watch    Watch
 [>] 2000 sequences in the data set
 [>] min/max sequence length: 12/12

But this is a relatively clean data set, because it has defined a time unit (monthly). There were also no missing value or no misalignment in the timing of data collection (everyone started and ended at the same time).

Import a log file

Let’s try to process a more messy example of a log file lasi21_logdata.csv. The dataset consists of 852,458 records and 9 columns

  • id refers to anonymized student ID

  • sas_id_site refers to the id of the site that students visited

  • site_type refers to the type of content (e.g., resource, homepage, forumng, …)

  • date_time referes to the timestamp

  • spent_time refers to the estimated dwell time in miliseconds

  • action refers to the type of action took place (e.g., view, download, etc..)

  • instancename refers to the type of content (e.g., chapter 1, chapter 2, assignment 1, etc..)

  • avg score refers to the final score

  • PassFlag refers to a binary class of whether students passed or failed the course

lasi21_logdata <- read_csv("~/Downloads/lasi21_logdata.csv")
head(lasi21_logdata)
Rows: 852458 Columns: 9
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): site_type, action, instancename
dbl  (5): id, sas_id_site, spent_time, avg score, PassFlag
dttm (1): date_time
 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 9
idsas_id_sitesite_typedate_timespent_timeactioninstancenameavg scorePassFlag
<dbl><dbl><chr><dttm><dbl><chr><chr><dbl><dbl>
1872750content2017-01-22 02:16:00 1000apiset Block 2 Part 2260
1872771content2017-02-15 14:47:0012000view Block 3 Part 7260
1872750content2017-01-22 01:58:00 1000apiset Block 2 Part 2260
1872750content2017-01-22 02:16:00 1000apiset Block 2 Part 2260
1872890studio 2016-12-01 20:23:0058000view work stream (1)studio 260
1872859forumng2017-03-14 14:59:00 2000view discussion forumng 260
# Fixing some typos
lasi21_logdata$site_type <- ifelse(grepl("ssignment", lasi21_logdata$instancename), "assignment",lasi21_logdata$site_type )
lasi21_logdata$site_type <- ifelse(grepl("exam", lasi21_logdata$instancename), "assignment",lasi21_logdata$site_type )

# Convert dataframe to data.table for faster processing
lasi21_logdata <- lasi21_logdata %>% select(id,date_time,spent_time,site_type)

head(lasi21_logdata)
A tibble: 6 × 4
iddate_timespent_timesite_type
<dbl><dttm><dbl><chr>
12017-01-22 02:16:00 1000content
12017-02-15 14:47:0012000content
12017-01-22 01:58:00 1000content
12017-01-22 02:16:00 1000content
12016-12-01 20:23:0058000studio
12017-03-14 14:59:00 2000forumng

Now we need to make some decisions:

  • What is a sequence in our data? (i.e., what each row will represent)

    • Each sequence could be a student

    • Each sequence could be a learning session (defined as consecutive learning activities of at least 60s with no more than 30 minutes break between activities)

  • What is the time unit in our data?

    • Every minute?

    • Every 5 mins?

    • Every hour?

# convert ms to minutes
lasi21_logdata$spent_time_m <- round(lasi21_logdata$spent_time/60000, digits=1)

# create a flag for session break (aka where time spent > 30 minutes)
lasi21_logdata$session_flag <- 0
lasi21_logdata$session_flag <- ifelse(lasi21_logdata$spent_time_m>30, 1, lasi21_logdata$session_flag)

# filter out all spent_time < 90s
lasi21_logdata2 <- lasi21_logdata %>%  filter(spent_time>=90000)


# create session number 
lasi21_logdata2 <- lasi21_logdata2 %>% 
                        arrange(id,date_time,spent_time) %>% 
                        mutate(session_num = cumsum(session_flag))

# remove session break
lasi21_logdata2 <- lasi21_logdata2 %>%  filter(session_flag==0)

# for each learning session, calculate the cummulative time spent
lasi21_logdata2 <- lasi21_logdata2 %>% 
                        arrange(id,date_time,spent_time) %>% 
                        group_by(session_num) %>% 
                        mutate(spent_time_m_cum = cumsum(spent_time_m))

# create time unit as 1 minute (60s)
lasi21_logdata2$time_unit <- round(lasi21_logdata2$spent_time_m_cum,digits=0)
head(lasi21_logdata2,10)
A grouped_df: 10 × 9
iddate_timespent_timesite_typespent_time_msession_flagsession_numspent_time_m_cumtime_unit
<dbl><dttm><dbl><chr><dbl><dbl><dbl><dbl><dbl>
12016-09-06 18:19:00 217000content 3.601 3.6 4
12016-09-06 18:22:00 397000content 6.60110.210
12016-09-06 18:29:00 337000content 5.60115.816
12016-09-06 18:34:00 182000content 3.00118.819
12016-09-06 18:38:00 178000content 3.00121.822
12016-09-07 13:43:00 323000content 5.403 5.4 5
12016-09-07 13:48:001298000content21.60327.027
12016-09-07 14:10:001576000content26.30353.353
12016-09-07 18:16:00 522000content 8.704 8.7 9
12016-09-07 18:24:00 91000content 1.50410.210

Prepare data in SPELL format

Since the log data can capture the start time and end time of each visit, it makes it suitable to prepare the data in a SPELL format.

To recall, the structure of a SPELL data format should look like this

Position

Varible

Option name

1

ID

id

2

Start time

begin

3

End time

end

4

Status

status

log_spell <- lasi21_logdata2 %>% 
            group_by(session_num) %>% 
            mutate(start = lag(time_unit), start=replace_na(start,1), end=time_unit) %>% 
            select(session_num,start,end,site_type)
log_spell %>% filter(session_num==22)
A grouped_df: 4 × 4
session_numstartendsite_type
<dbl><dbl><dbl><chr>
22 119content
221929resource
222953content
225358content

Create sequence data from SPELL format

We can use the seqdef() function to convert the SPELL data into STS format, which will be used for subsequent analyses.

Note:

  • For some reasons, seqdef() does not play well with tibble format or any other formats than data.frame, so make sure to convert your data using the as.data.frame() function

log_spell <- as.data.frame(log_spell)
log_sts.seq <- seqdef(log_spell, var = c(id="session_num", begin="start", end="end", status="site_type"), 
                   informat = "SPELL",  process = FALSE)
 [>] time axis: 1 -> 429
 [>] converting SPELL data into 21418 STS sequences (internal format)
 [>] found missing values ('NA') in sequence data
 [>] preparing 21418 sequences
 [>] coding void elements with '%' and missing values with '*'
 [>] 12 distinct states appear in the data: 
     1 = assignment
     2 = collaborate
     3 = content
     4 = forumng
     5 = glossary
     6 = homepage
     7 = questionnaire
     8 = resource
     9 = studio
     10 = subpage
     11 = url
     12 = wiki
 [>] state coding:
       [alphabet]    [label]       [long label] 
     1  assignment    assignment    assignment
     2  collaborate   collaborate   collaborate
     3  content       content       content
     4  forumng       forumng       forumng
     5  glossary      glossary      glossary
     6  homepage      homepage      homepage
     7  questionnaire questionnaire questionnaire
     8  resource      resource      resource
     9  studio        studio        studio
     10  subpage       subpage       subpage
     11  url           url           url
     12  wiki          wiki          wiki
 [>] 21418 sequences in the data set
 [>] min/max sequence length: 2/429

What can we observe from the output above?

  • There were 21,418 sequences in the data set

  • min/max sequence length: 2/429

  • 12 distinct states

  • coding void elements with ‘%’ and missing values with ‘*’

head(log_sts.seq)
A stslist: 6 × 429
y1y2y3y4y5y6y7y8y9y10y420y421y422y423y424y425y426y427y428y429
<fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct>
1contentcontentcontentcontentcontentcontentcontentcontentcontentcontent%%%%%%%%%%
3contentcontentcontentcontentcontentcontentcontentcontentcontentcontent%%%%%%%%%%
4contentcontentcontentcontentcontentcontentcontentcontentcontentcontent%%%%%%%%%%
5contentcontentcontentcontentcontentcontentcontentcontentcontentcontent%%%%%%%%%%
8contentcontent% % % % % % % % %%%%%%%%%%
9forumngforumng% % % % % % % % %%%%%%%%%%

We can also print out some sequences using the SPS format for easy observations

print(log_sts.seq[31:41,], format="SPS")
   Sequence                                                                  
45 (assignment,1)-(studio,22)-(assignment,7)-(studio,4)                      
46 (assignment,28)                                                           
48 (forumng,2)                                                               
50 (homepage,4)-(assignment,43)                                              
51 (assignment,8)                                                            
54 (assignment,3)                                                            
55 (homepage,13)-(assignment,8)-(content,3)                                  
56 (assignment,16)                                                           
57 (assignment,14)-(content,28)-(assignment,11)-(resource,18)-(assignment,17)
59 (assignment,19)-(resource,16)-(forumng,2)                                 
61 (content,13)                                                              

Let’s plot these example sequences using the seqiplot() function

seqiplot(log_sts.seq[31:41,1:100], with.legend = T, main = "Index plot (10 first sequences)")
../_images/Lecture1_40_0.png

As a final note, there is not a single approach to pre-process your data. In practice, there are a few ‘nobs’ that you could adjust when pre-processing your data:

  • What is the time granularity of your sequence (e.g., every second, every minute, every 5 mins)

  • What is a sequence in your study? (e.g., each person is a sequence, or each learing session - however you define a learning session)

  • What is a status? (e.g., read, write, discuss, practice)

Truncations, gaps and missing values

  • Sequences defined as the list of successive states without duration information are typically of varying length.

  • In event sequences, the number of events experienced by each individual differs from one individual to the other.

  • The length of the follow up is not the same for all individuals or sequences may be right or left censored.

  • Sequences may not be left aligned depending on the time axis on which they are defined.

  • Data may not be available for all measuring points yielding internal gaps in the sequences.

Let’s simulate an example of data when sequences are not aligned. In this example, three sequences (s1,s2,s3) have different start and end date. Participant b also have a gap in 1993. If the respondents entered the study at different points in time and we represent the data on a calendar time axis, the data could look like this

s1 <- c("a","b","c","d",NA,NA)
s2 <- c(NA,"a","b",NA,"c","d")
s3 <- c(NA,NA, "a","b","c","d")

df <- data.frame(rbind(s1,s2,s3))
colnames(df) <- c(1990:1995)
df
A data.frame: 3 × 6
199019911992199319941995
<chr><chr><chr><chr><chr><chr>
s1a b cd NANA
s2NAa bNAc d
s3NANAab c d

Let’s create a sequence object from df. The default values of the seqdef() function are left=NA, gaps=NA and right="DEL". We will see what these options mean in a moment

seqdef(df)
 [>] found missing values ('NA') in sequence data
 [>] preparing 3 sequences
 [>] coding void elements with '%' and missing values with '*'
 [>] 4 distinct states appear in the data: 
     1 = a
     2 = b
     3 = c
     4 = d
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  a           a        a
     2  b           b        b
     3  c           c        c
     4  d           d        d
 [>] 3 sequences in the data set
 [>] min/max sequence length: 4/6
A stslist: 3 × 6
199019911992199319941995
<fct><fct><fct><fct><fct><fct>
s1abcd%%
s2*ab*cd
s3**abcd

In this case it may be more appropriate to represent the data on a process time axis where all sequences would be left aligned, meaning that their common origin is not a specific year but the beginning of the observed 4 year duration.

The left part of sequences s2 and s3 which do not begin in the first column of the matrix, has been considered as part of them. To remedy to this problem, we could use the left=”DEL” option.

seqdef(df,left="DEL")
 [>] found missing values ('NA') in sequence data
 [>] preparing 3 sequences
 [>] coding void elements with '%' and missing values with '*'
 [>] 4 distinct states appear in the data: 
     1 = a
     2 = b
     3 = c
     4 = d
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  a           a        a
     2  b           b        b
     3  c           c        c
     4  d           d        d
 [>] 3 sequences in the data set
 [>] min/max sequence length: 4/5
A stslist: 3 × 6
199019911992199319941995
<fct><fct><fct><fct><fct><fct>
s1abcd%%
s2ab*cd%
s3abcd%%

Now that all the 3 sequences have been left aligned. But sequence s2 has a gap in the data. Each missing value is left as an explicit missing element. We could also delete the missing values encountered in the center part of the sequence by setting gaps="DEL"

seqdef(df,left="DEL", gaps="DEL")
 [>] found missing values ('NA') in sequence data
 [>] preparing 3 sequences
 [>] coding void elements with '%' and missing values with '*'
 [>] 4 distinct states appear in the data: 
     1 = a
     2 = b
     3 = c
     4 = d
 [>] state coding:
       [alphabet]  [label]  [long label] 
     1  a           a        a
     2  b           b        b
     3  c           c        c
     4  d           d        d
 [>] 3 sequences in the data set
 [>] min/max sequence length: 4/4
A stslist: 3 × 6
199019911992199319941995
<fct><fct><fct><fct><fct><fct>
s1abcd%%
s2abcd%%
s3abcd%%

3. Exploratory analysis of sequences

Sequence index plot

We can easily plot the top 20 sequences using the seqiplot() function. Index plot (first 20 sequences)

seqiplot(data, main = "Index plot (first 20 sequences)", idxs = 1:20)
# Plot data during the first 100 minutes only

seqiplot(log_sts.seq[,1:100], main = "Index plot (first 20 sequences)", idxs = 1:20)
../_images/Lecture1_53_0.png

State distribution plot

The seqdplot() function plots a graphic showing the state distribution at each time point

seqdplot(data, main = "State distribution plot")
seqdplot(log_sts.seq[,1:300], main = "State distribution plot")
../_images/Lecture1_55_0.png

Beside plotting the distribution of the states at each time point, you may want to get the figures of the distribution. The seqstatd() function returns the table of the state distributions together with the number of valid states and an entropy measure for each time unit. We will explain what entropy means later.

Here’s I am going to return a table of state distributions of 5 time units as an example

print(seqstatd(log_sts.seq[,1:5]))
                          [State frequencies]
                  y1     y2     y3     y4     y5
assignment    0.0733 0.0732 0.0732 0.0723 0.0716
collaborate   0.0275 0.0285 0.0292 0.0291 0.0286
content       0.3562 0.3674 0.3821 0.3907 0.3981
forumng       0.2497 0.2442 0.2292 0.2217 0.2159
glossary      0.0120 0.0130 0.0144 0.0146 0.0144
homepage      0.1275 0.1142 0.1063 0.0997 0.0965
questionnaire 0.0028 0.0026 0.0023 0.0024 0.0022
resource      0.0256 0.0291 0.0325 0.0348 0.0371
studio        0.0763 0.0774 0.0788 0.0809 0.0802
subpage       0.0282 0.0286 0.0290 0.0299 0.0308
url           0.0016 0.0016 0.0017 0.0018 0.0017
wiki          0.0191 0.0203 0.0212 0.0221 0.0229

                           [Valid states]
                 y1    y2    y3    y4    y5
N             21418 21418 19605 18808 17864

                           [Entropy index]
                y1   y2   y3   y4   y5
H             0.73 0.73 0.73 0.73 0.73

Sequence frequency plot

The seqfplot() function plots the most frequent sequences. By default, the 10 most frequent sequences are plotted. You can adjust this with the idxs option. The sequences are ordered by decreasing frequency from bottom up and the bar widths are set proportional to the sequence frequency (pbarw = TRUE)

seqfplot(data, main = "Sequence frequency plot", idxs = 1:10)
seqfplot(log_sts.seq[,1:100], main = "Sequence frequency plot", pbarw = TRUE, idxs = 1:10)
../_images/Lecture1_59_0.png

You can also return the frequency table using seqtab() function

print(seqtab(log_sts.seq[,1:100]))
           Freq Percent
forumng/2   660    3.08
content/2   440    2.05
homepage/2  273    1.27
forumng/4   270    1.26
forumng/3   256    1.20
content/4   229    1.07
content/3   207    0.97
content/6   198    0.92
forumng/5   192    0.90
content/5   177    0.83

Transition rates

The seqtrate() function computes the transition rates between states or events. The outcome is a matrix where each rows gives a transition distribution from associated originating state (or event) in t to the states in t + 1 (the figures sum to one in each row).

# ?seqtrate
seqtrate(log_sts.seq[,1:100])
 [>] computing transition probabilities for states assignment/collaborate/content/forumng/glossary/homepage/questionnaire/resource/studio/subpage/url/wiki ...
A matrix: 12 × 12 of type dbl
[-> assignment][-> collaborate][-> content][-> forumng][-> glossary][-> homepage][-> questionnaire][-> resource][-> studio][-> subpage][-> url][-> wiki]
[assignment ->]0.9542213020.00094893030.0184034970.0067287780.00310559010.0076777092.012882e-040.00273176900.0040832760.00074764212.012882e-040.0009489303
[collaborate ->]0.0030006920.95437408630.0106947760.0161575750.00023082250.0070785570.000000e+000.00061552670.0028468110.00200046167.694083e-050.0029237516
[content ->]0.0022040100.00039867760.9754963720.0047690870.00223786010.0031480491.955777e-040.00521289760.0049157700.00046637764.513331e-050.0009101885
[forumng ->]0.0018493630.00216264290.0150879710.9552009540.00072761820.0079330591.616929e-040.00114195630.0071852290.00354713851.313755e-040.0048709994
[glossary ->]0.0091103970.00053590570.0614147910.0073954980.90610932480.0034297961.071811e-040.00482315110.0056806000.00042872451.071811e-040.0008574491
[homepage ->]0.0087704580.00251611510.0358246870.0172773240.00122211310.9206824665.271860e-040.00203685510.0060866020.00306726411.917040e-040.0017972251
[questionnaire ->]0.0116772820.00000000000.0583864120.0169851380.00106157110.0276008498.811040e-010.00000000000.0021231420.00000000000.000000e+000.0010615711
[resource ->]0.0029545870.00069305120.0431880360.0047419300.00098486230.0031369692.918111e-040.93634871420.0047419300.00102133872.553347e-040.0016414372
[studio ->]0.0024433360.00077471640.0250690290.0129715340.00085417450.0042708735.959357e-050.00206591050.9480741340.00107268437.945810e-050.0022645557
[subpage ->]0.0014778330.00246305420.0079912420.0140120420.00010946910.0061850035.473454e-050.00235358510.0036124790.95998905311.094691e-040.0016420361
[url ->]0.0063391440.00000000000.0316957210.0174326470.00000000000.0095087160.000000e+000.01109350240.0047543580.00633914429.064976e-010.0063391442
[wiki ->]0.0019518050.00307784700.0156144430.0328804140.00105097210.0051047220.000000e+000.00322798590.0090083330.00202687490.000000e+000.9260566024

Average time spent in each state

The seqmtplot() function to visualize the mean time spent in each state. You can also create multiple plots by group using the group option.

seqmtplot(log_sts.seq[,1:100], group = NULL, title = "Mean time")
 [!!] In IRkernel::main() : title is deprecated, use main instead.
../_images/Lecture1_66_1.png

Descriptive statisitics of sequences

  • Sequence length using the seqlength() function

hist(seqlength(log_sts.seq), main = 'Histogram of sequence length', xlab="Sequence length in minutes")
../_images/Lecture1_68_0.png
  • Distinct states sequence (DSS) using the seqdss() function

print(seqdss(log_sts.seq[31:41,1:100]))
   Sequence                                         
45 assignment-studio-assignment-studio              
46 assignment                                       
48 forumng                                          
50 homepage-assignment                              
51 assignment                                       
54 assignment                                       
55 homepage-assignment-content                      
56 assignment                                       
57 assignment-content-assignment-resource-assignment
59 assignment-resource-forumng                      
61 content                                          
  • Shanon’s entropy is calculated as:

\[h = -\sum^s_i{\pi_i log{\pi_i}}\]
  • where s is the size of the alphabet

  • \(\pi_i\) the proportion of occurrences of the ith state in the considered sequence

The entropy can be interpreted as the ‘uncertainty’ of predicting the states in a given sequence. If all states in the sequence are the same, the entropy is equal to 0. It is maximum when the cases are equally distributed between the states

The seqient() function returns by default a normalized Shanon’s entropy

seqiplot(log_sts.seq[31:41,1:100])
seqient(log_sts.seq[31:41,1:100], norm = TRUE)
A matrix: 11 × 1 of type dbl
Entropy
450.2195634
460.0000000
480.0000000
500.1171340
510.0000000
540.0000000
550.3856211
560.0000000
570.4193275
590.3470891
610.0000000
../_images/Lecture1_72_1.png

Let’s plot a histogram of entropy for the whole dataset

hist(seqient(log_sts.seq), main = 'Histogram of sequence entropy', xlab="Normalized Shannon's entropy")
../_images/Lecture1_74_0.png

Let us have a look at the sequences near the minimum, median and maximum entropy. For that, we draw sets of sequences having an entropy lower or equal to the 55th percentile, an entropy 50-90th percentile, and an entropy greater than the 90th percentile.

ient.quant <- quantile(seqient(log_sts.seq[,1:100]), c(0, 0.55, 0.9, 1))
ient.quant
0%
0
55%
0.0533989559934544
90%
0.372804151486808
100%
0.775849600937639
ient.group <- cut(seqient(log_sts.seq[,1:100]), ient.quant, labels = c("55th or lower", "55th-90th", "above 90th"), include.lowest = T)
ient.group <- factor(ient.group, levels = c("55th or lower", "55th-90th", "above 90th"))
table(ient.group)
ient.group
55th or lower     55th-90th    above 90th 
        11783          7493          2142 
options(repr.plot.width=15, repr.plot.height=8)
seqfplot(log_sts.seq[,1:100], group = ient.group, pbarw = TRUE)
../_images/Lecture1_78_0.png
  • Turbulence

The Turbulence depends on the length of the sequence. The Turbulence is based on the number \(\phi\)(x) of distinct subsequences that can be extracted from the distinct state sequence and the variance of the consecutive times ti spent in the distinct states.

\[some math\]
seqiplot(log_sts.seq[31:401,1:100])
cbind(seqST(log_sts.seq[31:41,1:100]),seqient(log_sts.seq[31:41,1:100]))
A matrix: 11 × 2 of type dbl
TurbulenceEntropy
454.9423820.2195634
461.0000000.0000000
481.0000000.0000000
502.4119600.1171340
511.0000000.0000000
541.0000000.0000000
555.4863990.3856211
561.0000000.0000000
579.6033340.4193275
595.2061160.3470891
611.0000000.0000000
../_images/Lecture1_80_1.png

Finally, we can use a single function seqindic to produce multiple descriptive statistics of the sequences. Here’s a few examples

  • “lgth” (sequence length)

  • “trans” (number of state changes)

  • “entr” (longitudinal normalized entropy)

  • “cplx” (complexity index)

  • “turb” (turbulence)

# ?seqindic
seqindic(log_sts.seq[31:41,1:100], indic=c("lgth","trans","entr","turbn","cplx"))
A data.frame: 11 × 5
LgthTransEntrCplxTurbn
<dbl><dbl><dbl><dbl><dbl>
453430.21956340.141280960.04532171
462800.00000000.000000000.00000000
48 200.00000000.000000000.00000000
504710.11713400.050461790.01623192
51 800.00000000.000000000.00000000
54 300.00000000.000000000.00000000
552420.38562110.183118200.05157575
561600.00000000.000000000.00000000
578840.41932750.138850370.09890414
593720.34708910.138862260.04835361
611300.00000000.000000000.00000000