SOCR ≫ | DSPA ≫ | Topics ≫ |

The time-varying (longitudinal) characteristics of large information flows represent a special case of the complexity, dynamic and multi-scale nature of big biomedical data that we discussed in the DSPA Motivation section. Previously, in Chapter 3, we saw space-time (4D) functional magnetic resonance imaging (fMRI) data, and in Chapter 15 we discussed streaming data, which also has a natural temporal dimension.

In this Chapter, we will expand our predictive data analytic strategies specifically for analyzing longitudinal data. We will interrogate datasets that track the same type of information, for same subjects, units or locations, over a period of time. Specifically, we will present time series analysis, forecasting using autoregressive integrated moving average (ARIMA) models, structural equation models (SEM), and longitudinal data analysis via linear mixed models.

*Note*: This Chapter is purposely divided in two separate sections to reduce the page loading time, stratify the extensive computational tasks, and improve the user experiences. The separate RNN/LSTM section is available here, and it includes:

- Recurrent Neural Networks (RNN)
- Tensor Format Representation
- Simulated RNN case-study
- Climate Data Study
- Keras-based Multi-covariate LSTM Time-series Analysis and Forecasting.

Time series analysis relies on models like ARIMA (Autoregressive integrated moving average) that utilize past longitudinal information to predict near future outcomes. Time series data tent to track univariate, sometimes multivariate, processes over a continuous time interval. The stock market, e.g., daily closing value of the Dow Jones Industrial Average index, or electroencephalography (EEG) data provide examples of such longitudinal datasets (time series).

The basic concepts in time series analysis include:

- The characteristics of
*(second-order) stationary time series*(e.g., first two moments are stable over time) do not depend on the time at which the series process is observed. *Differencing*– a transformation applied to time-series data to make it stationary. Differences between consecutive time-observations may be computed by \(\displaystyle y_{t}'=y_{t}-y_{t-1}\). Differencing removes the level changes in the time series, eliminates trend, reduces seasonality, and stabilizes the mean of the time series. Differencing the time series repeatedly may yield a stationary time series. For example, a second order differencing: \[\displaystyle {\begin{aligned}y_{t}^{''}&=y_{t}'-y_{t-1}'\\&=(y_{t}-y_{t-1})-(y_{t-1}-y_{t-2})\\&=y_{t}-2y_{t-1}+y_{t-2}\end{aligned}}.\]*Seasonal differencing*is computed as a difference between one observation and its corresponding observation in the previous epoch, or season (e.g., annually, there are \(m=4\) seasons), like in this example: \[\displaystyle y_{t}'''=y_{t}-y_{t-m}\quad {\text{where }}m={\text{number of seasons}}.\] The differenced data may then be used to estimate an ARMA model.

We will use the Beijing air quality PM2.5 dataset as an example to demonstrate the analysis process. This dataset measures air pollutants - PM2.5 particles in micrograms per cubic meter during the past 9 years (2008-2016). It measures the *hourly average of the number of particles that are of size 2.5 microns (PM2.5)* once per hour in Beijing, China.

Let’s first import the dataset into `R`

.

```
read.csv("https://umich.instructure.com/files/1823138/download?download_frd=1")
beijing.pm25<-summary(beijing.pm25)
```

```
## Index Site Parameter Date..LST.
## Min. : 1 Length:69335 Length:69335 Length:69335
## 1st Qu.:17335 Class :character Class :character Class :character
## Median :34668 Mode :character Mode :character Mode :character
## Mean :34668
## 3rd Qu.:52002
## Max. :69335
## Year Month Day Hour
## Min. :2008 Min. : 1.000 Min. : 1.00 Min. : 0.0
## 1st Qu.:2010 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 5.5
## Median :2012 Median : 6.000 Median :16.00 Median :11.0
## Mean :2012 Mean : 6.407 Mean :15.73 Mean :11.5
## 3rd Qu.:2014 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:17.5
## Max. :2016 Max. :12.000 Max. :31.00 Max. :23.0
## Value Duration QC.Name
## Min. :-999.00 Length:69335 Length:69335
## 1st Qu.: 22.00 Class :character Class :character
## Median : 63.00 Mode :character Mode :character
## Mean : 24.99
## 3rd Qu.: 125.00
## Max. : 994.00
```

The `Value`

column records PM2.5 AQI (Air Quality Index) for the past 8 years. We observe that there are some missing data in the `Value`

column. By looking at the `QC.Name`

column, we only have about 6.5% (4,408 observations) missing values. One way of solving the missingness is to replace them by the corresponding variable mean.

```
$Value==-999, 9]<-NA
beijing.pm25[beijing.pm25is.na(beijing.pm25$Value), 9]<-floor(mean(beijing.pm25$Value, na.rm = T)) beijing.pm25[
```

Here we first reassign the missing values into `NA`

labels. Then we replace all `NA`

labels with the *mean* computed using all non-missing observations. Note that the `floor()`

function casts the arithmetic averages as integer numbers, which is needed as AQI values are expected to be whole numbers.

Now, let’s observe the trend of hourly average PM2.5 across one day. You can see a significant pattern: The PM2.5 level peeks in the afternoons and is the lowest in the early mornings and exhibits approximate periodic boundary conditions (these patterns oscillate daily).

```
require(ggplot2)
1:nrow(beijing.pm25)
id = matrix(0,nrow=24,ncol=3)
mat = function(x){
stat =c(mean(beijing.pm25[iid,"Value"]),quantile(beijing.pm25[iid,"Value"],c(0.2,0.8)))
}for (i in 1:24){
which(id%%24==i-1)
iid = stat(iid)
mat[i,] =
}
as.data.frame(mat)
mat <-colnames(mat) <- c("mean","20%","80%")
$time = c(15:23,0:14)
matrequire(reshape2)
melt(mat,id="time")
dt <-colnames(dt)
```

`## [1] "time" "variable" "value"`

```
ggplot(data = dt,mapping = aes(x=time,y=value,color = variable))+geom_line()+
scale_x_continuous(breaks = 0:23)+ggtitle("Beijing hour average PM2.5 from 2008-2016")
```

Are there any daily or monthly trends? We can start the data interrogation by building an ARIMA model and examining detailed patterns in the data.

To begin with, we can visualize the overall trend by plotting PM2.5 values against time. This can be achieved using the `plyr`

package.

```
library(plyr)
ts(beijing.pm25$Value, start=1, end=69335, frequency=1)
ts<-ts.plot(ts)
```