SOCR ≫ TCIU Website ≫ TCIU GitHub ≫

1 Structured Big Data Analytics Case-Study

Next, we will look at another interesting example of a large structured tabular dataset. The goal remains the same - examine the effects of indexing complex data only using kime-order (time) and comparing the data representations as well as the subsequent data analytics. In this case-study, we will use the UK Biobank (UKBB) data.

A previous investigation [CITE 2018 SOCR UKBB paper], based on \(7,614\) imaging, clinical, and phenotypic features and neuroimaging data of \(9,914\) UKBB subjects reported the twenty most salient derived imaging biomarkers. By jointly representing and modeling the significant clinical and demographic variables along with specific salient neuroimaging features, the researchers predicted the presence and progression of depression and mental health of participating volunteers. We will explore the effects of kime-direction on the findings based on the same data and methods. For ease of demonstration, efficient calculations, and direct interpretation, we start by transforming the data into a tighter computable object of dimensions \(9,914\times 107\).

1.1 Data Source Type

For more information about the data, please refer to: https://www.nature.com/articles/s41598-019-41634-y

UKBB_data <- get(load("UKBB_data_cluster_label.Rdata")) 
# str(UKBB_data)
UKBB_Colnames <- colnames(UKBB_data); View(UKBB_Colnames); dim(UKBB_data)   # 9914 7615
# Extract the top-50 derived NI biomarkers (data_summary_cluster_2.xlsx), per
# https://drive.google.com/drive/folders/1SdAtefp_taabNL70JvwJZSexTkzXiEKD 
top50_NI_Biomarkers <- c("lh_BA_exvivo_area__lh_WhiteSurfArea_area", "rh_BA_exvivo_area__rh_WhiteSurfArea_area", "rh_aparc.a2009s_area__rh_WhiteSurfArea_area", "rh_aparc_area__rh_WhiteSurfArea_area", "lh_aparc_area__lh_WhiteSurfArea_area", "lh_aparc.a2009s_area__lh_WhiteSurfArea_area", "aseg__SupraTentorialVol", "aseg__SupraTentorialVolNotVent", "aseg__SupraTentorialVolNotVentVox", "aseg__BrainSegVol", "aseg__BrainSegVolNotVentSurf", "aseg__BrainSegVolNotVent", "aseg__CortexVol", "aseg__rhCortexVol", "aseg__lhCortexVol", "aseg__TotalGrayVol", "aseg__MaskVol", "rh_aparc.DKTatlas_area__rh_superiortemporal_area", "rh_aparc.DKTatlas_area__rh_superiorfrontal_area", "lh_aparc.DKTatlas_area__lh_superiorfrontal_area", "lh_aparc.DKTatlas_area__lh_lateralorbitofrontal_area", "lh_aparc.DKTatlas_area__lh_superiortemporal_area", "aseg__EstimatedTotalIntraCranialVol", "lh_aparc_area__lh_lateralorbitofrontal_area", "rh_aparc.DKTatlas_area__rh_lateralorbitofrontal_area", "lh_aparc_area__lh_superiorfrontal_area", "rh_aparc_area__rh_superiortemporal_area", "rh_aparc.a2009s_area__rh_G.S_cingul.Ant_area", "rh_aparc_area__rh_superiorfrontal_area", "lh_aparc_area__lh_rostralmiddlefrontal_area", "wmparc__wm.lh.lateralorbitofrontal", "wmparc__wm.lh.insula", "rh_aparc_area__rh_medialorbitofrontal_area", "lh_BA_exvivo_area__lh_BA3b_exvivo_area", "lh_aparc.DKTatlas_area__lh_postcentral_area", "lh_aparc.DKTatlas_volume__lh_lateralorbitofrontal_volume", "lh_aparc.DKTatlas_area__lh_insula_area", "aseg__SubCortGrayVol", "lh_aparc.a2009s_area__lh_G_orbital_area", "lh_aparc_area__lh_superiortemporal_area", "rh_aparc.DKTatlas_area__rh_insula_area", "lh_aparc.DKTatlas_area__lh_precentral_area", "lh_aparc.pial_area__lh_lateralorbitofrontal_area", "lh_aparc.DKTatlas_area__lh_rostralmiddlefrontal_area", "lh_aparc_area__lh_postcentral_area", "lh_aparc.pial_area__lh_superiorfrontal_area", "rh_aparc_area__rh_rostralmiddlefrontal_area", "wmparc__wm.lh.superiortemporal", "lh_aparc.pial_area__lh_rostralmiddlefrontal_area", "rh_aparc.DKTatlas_volume__rh_lateralorbitofrontal_volume")

# Extract the main clinical features (binary/dichotomous and categorical/polytomous)
#### binary
top25_BinaryClinical_Biomarkers <- 
  c("X1200.0.0", "X1200.2.0", "X1170.0.0", "X1190.2.0", "X1170.2.0","X2080.0.0", "X6138.2.2",
    "X20117.0.0", "X6138.0.2", "X2877.0.0", "X20117.2.0", "X2877.2.0","X1190.0.0", "X4968.2.0",
    "X1249.2.0", "X1190.1.0", "X1170.1.0", "X2080.2.0", "X4292.2.0","X2050.0.0", "X1628.0.0",
    "X1200.1.0", "X20018.2.0", "X4292.0.0", "X3446.0.0")
#### polytomous
top31_PolytomousClinical_Biomarkers <- 
  c("X31.0.0", "X22001.0.0", "X1950.0.0", "X1950.2.0", "X1980.0.0", "X2040.2.0", "X1980.2.0",
    "X2030.0.0", "X2090.0.0", "X2040.0.0", "X1618.2.0", "X1618.0.0", "X1210.0.0", "X2030.2.0",
    "X2000.0.0", "X1930.0.0", "X2090.2.0", "X2000.2.0", "X1210.2.0", "X1618.1.0", "X4653.2.0",
    "X1970.2.0", "X1970.0.0", "X1980.1.0", "X1930.2.0", "X4598.2.0", "X4598.0.0", "X4653.0.0",
    "X2090.1.0", "X2040.1.0", "X4631.2.0")

# Extract derived computed phenotype 
derivedComputedPhenotype <- UKBB_Colnames[length(UKBB_Colnames)]

# Construct the Computable data object including all salient predictors and derived cluster phenotype
ColNameList <- 
  c(top50_NI_Biomarkers, top25_BinaryClinical_Biomarkers, 
    top31_PolytomousClinical_Biomarkers, derivedComputedPhenotype)
length(ColNameList)
col.index <- which(colnames(UKBB_data) %in% ColNameList)
length(col.index)
tight107_UKBB_data <- UKBB_data[ , col.index]; dim(tight107_UKBB_data)
## View(tight107_UKBB_data[1:10, ])   # Confirm the tight data object organization 

We can investigate the effects of the kime-phase on the resulting data analytic inference obtained using the UKBB data.

Data preprocessing is basically done here, readers can just load “Fig5.10_to_13.Rdata” to start here.

1.2 Finish Data Imputation

## [1] TRUE
## [1] 9900  106
## [1]  11 900 106
## [1] 900 106
## [1] 900 106
# To Transform the entire UKBB data to k-space (Fourier domain)
# library(EBImage)
#FT_UKBB_data <- fft(comp_imp_tight106_UKBB_data)
#X2 <- FT_UKBB_data  # display(FT_UKBB_data, method = "raster") 
#mag_FT_UKBB_data <- sqrt(Re(X2)^2+Im(X2)^2) 
###  # plot(log(fftshift1D(Re(X2_mag))), main = "log(Magnitude(FFT(timeseries)))") 
#phase_FT_UKBB_data <- atan2(Im(X2), Re(X2)) 
### Test the process to confirm calculations
# X2<-FT_UKBB_data; X2_mag <- mag_FT_UKBB_data; X2_phase<-phase_FT_UKBB_data
# Real2 = X2_mag * cos(X2_phase)
# Imaginary2 = X2_mag * sin(X2_phase)
# man_hat_X2 = Re(fft(Real2 + 1i*Imaginary2, inverse = T)/length(X2))
# ifelse(abs(man_hat_X2[5,10] - comp_imp_tight106_UKBB_data[5, 10]) < 0.001, "Perfect Syntesis", "Problems!!!")
#######
# Then we can Invert back the complete UKBB FT data into spacetime using nil phase
#Real = mag_FT_UKBB_data * cos(0)  # cos(phase_FT_UKBB_data)
#Imaginary = mag_FT_UKBB_data * sin(0)   # sin(phase_FT_UKBB_data)
#ift_NilPhase_X2mag = Re(fft(Real+1i*Imaginary, inverse = T)/length(FT_UKBB_data))
# display(ift_NilPhase_X2mag, method = "raster")
# dim(ift_NilPhase_X2mag); View(ift_NilPhase_X2mag); # compare to View(aqi_data1)
#summary(comp_imp_tight106_UKBB_data); summary(ift_NilPhase_X2mag)

# 5. Epoch 1: Perform Random Forest prediction (based on ift_TruePhase_X2mag==Original==epochs_tight106_UKBB_data_1) of:
##### Ever depressed for a whole week 1 #########################################
# library(randomForest)
# y_pheno <- comp_imp_tight106_UKBB_data[,"X4598.2.0"] ### Ever depressed for a whole week 1
# y_pheno <- as.factor(y_pheno)
colnames(epochs_tight106_UKBB_data_1) <- colnames(tight106_UKBB_data)
set.seed(1234)
rf_depressed <- 
  randomForest(as.factor(epochs_tight106_UKBB_data_1[ ,"X4598.2.0"]) ~ . , 
               data=epochs_tight106_UKBB_data_1[ , !(colnames(epochs_tight106_UKBB_data_1) %in%
                                                       c("X4598.0.0", "X4598.2.0"))])
rf_depressed
## 
## Call:
##  randomForest(formula = as.factor(epochs_tight106_UKBB_data_1[,      "X4598.2.0"]) ~ ., data = epochs_tight106_UKBB_data_1[, !(colnames(epochs_tight106_UKBB_data_1) %in%      c("X4598.0.0", "X4598.2.0"))]) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 21.44%
## Confusion matrix:
##     0   1 class.error
## 0 370  71   0.1609977
## 1 122 337   0.2657952
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 370 122
##          1  71 337
##                                          
##                Accuracy : 0.7856         
##                  95% CI : (0.7573, 0.812)
##     No Information Rate : 0.51           
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5719         
##                                          
##  Mcnemar's Test P-Value : 0.0003193      
##                                          
##             Sensitivity : 0.8390         
##             Specificity : 0.7342         
##          Pos Pred Value : 0.7520         
##          Neg Pred Value : 0.8260         
##              Prevalence : 0.4900         
##          Detection Rate : 0.4111         
##    Detection Prevalence : 0.5467         
##       Balanced Accuracy : 0.7866         
##                                          
##        'Positive' Class : 0              
##