SOCR ≫ DSPA ≫ Topics ≫

This chapter introduces the foundations of R programming for visualization, statistical computing and scientific inference. Specifically, in this chapter we will (1) discuss the rationale for selecting R as a computational platform for all DSPA demonstrations; (2) present the basics of installing shell-based R and RStudio user-interface, (3) show some simple R commands and scripts (e.g., translate long-to-wide data format, data simulation, data stratification and subsetting), (4) introduce variable types and their manipulation; (5) demonstrate simple mathematical functions, statistics, and matrix operators; (6) explore simple data visualization; and (7) introduce optimization and model fitting. The chapter appendix includes references to R introductory and advanced resources, as well as a primer on debugging.

1 Why use R?

There are many different classes of software that can be used for data interrogation, modeling, inference and statistical computing. Among these are R, Python, Java, C/C++, Perl, and many others. The table below compares R to various other statistical analysis software packages and more detailed comparison is available online.

Statistical Software Advantages Disadvantages
R R is actively maintained (\(\ge 100,000\) developers, \(\ge 15K\) packages). Excellent connectivity to various types of data and other systems. Versatile for solving problems in many domains. It’s free, open-source code. Anybody can access/review/extend the source code. R is very stable and reliable. If you change or redistribute the R source code, you have to make those changes available for anybody else to use. R runs anywhere (platform agnostic). Extensibility: R supports extensions, e.g., for data manipulation, statistical modeling, and graphics. Active and engaged community supports R. Unparalleled question-and-answer (Q&A) websites. R connects with other languages (Java/C/JavaScript/Python/Fortran) & database systems, and other programs, SAS, SPSS, etc. Other packages have add-ons to connect with R. SPSS has incorporated a link to R, and SAS has protocols to move data and graphics between the two packages. Mostly scripting language. Steeper learning curve
SAS Large datasets. Commonly used in business & Government Expensive. Somewhat dated programming language. Expensive/proprietary
Stata Easy statistical analyses Mostly classical stats
SPSS Appropriate for beginners Simple interfaces Weak in more cutting edge statistical procedures lacking in robust methods and survey methods

There exist substantial differences between different types of computational environments for data wrangling, preprocessing, analytics, visualization and interpretation. The table below provides some rough comparisons between some of the most popular data computational platforms (higher scores represent better performance within the specific category, but the scales are not normalized between categories).

Language OpenSource Speed ComputeTime LibraryExtent EaseOfEntry Costs Interoperability
Python Yes 16 62 80 85 10 90
Julia Yes 2941 0.34 100 30 10 90
R Yes 1 745 100 80 15 90
IDL No 67 14.77 50 88 100 20
Matlab No 147 6.8 75 95 100 20
Scala Yes 1428 0.7 50 30 20 40
C Yes 1818 0.55 100 30 10 99
Fortran Yes 1315 0.76 95 25 15 95

Let’s look as some real data specifically comparing R, SAS and SPSS, as popular tools for data manipulation and statistical modeling.

require(ggplot2)
## Loading required package: ggplot2
require(reshape2)
## Loading required package: reshape2
Data_R_SAS_SPSS_Pubs <- read.csv('https://umich.instructure.com/files/2361245/download?download_frd=1', header=T)
df <- data.frame(Data_R_SAS_SPSS_Pubs) 
# convert to long format (http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/) 
df <- melt(df ,  id.vars = 'Year', variable.name = 'Software') 
ggplot(data=df, aes(x=Year, y=value, color=Software, group = Software)) +  geom_line() + geom_line(size=4) + labs(x='Year', y='Citations')

2 Getting started

2.1 Install Basic Shell-based R

R is a free software that can be installed on any computer. The ‘R’ website is: http://R-project.org. There you can install a shell-based R-environment following this protocol:

  • click download CRAN in the left bar
  • choose a download site
  • choose your operation system (e.g., Windows, Mac, Linux)
  • click base
  • choose the latest version to Download R (3.4, or higher (newer) version for your specific operating system, e.g., Windows)

2.2 GUI based R Invocation (RStudio)

For many readers, its best to also instal and run R via RStudio GUI, which provides a nice user interface. To install RStudio, go to: http://www.rstudio.org/ and do the following:

  • click Download RStudio
  • click Download RStudio Desktop
  • click Recommended For Your System
  • download the .exe file and run it (choose default answers for all questions)

2.3 RStudio GUI Layout

The RStudio interface consists of several windows.

  • Bottom left: console window (also called command window). Here you can type simple commands after the “>” prompt and R will then execute your command. This is the most important window, because this is where R actually does stuff.
  • Top left: editor window (also called script window). Collections of commands (scripts) can be edited and saved. When you don’t get this window, you can open it with File > New > R script. Just typing a command in the editor window is not enough, it has to get into the command window before R executes the command. If you want to run a line from the script window (or the whole script), you can click Run or press CTRL+ENTER to send it to the command window.
  • Top right: workspace / history window. In the workspace window, you can see which data and values R has in its memory. You can view and edit the values by clicking on them. The history window shows what has been typed before.
  • Bottom right: files / plots / packages / help window. Here you can open files, view plots (also previous plots), install and load packages or use the help function. You can change the size of the windows by dragging the grey bars between the windows.

2.4 Some notes

  • The basic R environment installation comes with limited core functionality. Everyone eventually will have to install more packages, e.g., reshape2, ggplot2, and we will show how to expand your Rstudio library throughout these materials.
  • The core R environment also has to be upgraded occasionally, e.g., every 3-6 months to ger R patches, to fix known problems, and to add new functionality. This is also easy to do.
  • The assignment operator in R is <- (although = may also be used), so to assign a value of \(2\) to a variable \(x\), we can use x <- 2 or equivalently x = 2.

3 Help

R provides us documentations for different R functions. The function call those documentations use help(). Just put help(topic) in the R console and you can get detailed explanations for each R topic or function. Another way of doing it is to call ?topic, which is even easier.

For example, if I want to check the function for linear models (i.e. function lm()), I will use the following function.

help(lm)
?lm

4 Simple Wide-to-Long Data format translation

A first R script for melting a simple dataset

rawdata_wide <- read.table(header=TRUE, text='
 CaseID Gender Age  Condition1  Condition2
       1   M    5   13          10.5
       2   F    6   16          11.2
       3   F    8   10          18.3
       4   M    9       9.5     18.1
       5   M    10      12.1        19
')
# Make the CaseID column a factor
rawdata_wide$subject <- factor(rawdata_wide$CaseID)

rawdata_wide
##   CaseID Gender Age Condition1 Condition2 subject
## 1      1      M   5       13.0       10.5       1
## 2      2      F   6       16.0       11.2       2
## 3      3      F   8       10.0       18.3       3
## 4      4      M   9        9.5       18.1       4
## 5      5      M  10       12.1       19.0       5
library(reshape2)

# Specify id.vars: the variables to keep (don't split apart on!)
melt(rawdata_wide, id.vars=c("CaseID", "Gender"))
## Warning: attributes are not identical across measure variables; they will
## be dropped
##    CaseID Gender   variable value
## 1       1      M        Age     5
## 2       2      F        Age     6
## 3       3      F        Age     8
## 4       4      M        Age     9
## 5       5      M        Age    10
## 6       1      M Condition1    13
## 7       2      F Condition1    16
## 8       3      F Condition1    10
## 9       4      M Condition1   9.5
## 10      5      M Condition1  12.1
## 11      1      M Condition2  10.5
## 12      2      F Condition2  11.2
## 13      3      F Condition2  18.3
## 14      4      M Condition2  18.1
## 15      5      M Condition2    19
## 16      1      M    subject     1
## 17      2      F    subject     2
## 18      3      F    subject     3
## 19      4      M    subject     4
## 20      5      M    subject     5

There are options for melt that can make the output a little easier to work with:

data_long <- melt(rawdata_wide, 
        # ID variables - all the variables to keep but not split apart on
    id.vars=c("CaseID", "Gender"), 
        # The source columns
    measure.vars=c("Age", "Condition1", "Condition2" ), 
        # Name of the destination column that will identify the original
        # column that the measurement came from
    variable.name="Feature", 
    value.name="Measurement"
)
data_long
##    CaseID Gender    Feature Measurement
## 1       1      M        Age         5.0
## 2       2      F        Age         6.0
## 3       3      F        Age         8.0
## 4       4      M        Age         9.0
## 5       5      M        Age        10.0
## 6       1      M Condition1        13.0
## 7       2      F Condition1        16.0
## 8       3      F Condition1        10.0
## 9       4      M Condition1         9.5
## 10      5      M Condition1        12.1
## 11      1      M Condition2        10.5
## 12      2      F Condition2        11.2
## 13      3      F Condition2        18.3
## 14      4      M Condition2        18.1
## 15      5      M Condition2        19.0

For an elaborate justification, detailed description, and multiple examples of handling long-and-wide data, messy and tidy data, and data cleaning strategies see the (JSS Tidy Data article by Hadley Wickham)[https://www.jstatsoft.org/article/view/v059i10].

5 Data generation

Popular data generation functions are c(), seq(), rep(), and data.frame(). Sometimes we use list() and array() to create data too.

c()

c() creates a (column) vector. With option recursive=T, it descends through lists combining all elements into one vector.

a<-c(1, 2, 3, 5, 6, 7, 10, 1, 4)
a
## [1]  1  2  3  5  6  7 10  1  4
c(list(A = c(Z = 1, Y = 2), B = c(X = 7), C = c(W = 7, V=3, U=-1.9)), recursive = TRUE)
##  A.Z  A.Y  B.X  C.W  C.V  C.U 
##  1.0  2.0  7.0  7.0  3.0 -1.9

When combined with list(), c() successfully created a vector with all the information in a list with three members A, B, and C.

seq(from, to)

seq(from, to) generates a sequence. Adding option by= can help us specifies increment; Option length= specifies desired length. Also, seq(along=x) generates a sequence 1, 2, ..., length(x). This is used for loops to create ID for each element in x.

seq(1, 20, by=0.5)
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5
## [15]  8.0  8.5  9.0  9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
## [29] 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0
seq(1, 20, length=9)
## [1]  1.000  3.375  5.750  8.125 10.500 12.875 15.250 17.625 20.000
seq(along=c(5, 4, 5, 6))
## [1] 1 2 3 4

rep(x, times)

rep(x, times) creates a sequence that repeats x a specified number of times. The option each= also allow us to repeat first over each element of x certain number of times.

rep(c(1, 2, 3), 4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1, 2, 3), each=4)
##  [1] 1 1 1 1 2 2 2 2 3 3 3 3

Compare this to replicating using replicate()

X <- seq(along=c(1, 2, 3)); replicate(4, X+1)
##      [,1] [,2] [,3] [,4]
## [1,]    2    2    2    2
## [2,]    3    3    3    3
## [3,]    4    4    4    4

data.frame()

data.frame() creates a data frame of named or unnamed arguments. We can combine multiple vectors. Each vector is stored as a column. Shorter vectors are recycled to the length of the longest one. With data.frame() you can mix numeric and characteristic vectors.

data.frame(v=1:4, ch=c("a", "B", "C", "d"), n=c(10, 11))
##   v ch  n
## 1 1  a 10
## 2 2  B 11
## 3 3  C 10
## 4 4  d 11

Note that the 1:4 means from 1 to 4. The operator : generates a sequence.

list()

Like we mentioned in function c(), list() creates a list of the named or unnamed arguments - indexing rule: from 1 to n, including 1 and n.

l<-list(a=c(1, 2), b="hi", c=-3+3i)
l
## $a
## [1] 1 2
## 
## $b
## [1] "hi"
## 
## $c
## [1] -3+3i
# Note Complex Numbers a <- -1+3i; b <- -2-2i; a+b

We use $ to call each member in the list and [[]] to call the element corresponding to specific index. For example,

l$a[[2]]
## [1] 2
l$b
## [1] "hi"

Note that R uses 1-based numbering rather than 0-based like some other languages (C/Java), so the first element of a list has index \(1\).

array(x, dim=)

array(x, dim=) creates an array with specific dimensions. For example, dim=c(3, 4, 2) means two 3x4 matrices. We use [] to extract specific elements in the array. [2, 3, 1] means the element at the 2nd row 3rd column in the 1st page. Leaving one number in the dimensions empty would help us to get a specific row, column or page. [2, ,1] means the second row in the 1st page. See this image: :

ar<-array(1:24, dim=c(3, 4, 2))
ar
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24
ar[2, 3, 1]
## [1] 8
ar[2, ,1]
## [1]  2  5  8 11
  • In General, multi-dimensional arrays are called “tensors” (of order=number of dimensions).

Other useful functions are:

  • matrix(x, nrow=, ncol=): creates matrix elements of nrow rows and ncol columns.
  • factor(x, levels=): encodes a vector x as a factor.
  • gl(n, k, length=n*k, labels=1:n): generate levels (factors) by specifying the pattern of their levels. k is the number of levels, and n is the number of replications.
  • expand.grid(): a data frame from all combinations of the supplied vectors or factors.
  • rbind() combine arguments by rows for matrices, data frames, and others
  • cbind() combine arguments by columns for matrices, data frames, and others

6 Input/Output(I/O)

The first pair of functions we will talk about are load(), which helps us reload datasets written with the save() function.

Let’s create some data first.

x <- seq(1, 10, by=0.5)
y <- list(a = 1, b = TRUE, c = "oops")
save(x, y, file="xy.RData")
load("xy.RData")

data(x) loads specified data sets library(x) load add-on packages.

data("iris")
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

read.table(file) reads a file in table format and creates a data frame from it. The default separator sep="" is any whitespace. Use header=TRUE to read the first line as a header of column names. Use as.is=TRUE to prevent character vectors from being converted to factors. Use comment.char="" to prevent "#" from being interpreted as a comment. Use skip=n to skip n lines before reading data. See the help for options on row naming, NA treatment, and others.

Let’s use read.table() to read a text file in our class file.

data.txt<-read.table("https://umich.instructure.com/files/1628628/download?download_frd=1", header=T, as.is = T) # 01a_data.txt
summary(data.txt)
##      Name               Team             Position             Height    
##  Length:1034        Length:1034        Length:1034        Min.   :67.0  
##  Class :character   Class :character   Class :character   1st Qu.:72.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :74.0  
##                                                           Mean   :73.7  
##                                                           3rd Qu.:75.0  
##                                                           Max.   :83.0  
##      Weight           Age       
##  Min.   :150.0   Min.   :20.90  
##  1st Qu.:187.0   1st Qu.:25.44  
##  Median :200.0   Median :27.93  
##  Mean   :201.7   Mean   :28.74  
##  3rd Qu.:215.0   3rd Qu.:31.23  
##  Max.   :290.0   Max.   :48.52

read.csv(“filename”, header=TRUE) is identical to read.table() but with defaults set for reading comma-delimited files.

data.csv<-read.csv("https://umich.instructure.com/files/1628650/download?download_frd=1", header = T)  # 01_hdp.csv
summary(data.csv)
##    tumorsize           co2             pain           wound      
##  Min.   : 33.97   Min.   :1.222   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 62.49   1st Qu.:1.519   1st Qu.:4.000   1st Qu.:5.000  
##  Median : 70.07   Median :1.601   Median :5.000   Median :6.000  
##  Mean   : 70.88   Mean   :1.605   Mean   :5.473   Mean   :5.732  
##  3rd Qu.: 79.02   3rd Qu.:1.687   3rd Qu.:6.000   3rd Qu.:7.000  
##  Max.   :116.46   Max.   :2.128   Max.   :9.000   Max.   :9.000  
##     mobility       ntumors        nmorphine        remission     
##  Min.   :1.00   Min.   :0.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:5.00   1st Qu.:1.000   1st Qu.: 2.000   1st Qu.:0.0000  
##  Median :6.00   Median :3.000   Median : 3.000   Median :0.0000  
##  Mean   :6.08   Mean   :3.066   Mean   : 3.624   Mean   :0.2957  
##  3rd Qu.:7.00   3rd Qu.:5.000   3rd Qu.: 5.000   3rd Qu.:1.0000  
##  Max.   :9.00   Max.   :9.000   Max.   :18.000   Max.   :1.0000  
##   lungcapacity          Age           Married    FamilyHx     SmokingHx   
##  Min.   :0.01612   Min.   :26.32   Min.   :0.0   no :6820   current:1705  
##  1st Qu.:0.67647   1st Qu.:46.69   1st Qu.:0.0   yes:1705   former :1705  
##  Median :0.81560   Median :50.93   Median :1.0              never  :5115  
##  Mean   :0.77409   Mean   :50.97   Mean   :0.6                            
##  3rd Qu.:0.91150   3rd Qu.:55.27   3rd Qu.:1.0                            
##  Max.   :0.99980   Max.   :74.48   Max.   :1.0                            
##      Sex       CancerStage  LengthofStay         WBC            RBC       
##  female:5115   I  :2558    Min.   : 1.000   Min.   :2131   Min.   :3.919  
##  male  :3410   II :3409    1st Qu.: 5.000   1st Qu.:5323   1st Qu.:4.802  
##                III:1705    Median : 5.000   Median :6007   Median :4.994  
##                IV : 853    Mean   : 5.492   Mean   :5998   Mean   :4.995  
##                            3rd Qu.: 6.000   3rd Qu.:6663   3rd Qu.:5.190  
##                            Max.   :10.000   Max.   :9776   Max.   :6.065  
##       BMI             IL6                CRP               DID       
##  Min.   :18.38   Min.   : 0.03521   Min.   : 0.0451   Min.   :  1.0  
##  1st Qu.:24.20   1st Qu.: 1.93039   1st Qu.: 2.6968   1st Qu.:100.0  
##  Median :27.73   Median : 3.34400   Median : 4.3330   Median :199.0  
##  Mean   :29.07   Mean   : 4.01698   Mean   : 4.9730   Mean   :203.3  
##  3rd Qu.:32.54   3rd Qu.: 5.40551   3rd Qu.: 6.5952   3rd Qu.:309.0  
##  Max.   :58.00   Max.   :23.72777   Max.   :28.7421   Max.   :407.0  
##    Experience        School        Lawsuits          HID       
##  Min.   : 7.00   average:6405   Min.   :0.000   Min.   : 1.00  
##  1st Qu.:15.00   top    :2120   1st Qu.:1.000   1st Qu.: 9.00  
##  Median :18.00                  Median :2.000   Median :17.00  
##  Mean   :17.64                  Mean   :1.866   Mean   :17.76  
##  3rd Qu.:21.00                  3rd Qu.:3.000   3rd Qu.:27.00  
##  Max.   :29.00                  Max.   :9.000   Max.   :35.00  
##     Medicaid     
##  Min.   :0.1416  
##  1st Qu.:0.3369  
##  Median :0.5215  
##  Mean   :0.5125  
##  3rd Qu.:0.7083  
##  Max.   :0.8187

read.delim(“filename”, header=TRUE) is very similar to the first two. However, it has defaults set for reading tab-delimited files.

Also we have read.fwf(file, widths, header=FALSE, sep="\t" , as.is=FALSE) to read a table of fixed width formatted data into a data frame.

match(x, y) returns a vector of the positions of (first) matches of its first argument in its second. For a specific element in x if no elements matches it in y then the output for that elements would be NA.

match(c(1, 2, 4, 5), c(1, 4, 4, 5, 6, 7))
## [1]  1 NA  2  4

save.image(file) saves all objects in the current work space.

**write.table(x, file=“”, row.names=TRUE, col.names=TRUE, sep=" “)** prints x after converting to a data frame and stores it into a specified file. If quote is TRUE, character or factor columns are surrounded by quotes (”). sep is the field separator. eol is the end-of-line separator. na is the string for missing values. Use col.names=NA to add a blank column header to get the column headers aligned correctly for spreadsheet input.

Most of the I/O functions have a file argument. This can often be a character string naming a file or a connection. file="" means the standard input or output. Connections can include files, pipes, zipped files, and R variables.

On windows, the file connection can also be used with description = "clipboard". To read a table copied from Excel, use x <- read.delim("clipboard")

To write a table to the clipboard for Excel, use write.table(x, "clipboard", sep="\t", col.names=NA)

For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and ROracle, as well as packages XML, hdf5, netCDF for reading other file formats. We will talk about some of them in later chapters.

Note, an alternative library called rio handles import/export of multiple data types with simple syntax.

7 Slicing and extracting data

The following table can help us to understand how to index vectors.

Expression Explanation
x[n] nth element
x[-n] all but the nth element
x[1:n] first n elements
x[-(1:n)] elements from n+1 to the end
x[c(1, 4, 2)] specific elements
x["name"] element named “name”
x[x > 3] all elements greater than 3
x[x > 3 & x < 5] all elements between 3 and 5
x[x %in% c("a", "and", "the")] elements in the given set

Indexing lists are similar to indexing vectors but some of the symbols are different.

Expression Explanation
x[n] list with n elements
x[[n]] nth element of the list
x[["name"]] element of the list named “name”

Indexing for matrices is a higher dimensional version of indexing vectors.

Expression Explanation
x[i, j] element at row i, column j
x[i, ] row i
x[, j] column j
x[, c(1, 3)] columns 1 and 3
x["name", ] row named “name”

8 Variable conversion

The following functions can be used to convert data types:

as.array(x), as.data.frame(x), as.numeric(x), as.logical(x), as.complex(x), as.character(x), …

Typing methods(as) in the console will generate a complete list for variable conversion functions.

9 Variable information

The following functions will test if the each data element is a specific type:

is.na(x), is.null(x), is.array(x), is.data.frame(x), is.numeric(x), is.complex(x), is.character(x), …

For a complete list, type methods(is) in R console. The output for these functions are a bunch of TRUE or FALSE logical statements. One statement for one element in the dataset.

length(x) gives us the number of elements in x.

x<-c(1, 3, 10, 23, 1, 3)
length(x)
## [1] 6

dim(x) retrieves or sets the dimension of an object.

x<-1:12
dim(x)<-c(3, 4)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

dimnames(x) retrieves or sets the dimension names of an object. For higher dimensional objects like matrix or arrays we can combine dimnames() with list.

dimnames(x)<-list(c("R1", "R2", "R3"), c("C1", "C2", "C3", "C4"))
x
##    C1 C2 C3 C4
## R1  1  4  7 10
## R2  2  5  8 11
## R3  3  6  9 12

nrow(x) number of rows; ncol(x) number of columns

nrow(x)
## [1] 3
ncol(x)
## [1] 4

class(x) get or set the class of x. Note that we can use unclass(x) to remove the class attribute of x.

class(x)
## [1] "matrix"
class(x)<-"myclass"
x<-unclass(x)
x
##    C1 C2 C3 C4
## R1  1  4  7 10
## R2  2  5  8 11
## R3  3  6  9 12

attr(x, which) get or set the attribute which of x.

attr(x, "class")
## NULL
attr(x, "dim")<-c(2, 6)
x
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    3    5    7    9   11
## [2,]    2    4    6    8   10   12

From the above commands we know that when we unclass x, its class would be NULL.

attributes(obj) get or set the list of attributes of object.

attributes(x) <- list(mycomment = "really special", dim = 3:4, 
   dimnames = list(LETTERS[1:3], letters[1:4]), names = paste(1:12))
x
##   a b c  d
## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12
## attr(,"mycomment")
## [1] "really special"
## attr(,"names")
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

10 Data selection and manipulation

In this section, we will introduce some data manipulation functions. In addition, tools from dplyr provide easy dataset manipulation routines.

which.max(x) returns the index of the greatest element of x. which.min(x) returns the index of the smallest element of x. rev(x) reverses the elements of x. Let’s see these three functions first.

x<-c(1, 5, 2, 1, 10, 40, 3)
which.max(x)
## [1] 6
which.min(x)
## [1] 1
rev(x)
## [1]  3 40 10  1  2  5  1

sort(x) sorts the elements of x in increasing order. To sort in decreasing order we can use rev(sort(x)).

sort(x)
## [1]  1  1  2  3  5 10 40
rev(sort(x))
## [1] 40 10  5  3  2  1  1

cut(x, breaks) divides x into intervals with same length (sometimes factors). breaks is the number of cut intervals or a vector of cut points. cut divides the range of x into intervals coding the values in x according to the intervals they fall into.

x
## [1]  1  5  2  1 10 40  3
cut(x, 3)
## [1] (0.961,14] (0.961,14] (0.961,14] (0.961,14] (0.961,14] (27,40]   
## [7] (0.961,14]
## Levels: (0.961,14] (14,27] (27,40]
cut(x, c(0, 5, 20, 30))
## [1] (0,5]  (0,5]  (0,5]  (0,5]  (5,20] <NA>   (0,5] 
## Levels: (0,5] (5,20] (20,30]

which(x == a) returns a vector of the indices of x if the comparison operation is true (TRUE). For example it returns the value i, if x[i]== a is true. Thus, the argument of this function (like x==a) must be a variable of mode logical.

x
## [1]  1  5  2  1 10 40  3
which(x==2)
## [1] 3

na.omit(x) suppresses the observations with missing data (NA). It suppresses the corresponding line if x is a matrix or a data frame. na.fail(x) returns an error message if x contains at least one NA.

df<-data.frame(a=1:5, b=c(1, 3, NA, 9, 8))
df
##   a  b
## 1 1  1
## 2 2  3
## 3 3 NA
## 4 4  9
## 5 5  8
na.omit(df)
##   a b
## 1 1 1
## 2 2 3
## 4 4 9
## 5 5 8

unique(x) If x is a vector or a data frame, it returns a similar object but with the duplicate elements suppressed.

df1<-data.frame(a=c(1, 1, 7, 6, 8), b=c(1, 1, NA, 9, 8))
df1
##   a  b
## 1 1  1
## 2 1  1
## 3 7 NA
## 4 6  9
## 5 8  8
unique(df1)
##   a  b
## 1 1  1
## 3 7 NA
## 4 6  9
## 5 8  8

table(x) returns a table with the different values of x and their frequencies (typically for integers or factors). Also check prob.table().

v<-c(1, 2, 4, 2, 2, 5, 6, 4, 7, 8, 8)
table(v)
## v
## 1 2 4 5 6 7 8 
## 1 3 2 1 1 1 2

subset(x, …) returns a selection of x with respect to criteria ... (typically ... are comparisons like x$V1 < 10). If x is a data frame, the option select= gives the variables to be kept or dropped using a minus sign.

sub<-subset(df1, df1$a>5)
sub
##   a  b
## 3 7 NA
## 4 6  9
## 5 8  8
sub<-subset(df1, select=-a)
sub
##    b
## 1  1
## 2  1
## 3 NA
## 4  9
## 5  8

sample(x, size) resample randomly and without replacement size elements in the vector x, the option replace = TRUE allows to resample with replacement.

v
##  [1] 1 2 4 2 2 5 6 4 7 8 8
sample(df1$a, 20, replace = T)
##  [1] 8 7 1 7 1 6 1 6 7 6 1 6 1 1 1 7 1 1 1 1

prop.table(x, margin=) table entries as fraction of marginal table.

prop.table(table(v))
## v
##          1          2          4          5          6          7 
## 0.09090909 0.27272727 0.18181818 0.09090909 0.09090909 0.09090909 
##          8 
## 0.18181818

11 Math Functions

Basic math functions like sin, cos, tan, asin, acos, atan, atan2, log, log10, exp and “set” functions union(x, y), intersect(x, y), setdiff(x, y), setequal(x, y), is.element(el, set) are available in R.

lsf.str("package:base") displays all base function built in a specific R package (like base).

Also we have the following table of functions that you might need when use R for calculations.

Expression Explanation
choose(n, k) computes the combinations of k events among n repetitions. Mathematically it equals to \(\frac{n!}{[(n-k)!k!]}\)
max(x) maximum of the elements of x
min(x) minimum of the elements of x
range(x) minimum and maximum of the elements of x
sum(x) sum of the elements of x
diff(x) lagged and iterated differences of vector x
prod(x) product of the elements of x
mean(x) mean of the elements of x
median(x) median of the elements of x
quantile(x, probs=) sample quantiles corresponding to the given probabilities (defaults to 0, .25, .5, .75, 1)
weighted.mean(x, w) mean of x with weights w
rank(x) ranks of the elements of x
var(x) or cov(x) variance of the elements of x (calculated on n>1). If x is a matrix or a data frame, the variance-covariance matrix is calculated
sd(x) standard deviation of x
cor(x) correlation matrix of x if it is a matrix or a data frame (1 if x is a vector)
var(x, y) or cov(x, y) covariance between x and y, or between the columns of x and those of y if they are matrices or data frames
cor(x, y) linear correlation between x and y, or correlation matrix if they are matrices or data frames
round(x, n) rounds the elements of x to n decimals
log(x, base) computes the logarithm of x with base base
scale(x) if x is a matrix, centers and reduces the data. Without centering use the option center=FALSE. Without scaling use scale=FALSE (by default center=TRUE, scale=TRUE)
pmin(x, y, ...) a vector which ith element is the minimum of x[i], y[i], . . .
pmax(x, y, ...) a vector which ith element is the maximum of x[i], y[i], . . .
cumsum(x) a vector which ith element is the sum from x[1] to x[i]
cumprod(x) id. for the product
cummin(x) id. for the minimum
cummax(x) id. for the maximum
Re(x) real part of a complex number
Im(x) imaginary part of a complex number
Mod(x) modulus. abs(x) is the same
Arg(x) angle in radians of the complex number
Conj(x) complex conjugate
convolve(x, y) compute the several kinds of convolutions of two sequences
fft(x) Fast Fourier Transform of an array
mvfft(x) FFT of each column of a matrix
filter(x, filter) applies linear filtering to a univariate time series or to each series separately of a multivariate time series

Note: many math functions have a logical parameter na.rm=FALSE to specify missing data (NA) removal.

12 Matrix Operations

The following table summarizes basic operation functions. We will discuss this topic in detail in Chapter 4.

Expression Explanation
t(x) transpose
diag(x) diagonal
%*% matrix multiplication
solve(a, b) solves a %*% x = b for x
solve(a) matrix inverse of a
rowsum(x) sum of rows for a matrix-like object. rowSums(x) is a faster version
colsum(x), colSums(x) id. for columns
rowMeans(x) fast version of row means
colMeans(x) id. for columns
mat1 <- cbind(c(1, -1/5), c(-1/3, 1))
mat1.inv <- solve(mat1)

mat1.identity <- mat1.inv %*% mat1
mat1.identity
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
b <- c(1, 2)
x <- solve (mat1, b)
x
## [1] 1.785714 2.357143

13 Advanced Data Processing

In this section, we will introduce some fancy functions that can save time remarkably.

apply(X, INDEX, FUN=) a vector or array or list of values obtained by applying a function FUN to margins (INDEX=1 means row, INDEX=2 means column) of X.

df1
##   a  b
## 1 1  1
## 2 1  1
## 3 7 NA
## 4 6  9
## 5 8  8
apply(df1, 2, mean, na.rm=T)
##    a    b 
## 4.60 4.75

Note that we can add options for the FUN after the function.

lapply(X, FUN) apply FUN to each member of the list X. If X is a data frame then it will apply the FUN to each column and return a list.

lapply(df1, mean, na.rm=T)
## $a
## [1] 4.6
## 
## $b
## [1] 4.75
lapply(list(a=c(1, 23, 5, 6, 1), b=c(9, 90, 999)), median)
## $a
## [1] 5
## 
## $b
## [1] 90

tapply(X, INDEX, FUN=) apply FUN to each cell of a ragged array given by X with indexes equals to INDEX. Note that X is an atomic object, typically a vector

v
##  [1] 1 2 4 2 2 5 6 4 7 8 8
fac <- factor(rep(1:3, length = 11), levels = 1:3)
table(fac)
## fac
## 1 2 3 
## 4 4 3
tapply(v, fac, sum)
##  1  2  3 
## 17 16 16

by(data, INDEX, FUN) apply FUN to data frame data subsetted by INDEX.

by(df1, df1[, 1], sum)
## df1[, 1]: 1
## [1] 4
## -------------------------------------------------------- 
## df1[, 1]: 6
## [1] 15
## -------------------------------------------------------- 
## df1[, 1]: 7
## [1] NA
## -------------------------------------------------------- 
## df1[, 1]: 8
## [1] 16

The above line of code apply the sum function using column 1(a) as an index.

merge(a, b) merge two data frames by common columns or row names. We can use option by= to specify the index column.

df2<-data.frame(a=c(1, 1, 7, 6, 8), c=1:5)
df2
##   a c
## 1 1 1
## 2 1 2
## 3 7 3
## 4 6 4
## 5 8 5
df3<-merge(df1, df2, by="a")
df3
##   a  b c
## 1 1  1 1
## 2 1  1 2
## 3 1  1 1
## 4 1  1 2
## 5 6  9 4
## 6 7 NA 3
## 7 8  8 5

xtabs(a ~ b, data=x) a contingency table from cross-classifying factors.

DF <- as.data.frame(UCBAdmissions)
##  'DF' is a data frame with a grid of the factors and the counts
## in variable 'Freq'.
DF
##       Admit Gender Dept Freq
## 1  Admitted   Male    A  512
## 2  Rejected   Male    A  313
## 3  Admitted Female    A   89
## 4  Rejected Female    A   19
## 5  Admitted   Male    B  353
## 6  Rejected   Male    B  207
## 7  Admitted Female    B   17
## 8  Rejected Female    B    8
## 9  Admitted   Male    C  120
## 10 Rejected   Male    C  205
## 11 Admitted Female    C  202
## 12 Rejected Female    C  391
## 13 Admitted   Male    D  138
## 14 Rejected   Male    D  279
## 15 Admitted Female    D  131
## 16 Rejected Female    D  244
## 17 Admitted   Male    E   53
## 18 Rejected   Male    E  138
## 19 Admitted Female    E   94
## 20 Rejected Female    E  299
## 21 Admitted   Male    F   22
## 22 Rejected   Male    F  351
## 23 Admitted Female    F   24
## 24 Rejected Female    F  317
## Nice for taking margins ...
xtabs(Freq ~ Gender + Admit, DF)
##         Admit
## Gender   Admitted Rejected
##   Male       1198     1493
##   Female      557     1278
## And for testing independence ...
summary(xtabs(Freq ~ ., DF))
## Call: xtabs(formula = Freq ~ ., data = DF)
## Number of cases in table: 4526 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 2000.3, df = 16, p-value = 0

aggregate(x, by, FUN) splits the data frame x into subsets, computes summary statistics for each, and returns the result in a convenient form. by is a list of grouping elements, each has the same length as the variables in x.

list(rep(1:3, length=7))
## [[1]]
## [1] 1 2 3 1 2 3 1
aggregate(df3, by=list(rep(1:3, length=7)), sum)
##   Group.1  a  b c
## 1       1 10 10 8
## 2       2  7 10 6
## 3       3  8 NA 4

The above code applied function sum to data frame df3 according to the index created by list(rep(1:3, length=7)).

stack(x, …) transform data available as separate columns in a data frame or list into a single column. and unstack(x, ...) is inverse of stack().

stack(df3)
##    values ind
## 1       1   a
## 2       1   a
## 3       1   a
## 4       1   a
## 5       6   a
## 6       7   a
## 7       8   a
## 8       1   b
## 9       1   b
## 10      1   b
## 11      1   b
## 12      9   b
## 13     NA   b
## 14      8   b
## 15      1   c
## 16      2   c
## 17      1   c
## 18      2   c
## 19      4   c
## 20      3   c
## 21      5   c
unstack(stack(df3))
##   a  b c
## 1 1  1 1
## 2 1  1 2
## 3 1  1 1
## 4 1  1 2
## 5 6  9 4
## 6 7 NA 3
## 7 8  8 5

reshape(x, …) reshapes a data frame between “wide” format with repeated measurements in separate columns of the same record and “long” format with the repeated measurements in separate records. Use direction="wide" or direction="long".

df4 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6), 
                  time = rep(c(1, 1, 2, 2), 3), score = rnorm(12))
wide <- reshape(df4, idvar = c("school", "class"), direction = "wide")
wide
##    school class     score.1    score.2
## 1       1     9 -1.10889176 -1.6538181
## 2       1    10 -0.77658270  0.1731284
## 5       2     9  0.05242239  2.0630974
## 6       2    10 -0.28752610 -1.8112969
## 9       3     9  0.58834665 -0.4316699
## 10      3    10  0.74352294 -1.1964874
long <- reshape(wide, idvar = c("school", "class"), direction = "long")
long
##        school class time     score.1
## 1.9.1       1     9    1 -1.10889176
## 1.10.1      1    10    1 -0.77658270
## 2.9.1       2     9    1  0.05242239
## 2.10.1      2    10    1 -0.28752610
## 3.9.1       3     9    1  0.58834665
## 3.10.1      3    10    1  0.74352294
## 1.9.2       1     9    2 -1.65381814
## 1.10.2      1    10    2  0.17312841
## 2.9.2       2     9    2  2.06309743
## 2.10.2      2    10    2 -1.81129688
## 3.9.2       3     9    2 -0.43166987
## 3.10.2      3    10    2 -1.19648740

Notes

  • The \(x\) in this function has to be longitudinal data.
  • The call to rnorm used in reshape might generate different results for each call, unless set.seed(1234) is used to ensure reproducibility of random-number generation.

14 Strings

The following functions are useful for handling strings in R.

paste(…) concatenate vectors after converting to character. It has a few options. sep= is the string to separate terms (a single space is the default). collapse= is an optional string to separate “collapsed” results.

a<-"today"
b<-"is a good day"
paste(a, b)
## [1] "today is a good day"
paste(a, b, sep=", ")
## [1] "today, is a good day"

substr(x, start, stop) substrings in a character vector. It can also assign values (with the same length) to part of a string, as substr(x, start, stop) <- value.

a<-"When the going gets tough, the tough get going!"
substr(a, 10, 40)
## [1] "going gets tough, the tough get"
## [1] "going gets tough, the tough get"
substr(a, 1, 9)<-"........."
a
## [1] ".........going gets tough, the tough get going!"

Note that characters at start and stop indexes are inclusive in the output.

strsplit(x, split) split x according to the substring split. Use fixed=TRUE for non-regular expressions.

strsplit("a.b.c", ".", fixed = TRUE)
## [[1]]
## [1] "a" "b" "c"

grep(pattern, x) searches for matches to pattern within x. It will return a vector of the indices of the elements of x that yielded a match. Use regular expression for pattern(unless fixed=TRUE). See ?regex for details.

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
grep("[a-z]", letters)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26

gsub(pattern, replacement, x) replacement of matches determined by regular expression matching. sub() is the same but only replaces the first occurrence.

a<-c("e", 0, "kj", 10, ";")
gsub("[a-z]", "letters", a)
## [1] "letters"        "0"              "lettersletters" "10"            
## [5] ";"
sub("[a-z]", "letters", a)
## [1] "letters"  "0"        "lettersj" "10"       ";"

tolower(x) convert to lowercase. toupper(x) convert to uppercase.

match(x, table) a vector of the positions of first matches for the elements of x among table. x %in% table id. but returns a logical vector.

x<-c(1, 2, 10, 19, 29)
match(x, c(1, 10))
## [1]  1 NA  2 NA NA
x %in% c(1, 10)
## [1]  TRUE FALSE  TRUE FALSE FALSE

pmatch(x, table) partial matches for the elements of x among table.

pmatch("m",   c("mean", "median", "mode")) # returns NA
## [1] NA
pmatch("med", c("mean", "median", "mode")) # returns 2
## [1] 2

The first one returns NA, and dependent on the R-version possibly a warning, because all elements have the pattern "m".

nchar(x) number of characters

Dates and Times

The class Date has dates without times. POSIXct() has dates and times, including time zones. Comparisons (e.g. >), seq(), and difftime() are useful. ?DateTimeClasses gives more information. See also package chron.

as.Date(s) and as.POSIXct(s) convert to the respective class; format(dt) converts to a string representation. The default string format is 2001-02-21. These accept a second argument to specify a format for conversion. Some common formats are:

Formats Explanations
%a, %A Abbreviated and full weekday name.
%b, %B Abbreviated and full month name.
%d Day of the month (01 … 31).
%H Hours (00 … 23).
%I Hours (01 … 12).
%j Day of year (001 … 366).
%m Month (01 … 12).
%M Minute (00 … 59).
%p AM/PM indicator.
%S Second as decimal number (00 … 61).
%U Week (00 … 53); the first Sunday as day 1 of week 1.
%w Weekday (0 … 6, Sunday is 0).
%W Week (00 … 53); the first Monday as day 1 of week 1.
%y Year without century (00 … 99). Don’t use.
%Y Year with century.
%z (output only.) Offset from Greenwich; -0800 is 8 hours west of.
%Z (output only.) Time zone as a character string (empty if not available).

Where leading zeros are shown they will be used on output but are optional on input. See ?strftime for details.

15 Plotting

This only an introduction for plotting functions in R. In Chapter 2, we will discuss visualization in more detail.

plot(x) plot of the values of x (on the y-axis) ordered on the x-axis.

plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis).

hist(x) histogram of the frequencies of x.

barplot(x) histogram of the values of x. Use horiz=FALSE for horizontal bars.

dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column).

pie(x) circular pie-chart.

boxplot(x) ‘box-and-whiskers’ plot.

sunflowerplot(x, y) id. than plot() but the points with similar coordinates are drawn as flowers which petal number represents the number of points.

stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes).

coplot(x~y | z) bivariate plot of x and y for each value or interval of values of z.

interaction.plot (f1, f2, y) if f1 and f2 are factors, plots the means of y (on the y-axis) with respect to the values of f1 (on the x-axis) and of f2 (different curves). The option fun allows to choose the summary statistic of y (by default fun=mean).

matplot(x, y) bivariate plot of the first column of x vs. the first one of y, the second one of x vs. the second one of y, etc.

fourfoldplot(x) visualizes, with quarters of circles, the association between two dichotomous variables for different populations (x must be an array with dim=c(2, 2, k), or a matrix with dim=c(2, 2) if k = 1)

assocplot(x) Cohen’s Friendly graph shows the deviations from independence of rows and columns in a two dimensional contingency table.

mosaicplot(x) “mosaic”" graph of the residuals from a log-linear regression of a contingency table.

pairs(x) if x is a matrix or a data frame, draws all possible bivariate plots between the columns of x.

plot.ts(x) if x is an object of class “ts”, plot of x with respect to time, x may be multivariate but the series must have the same frequency and dates. Detailed examples are in Chapter 17: Big Longitudinal Data Analysis.

ts.plot(x) id. but if x is multivariate the series may have different dates and must have the same frequency.

qqnorm(x) quantiles of x with respect to the values expected under a normal law.

qqplot(x, y) quantiles of y with respect to the quantiles of x.

contour(x, y, z) contour plot (data are interpolated to draw the curves), x and y must be vectors and z must be a matrix so that dim(z)=c(length(x), length(y)) (x and y may be omitted).

filled.contour(x, y, z) areas between the contours are colored, and a legend of the colors is drawn as well.

image(x, y, z) plotting actual data with colors.

persp(x, y, z) plotting actual data in perspective view.

stars(x) if x is a matrix or a data frame, draws a graph with segments or a star where each row of x is represented by a star and the columns are the lengths of the segments.

symbols(x, y, …) draws, at the coordinates given by x and y, symbols (circles, squares, rectangles, stars, thermometers or “boxplots”“) which sizes, colors… are specified by supplementary arguments.

termplot(mod.obj) plot of the (partial) effects of a regression model (mod.obj).

The following parameters are common to many plotting functions:

Parameters Explanations
add=FALSE if TRUE superposes the plot on the previous one (if it exists)
axes=TRUE if FALSE does not draw the axes and the box
type="p" specifies the type of plot, “p”: points, “l”: lines, “b”: points connected by lines, “o”: id. But the lines are over the points, “h”: vertical lines, “s”: steps, the data are represented by the top of the vertical lines, “S”: id. However, the data are represented at the bottom of the vertical lines
xlim=, ylim= specifies the lower and upper limits of the axes, for example with xlim=c(1, 10) or xlim=range(x)
xlab=, ylab= annotates the axes, must be variables of mode character
main= main title, must be a variable of mode character
sub= subtitle (written in a smaller font)

16 QQ Normal probability plot

Let’s look at one simple example - quantile-quantile probability plot. Suppose \(X\sim N(0,1)\) and \(Y\sim Cauchy\) represent the observed/raw and simulated/generated data for one feature (variable) in the data.

X <- rnorm(1000)
Y <- rcauchy(1500)

# compare X to StdNormal distribution
    qqnorm(X, 
                main="Normal Q-Q Plot of the data", 
                xlab="Theoretical Quantiles of the Normal", 
                ylab="Sample Quantiles of the X (Normal) Data")
    qqline(X)

    qqplot(X, Y)

# Y against StdNormal
    qqnorm(Y, 
                main="Normal Q-Q Plot of the data", 
                xlab="Theoretical Quantiles of the Normal", 
                ylab="Sample Quantiles of the Y (Cauchy) Data", ylim= range(-4, 4))
        # Why is the y-range specified here?
    qqline(Y)

# Q-Q plot data (X) vs. simulation(Y)

myQQ <- function(x, y, ...) {
  #rang <- range(x, y, na.rm=T)
  rang <- range(-4, 4, na.rm=T)
  qqplot(x, y, xlim=rang, ylim=rang)
}

myQQ(X, Y) # where the Y is the newly simulated data for X
qqline(X)

# Subsampling

x <- matrix(rnorm(100), ncol = 5)
y <- c(1, seq(19))

z <- cbind(x, y)

z.df <- data.frame(z)
z.df
##             V1          V2         V3          V4           V5  y
## 1   0.80259081  0.23276608  1.1823582  1.14964201 -0.797364417  1
## 2   0.79258979 -0.38458210 -0.5794984 -1.25442139  0.576494073  1
## 3  -0.99077307 -0.59710415 -1.0469606  0.21257211 -1.358009306  2
## 4   1.66937276  0.15160188 -1.3384392 -0.66305946 -0.208514133  3
## 5   0.51036865  0.04917294 -0.3844586 -1.10503697 -0.001084918  4
## 6   0.09483345 -1.11195463 -1.0476316  1.39288862  0.293037761  5
## 7  -0.92827341 -0.69109575  2.1336919  1.38960155 -0.479887537  6
## 8   1.51154992 -1.41946091  1.0667077 -0.27266554 -0.394176599  7
## 9  -0.41364929  0.06367532 -0.8358991  0.29901635  2.027225229  8
## 10  1.90527473  0.44490202  1.2378843 -0.81389845 -0.393088370  9
## 11  0.84390070  1.34613891  0.2201998  0.08014402  0.693217240 10
## 12 -0.27170334 -1.40224412  0.4919226  1.19859235 -0.584282497 11
## 13 -0.85489970 -0.71371883  1.4394329  1.06835809  1.485963949 12
## 14 -0.12816291 -0.36772645  1.0236035  2.28503495 -0.094192793 13
## 15 -1.27376848  1.54523830 -1.3808661 -0.16210170 -0.422171982 14
## 16 -0.89869796  1.27404263  0.7587378  1.19985667  0.454814136 15
## 17 -0.97883311 -0.27380766  0.8973360  0.87545280  1.291004536 16
## 18 -1.87457481 -0.12116843 -0.2864056  0.83092760 -1.131867463 17
## 19 -0.64693554 -1.69156602  0.8707363 -1.25597728 -0.426603886 18
## 20 -0.07343203 -0.88859498 -0.2175640  0.98614632  1.649036730 19
names(z.df)
## [1] "V1" "V2" "V3" "V4" "V5" "y"
# subsetting rows
z.sub <- subset(z.df, y > 2 & (y<10 | V1>0))
z.sub
##             V1          V2         V3          V4           V5  y
## 4   1.66937276  0.15160188 -1.3384392 -0.66305946 -0.208514133  3
## 5   0.51036865  0.04917294 -0.3844586 -1.10503697 -0.001084918  4
## 6   0.09483345 -1.11195463 -1.0476316  1.39288862  0.293037761  5
## 7  -0.92827341 -0.69109575  2.1336919  1.38960155 -0.479887537  6
## 8   1.51154992 -1.41946091  1.0667077 -0.27266554 -0.394176599  7
## 9  -0.41364929  0.06367532 -0.8358991  0.29901635  2.027225229  8
## 10  1.90527473  0.44490202  1.2378843 -0.81389845 -0.393088370  9
## 11  0.84390070  1.34613891  0.2201998  0.08014402  0.693217240 10
z.sub1 <- z.df[z.df$y == 1, ]
z.sub1
##          V1         V2         V3        V4         V5 y
## 1 0.8025908  0.2327661  1.1823582  1.149642 -0.7973644 1
## 2 0.7925898 -0.3845821 -0.5794984 -1.254421  0.5764941 1
z.sub2 <- z.df[z.df$y %in% c(1, 4), ]
z.sub2
##          V1          V2         V3        V4           V5 y
## 1 0.8025908  0.23276608  1.1823582  1.149642 -0.797364417 1
## 2 0.7925898 -0.38458210 -0.5794984 -1.254421  0.576494073 1
## 5 0.5103686  0.04917294 -0.3844586 -1.105037 -0.001084918 4
# subsetting columns
z.sub6 <- z.df[, 1:2]
z.sub6
##             V1          V2
## 1   0.80259081  0.23276608
## 2   0.79258979 -0.38458210
## 3  -0.99077307 -0.59710415
## 4   1.66937276  0.15160188
## 5   0.51036865  0.04917294
## 6   0.09483345 -1.11195463
## 7  -0.92827341 -0.69109575
## 8   1.51154992 -1.41946091
## 9  -0.41364929  0.06367532
## 10  1.90527473  0.44490202
## 11  0.84390070  1.34613891
## 12 -0.27170334 -1.40224412
## 13 -0.85489970 -0.71371883
## 14 -0.12816291 -0.36772645
## 15 -1.27376848  1.54523830
## 16 -0.89869796  1.27404263
## 17 -0.97883311 -0.27380766
## 18 -1.87457481 -0.12116843
## 19 -0.64693554 -1.69156602
## 20 -0.07343203 -0.88859498

17 Low-level plotting commands

points(x, y) adds points (the option type= can be used)

lines(x, y) id. but with lines

text(x, y, labels, …) adds text given by labels at coordinates (x, y). Typical use: plot(x, y, type="n"); text(x, y, names)

mtext(text, side=3, line=0, …) adds text given by text in the margin specified by side (see axis() below); line specifies the line from the plotting area.

segments(x0, y0, x1, y1) draws lines from points (x0, y0) to points (x1, y1)

arrows(x0, y0, x1, y1, angle= 30, code=2) id. With arrows at points (x0, y0), if code=2. The arrow is at point (x1, y1), if code=1. Arrows are at both if code=3. Angle controls the angle from the shaft of the arrow to the edge of the arrow head.

abline(a, b) draws a line of slope b and intercept a.

abline(h=y) draws a horizontal line at ordinate y.

abline(v=x) draws a vertical line at abscissa x.

abline(lm.obj) draws the regression line given by lm.obj. abline(h=0, col=2) #color (col) is often used

rect(x1, y1, x2, y2) draws a rectangle which left, right, bottom, and top limits are x1, x2, y1, and y2, respectively.

polygon(x, y) draws a polygon linking the points with coordinates given by x and y.

legend(x, y, legend) adds the legend at the point (x, y) with the symbols given by legend.

title() adds a title and optionally a subtitle.

axis(side, vect) adds an axis at the bottom (side=1), on the left (side=2), at the top (side=3), or on the right (side=4); vect (optional) gives the abscissa (or ordinates) where tick-marks are drawn.

rug(x) draws the data x on the x-axis as small vertical lines.

locator(n, type=“n”, …) returns the coordinates (x, y) after the user has clicked n times on the plot with the mouse; also draws symbols (type="p") or lines (type="l") with respect to optional graphic parameters (…); by default nothing is drawn (type="n").

18 Graphics parameters

These can be set globally with par(…). Many can be passed as parameters to plotting commands.

adj controls text justification (adj=0 left-justified, adj=0.5 centered, adj=1 right-justified).

bg specifies the color of the background (ex. : bg="red", bg="blue", …the list of the 657 available colors is displayed with colors()).

bty controls the type of box drawn around the plot. Allowed values are: “o”, “l”, “7”, “c”, “u” ou “]” (the box looks like the corresponding character). If bty="n" the box is not drawn.

cex a value controlling the size of texts and symbols with respect to the default. The following parameters have the same control for numbers on the axes-cex.axis, the axis labels-cex.lab, the title-cex.main, and the subtitle-cex.sub.

col controls the color of symbols and lines. Use color names: “red”, “blue” see colors() or as “#RRGGBB”; see rgb(), hsv(), gray(), and rainbow(); as for cex there are: col.axis, col.lab, col.main, col.sub.

font an integer which controls the style of text (1: normal, 2: italics, 3: bold, 4: bold italics); as for cex there are: font.axis, font.lab, font.main, font.sub.

las an integer which controls the orientation of the axis labels (0: parallel to the axes, 1: horizontal, 2: perpendicular to the axes, 3: vertical).

lty controls the type of lines, can be an integer or string (1: “solid”, 2: “dashed”, 3: “dotted”, 4: “dotdash”, 5: “longdash”, 6: “twodash”, or a string of up to eight characters (between “0” and “9”) which specifies alternatively the length, in points or pixels, of the drawn elements and the blanks, for example lty="44" will have the same effect than lty=2.

lwd a numeric which controls the width of lines, default=1.

mar a vector of 4 numeric values which control the space between the axes and the border of the graph of the form c(bottom, left, top, right), the default values are c(5.1, 4.1, 4.1, 2.1).

mfcol a vector of the form c(nr, nc) which partitions the graphic window as a matrix of nr lines and nc columns, the plots are then drawn in columns.

mfrow id. but the plots are drawn by row.

pch controls the type of symbol, either an integer between 1 and 25, or any single character within “”.

ts.plot(x) id. but if x is multivariate the series may have different dates by x and y.

ps an integer which controls the size in points of texts and symbols.

pty a character, which specifies the type of the plotting region, “s”: square, “m”: maximal.

tck a value which specifies the length of tick-marks on the axes as a fraction of the smallest of the width or height of the plot; if tck=1 a grid is drawn.

tcl a value which specifies the length of tick-marks on the axes as a fraction of the height of a line of text (by default tcl=-0.5).

xaxt if xaxt="n" the x-axis is set but not drawn (useful in conjunction with axis(side=1, ...)).

yaxt if yaxt="n" the y-axis is set but not drawn (useful in conjunction with axis(side=2, ...)).

Lattice (Trellis) graphics

Expression Explanation
xyplot(y~x) bivariate plots (with many functionalities).
barchart(y~x) histogram of the values of y with respect to those of x.
dotplot(y~x) Cleveland dot plot (stacked plots line-by-line and column-by-column)
densityplot(~x) density functions plot
histogram(~x) histogram of the frequencies of x
bwplot(y~x) “box-and-whiskers” plot
qqmath(~x) quantiles of x with respect to the values expected under a theoretical distribution
stripplot(y~x) single dimension plot, x must be numeric, y may be a factor
qq(y~x) quantiles to compare two distributions, x must be numeric, y may be numeric, character, or factor but must have two “levels”
splom(~x) matrix of bivariate plots
parallel(~x) parallel coordinates plot
levelplot(\(z\sim x*y\|g1*g2\)) colored plot of the values of z at the coordinates given by x and y (x, y and z are all of the same length)
wireframe(\(z\sim x*y\|g1*g2\)) 3d surface plot
cloud(\(z\sim x*y\|g1*g2\)) 3d scatter plot

In the normal Lattice formula, y~x|g1*g2 has combinations of optional conditioning variables g1 and g2 plotted on separate panels. Lattice functions take many of the same arguments as base graphics plus also data= the data frame for the formula variables and subset= for subsetting. Use panel= to define a custom panel function (see apropos("panel") and ?lines). Lattice functions return an object of class trellis and have to be printed to produce the graph. Use print(xyplot(...)) inside functions where automatic printing doesn’t work. Use lattice.theme and lset to change Lattice defaults.

19 Optimization and model fitting

optim(par, fn, method = c(“Nelder-Mead”, “BFGS”, “CG”, “L-BFGS-B”, “SANN”)) general-purpose optimization; par is initial values, fn is function to optimize (normally minimize).

nlm(f, p) minimize function fusing a Newton-type algorithm with starting values p.

lm(formula) fit linear models; formula is typically of the form response ~ termA + termB + ...; use I(x*y) + I(x^2) for terms made of nonlinear components.

glm(formula, family=) fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution; family is a description of the error distribution and link function to be used in the model; see ?family.

nls(formula) nonlinear least-squares estimates of the nonlinear model parameters.

approx(x, y=) linearly interpolate given data points; x can be an xy plotting structure.

spline(x, y=) cubic spline interpolation.

loess(formula) (locally weighted scatterplot smoothing) fit a polynomial surface using local fitting.

Many of the formula-based modeling functions have several common arguments:

data= the data frame for the formula variables, subset= a subset of variables used in the fit, na.action= action for missing values: "na.fail", "na.omit", or a function.

The following generics often apply to model fitting functions:

predict(fit, ...) predictions from fit based on input data.

df.residual(fit) returns the number of residual degrees of freedom.

coef(fit) returns the estimated coefficients (sometimes with their standard-errors).

residuals(fit) returns the residuals.

deviance(fit) returns the deviance.

fitted(fit) returns the fitted values.

logLik(fit) computes the logarithm of the likelihood and the number of parameters.

AIC(fit) computes the Akaike information criterion (AIC).

20 Statistics

aov(formula) analysis of variance model.

anova(fit, …) analysis of variance (or deviance) tables for one or more fitted model objects.

density(x) kernel density estimates of x.

Other functions include: binom.test(), pairwise.t.test(), power.t.test(), prop.test(), t.test(), … use help.search("test") to see details.

21 Distributions

Random sample generation from different distributions. Remember to include set.seed() if you want to get reproducibility during exercises.

Expression Explanation
rnorm(n, mean=0, sd=1) Gaussian (normal)
rexp(n, rate=1) exponential
rgamma(n, shape, scale=1) gamma
rpois(n, lambda) Poisson
rweibull(n, shape, scale=1) Weibull
rcauchy(n, location=0, scale=1) Cauchy
rbeta(n, shape1, shape2) beta
rt(n, df) Student’s (t)
rf(n, df1, df2) Fisher’s (F) (df1, df2)
rchisq(n, df) Pearson rbinom(n, size, prob) binomial
rgeom(n, prob) geometric
rhyper(nn, m, n, k) hypergeometric
rlogis(n, location=0, scale=1) logistic
rlnorm(n, meanlog=0, sdlog=1) lognormal
rnbinom(n, size, prob) negative binomial
runif(n, min=0, max=1) uniform
rwilcox(nn, m, n), rsignrank(nn, n) Wilcoxon’s statistics

Also all these functions can be used by replacing the letter r with d, p or q to get, respectively, the probability density (dfunc(x, ...)), the cumulative probability density (pfunc(x, ...)), and the value of quantile (qfunc(p, ...), with \(0 < p < 1\)).

22 Programming

The standard setting for our own function is:

function.name<-function(x) { expr(an expression) return(value) }

Where \(x\) is the parameter in the expression. A simple example of this is:

adding<-function(x=0, y=0){z<-x+y
return(z)}
adding(x=5, y=10)
## [1] 15

Conditions setting:

if(cond) {expr}

or

if(cond) cons.expr else alt.expr

x<-10
if(x>10) z="T" else z="F"
z
## [1] "F"

Alternatively, ifelse represents a vectorized and extremely efficient conditional mechanism that provides one of the main advantages of R.

For loop:

for(var in seq) expr

x<-c()
for(i in 1:10) x[i]=i
x
##  [1]  1  2  3  4  5  6  7  8  9 10

Other loops:

While loop: while(cond) expr

Repeat: repeat expr

Applied to innermost of nested loops: break, next

Use braces {} around statements.

ifelse(test, yes, no) a value with the same shape as test filled with elements from either yes or no.

do.call(funname, args) executes a function call from the name of the function and a list of arguments to be passed to it.

23 Data Simulation Primer

Before we demonstrtate how to synthetically simulate that that resembles closely the characteristics of real observations fromt he same process let’s import some observed data for initial exploratory analytics.

Using the SOCR Health Evaluation and Linkage to Primary (HELP) Care Dataset we can extract some sample data: 00_Tiny_SOCR_HELP_Data_Simmulation.csv.

data_1 <- read.csv("https://umich.instructure.com/files/1628625/download?download_frd=1", as.is=T, header=T)
# data_1 = read.csv(file.choose( ))

attach(data_1)  
# to ensure all variables are accessible within R, e.g., using "age" instead of data_1$age

#  i2 maximum number of drinks (standard units) consumed per day (in the past 30 days range 0-184) see also i1
# treat randomization group (0=usual care, 1=HELP clinic)
# pcs SF-36 Physical Component Score (range 14-75) 
# mcs SF-36 Mental Component Score(range 7-62)
# cesd Center for Epidemiologic Studies Depression scale (range 0-60)
#  indtot Inventory of Drug Use Con-sequences (InDUC) total score (range 4-45)
# pss_fr perceived social supports (friends, range 0-14) see also dayslink
# drugrisk Risk-Assessment Battery(RAB) drug risk score (range0-21)
# satreat any BSAS substance abuse treatment at baseline (0=no, 1=yes)
ID i2 age treat homeless pcs mcs cesd indtot pss_fr drugrisk sexrisk satreat female substance racegrp
1 0 25 0 0 49 7 46 37 0 1 6 0 0 cocaine black 2
3 39 36 0 0 76 9 33 41 12 19 4 0 0 heroin black
.
100 81 22 0 0 37 17 19 30 3 0 10 0 0 alcohol other
summary(data_1)
##        ID               i2              age            treat       
##  Min.   :  1.00   Min.   :  0.00   Min.   : 3.00   Min.   :0.0000  
##  1st Qu.: 24.25   1st Qu.:  1.00   1st Qu.:27.00   1st Qu.:0.0000  
##  Median : 50.50   Median : 15.50   Median :34.00   Median :0.0000  
##  Mean   : 50.29   Mean   : 27.08   Mean   :34.31   Mean   :0.1222  
##  3rd Qu.: 74.75   3rd Qu.: 39.00   3rd Qu.:43.00   3rd Qu.:0.0000  
##  Max.   :100.00   Max.   :137.00   Max.   :65.00   Max.   :2.0000  
##     homeless           pcs             mcs             cesd      
##  Min.   :0.0000   Min.   : 6.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.:41.25   1st Qu.:20.25   1st Qu.:17.25  
##  Median :0.0000   Median :48.50   Median :29.00   Median :30.00  
##  Mean   :0.1444   Mean   :47.61   Mean   :30.49   Mean   :30.21  
##  3rd Qu.:0.0000   3rd Qu.:57.00   3rd Qu.:39.75   3rd Qu.:43.00  
##  Max.   :1.0000   Max.   :76.00   Max.   :93.00   Max.   :68.00  
##      indtot          pss_fr          drugrisk         sexrisk      
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:31.25   1st Qu.: 2.000   1st Qu.: 0.000   1st Qu.: 1.250  
##  Median :36.00   Median : 6.000   Median : 0.000   Median : 5.000  
##  Mean   :37.03   Mean   : 6.533   Mean   : 2.578   Mean   : 4.922  
##  3rd Qu.:45.00   3rd Qu.:10.000   3rd Qu.: 3.000   3rd Qu.: 7.750  
##  Max.   :60.00   Max.   :20.000   Max.   :23.000   Max.   :13.000  
##     satreat            female         substance           racegrp         
##  Min.   :0.00000   Min.   :0.00000   Length:90          Length:90         
##  1st Qu.:0.00000   1st Qu.:0.00000   Class :character   Class :character  
##  Median :0.00000   Median :0.00000   Mode  :character   Mode  :character  
##  Mean   :0.07778   Mean   :0.05556                                        
##  3rd Qu.:0.00000   3rd Qu.:0.00000                                        
##  Max.   :1.00000   Max.   :1.00000
x.norm <- rnorm(n=200, m=10, sd=20)
hist(x.norm, main='N(10, 20) Histogram')

mean(data_1$age)
## [1] 34.31111
sd(data_1$age)
## [1] 11.68947

Next, we will simulate new synthetic data to match the properties/characteristics of the observed data (using Uniform, Normal, and Poisson distributions):

# i2 [0: 184]
# age m=34, sd=12
# treat {0, 1}
# homeless {0, 1}
# pcs 14-75
# mcs 7-62
# cesd 0-60
# indtot 4-45
# pss_fr 0-14
# drugrisk 0-21
# sexrisk
# satreat (0=no, 1=yes)
# female (0=no, 1=yes)
# racegrp (black, white, other)

# Demographics variables
# Define number of subjects
NumSubj <- 282
NumTime <- 4

# Define data elements
# Cases
Cases <- c(2, 3, 6, 7, 8, 10, 11, 12, 13, 14, 17, 18, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 37, 41, 42, 43, 44, 45, 53, 55, 58, 60, 62, 67, 69, 71, 72, 74, 79, 80, 85, 87, 90, 95, 97, 99, 100, 101, 106, 107, 109, 112, 120, 123, 125, 128, 129, 132, 134, 136, 139, 142, 147, 149, 153, 158, 160, 162, 163, 167, 172, 174, 178, 179, 180, 182, 192, 195, 201, 208, 211, 215, 217, 223, 227, 228, 233, 235, 236, 240, 245, 248, 250, 251, 254, 257, 259, 261, 264, 268, 269, 272, 273, 275, 279, 288, 289, 291, 296, 298, 303, 305, 309, 314, 318, 324, 325, 326, 328, 331, 332, 333, 334, 336, 338, 339, 341, 344, 346, 347, 350, 353, 354, 359, 361, 363, 364, 366, 367, 368, 369, 370, 371, 372, 374, 375, 376, 377, 378, 381, 382, 384, 385, 386, 387, 389, 390, 393, 395, 398, 400, 410, 421, 423, 428, 433, 435, 443, 447, 449, 450, 451, 453, 454, 455, 456, 457, 458, 459, 460, 461, 465, 466, 467, 470, 471, 472, 476, 477, 478, 479, 480, 481, 483, 484, 485, 486, 487, 488, 489, 492, 493, 494, 496, 498, 501, 504, 507, 510, 513, 515, 528, 530, 533, 537, 538, 542, 545, 546, 549, 555, 557, 559, 560, 566, 572, 573, 576, 582, 586, 590, 592, 597, 603, 604, 611, 619, 621, 623, 624, 625, 631, 633, 634, 635, 637, 640, 641, 643, 644, 645, 646, 647, 648, 649, 650, 652, 654, 656, 658, 660, 664, 665, 670, 673, 677, 678, 679, 680, 682, 683, 686, 687, 688, 689, 690, 692)

# Imaging Biomarkers
L_caudate_ComputeArea <- rpois(NumSubj, 600)
L_caudate_Volume <- rpois(NumSubj, 800)
R_caudate_ComputeArea <- rpois(NumSubj, 893)
R_caudate_Volume <- rpois(NumSubj, 1000)
L_putamen_ComputeArea <- rpois(NumSubj, 900)
L_putamen_Volume <- rpois(NumSubj, 1400)
R_putamen_ComputeArea <- rpois(NumSubj, 1300)
R_putamen_Volume <- rpois(NumSubj, 3000)
L_hippocampus_ComputeArea <- rpois(NumSubj, 1300)
L_hippocampus_Volume <- rpois(NumSubj, 3200)
R_hippocampus_ComputeArea <- rpois(NumSubj, 1500)
R_hippocampus_Volume <- rpois(NumSubj, 3800)
cerebellum_ComputeArea <- rpois(NumSubj, 16700)
cerebellum_Volume <- rpois(NumSubj, 14000)
L_lingual_gyrus_ComputeArea <- rpois(NumSubj, 3300)
L_lingual_gyrus_Volume <- rpois(NumSubj, 11000)
R_lingual_gyrus_ComputeArea <- rpois(NumSubj, 3300)
R_lingual_gyrus_Volume <- rpois(NumSubj, 12000)
L_fusiform_gyrus_ComputeArea <- rpois(NumSubj, 3600)
L_fusiform_gyrus_Volume <- rpois(NumSubj, 11000)
R_fusiform_gyrus_ComputeArea <- rpois(NumSubj, 3300)
R_fusiform_gyrus_Volume <- rpois(NumSubj, 10000)

Sex <- ifelse(runif(NumSubj)<.5, 0, 1)

Weight <- as.integer(rnorm(NumSubj, 80, 10))

Age <- as.integer(rnorm(NumSubj, 62, 10))

# Diagnosis:
Dx <- c(rep("PD", 100), rep("HC", 100), rep("SWEDD", 82))

# Genetics
chr12_rs34637584_GT <- c(ifelse(runif(100)<.3, 0, 1), ifelse(runif(100)<.6, 0, 1), ifelse(runif(82)<.4, 0, 1))                              # NumSubj Bernoulli trials

chr17_rs11868035_GT <- c(ifelse(runif(100)<.7, 0, 1), ifelse(runif(100)<.4, 0, 1), ifelse(runif(82)<.5, 0, 1))                              # NumSubj Bernoulli trials

# Clinical          # rpois(NumSubj, 15) + rpois(NumSubj, 6)
UPDRS_part_I <- c( ifelse(runif(100)<.7, 0, 1) + ifelse(runif(100) < .7, 0, 1), 

ifelse(runif(100)<.6, 0, 1)+ ifelse(runif(100)<.6, 0, 1), 

ifelse(runif(82)<.4, 0, 1)+ ifelse(runif(82)<.4, 0, 1) )

UPDRS_part_II <- c(sample.int(20, 100, replace=T), sample.int(14, 100, replace=T), 

sample.int(18, 82, replace=T) )

UPDRS_part_III <- c(sample.int(30, 100, replace=T), sample.int(20, 100, replace=T), 

           sample.int(25, 82, replace=T) )

# Time: VisitTime - done automatically below in aggregator

# Data (putting all components together)
sim_PD_Data <- cbind(
        rep(Cases, each= NumTime),                 # Cases
        rep(L_caudate_ComputeArea, each= NumTime), # Imaging
        rep(Sex, each= NumTime),                   # Demographics
        rep(Weight, each= NumTime), 
        rep(Age, each= NumTime), 
        rep(Dx, each= NumTime),                    # Dx
        rep(chr12_rs34637584_GT, each= NumTime),   # Genetics
        rep(chr17_rs11868035_GT, each= NumTime), 
        rep(UPDRS_part_I, each= NumTime),          # Clinical
        rep(UPDRS_part_II, each= NumTime), 
        rep(UPDRS_part_III, each= NumTime), 
        rep(c(0, 6, 12, 18), NumSubj)                # Time
)


# Assign the column names

colnames(sim_PD_Data) <- c(
"Cases", 
"L_caudate_ComputeArea", 
"Sex", "Weight", "Age", 
"Dx", "chr12_rs34637584_GT", "chr17_rs11868035_GT", 
"UPDRS_part_I", "UPDRS_part_II", "UPDRS_part_III", 
"Time"
)

# some QC
summary(sim_PD_Data)
##      Cases      L_caudate_ComputeArea Sex         Weight         Age     
##  10     :   4   595    : 32           0:512   71     : 52   65     : 72  
##  100    :   4   606    : 32           1:616   84     : 52   64     : 60  
##  101    :   4   598    : 28                   87     : 48   58     : 52  
##  106    :   4   602    : 28                   78     : 44   59     : 52  
##  107    :   4   613    : 28                   76     : 40   61     : 52  
##  109    :   4   574    : 24                   82     : 40   62     : 44  
##  (Other):1104   (Other):956                   (Other):852   (Other):796  
##      Dx      chr12_rs34637584_GT chr17_rs11868035_GT UPDRS_part_I
##  HC   :400   0:456               0:640               0:392       
##  PD   :400   1:672               1:488               1:536       
##  SWEDD:328                                           2:200       
##                                                                  
##                                                                  
##                                                                  
##                                                                  
##  UPDRS_part_II UPDRS_part_III Time    
##  4      : 88   17     : 68    0 :282  
##  3      : 80   12     : 64    12:282  
##  1      : 72   13     : 64    18:282  
##  13     : 72   1      : 60    6 :282  
##  6      : 72   7      : 60            
##  2      : 68   6      : 56            
##  (Other):676   (Other):756
dim(sim_PD_Data)
## [1] 1128   12
head(sim_PD_Data)
##      Cases L_caudate_ComputeArea Sex Weight Age  Dx   chr12_rs34637584_GT
## [1,] "2"   "607"                 "0" "71"   "76" "PD" "1"                
## [2,] "2"   "607"                 "0" "71"   "76" "PD" "1"                
## [3,] "2"   "607"                 "0" "71"   "76" "PD" "1"                
## [4,] "2"   "607"                 "0" "71"   "76" "PD" "1"                
## [5,] "3"   "604"                 "0" "85"   "51" "PD" "0"                
## [6,] "3"   "604"                 "0" "85"   "51" "PD" "0"                
##      chr17_rs11868035_GT UPDRS_part_I UPDRS_part_II UPDRS_part_III Time
## [1,] "0"                 "1"          "17"          "14"           "0" 
## [2,] "0"                 "1"          "17"          "14"           "6" 
## [3,] "0"                 "1"          "17"          "14"           "12"
## [4,] "0"                 "1"          "17"          "14"           "18"
## [5,] "0"                 "2"          "16"          "16"           "0" 
## [6,] "0"                 "2"          "16"          "16"           "6"
hist(data_1$age, freq=FALSE, right=FALSE, ylim = c(0,0.05))
lines(density(as.numeric(as.data.frame(sim_PD_Data)$Age)), lwd=2, col="blue")
legend("topright", c("Raw Data", "Simulated Data"), fill=c("black", "blue"))

# Save Results
# Write out (save) the result to a file that can be shared
write.table(sim_PD_Data, "output_data.csv", sep=", ", row.names=FALSE, col.names=TRUE)

24 Appendix

24.2 HTML SOCR Data Import

SOCR Datasets can automatically be downloaded into the R environment using the following protocol, which uses the Parkinson’s Disease dataset as an example:

library(rvest)
## Loading required package: xml2
# Loading required package: xml2
wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata")
html_nodes(wiki_url, "#content")
## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body-primary" role="main">\n\t<a id="top ...
pd_data <- html_table(html_nodes(wiki_url, "table")[[1]])
head(pd_data); summary(pd_data)
##   Cases L_caudate_ComputeArea L_caudate_Volume R_caudate_ComputeArea
## 1     2                   597              767                   855
## 2     2                   597              767                   855
## 3     2                   597              767                   855
## 4     2                   597              767                   855
## 5     3                   604              873                   935
## 6     3                   604              873                   935
##   R_caudate_Volume L_putamen_ComputeArea L_putamen_Volume
## 1              968                   842             1357
## 2              968                   842             1357
## 3              968                   842             1357
## 4              968                   842             1357
## 5             1043                   892             1366
## 6             1043                   892             1366
##   R_putamen_ComputeArea R_putamen_Volume L_hippocampus_ComputeArea
## 1                  1285             3052                      1306
## 2                  1285             3052                      1306
## 3                  1285             3052                      1306
## 4                  1285             3052                      1306
## 5                  1305             2920                      1292
## 6                  1305             2920                      1292
##   L_hippocampus_Volume R_hippocampus_ComputeArea R_hippocampus_Volume
## 1                 3238                      1513                 3759
## 2                 3238                      1513                 3759
## 3                 3238                      1513                 3759
## 4                 3238                      1513                 3759
## 5                 3079                      1516                 3827
## 6                 3079                      1516                 3827
##   cerebellum_ComputeArea cerebellum_Volume L_lingual_gyrus_ComputeArea
## 1                  16845             13949                        3268
## 2                  16845             13949                        3268
## 3                  16845             13949                        3268
## 4                  16845             13949                        3268
## 5                  16698             14076                        3243
## 6                  16698             14076                        3243
##   L_lingual_gyrus_Volume R_lingual_gyrus_ComputeArea
## 1                  11130                        3294
## 2                  11130                        3294
## 3                  11130                        3294
## 4                  11130                        3294
## 5                  11033                        3190
## 6                  11033                        3190
##   R_lingual_gyrus_Volume L_fusiform_gyrus_ComputeArea
## 1                  12221                         3625
## 2                  12221                         3625
## 3                  12221                         3625
## 4                  12221                         3625
## 5                  12187                         3631
## 6                  12187                         3631
##   L_fusiform_gyrus_Volume R_fusiform_gyrus_ComputeArea
## 1                   11087                         3232
## 2                   11087                         3232
## 3                   11087                         3232
## 4                   11087                         3232
## 5                   11116                         3302
## 6                   11116                         3302
##   R_fusiform_gyrus_Volume Sex Weight Age Dx chr12_rs34637584_GT
## 1                   10122   1     84  67 PD                   1
## 2                   10122   1     84  67 PD                   1
## 3                   10122   1     84  67 PD                   1
## 4                   10122   1     84  67 PD                   1
## 5                   10162   0     97  39 PD                   1
## 6                   10162   0     97  39 PD                   1
##   chr17_rs11868035_GT UPDRS_part_I UPDRS_part_II UPDRS_part_III Time
## 1                   0            1            12              1    0
## 2                   0            1            12              1    6
## 3                   0            1            12              1   12
## 4                   0            1            12              1   18
## 5                   1            0            19             22    0
## 6                   1            0            19             22    6
##      Cases       L_caudate_ComputeArea L_caudate_Volume
##  Min.   :  2.0   Min.   :525.0         Min.   :719.0   
##  1st Qu.:158.0   1st Qu.:582.0         1st Qu.:784.0   
##  Median :363.5   Median :600.0         Median :800.0   
##  Mean   :346.1   Mean   :600.4         Mean   :800.3   
##  3rd Qu.:504.0   3rd Qu.:619.0         3rd Qu.:819.0   
##  Max.   :692.0   Max.   :667.0         Max.   :890.0   
##  R_caudate_ComputeArea R_caudate_Volume L_putamen_ComputeArea
##  Min.   :795.0         Min.   : 916     Min.   : 815.0       
##  1st Qu.:875.0         1st Qu.: 979     1st Qu.: 879.0       
##  Median :897.0         Median : 998     Median : 897.5       
##  Mean   :894.5         Mean   :1001     Mean   : 898.9       
##  3rd Qu.:916.0         3rd Qu.:1022     3rd Qu.: 919.0       
##  Max.   :977.0         Max.   :1094     Max.   :1003.0       
##  L_putamen_Volume R_putamen_ComputeArea R_putamen_Volume
##  Min.   :1298     Min.   :1198          Min.   :2846    
##  1st Qu.:1376     1st Qu.:1276          1st Qu.:2959    
##  Median :1400     Median :1302          Median :3000    
##  Mean   :1400     Mean   :1300          Mean   :3000    
##  3rd Qu.:1427     3rd Qu.:1321          3rd Qu.:3039    
##  Max.   :1507     Max.   :1392          Max.   :3148    
##  L_hippocampus_ComputeArea L_hippocampus_Volume R_hippocampus_ComputeArea
##  Min.   :1203              Min.   :3036         Min.   :1414             
##  1st Qu.:1277              1st Qu.:3165         1st Qu.:1479             
##  Median :1300              Median :3200         Median :1504             
##  Mean   :1302              Mean   :3198         Mean   :1504             
##  3rd Qu.:1325              3rd Qu.:3228         3rd Qu.:1529             
##  Max.   :1422              Max.   :3381         Max.   :1602             
##  R_hippocampus_Volume cerebellum_ComputeArea cerebellum_Volume
##  Min.   :3634         Min.   :16378          Min.   :13680    
##  1st Qu.:3761         1st Qu.:16617          1st Qu.:13933    
##  Median :3802         Median :16699          Median :13996    
##  Mean   :3799         Mean   :16700          Mean   :14002    
##  3rd Qu.:3833         3rd Qu.:16784          3rd Qu.:14077    
##  Max.   :4013         Max.   :17096          Max.   :14370    
##  L_lingual_gyrus_ComputeArea L_lingual_gyrus_Volume
##  Min.   :3136                Min.   :10709         
##  1st Qu.:3262                1st Qu.:10943         
##  Median :3299                Median :11007         
##  Mean   :3300                Mean   :11010         
##  3rd Qu.:3333                3rd Qu.:11080         
##  Max.   :3469                Max.   :11488         
##  R_lingual_gyrus_ComputeArea R_lingual_gyrus_Volume
##  Min.   :3135                Min.   :11679         
##  1st Qu.:3258                1st Qu.:11935         
##  Median :3294                Median :12001         
##  Mean   :3296                Mean   :12008         
##  3rd Qu.:3338                3rd Qu.:12079         
##  Max.   :3490                Max.   :12324         
##  L_fusiform_gyrus_ComputeArea L_fusiform_gyrus_Volume
##  Min.   :3446                 Min.   :10682          
##  1st Qu.:3554                 1st Qu.:10947          
##  Median :3594                 Median :11016          
##  Mean   :3598                 Mean   :11011          
##  3rd Qu.:3637                 3rd Qu.:11087          
##  Max.   :3763                 Max.   :11394          
##  R_fusiform_gyrus_ComputeArea R_fusiform_gyrus_Volume      Sex        
##  Min.   :3094                 Min.   : 9736           Min.   :0.0000  
##  1st Qu.:3260                 1st Qu.: 9928           1st Qu.:0.0000  
##  Median :3296                 Median : 9994           Median :1.0000  
##  Mean   :3299                 Mean   : 9996           Mean   :0.5851  
##  3rd Qu.:3332                 3rd Qu.:10058           3rd Qu.:1.0000  
##  Max.   :3443                 Max.   :10235           Max.   :1.0000  
##      Weight            Age             Dx            chr12_rs34637584_GT
##  Min.   : 51.00   Min.   :31.00   Length:1128        Min.   :0.000      
##  1st Qu.: 71.00   1st Qu.:54.00   Class :character   1st Qu.:0.000      
##  Median : 78.50   Median :61.00   Mode  :character   Median :1.000      
##  Mean   : 78.45   Mean   :60.64                      Mean   :0.539      
##  3rd Qu.: 84.00   3rd Qu.:68.00                      3rd Qu.:1.000      
##  Max.   :109.00   Max.   :87.00                      Max.   :1.000      
##  chr17_rs11868035_GT  UPDRS_part_I   UPDRS_part_II    UPDRS_part_III 
##  Min.   :0.0000      Min.   :0.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:0.0000      1st Qu.:0.000   1st Qu.: 5.000   1st Qu.: 6.00  
##  Median :0.0000      Median :1.000   Median : 9.000   Median :13.00  
##  Mean   :0.4184      Mean   :0.773   Mean   : 8.879   Mean   :13.02  
##  3rd Qu.:1.0000      3rd Qu.:1.000   3rd Qu.:13.000   3rd Qu.:18.00  
##  Max.   :1.0000      Max.   :2.000   Max.   :20.000   Max.   :30.00  
##       Time     
##  Min.   : 0.0  
##  1st Qu.: 4.5  
##  Median : 9.0  
##  Mean   : 9.0  
##  3rd Qu.:13.5  
##  Max.   :18.0

Also see the SMHS Simulation Primer.

24.3 R Debugging

Most programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need explore deeper – this procedure to identify and fix bugs is “debugging”.

R tools for debugging: traceback(), debug() browser() trace() recover()

traceback(): Failing R functions report to the screen immediately the error. Calling traceback() will show the function where the error occurred. The traceback() function prints the list of functions that were called before the error occurred.

The function calls are printed in reverse order.

f1<-function(x) { r<- x-g1(x); r }

g1<-function(y) { r<-y*h1(y); r }

h1<-function(z) { r<-log(z); if(r<10) r^2 else   r^3}

f1(-1)
## Warning in log(z): NaNs produced
## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed
traceback()   
3:  h(y)  
## Error in eval(expr, envir, enclos): could not find function "h"
2: g(x)  
## Error in eval(expr, envir, enclos): could not find function "g"
1: f(-1)  
## Error in eval(expr, envir, enclos): could not find function "f"

debug()

traceback() does not tell you where is the error. To find out which line causes the error, we may step through the function using debug().

debug(foo) flags the function foo() for debugging. undebug(foo) unflags the function.

When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and user can interact with the R shell.

This allows us to inspect a function line-by-line.

Example: compute sum of squared error SS

## compute sum of squares   
SS<-function(mu, x) { 
  d<-x-mu; 
  d2<-d^2; 
  ss<-sum(d2);  
  ss }  
set.seed(100);  
x<-rnorm(100); 
SS(1, x)    
## to debug  
debug(SS); SS(1, x)  
## debugging in: SS(1, x)
## debug at <text>#2: {
##     d <- x - mu
##     d2 <- d^2
##     ss <- sum(d2)
##     ss
## }
## debug at <text>#3: d <- x - mu
## debug at <text>#4: d2 <- d^2
## debug at <text>#5: ss <- sum(d2)
## debug at <text>#6: ss
## exiting from: SS(1, x)
## [1] 202.5615

In the debugging shell (“Browse[1]>”), users can:

  • Enter n (next) executes the current line and prints the next one;
  • Typing c (continue) executes the rest of the function without stopping;
  • Enter Q quits the debugging;
  • Enter ls() list all objects in the local environment;
  • Enter an object name or print() tells the current value of an object.

    Example:

    debug(SS)
    SS(1, x)
    ## debugging in: SS(1, x)
    ## debug at <text>#2: {
    ##     d <- x - mu
    ##     d2 <- d^2
    ##     ss <- sum(d2)
    ##     ss
    ## }
    ## debug at <text>#3: d <- x - mu
    ## debug at <text>#4: d2 <- d^2
    ## debug at <text>#5: ss <- sum(d2)
    ## debug at <text>#6: ss
    ## exiting from: SS(1, x)
    ## [1] 202.5615

    Browse[1]> n
    debug: d <- x - mu ## the next command
    Browse[1]> ls() ## current environment [1] “mu” “x” ## there is no d
    Browse[1]> n ## go one step debug: d2 <- d^2 ## the next command
    Browse[1]> ls() ## current environment [1] “d” “mu” “x” ## d has been created
    Browse[1]> d[1:3] ## first three elements of d [1] -1.5021924 -0.8684688 -1.0789171
    Browse[1]> hist(d) ## histogram of d
    Browse[1]> where ## current position in call stack where 1: SS(1, x)
    Browse[1]> n
    debug: ss <- sum(d2)
    Browse[1]> Q ## quit

    undebug(SS)         ## remove debug label, stop debugging process  
    SS(1, x)                ## now call SS again will without debugging  

    You can label a function for debugging while debugging another function

    f<-function(x) 
    { r<-x-g(x); 
      r }  
    g<-function(y) 
    { r<-y*h(y); 
      r }  
    h<-function(z) 
    { r<-log(z); 
      if(r<10) 
        r^2 
    else 
      r^3 }  
    
    debug(f)            # ## If you only debug f, you will not go into g
    f(-1)
    ## Warning in log(z): NaNs produced
    ## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed

    Browse[1]> n
    Browse[1]> n

    But, we can also label g and h for debugging when we debug f

    f(-1)
    Browse[1]> n
    Browse[1]> debug(g)
    Browse[1]> debug(h)
    Browse[1]> n

    Inserting a call to browser() in a function will pause the execution of a function at the point where browser() is called.
    Similar to using debug() except you can control where execution gets paused.

    24.3.1 Example

    h<-function(z) {
    browser()   ## a break point inserted here 
    r<-log(z); 
    if(r<10) 
    r^2 
    else 
    r^3
    }
    
    f(-1)
    ## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed

    Browse[1]> ls() Browse[1]> z
    Browse[1]> n
    Browse[1]> n
    Browse[1]> ls()
    Browse[1]> c

    Calling trace() on a function allows inserting new code into a function. The syntax for trace() may be challenging.

    as.list(body(h))
    trace(“h”, quote(
    if(is.nan(r))
    {browser()}), at=3, print=FALSE)
    f(1)
    f(-1)

    trace(“h”, quote(if(z<0) {z<-1}), at=2, print=FALSE)
    f(-1)
    untrace()

    During the debugging process, recover() allows checking the status of variables in upper level functions. recover() can be used as an error handler using options() (e.g. options(error=recover)). When functions throw exceptions, execution stops at point of failure. Browsing the function calls and examining the environment may indicate the source of the problem.

    SOCR Resource Visitor number Dinov Email