SOCR ≫ DSPA ≫ Topics ≫

In this DSPA Appendix, we will present some common strategies to interact with relational databases.

The Structured Query Language (SQL) is used to communicate with complex databases (DBs)that include large, structured, and relational information. SQL is the de facto standard language for managing, storing, and retrieval of information from relational database management systems. SQL statements may be used to update data elements in a database, retrieve data from a database, or query the meta-data in the system. Many relational database management systems like Oracle, Sybase, Microsoft SQL Server, Access, Ingres, and so on provide SQL application programming interfqaces (API’s). There are also no-SQL database systems, e.g., MongoDB, that provide more flexibility by relaxing the relational requirement of SQL systems. In many situations, relational SQL DBs are impractical for large amounts of heterogeneous, incongruent, and rapidly-changing data that blends structured, semi-structured, unstructured, and polymorphic data elements. We will predominantly focus on the standard SQL commands, such as Select, Insert, Update, Delete, Create, and Drop to interact with external databases. The MongoDB noSQL tutorial provides information on interacting with noSQL DBs. In this module, we mostly focus on data import from SQL Databases.

Let’s start by exploring R-based data-import from SQL databases. First, we need to install and load the RODBC(R Open Database Connectivity) package.

# install.packages("RODBC", repos = "http://cran.us.r-project.org")
library(RODBC)

Then, we will open a connection to the SQL server database with Data Source Name (DSN), using Microsoft Access. More details are provided in the Microsoft documentaiton and in the RODBC documentaiton.

1 Connections to SQL Database

1.1 Set a connection

As a demonstration, we will be connecting with user name “genome” to a MySQL database hg19 hosted by genome.ucsc.edu server.

# install.packages("RMySQL"); 
library(RMySQL)
# install.packages("RODBC"); 
library(RODBC)
# install.packages("DBI"); 
library(DBI)

ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

The RODBC library also allows connecting to other Relational Databases such as SQL Server or Database-like platforms such as Dremio. In order to connect to each of these data sources, the specific driver for that data source needs to be downloaded and installed in R.

If connecting to multiple SQL data sources simultaneously within one R project is desired, establish multiple connections in fashion similar to above.

1.2 Managing and closing existing connections

After all needed jobs are done on the database, it is important to disconnect from the database, because the connection isn’t automatically closed and there can be a limit to how many connections can be set up at the same time in R.

Here we will demonstrate how to see all active connections and how to close connections.

# install.packages("DBI"); 
library(DBI)
# install.packages("RMySQL"); 
library(RMySQL)
# install.packages("RODBC"); 
library(RODBC)


#display all current connections to MySQL databases
dbListConnections(MySQL())
## list()
#close connection [1] to MySQL database
#dbDisconnect(dbListConnections(dbDriver(drv="MySQL"))[[1]])

#list and close all conenctions to MySQL databses
lapply(dbListConnections(MySQL()), dbDisconnect)
## list()
#setup connection
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

#disconnect current session from the database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

2 Basic Functions in RODBC

There are several other important functions in RODBC besides dbConnect, dbDisconnect and dbListConnections that can be useful for database users. These functions are demonstrated below.

# install.packages("RMySQL"); 
library(RMySQL)
# install.packages("RODBC"); 
library(RODBC)
#setup connection
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')
#Store the names of all the tables in the database into 'allTables' and display the total number of tables in the database
allTables <- dbListTables(ucscGenomeConn); length(allTables)
## [1] 12425
#List the fields in the table "affyU133Plus2"
dbListFields(ucscGenomeConn, "affyU133Plus2")
##  [1] "bin"         "matches"     "misMatches"  "repMatches"  "nCount"     
##  [6] "qNumInsert"  "qBaseInsert" "tNumInsert"  "tBaseInsert" "strand"     
## [11] "qName"       "qSize"       "qStart"      "qEnd"        "tName"      
## [16] "tSize"       "tStart"      "tEnd"        "blockCount"  "blockSizes" 
## [21] "qStarts"     "tStarts"
#Read the table "affyU133Plus2" into "affyData" and display the first few lines
affyData <- dbReadTable(ucscGenomeConn, "affyU133Plus2");head(affyData)
##   bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert
## 1 585     530          4          0     23          3          41          3
## 2 585    3355         17          0    109          9          67          9
## 3 585    4156         14          0     83         16          18          2
## 4 585    4667          9          0     68         21          42          3
## 5 585    5180         14          0    167         10          38          1
## 6 585     468          5          0     14          0           0          0
##   tBaseInsert strand        qName qSize qStart qEnd tName     tSize tStart
## 1         898      -  225995_x_at   637      5  603  chr1 249250621  14361
## 2       11621      -  225035_x_at  3635      0 3548  chr1 249250621  14381
## 3          93      -  226340_x_at  4318      3 4274  chr1 249250621  14399
## 4        5743      - 1557034_s_at  4834     48 4834  chr1 249250621  14406
## 5          29      -    231811_at  5399      0 5399  chr1 249250621  19688
## 6           0      -    236841_at   487      0  487  chr1 249250621  27542
##    tEnd blockCount
## 1 15816          5
## 2 29483         17
## 3 18745         18
## 4 24893         23
## 5 25078         11
## 6 28029          1
##                                                                   blockSizes
## 1                                                          93,144,229,70,21,
## 2              73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
## 3                 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
## 4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
## 5                                       131,26,1300,6,4,11,4,7,358,3359,155,
## 6                                                                       487,
##                                                                                                  qStarts
## 1                                                                                    34,132,278,541,611,
## 2                        87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
## 3                   44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
## 4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
## 5                                                     0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
## 6                                                                                                     0,
##                                                                                                                                      tStarts
## 1                                                                                                             14361,14454,14599,14968,15795,
## 2                                     14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
## 3                               14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
## 4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
## 5                                                                         19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
## 6                                                                                                                                     27542,

The function dbReadTable() is equivalent to the SQL command SELECT * FROM affyU133Plus2.

3 Querying with SQL

After establishing the connection to a SQL data-source, you can then pass SQL commands in string formats to the database. Here we will demonstrate such interactions with several basic SQL commands.

# set up connection
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

# select the top 10 rows of genoName column from table rmsk with dbSendQuery and dbClearResult.
rs <- dbSendQuery(ucscGenomeConn,'select genoName from rmsk limit 10')
dbClearResult(rs)
## [1] TRUE
# select the top 10 rows of genoName column from table rmsk with dbGetQuery
df <- dbGetQuery(ucscGenomeConn,'select genoName from rmsk limit 10')

#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

The difference between dbSendQuery and dbGetQuery is the following:

  • dbSendQuery sends query to the database and doesn’t retrieve any data until being fetched, while dbGetQuery retrieves data back from the server and can be written directly into a data structure.
  • dbSendQuery needs to be used with dbClearResult() to close the query activity before any other interaction with the server, while dbGetQuery closes the query activity on the database automatically.

As a result, dbGetQuery is more suitable for running quick queries on the server and retrieving data immediately, while dbSendQuery is more useful when running queries on larger datasets and when doesn’t require immediate result.

Both of these two functions are intended for data users only, so only SELECT statements are intended to be sent through them. If the user has higher access to the database and intend to manipulate data, dbExecute is needed. This is demonstrated (but not executed) below.

# create a new empty table named newTable using dbExecute
#dbExecute(ucscGenomeConn,'create table newTable')

4 Fetching Results

When using dbGetQuery, data is automatically retrieved, or “fetched” from the server. However, this is not always intended. If the demand is to query first and fetch at a later time, a combination of dbSendQuery and fetch. Here is an example.

# set up connection
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

# select the top 10 rows of genoName column from table rmsk with deSendQuery and fetch.
rs <- dbSendQuery(ucscGenomeConn,'select genoName from rmsk limit 10')
fetch(rs)
##    genoName
## 1      chr1
## 2      chr1
## 3      chr1
## 4      chr1
## 5      chr1
## 6      chr1
## 7      chr1
## 8      chr1
## 9      chr1
## 10     chr1
dbClearResult(rs)
## [1] TRUE
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

Regardless which function is used to query on the database, the results from the query must be fetched into R if analysis with R is intended.

5 Important SQL clauses

Here are a few important SQL commands that can be used when working with databases.

5.1 Basic SELECT

“SELECT (…) FROM (…)” selects all rows (unless further specified) of specific columns of data from a table in the database. It can be used in combination with these following clauses: - “DISTINCT” selects data where replicated rows with same values for all specified columns are removed. - “WHERE” selects data where values at specified columns meet conditions defined after this clause - “AND/OR/NOT” logic operators that allow setting complex requirement for the WHERE clause - “GROUP BY” separates the table into groups by values in a specified column. The number of groups created is equal to the number of distinct values in the column specified. - “AS” renames a column or the results yielded from operations on it to another specified name - “HAVING” works similar to WHERE class in terms of serving the purpose of qualification of values, but HAVING is different in that it allows conditioning on aggregate functions like COUNT() and works best with GROUP BY - “ORDER BY” rearrange the rows by specified orders on values of specified column(s) - “LIMIT” sets the limit to how many rows of data are selected, useful when working with large datasets These clauses are shown below.

ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')
#The tables we will be working with: pseudoYale60 and pseudoYale60Class from the hg19 database:
dbListFields(ucscGenomeConn, "pseudoYale60")
##  [1] "bin"        "name"       "chrom"      "strand"     "txStart"   
##  [6] "txEnd"      "cdsStart"   "cdsEnd"     "exonCount"  "exonStarts"
## [11] "exonEnds"
dbListFields(ucscGenomeConn, "pseudoYale60Class")
## [1] "name"  "class" "owner"
# select the top 10 rows of all columns from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class limit 10')
##                 name      class owner
## 1  PGOHUM00000232565  Ambiguous  Yale
## 2  PGOHUM00000232567  Processed  Yale
## 3  PGOHUM00000232568  Ambiguous  Yale
## 4  PGOHUM00000232569 Duplicated  Yale
## 5  PGOHUM00000232570 Duplicated  Yale
## 6  PGOHUM00000232573  Processed  Yale
## 7  PGOHUM00000232574  Ambiguous  Yale
## 8  PGOHUM00000232575 Duplicated  Yale
## 9  PGOHUM00000232579  Processed  Yale
## 10 PGOHUM00000232580 Duplicated  Yale
# select the top 10 rows of class column from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select class from pseudoYale60Class limit 10')
##         class
## 1   Ambiguous
## 2   Processed
## 3   Ambiguous
## 4  Duplicated
## 5  Duplicated
## 6   Processed
## 7   Ambiguous
## 8  Duplicated
## 9   Processed
## 10 Duplicated
# select the top 10 distinct rows of class column from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select distinct class from pseudoYale60Class limit 10')
##        class
## 1  Ambiguous
## 2  Processed
## 3 Duplicated
# select the top 10 rows of name and class columns from table pseudoYale60Class where class is 'Ambiguous'.
dbGetQuery(ucscGenomeConn,'select name, class from pseudoYale60Class WHERE class = \'Ambiguous\' limit 10')
##                 name     class
## 1  PGOHUM00000232565 Ambiguous
## 2  PGOHUM00000232568 Ambiguous
## 3  PGOHUM00000232574 Ambiguous
## 4  PGOHUM00000232584 Ambiguous
## 5  PGOHUM00000232585 Ambiguous
## 6  PGOHUM00000232590 Ambiguous
## 7  PGOHUM00000232593 Ambiguous
## 8  PGOHUM00000232594 Ambiguous
## 9  PGOHUM00000232597 Ambiguous
## 10 PGOHUM00000232601 Ambiguous
# select the top 10 rows of all columns from table pseudoYale60Class where class is 'Ambiguous' and owner is 'Yale'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')
##                 name     class owner
## 1  PGOHUM00000232565 Ambiguous  Yale
## 2  PGOHUM00000232568 Ambiguous  Yale
## 3  PGOHUM00000232574 Ambiguous  Yale
## 4  PGOHUM00000232584 Ambiguous  Yale
## 5  PGOHUM00000232585 Ambiguous  Yale
## 6  PGOHUM00000232590 Ambiguous  Yale
## 7  PGOHUM00000232593 Ambiguous  Yale
## 8  PGOHUM00000232594 Ambiguous  Yale
## 9  PGOHUM00000232597 Ambiguous  Yale
## 10 PGOHUM00000232601 Ambiguous  Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is 'Ambiguous' or owner is 'Yale'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' OR owner = \'Yale\' limit 10')
##                 name      class owner
## 1  PGOHUM00000232565  Ambiguous  Yale
## 2  PGOHUM00000232567  Processed  Yale
## 3  PGOHUM00000232568  Ambiguous  Yale
## 4  PGOHUM00000232569 Duplicated  Yale
## 5  PGOHUM00000232570 Duplicated  Yale
## 6  PGOHUM00000232573  Processed  Yale
## 7  PGOHUM00000232574  Ambiguous  Yale
## 8  PGOHUM00000232575 Duplicated  Yale
## 9  PGOHUM00000232579  Processed  Yale
## 10 PGOHUM00000232580 Duplicated  Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is \'Ambiguous\' or \'Processed\'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' OR class = \'Processed\' limit 10')
##                 name     class owner
## 1  PGOHUM00000232565 Ambiguous  Yale
## 2  PGOHUM00000232567 Processed  Yale
## 3  PGOHUM00000232568 Ambiguous  Yale
## 4  PGOHUM00000232573 Processed  Yale
## 5  PGOHUM00000232574 Ambiguous  Yale
## 6  PGOHUM00000232579 Processed  Yale
## 7  PGOHUM00000232581 Processed  Yale
## 8  PGOHUM00000232582 Processed  Yale
## 9  PGOHUM00000232584 Ambiguous  Yale
## 10 PGOHUM00000232585 Ambiguous  Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is not \'Ambiguous\'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE NOT class = \'Ambiguous\' limit 10')
##                 name      class owner
## 1  PGOHUM00000232567  Processed  Yale
## 2  PGOHUM00000232569 Duplicated  Yale
## 3  PGOHUM00000232570 Duplicated  Yale
## 4  PGOHUM00000232573  Processed  Yale
## 5  PGOHUM00000232575 Duplicated  Yale
## 6  PGOHUM00000232579  Processed  Yale
## 7  PGOHUM00000232580 Duplicated  Yale
## 8  PGOHUM00000232581  Processed  Yale
## 9  PGOHUM00000232582  Processed  Yale
## 10 PGOHUM00000232583 Duplicated  Yale
# select class and how many names in the class from pseudoYale60Class
dbGetQuery(ucscGenomeConn,'select count(name) as number_of_names, class from pseudoYale60Class GROUP BY class')
##   number_of_names      class
## 1            6409  Ambiguous
## 2            2631 Duplicated
## 3            8836  Processed
# select class and how many names in the class from pseudoYale60Class if the number of names in the class is greater than 4000
dbGetQuery(ucscGenomeConn,'select count(name) as number_of_names, class from pseudoYale60Class group by class having count(name) > 4000')
##   number_of_names     class
## 1            6409 Ambiguous
## 2            8836 Processed
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

5.2 SELECT from multiple tables

Let’s now show several clauses that will be useful to work across multiple tables

Working with Multiple Tables: - “JOIN/LEFT JOIN/RIGHT JOIN/INNER JOIN/” JOIN functions combine two tables together where they have one or more columns sharing the same type of data. - “USING” can be an alternative to the ON clause in JOINs. It allows joining with columns of same name while with ON you can specify and join columns with different names. - “UNION” appends one table to another where the columns of the two tables have the same data types, replicates are removed. - “UNION ALL” appends one table to another where the columns of the two tables have the same data types without removing replicates.

ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')
#The tables we will be working with: pseudoYale60 and pseudoYale60Class from the hg19 database

#joins the pseudoYale60 table with pseudoYale60Class table, where each row is matched on the name column
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60 join pseudoYale60Class on pseudoYale60.name = pseudoYale60Class.name limit 10')
##    bin              name chrom strand txStart  txEnd cdsStart cdsEnd exonCount
## 1  585 PGOHUM00000244550  chr1      +   12981  13670        0      0         3
## 2  585 PGOHUM00000251045  chr1      -   16607  18381        0      0         6
## 3  585 PGOHUM00000244551  chr1      +   52475  53354        0      0         1
## 4  585 PGOHUM00000244552  chr1      +   62951  63851        0      0         1
## 5  585 PGOHUM00000244553  chr1      +   65887  65974        0      0         1
## 6   73 PGOHUM00000243857  chr1      -  120819 133580        0      0         3
## 7   73 PGOHUM00000243858  chr1      -  131029 131544        0      0         1
## 8  586 PGOHUM00000244554  chr1      +  131124 135623        0      0         1
## 9  586 PGOHUM00000251046  chr1      -  135714 136507        0      0         1
## 10 586 PGOHUM00000244555  chr1      +  135956 136244        0      0         1
##                              exonStarts                             exonEnds
## 1                    12981,13224,13448,                   13056,13377,13670,
## 2  16607,16854,17229,17601,17914,18265, 16766,17052,17364,17745,18058,18381,
## 3                                52475,                               53354,
## 4                                62951,                               63851,
## 5                                65887,                               65974,
## 6                 120819,129028,133367,                120930,129232,133580,
## 7                               131029,                              131544,
## 8                               131124,                              135623,
## 9                               135714,                              136507,
## 10                              135956,                              136244,
##                 name      class owner
## 1  PGOHUM00000244550 Duplicated  Yale
## 2  PGOHUM00000251045 Duplicated  Yale
## 3  PGOHUM00000244551  Processed  Yale
## 4  PGOHUM00000244552  Processed  Yale
## 5  PGOHUM00000244553  Ambiguous  Yale
## 6  PGOHUM00000243857 Duplicated  Yale
## 7  PGOHUM00000243858  Processed  Yale
## 8  PGOHUM00000244554  Processed  Yale
## 9  PGOHUM00000251046  Processed  Yale
## 10 PGOHUM00000244555  Ambiguous  Yale
#alternatively using USING Clause
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60 join pseudoYale60Class USING (name) limit 10')
##                 name bin chrom strand txStart  txEnd cdsStart cdsEnd exonCount
## 1  PGOHUM00000244550 585  chr1      +   12981  13670        0      0         3
## 2  PGOHUM00000251045 585  chr1      -   16607  18381        0      0         6
## 3  PGOHUM00000244551 585  chr1      +   52475  53354        0      0         1
## 4  PGOHUM00000244552 585  chr1      +   62951  63851        0      0         1
## 5  PGOHUM00000244553 585  chr1      +   65887  65974        0      0         1
## 6  PGOHUM00000243857  73  chr1      -  120819 133580        0      0         3
## 7  PGOHUM00000243858  73  chr1      -  131029 131544        0      0         1
## 8  PGOHUM00000244554 586  chr1      +  131124 135623        0      0         1
## 9  PGOHUM00000251046 586  chr1      -  135714 136507        0      0         1
## 10 PGOHUM00000244555 586  chr1      +  135956 136244        0      0         1
##                              exonStarts                             exonEnds
## 1                    12981,13224,13448,                   13056,13377,13670,
## 2  16607,16854,17229,17601,17914,18265, 16766,17052,17364,17745,18058,18381,
## 3                                52475,                               53354,
## 4                                62951,                               63851,
## 5                                65887,                               65974,
## 6                 120819,129028,133367,                120930,129232,133580,
## 7                               131029,                              131544,
## 8                               131124,                              135623,
## 9                               135714,                              136507,
## 10                              135956,                              136244,
##         class owner
## 1  Duplicated  Yale
## 2  Duplicated  Yale
## 3   Processed  Yale
## 4   Processed  Yale
## 5   Ambiguous  Yale
## 6  Duplicated  Yale
## 7   Processed  Yale
## 8   Processed  Yale
## 9   Processed  Yale
## 10  Ambiguous  Yale
#Append the first 10 names from pseudoYale60Class to the first 10 names of pseudoYale60, and the replicated rows are removed
dbGetQuery(ucscGenomeConn,'(select name from pseudoYale60 limit 10) union (select name from pseudoYale60Class limit 10)')
##                 name
## 1  PGOHUM00000232565
## 2  PGOHUM00000232567
## 3  PGOHUM00000232568
## 4  PGOHUM00000232569
## 5  PGOHUM00000232570
## 6  PGOHUM00000232573
## 7  PGOHUM00000232574
## 8  PGOHUM00000232575
## 9  PGOHUM00000232579
## 10 PGOHUM00000232580
#Append the first 10 names from pseudoYale60Class to the first 10 names of pseudoYale60
dbGetQuery(ucscGenomeConn,'(select name from pseudoYale60 limit 10) union all (select name from pseudoYale60Class limit 10)')
##                 name
## 1  PGOHUM00000232565
## 2  PGOHUM00000232567
## 3  PGOHUM00000232568
## 4  PGOHUM00000232569
## 5  PGOHUM00000232570
## 6  PGOHUM00000232573
## 7  PGOHUM00000232574
## 8  PGOHUM00000232575
## 9  PGOHUM00000232579
## 10 PGOHUM00000232580
## 11 PGOHUM00000232565
## 12 PGOHUM00000232567
## 13 PGOHUM00000232568
## 14 PGOHUM00000232569
## 15 PGOHUM00000232570
## 16 PGOHUM00000232573
## 17 PGOHUM00000232574
## 18 PGOHUM00000232575
## 19 PGOHUM00000232579
## 20 PGOHUM00000232580
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

As demonstrated, the USING clause eliminates the repeated name column, while the ON clause version of JOIN kept both columns from both tables.

The difference between JOIN, LEFT JOIN, RIGHT JOIN, INNER JOIN, and FULL join will be explained later.

Here are a few other multi-table operators that can be useful:

  • “EXCEPT” selects rows in one table that are not present in another table.
    • SELECT columnA FROM Table1 EXCEPT SELECT columnA FROM Table2
  • “INTERSECT” selects rows that are present in both tables.
    • SELECT columnA FROM Table1 INTERSECT SELECT columnA FROM Table2

We will be demonstrating these two operators later with the JOINS.

6 Connecting to Google BigQuery

RODBC can also connect to PaaS providers such as Google Cloud Platform and use it as a data-source. Utilizing it will allow faster query executions and minimize usage of local memory.

6.2 Billing in Google BigQuery

With smaller scale querying operations, BigQuery is free for public use, but it charges fee for heavy usages. That’s why it’s important to specify billing information when querying.

Fees for BigQuery are billed on a per-project basis, and when supplying billing information, just provide the project that should be billed for. When you are not an eligible member of the project with the permission to bill the project, billing that project will be declined.

For many public libraries and projects, they make their data publicly available, but they don’t allow the public to bill these projects. That means, users who want to query data from a public library will need to additionally provide another project that Google can bill.

6.3 Case Study I: Open Library in BigQuery

Here we will connect to a sample dataset in Google Cloud Platform’s list of open libraries and query from it.

6.3.1 Set up BigQuery on GCP

In order to use BigQuery functionalities, a Google Cloud Platform account is needed

  1. Go to Google Cloud Platform and register a GCP account with your email. New members to GCP are given 12-months of free trials and $300 credits.
  2. HS 650 DSPA Studests will recieve $50 GCP credit (from the Instructor)
  3. In Google Cloud Platform, start a new project.
  4. After starting a new project, on GCP, go to IAM & Admin tab.
  5. On this tab, you will see a list of members of the project. You can add new members and assign each member specific role(s).
  6. Edit and make sure all members (including yourself) who need to access BigQuery functionlities from R are assigned the role of BigQuery Admin or similar ones that have the permission to create jobs remotely on BigQuery.
  7. After assigning roles and permissions, go to GCP BigQuery User Interface

This is the primary interface for setting up data sources. We will come to this tab only when needed to add new data sources and datasets and the remaining of the time we only need R to access data and query from GCP.

6.3.2 Connecting to Dataset

We will now demonstrate connecting to your project and the bigquery-public-data.samples.shakespeare.

#install.packages("bigrquery")
library(DBI)
library(RODBC)
library(bigrquery)
#set 'billing' to the project that will be billed

#Connect to the project
con <- dbConnect(bigquery(),
                 project = "bigquery-public-data",
                 billing = billing
)

#list first few lines of shakespeare table
sql <- "SELECT * FROM [bigquery-public-data.samples.shakespeare] LIMIT 5"
result <- query_exec(sql, project=billing)
result
##      word word_count  corpus corpus_date
## 1    LVII          1 sonnets           0
## 2  augurs          1 sonnets           0
## 3  dimm'd          1 sonnets           0
## 4 plagues          1 sonnets           0
## 5 treason          1 sonnets           0
#disconnect from Google BigQuery
dbDisconnect(con)

6.4 Case Study II: MIMICIII Intro

Here, we will demonstrate how to connect to a MIMIC-III dataset through Google Cloud’s BigQuery platform and query some data.

6.4.1 Setting Up MIMICIII on BigQuery

Similar to in Case Study I, MIMICIII is a project available on BigQuery so we will need to pin this project to the BigQuery interface. However, since this project is not in the public library, so we will need to gain access to this project first then pin it.

6.4.1.1 Getting Access to MIMICIII

The access to MIMICIII is granted on a per-email basis. Follow these steps to gain access to MIMICIII for the email address registered for the project owner:

  1. MIMICIII data is intially stored on PhysioNet, and this is also the cite for gaining access to MIMICIII.
  2. Go to PhysioNet Registration, where you will then register account with the email address that you desire to gain access for.
  3. Confrim the registeration after receiving an email, then log back into PhysioNet.
  4. Go the page for MIMICIII Data, scroll to the bottom and click on credentialed user.
  5. In the Credentialing tab of the page, click on “apply for access”.
  6. You will then be directed to the application page for accessing MIMICIII.
  7. Complete the application, which involves completing an online course: CITI Program in “Data or Specimens Only Research”, and listing your faculty supervisor as a reference.

6.4.1.2 Adding MIMICIII to BigQuery

After gaining access to MIMICIII data (it may take up to 2 weeks to process application), follow these steps to add the MIMICIII project to BigQuery:

  1. Go to GCP BigQuery User Interface
  2. On the Resources Tab, click on A