SOCR ≫ DSPA ≫ Topics ≫

In this DSPA Appendix, we will present some common strategies to interact with relational databases.

The Structured Query Language (SQL) is used to communicate with complex databases (DBs)that include large, structured, and relational information. SQL is the de facto standard language for managing, storing, and retrieval of information from relational database management systems. SQL statements may be used to update data elements in a database, retrieve data from a database, or query the meta-data in the system. Many relational database management systems like Oracle, Sybase, Microsoft SQL Server, Access, Ingres, and so on provide SQL application programming interfqaces (API’s). There are also no-SQL database systems, e.g., MongoDB, that provide more flexibility by relaxing the relational requirement of SQL systems. In many situations, relational SQL DBs are impractical for large amounts of heterogeneous, incongruent, and rapidly-changing data that blends structured, semi-structured, unstructured, and polymorphic data elements. We will predominantly focus on the standard SQL commands, such as Select, Insert, Update, Delete, Create, and Drop to interact with external databases. The MongoDB noSQL tutorial provides information on interacting with noSQL DBs. In this module, we mostly focus on data import from SQL Databases.

Let’s start by exploring R-based data-import from SQL databases. First, we need to install and load the RODBC(R Open Database Connectivity) package.

Then, we will open a connection to the SQL server database with Data Source Name (DSN), using Microsoft Access. More details are provided in the Microsoft documentaiton and in the RODBC documentaiton.

1 Connections to SQL Database

1.1 Set a connection

As a demonstration, we will be connecting with user name “genome” to a MySQL database hg19 hosted by genome.ucsc.edu server.

The RODBC library also allows connecting to other Relational Databases such as SQL Server or Database-like platforms such as Dremio. In order to connect to each of these data sources, the specific driver for that data source needs to be downloaded and installed in R.

If connecting to multiple SQL data sources simultaneously within one R project is desired, establish multiple connections in fashion similar to above.

1.2 Managing and closing existing connections

After all needed jobs are done on the database, it is important to disconnect from the database, because the connection isn’t automatically closed and there can be a limit to how many connections can be set up at the same time in R.

Here we will demonstrate how to see all active connections and how to close connections.

## list()
## list()
## [1] TRUE

2 Basic Functions in RODBC

There are several other important functions in RODBC besides dbConnect, dbDisconnect and dbListConnections that can be useful for database users. These functions are demonstrated below.

## [1] 12425
##  [1] "bin"         "matches"     "misMatches"  "repMatches"  "nCount"     
##  [6] "qNumInsert"  "qBaseInsert" "tNumInsert"  "tBaseInsert" "strand"     
## [11] "qName"       "qSize"       "qStart"      "qEnd"        "tName"      
## [16] "tSize"       "tStart"      "tEnd"        "blockCount"  "blockSizes" 
## [21] "qStarts"     "tStarts"
##   bin matches misMatches repMatches nCount qNumInsert qBaseInsert
## 1 585     530          4          0     23          3          41
## 2 585    3355         17          0    109          9          67
## 3 585    4156         14          0     83         16          18
## 4 585    4667          9          0     68         21          42
## 5 585    5180         14          0    167         10          38
## 6 585     468          5          0     14          0           0
##   tNumInsert tBaseInsert strand        qName qSize qStart qEnd tName
## 1          3         898      -  225995_x_at   637      5  603  chr1
## 2          9       11621      -  225035_x_at  3635      0 3548  chr1
## 3          2          93      -  226340_x_at  4318      3 4274  chr1
## 4          3        5743      - 1557034_s_at  4834     48 4834  chr1
## 5          1          29      -    231811_at  5399      0 5399  chr1
## 6          0           0      -    236841_at   487      0  487  chr1
##       tSize tStart  tEnd blockCount
## 1 249250621  14361 15816          5
## 2 249250621  14381 29483         17
## 3 249250621  14399 18745         18
## 4 249250621  14406 24893         23
## 5 249250621  19688 25078         11
## 6 249250621  27542 28029          1
##                                                                   blockSizes
## 1                                                          93,144,229,70,21,
## 2              73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
## 3                 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
## 4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
## 5                                       131,26,1300,6,4,11,4,7,358,3359,155,
## 6                                                                       487,
##                                                                                                  qStarts
## 1                                                                                    34,132,278,541,611,
## 2                        87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
## 3                   44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
## 4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
## 5                                                     0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
## 6                                                                                                     0,
##                                                                                                                                      tStarts
## 1                                                                                                             14361,14454,14599,14968,15795,
## 2                                     14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
## 3                               14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
## 4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
## 5                                                                         19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
## 6                                                                                                                                     27542,

The function dbReadTable() is equivalent to the SQL command SELECT * FROM affyU133Plus2.

3 Querying with SQL

After establishing the connection to a SQL data-source, you can then pass SQL commands in string formats to the database. Here we will demonstrate such interactions with several basic SQL commands.

## [1] TRUE
## [1] TRUE

The difference between dbSendQuery and dbGetQuery is the following:

  • dbSendQuery sends query to the database and doesn’t retrieve any data until being fetched, while dbGetQuery retrieves data back from the server and can be written directly into a data structure.
  • dbSendQuery needs to be used with dbClearResult() to close the query activity before any other interaction with the server, while dbGetQuery closes the query activity on the database automatically.

As a result, dbGetQuery is more suitable for running quick queries on the server and retrieving data immediately, while dbSendQuery is more useful when running queries on larger datasets and when doesn’t require immediate result.

Both of these two functions are intended for data users only, so only SELECT statements are intended to be sent through them. If the user has higher access to the database and intend to manipulate data, dbExecute is needed. This is demonstrated (but not executed) below.

4 Fetching Results

When using dbGetQuery, data is automatically retrieved, or “fetched” from the server. However, this is not always intended. If the demand is to query first and fetch at a later time, a combination of dbSendQuery and fetch. Here is an example.

##    genoName
## 1      chr1
## 2      chr1
## 3      chr1
## 4      chr1
## 5      chr1
## 6      chr1
## 7      chr1
## 8      chr1
## 9      chr1
## 10     chr1
## [1] TRUE
## [1] TRUE

Regardless which function is used to query on the database, the results from the query must be fetched into R if analysis with R is intended.

5 Important SQL clauses

Here are a few important SQL commands that can be used when working with databases.

5.1 Basic SELECT

“SELECT (…) FROM (…)” selects all rows (unless further specified) of specific columns of data from a table in the database. It can be used in combination with these following clauses: - “DISTINCT” selects data where replicated rows with same values for all specified columns are removed. - “WHERE” selects data where values at specified columns meet conditions defined after this clause - “AND/OR/NOT” logic operators that allow setting complex requirement for the WHERE clause - “GROUP BY” separates the table into groups by values in a specified column. The number of groups created is equal to the number of distinct values in the column specified. - “AS” renames a column or the results yielded from operations on it to another specified name - “HAVING” works similar to WHERE class in terms of serving the purpose of qualification of values, but HAVING is different in that it allows conditioning on aggregate functions like COUNT() and works best with GROUP BY - “ORDER BY” rearrange the rows by specified orders on values of specified column(s) - “LIMIT” sets the limit to how many rows of data are selected, useful when working with large datasets These clauses are shown below.

##  [1] "bin"        "name"       "chrom"      "strand"     "txStart"   
##  [6] "txEnd"      "cdsStart"   "cdsEnd"     "exonCount"  "exonStarts"
## [11] "exonEnds"
## [1] "name"  "class" "owner"
##                 name      class owner
## 1  PGOHUM00000232565  Ambiguous  Yale
## 2  PGOHUM00000232567  Processed  Yale
## 3  PGOHUM00000232568  Ambiguous  Yale
## 4  PGOHUM00000232569 Duplicated  Yale
## 5  PGOHUM00000232570 Duplicated  Yale
## 6  PGOHUM00000232573  Processed  Yale
## 7  PGOHUM00000232574  Ambiguous  Yale
## 8  PGOHUM00000232575 Duplicated  Yale
## 9  PGOHUM00000232579  Processed  Yale
## 10 PGOHUM00000232580 Duplicated  Yale
##         class
## 1   Ambiguous
## 2   Processed
## 3   Ambiguous
## 4  Duplicated
## 5  Duplicated
## 6   Processed
## 7   Ambiguous
## 8  Duplicated
## 9   Processed
## 10 Duplicated
##        class
## 1  Ambiguous
## 2  Processed
## 3 Duplicated
##                 name     class
## 1  PGOHUM00000232565 Ambiguous
## 2  PGOHUM00000232568 Ambiguous
## 3  PGOHUM00000232574 Ambiguous
## 4  PGOHUM00000232584 Ambiguous
## 5  PGOHUM00000232585 Ambiguous
## 6  PGOHUM00000232590 Ambiguous
## 7  PGOHUM00000232593 Ambiguous
## 8  PGOHUM00000232594 Ambiguous
## 9  PGOHUM00000232597 Ambiguous
## 10 PGOHUM00000232601 Ambiguous
##                 name     class owner
## 1  PGOHUM00000232565 Ambiguous  Yale
## 2  PGOHUM00000232568 Ambiguous  Yale
## 3  PGOHUM00000232574 Ambiguous  Yale
## 4  PGOHUM00000232584 Ambiguous  Yale
## 5  PGOHUM00000232585 Ambiguous  Yale
## 6  PGOHUM00000232590 Ambiguous  Yale
## 7  PGOHUM00000232593 Ambiguous  Yale
## 8  PGOHUM00000232594 Ambiguous  Yale
## 9  PGOHUM00000232597 Ambiguous  Yale
## 10 PGOHUM00000232601 Ambiguous  Yale
##                 name      class owner
## 1  PGOHUM00000232565  Ambiguous  Yale
## 2  PGOHUM00000232567  Processed  Yale
## 3  PGOHUM00000232568  Ambiguous  Yale
## 4  PGOHUM00000232569 Duplicated  Yale
## 5  PGOHUM00000232570 Duplicated  Yale
## 6  PGOHUM00000232573  Processed  Yale
## 7  PGOHUM00000232574  Ambiguous  Yale
## 8  PGOHUM00000232575 Duplicated  Yale
## 9  PGOHUM00000232579  Processed  Yale
## 10 PGOHUM00000232580 Duplicated  Yale
##                 name     class owner
## 1  PGOHUM00000232565 Ambiguous  Yale
## 2  PGOHUM00000232567 Processed  Yale
## 3  PGOHUM00000232568 Ambiguous  Yale
## 4  PGOHUM00000232573 Processed  Yale
## 5  PGOHUM00000232574 Ambiguous  Yale
## 6  PGOHUM00000232579 Processed  Yale
## 7  PGOHUM00000232581 Processed  Yale
## 8  PGOHUM00000232582 Processed  Yale
## 9  PGOHUM00000232584 Ambiguous  Yale
## 10 PGOHUM00000232585 Ambiguous  Yale
##                 name      class owner
## 1  PGOHUM00000232567  Processed  Yale
## 2  PGOHUM00000232569 Duplicated  Yale
## 3  PGOHUM00000232570 Duplicated  Yale
## 4  PGOHUM00000232573  Processed  Yale
## 5  PGOHUM00000232575 Duplicated  Yale
## 6  PGOHUM00000232579  Processed  Yale
## 7  PGOHUM00000232580 Duplicated  Yale
## 8  PGOHUM00000232581  Processed  Yale
## 9  PGOHUM00000232582  Processed  Yale
## 10 PGOHUM00000232583 Duplicated  Yale
##   number_of_names      class
## 1            6409  Ambiguous
## 2            2631 Duplicated
## 3            8836  Processed
##   number_of_names     class
## 1            6409 Ambiguous
## 2            8836 Processed
## [1] TRUE

5.2 SELECT from multiple tables

Let’s now show several clauses that will be useful to work across multiple tables

Working with Multiple Tables: - “JOIN/LEFT JOIN/RIGHT JOIN/INNER JOIN/” JOIN functions combine two tables together where they have one or more columns sharing the same type of data. - “USING” can be an alternative to the ON clause in JOINs. It allows joining with columns of same name while with ON you can specify and join columns with different names. - “UNION” appends one table to another where the columns of the two tables have the same data types, replicates are removed. - “UNION ALL” appends one table to another where the columns of the two tables have the same data types without removing replicates.

##    bin              name chrom strand txStart  txEnd cdsStart cdsEnd
## 1  585 PGOHUM00000244550  chr1      +   12981  13670        0      0
## 2  585 PGOHUM00000251045  chr1      -   16607  18381        0      0
## 3  585 PGOHUM00000244551  chr1      +   52475  53354        0      0
## 4  585 PGOHUM00000244552  chr1      +   62951  63851        0      0
## 5  585 PGOHUM00000244553  chr1      +   65887  65974        0      0
## 6   73 PGOHUM00000243857  chr1      -  120819 133580        0      0
## 7   73 PGOHUM00000243858  chr1      -  131029 131544        0      0
## 8  586 PGOHUM00000244554  chr1      +  131124 135623        0      0
## 9  586 PGOHUM00000251046  chr1      -  135714 136507        0      0
## 10 586 PGOHUM00000244555  chr1      +  135956 136244        0      0
##    exonCount                           exonStarts
## 1          3                   12981,13224,13448,
## 2          6 16607,16854,17229,17601,17914,18265,
## 3          1                               52475,
## 4          1                               62951,
## 5          1                               65887,
## 6          3                120819,129028,133367,
## 7          1                              131029,
## 8          1                              131124,
## 9          1                              135714,
## 10         1                              135956,
##                                exonEnds              name      class owner
## 1                    13056,13377,13670, PGOHUM00000244550 Duplicated  Yale
## 2  16766,17052,17364,17745,18058,18381, PGOHUM00000251045 Duplicated  Yale
## 3                                53354, PGOHUM00000244551  Processed  Yale
## 4                                63851, PGOHUM00000244552  Processed  Yale
## 5                                65974, PGOHUM00000244553  Ambiguous  Yale
## 6                 120930,129232,133580, PGOHUM00000243857 Duplicated  Yale
## 7                               131544, PGOHUM00000243858  Processed  Yale
## 8                               135623, PGOHUM00000244554  Processed  Yale
## 9                               136507, PGOHUM00000251046  Processed  Yale
## 10                              136244, PGOHUM00000244555  Ambiguous  Yale
##                 name bin chrom strand txStart  txEnd cdsStart cdsEnd
## 1  PGOHUM00000244550 585  chr1      +   12981  13670        0      0
## 2  PGOHUM00000251045 585  chr1      -   16607  18381        0      0
## 3  PGOHUM00000244551 585  chr1      +   52475  53354        0      0
## 4  PGOHUM00000244552 585  chr1      +   62951  63851        0      0
## 5  PGOHUM00000244553 585  chr1      +   65887  65974        0      0
## 6  PGOHUM00000243857  73  chr1      -  120819 133580        0      0
## 7  PGOHUM00000243858  73  chr1      -  131029 131544        0      0
## 8  PGOHUM00000244554 586  chr1      +  131124 135623        0      0
## 9  PGOHUM00000251046 586  chr1      -  135714 136507        0      0
## 10 PGOHUM00000244555 586  chr1      +  135956 136244        0      0
##    exonCount                           exonStarts
## 1          3                   12981,13224,13448,
## 2          6 16607,16854,17229,17601,17914,18265,
## 3          1                               52475,
## 4          1                               62951,
## 5          1                               65887,
## 6          3                120819,129028,133367,
## 7          1                              131029,
## 8          1                              131124,
## 9          1                              135714,
## 10         1                              135956,
##                                exonEnds      class owner
## 1                    13056,13377,13670, Duplicated  Yale
## 2  16766,17052,17364,17745,18058,18381, Duplicated  Yale
## 3                                53354,  Processed  Yale
## 4                                63851,  Processed  Yale
## 5                                65974,  Ambiguous  Yale
## 6                 120930,129232,133580, Duplicated  Yale
## 7                               131544,  Processed  Yale
## 8                               135623,  Processed  Yale
## 9                               136507,  Processed  Yale
## 10                              136244,  Ambiguous  Yale
##                 name
## 1  PGOHUM00000232565
## 2  PGOHUM00000232567
## 3  PGOHUM00000232568
## 4  PGOHUM00000232569
## 5  PGOHUM00000232570
## 6  PGOHUM00000232573
## 7  PGOHUM00000232574
## 8  PGOHUM00000232575
## 9  PGOHUM00000232579
## 10 PGOHUM00000232580
##                 name
## 1  PGOHUM00000232565
## 2  PGOHUM00000232567
## 3  PGOHUM00000232568
## 4  PGOHUM00000232569
## 5  PGOHUM00000232570
## 6  PGOHUM00000232573
## 7  PGOHUM00000232574
## 8  PGOHUM00000232575
## 9  PGOHUM00000232579
## 10 PGOHUM00000232580
## 11 PGOHUM00000232565
## 12 PGOHUM00000232567
## 13 PGOHUM00000232568
## 14 PGOHUM00000232569
## 15 PGOHUM00000232570
## 16 PGOHUM00000232573
## 17 PGOHUM00000232574
## 18 PGOHUM00000232575
## 19 PGOHUM00000232579
## 20 PGOHUM00000232580
## [1] TRUE

As demonstrated, the USING clause eliminates the repeated name column, while the ON clause version of JOIN kept both columns from both tables.

The difference between JOIN, LEFT JOIN, RIGHT JOIN, INNER JOIN, and FULL join will be explained later.

Here are a few other multi-table operators that can be useful:

  • “EXCEPT” selects rows in one table that are not present in another table.
    • SELECT columnA FROM Table1 EXCEPT SELECT columnA FROM Table2
  • “INTERSECT” selects rows that are present in both tables.
    • SELECT columnA FROM Table1 INTERSECT SELECT columnA FROM Table2

We will be demonstrating these two operators later with the JOINS.

6 Connecting to Google BigQuery (MIMIC-III Dataset)

RODBC can also connect to PaaS providers such as Google Cloud Platform and use it as a data-source. Here, we will demonstrate how to connect to a MIMIC-III dataset through Google Cloud’s BigQuery platform and query some data.

##  [1] "admissiondrug"             "admissiondx"              
##  [3] "allergy"                   "apacheapsvar"             
##  [5] "apachepatientresult"       "apachepredvar"            
##  [7] "careplancareprovider"      "careplaneol"              
##  [9] "careplangeneral"           "careplangoal"             
## [11] "careplaninfectiousdisease" "customlab"                
## [13] "diagnosis"                 "hospital"                 
## [15] "infusiondrug"              "intakeoutput"             
## [17] "lab"                       "medication"               
## [19] "microlab"                  "note"                     
## [21] "nurseassessment"           "nursecare"                
## [23] "nursecharting"             "pasthistory"              
## [25] "patient"                   "physicalexam"             
## [27] "respiratorycare"           "respiratorycharting"      
## [29] "treatment"                 "vitalaperiodic"           
## [31] "vitalperiodic"
## [1] "infusiondrugid"    "patientunitstayid" "infusionoffset"   
## [4] "drugname"          "drugrate"          "infusionrate"     
## [7] "drugamount"        "volumeoffluid"     "patientweight"
##  [1] "medicationid"       "patientunitstayid"  "drugorderoffset"   
##  [4] "drugstartoffset"    "drugivadmixture"    "drugordercancelled"
##  [7] "drugname"           "drughiclseqno"      "dosage"            
## [10] "routeadmin"         "frequency"          "loadingdose"       
## [13] "prn"                "drugstopoffset"     "gtc"
## # A tibble: 10 x 1
##    drugname  
##    <chr>     
##  1 LR        
##  2 NS        
##  3 D5w       
##  4 NSS       
##  5 TPA       
##  6 0.9% NS   
##  7 Heparin   
##  8 Insulin   
##  9 " (ml/hr)"
## 10 Dopamine
## [1] 64435
## [1] 64435
## [1] 22974
## [1] 22974
## [1] 87409
## [1] 87409

< Here we need some MIMIC-III summaries (e.g., HF), graphs, and basic analytics, e.g., preedict heart failure. >

7 Additional Applications

There are several additional applications of RODBC that may be useful.

7.1 Cross-Data-Source Projects

You can connect to mul