What is Dolt?

Dolt is ‘git for data’, a SQL database that supports common version control operations (clone, merge, etc.). This document is adapted from the dolt documentaion.

Setting the scene

We start by cloning an example dolt data base …

system("/usr/local/bin/dolt clone dolthub/ip-to-country")

… and serving it on a local server

system("(cd ip-to-country; /usr/local/bin/dolt sql-server &)")

Now we are ready to interact with the data base using R.

Querying a local Dolt DB

Since dolt is (also) an SQL data base, we can connect a local dolt data base to R using the MySQL connector.

con <- DBI::dbConnect(
    RMySQL::MySQL(),
    dbname = 'ip_to_country',
    host = '127.0.0.1',
    port = 3306,
    user = 'root'
)

Let’s see what we have in store:

DBI::dbListTables(con)
## [1] "IPv4ToCountry" "IPv6ToCountry"

We could now continue using the DBI package, but even better, this means that that dbplyr is supported out of the box (at least or reading data)! This allows us to use standard dplyr verbs to construct a SQL query against the data base, e.g.

library(dbplyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
## 
##     ident, sql
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
query <- dplyr::tbl(con, "IPv6ToCountry") %>% 
    filter(
        Registry == "iana",
        CountryCode2Letter  == "ZZ",
        AssignedDate == 838857600
    )

show_query(query)
## <SQL>
## SELECT *
## FROM `IPv6ToCountry`
## WHERE ((`Registry` = 'iana') AND (`CountryCode2Letter` = 'ZZ') AND (`AssignedDate` = 838857600.0))

The usual lazy behavior of remote queries works

query
## # Source:   lazy query [?? x 4]
## # Database: mysql 5.7.9-Vitess [@127.0.0.1:/ip_to_country]
##    IPRange                               CountryCode2Lett… Registry AssignedDate
##    <chr>                                 <chr>             <chr>           <dbl>
##  1 1000::-1fff:ffff:ffff:ffff:ffff:ffff… ZZ                iana        838857600
##  2 100::-1ff:ffff:ffff:ffff:ffff:ffff:f… ZZ                iana        838857600
##  3 2000::-2000:ffff:ffff:ffff:ffff:ffff… ZZ                iana        838857600
##  4 2001:1000::-2001:11ff:ffff:ffff:ffff… ZZ                iana        838857600
##  5 2001:4e00::-2001:4fff:ffff:ffff:ffff… ZZ                iana        838857600
##  6 2001:6000::-2001:7fff:ffff:ffff:ffff… ZZ                iana        838857600
##  7 2001:c000::-2001:ffff:ffff:ffff:ffff… ZZ                iana        838857600
##  8 2003:4000::-2003:7fff:ffff:ffff:ffff… ZZ                iana        838857600
##  9 2003:8000::-2003:ffff:ffff:ffff:ffff… ZZ                iana        838857600
## 10 2004::-2007:ffff:ffff:ffff:ffff:ffff… ZZ                iana        838857600
## # … with more rows

… just as explicit collection

collect(query)
## # A tibble: 80 x 4
##    IPRange                               CountryCode2Lett… Registry AssignedDate
##    <chr>                                 <chr>             <chr>           <dbl>
##  1 1000::-1fff:ffff:ffff:ffff:ffff:ffff… ZZ                iana        838857600
##  2 100::-1ff:ffff:ffff:ffff:ffff:ffff:f… ZZ                iana        838857600
##  3 2000::-2000:ffff:ffff:ffff:ffff:ffff… ZZ                iana        838857600
##  4 2001:1000::-2001:11ff:ffff:ffff:ffff… ZZ                iana        838857600
##  5 2001:4e00::-2001:4fff:ffff:ffff:ffff… ZZ                iana        838857600
##  6 2001:6000::-2001:7fff:ffff:ffff:ffff… ZZ                iana        838857600
##  7 2001:c000::-2001:ffff:ffff:ffff:ffff… ZZ                iana        838857600
##  8 2003:4000::-2003:7fff:ffff:ffff:ffff… ZZ                iana        838857600
##  9 2003:8000::-2003:ffff:ffff:ffff:ffff… ZZ                iana        838857600
## 10 2004::-2007:ffff:ffff:ffff:ffff:ffff… ZZ                iana        838857600
## # … with 70 more rows

Of course, any remotely hosted dolt DB can be queried just as above by defining the correct connection.

Querying a DoltHub DB

The above approach requires a running dolt server and cloning the DB of interest. A more direct route is the DoltHub API which is still in alpha. Unfortunately, we cannot specify a DoltHub repository as valid DBI connection the SQL query thus has to be entered manually. Since we already constructed the query earlier, let’s just reuse it.

str_sql <- query %>% 
    show_query() %>% 
    capture.output() %>% 
    {paste(.[2:length(.)], collapse = " ")}

First we need to build the API query, then we query DoltHub and parse the json response object into an R data frame (tibble).

library(glue)
## 
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
## 
##     collapse
owner <- "dolthub" 
repo <- "ip-to-country"
branch <- "master"
url <- glue("https://www.dolthub.com/api/v1alpha1/{owner}/{repo}/{branch}")

httr::GET(url, query = list(q = str_sql)) %>% 
    httr::content() %>% 
    {.$rows} %>% 
    jsonlite::toJSON() %>% 
    jsonlite::fromJSON() %>% 
    as_tibble() %>% 
    mutate_all(as.character)
## # A tibble: 80 x 4
##    IPRange                               CountryCode2Lett… Registry AssignedDate
##    <chr>                                 <chr>             <chr>    <chr>       
##  1 1000::-1fff:ffff:ffff:ffff:ffff:ffff… ZZ                iana     838857600   
##  2 100::-1ff:ffff:ffff:ffff:ffff:ffff:f… ZZ                iana     838857600   
##  3 2000::-2000:ffff:ffff:ffff:ffff:ffff… ZZ                iana     838857600   
##  4 2001:1000::-2001:11ff:ffff:ffff:ffff… ZZ                iana     838857600   
##  5 2001:4e00::-2001:4fff:ffff:ffff:ffff… ZZ                iana     838857600   
##  6 2001:6000::-2001:7fff:ffff:ffff:ffff… ZZ                iana     838857600   
##  7 2410::-241f:ffff:ffff:ffff:ffff:ffff… ZZ                iana     838857600   
##  8 2420::-243f:ffff:ffff:ffff:ffff:ffff… ZZ                iana     838857600   
##  9 2440::-247f:ffff:ffff:ffff:ffff:ffff… ZZ                iana     838857600   
## 10 2480::-24ff:ffff:ffff:ffff:ffff:ffff… ZZ                iana     838857600   
## # … with 70 more rows

Obviously, the entire process is still a bit rough around the edges and it would be nice to have a DBI driver for DoltHub remotes directly - but hey, at least it works!

Session Info

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] glue_1.4.2   dplyr_1.0.4  dbplyr_2.1.0
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.31        magrittr_2.0.1    tidyselect_1.1.0  R6_2.5.0         
##  [5] rlang_0.4.10      fansi_0.4.2       httr_1.4.2        blob_1.2.1       
##  [9] stringr_1.4.0     tools_4.0.4       xfun_0.21         utf8_1.1.4       
## [13] cli_2.3.0         DBI_1.1.1         withr_2.4.1       htmltools_0.5.1.1
## [17] ellipsis_0.3.1    yaml_2.2.1        RMySQL_0.10.21    digest_0.6.27    
## [21] assertthat_0.2.1  tibble_3.0.6      lifecycle_1.0.0   crayon_1.4.1     
## [25] purrr_0.3.4       vctrs_0.3.6       curl_4.3          evaluate_0.14    
## [29] rmarkdown_2.6     stringi_1.5.3     compiler_4.0.4    pillar_1.4.7     
## [33] generics_0.1.0    jsonlite_1.7.2    renv_0.12.0       pkgconfig_2.0.3