Dolt is ‘git for data’, a SQL database that supports common version control operations (clone, merge, etc.). This document is adapted from the dolt documentaion.
We start by cloning an example dolt data base …
system("/usr/local/bin/dolt clone dolthub/ip-to-country")
… and serving it on a local server
system("(cd ip-to-country; /usr/local/bin/dolt sql-server &)")
Now we are ready to interact with the data base using R.
Since dolt is (also) an SQL data base, we can connect a local dolt data base to R using the MySQL connector.
con <- DBI::dbConnect(
RMySQL::MySQL(),
dbname = 'ip_to_country',
host = '127.0.0.1',
port = 3306,
user = 'root'
)
Let’s see what we have in store:
DBI::dbListTables(con)
## [1] "IPv4ToCountry" "IPv6ToCountry"
We could now continue using the DBI
package, but even better, this means that that dbplyr
is supported out of the box (at least or reading data)! This allows us to use standard dplyr
verbs to construct a SQL query against the data base, e.g.
library(dbplyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
##
## ident, sql
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
query <- dplyr::tbl(con, "IPv6ToCountry") %>%
filter(
Registry == "iana",
CountryCode2Letter == "ZZ",
AssignedDate == 838857600
)
show_query(query)
## <SQL>
## SELECT *
## FROM `IPv6ToCountry`
## WHERE ((`Registry` = 'iana') AND (`CountryCode2Letter` = 'ZZ') AND (`AssignedDate` = 838857600.0))
The usual lazy behavior of remote queries works
query
## # Source: lazy query [?? x 4]
## # Database: mysql 5.7.9-Vitess [@127.0.0.1:/ip_to_country]
## IPRange CountryCode2Lett… Registry AssignedDate
## <chr> <chr> <chr> <dbl>
## 1 1000::-1fff:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 2 100::-1ff:ffff:ffff:ffff:ffff:ffff:f… ZZ iana 838857600
## 3 2000::-2000:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 4 2001:1000::-2001:11ff:ffff:ffff:ffff… ZZ iana 838857600
## 5 2001:4e00::-2001:4fff:ffff:ffff:ffff… ZZ iana 838857600
## 6 2001:6000::-2001:7fff:ffff:ffff:ffff… ZZ iana 838857600
## 7 2001:c000::-2001:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 8 2003:4000::-2003:7fff:ffff:ffff:ffff… ZZ iana 838857600
## 9 2003:8000::-2003:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 10 2004::-2007:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## # … with more rows
… just as explicit collection
collect(query)
## # A tibble: 80 x 4
## IPRange CountryCode2Lett… Registry AssignedDate
## <chr> <chr> <chr> <dbl>
## 1 1000::-1fff:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 2 100::-1ff:ffff:ffff:ffff:ffff:ffff:f… ZZ iana 838857600
## 3 2000::-2000:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 4 2001:1000::-2001:11ff:ffff:ffff:ffff… ZZ iana 838857600
## 5 2001:4e00::-2001:4fff:ffff:ffff:ffff… ZZ iana 838857600
## 6 2001:6000::-2001:7fff:ffff:ffff:ffff… ZZ iana 838857600
## 7 2001:c000::-2001:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 8 2003:4000::-2003:7fff:ffff:ffff:ffff… ZZ iana 838857600
## 9 2003:8000::-2003:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 10 2004::-2007:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## # … with 70 more rows
Of course, any remotely hosted dolt DB can be queried just as above by defining the correct connection.
The above approach requires a running dolt server and cloning the DB of interest. A more direct route is the DoltHub API which is still in alpha. Unfortunately, we cannot specify a DoltHub repository as valid DBI
connection the SQL query thus has to be entered manually. Since we already constructed the query earlier, let’s just reuse it.
str_sql <- query %>%
show_query() %>%
capture.output() %>%
{paste(.[2:length(.)], collapse = " ")}
First we need to build the API query, then we query DoltHub and parse the json response object into an R data frame (tibble).
library(glue)
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
owner <- "dolthub"
repo <- "ip-to-country"
branch <- "master"
url <- glue("https://www.dolthub.com/api/v1alpha1/{owner}/{repo}/{branch}")
httr::GET(url, query = list(q = str_sql)) %>%
httr::content() %>%
{.$rows} %>%
jsonlite::toJSON() %>%
jsonlite::fromJSON() %>%
as_tibble() %>%
mutate_all(as.character)
## # A tibble: 80 x 4
## IPRange CountryCode2Lett… Registry AssignedDate
## <chr> <chr> <chr> <chr>
## 1 1000::-1fff:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 2 100::-1ff:ffff:ffff:ffff:ffff:ffff:f… ZZ iana 838857600
## 3 2000::-2000:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 4 2001:1000::-2001:11ff:ffff:ffff:ffff… ZZ iana 838857600
## 5 2001:4e00::-2001:4fff:ffff:ffff:ffff… ZZ iana 838857600
## 6 2001:6000::-2001:7fff:ffff:ffff:ffff… ZZ iana 838857600
## 7 2410::-241f:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 8 2420::-243f:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 9 2440::-247f:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## 10 2480::-24ff:ffff:ffff:ffff:ffff:ffff… ZZ iana 838857600
## # … with 70 more rows
Obviously, the entire process is still a bit rough around the edges and it would be nice to have a DBI
driver for DoltHub remotes directly - but hey, at least it works!
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] glue_1.4.2 dplyr_1.0.4 dbplyr_2.1.0
##
## loaded via a namespace (and not attached):
## [1] knitr_1.31 magrittr_2.0.1 tidyselect_1.1.0 R6_2.5.0
## [5] rlang_0.4.10 fansi_0.4.2 httr_1.4.2 blob_1.2.1
## [9] stringr_1.4.0 tools_4.0.4 xfun_0.21 utf8_1.1.4
## [13] cli_2.3.0 DBI_1.1.1 withr_2.4.1 htmltools_0.5.1.1
## [17] ellipsis_0.3.1 yaml_2.2.1 RMySQL_0.10.21 digest_0.6.27
## [21] assertthat_0.2.1 tibble_3.0.6 lifecycle_1.0.0 crayon_1.4.1
## [25] purrr_0.3.4 vctrs_0.3.6 curl_4.3 evaluate_0.14
## [29] rmarkdown_2.6 stringi_1.5.3 compiler_4.0.4 pillar_1.4.7
## [33] generics_0.1.0 jsonlite_1.7.2 renv_0.12.0 pkgconfig_2.0.3