Interoperability with DuckDB and dbplyr

This article describes how to use the full power of DuckDB with duckplyr. Two options are discussed: interoperability with dbplyr and the use of DuckDB’s functions in duckplyr.

library(conflicted)
library(duckplyr)
conflict_prefer("filter", "dplyr")

Introduction

The duckplyr package is a drop-in replacement for dplyr, designed to work with DuckDB as the backend. There is a translation layer that converts R function calls to DuckDB functions or macros, aiming at full compatibility with R. Many functions are translated already, and many more are not. For functions that cannot be translated, duckplyr falls back to the original R implementation, disrupting the DuckDB pipeline and materializing intermediate results.

Furthermore, DuckDB has functions with no R equivalent. These might be used already by code that interacts with DuckDB through dbplyr, either making use of its passthrough feature (unknown functions are translated to SQL verbatim), or by using the mutate(x = sql(...)) pattern. When working with duckplyr, this functionality is still accessible, albeit through experimental interfaces:

From duckplyr to dbplyr

The experimental as_tbl() function, introduced in duckplyr 1.1.0, transparently converts a duckplyr frame to a dbplyr tbl object:

df <- duckdb_tibble(a = 2L)
df

tbl <- as_tbl(df)
tbl

It achieves this by creating a temporary view that points to the relational object created internally by duckplyr, in the same DBI connection as the duckplyr object. No data is copied in this operation. The view is discarded when the tbl object goes out of scope.

This allows using arbitrary SQL code, either through sql() or by relying on dbplyr’s passthrough feature.

tbl %>%
  mutate(b = sql("a + 1"), c = least_common_multiple(a, b)) %>%
  show_query()

There is no R function called least_common_multiple(), it is interpreted as a SQL function.

least_common_multiple(2, 3)
tbl %>%
  mutate(b = sql("a + 1"), c = least_common_multiple(a, b))

To continue processing with duckplyr, use as_duckdb_tibble():

tbl %>%
  mutate(b = sql("a + 1"), c = least_common_multiple(a, b)) %>%
  as_duckdb_tibble()

Call arbitrary functions in duckplyr

The escape hatch, also introduced in duckplyr 1.1.0, allows calling arbitrary DuckDB functions directly from duckplyr, without going through SQL:

duckdb_tibble(a = 2L, b = 3L) %>%
  mutate(c = dd$least_common_multiple(a, b))

The dd prefix has been picked for the following reasons:

A prefix is necessary to avoid name clashes with existing R functions. If this is used widely, large-scale code analysis may help prioritize the translation of functions that are not yet supported by duckplyr.

The dd package, when attached, will provide a dd object containing many known DuckDB functions. This adds support for autocomplete:

Screenshot for autocomplete with the dd package
Screenshot for autocomplete with the dd package

This package is not necessary to use duckplyr, and the list of functions is incomplete and growing. In case you’re wondering:

duckdb_tibble(a = "dbplyr", b = "duckplyr") %>%
  mutate(c = dd$damerau_levenshtein(a, b))

Conclusion

While duckplyr is designed to be a drop-in replacement for dplyr, it still allows to harness most if not all of the power of DuckDB.

See vignette("limits") for limitations in the translation employed by duckplyr, vignette("fallback") for more information on fallback, and vignette("telemetry") for existing attempts to prioritize work on the translation layer.