DSL interface based on dplyr verbs
Before jumping into implementation, I'd like to start a design discussion around the idea of offering a DSL interface based on dplyr verbs, which would get translated internally into valid ODSQL query strings.
ODSQL is a SQL-inspired, URL-encoded query language, but using it directly via httr2 or manually building strings can be error-prone and unidiomatic for R users.
By designing a DSL that mirrors the tidyverse's data manipulation grammar, we can let users write code like this:
odsql("dataset_id") |>
select(city, population) |>
filter(population > 1000000) |>
arrange(desc(population)) |>
slice_head(n = 5) |>
collect()
Which would internally generate:
GET /api/explore/v2.1/catalog/datasets/dataset_id/records?
select=city,population&
where=population>1000000&
order_by=population desc&
limit=5
Design Questions
- Core verbs: Which dplyr verbs do we want to support out-of-the-box? Obvious candidates are:
select()
filter()
group_by() + summarise()
arrange()
slice_head() for limit= possibly mutate() for things like year = year(date)
-
Custom expression translation: How do we handle things like startsWith(), in(), and date/time functions? Should we reimplement a small expression parser using {rlang}?
-
S3 query object: Should we introduce a class like odsql_query that holds the pipeline before it's collected via collect()?
-
Caching, error handling, metadata introspection: What should be built into the request pipeline vs delegated to user code?
Precedents
This kind of DSL translation has good grounding in the R ecosystem:
{dbplyr} translates dplyr verbs to SQL for remote databases.
{httr2} itself uses a piped DSL to build HTTP requests.
Worth looking into
{sos4R}
{ckanr}
{eurostat}
DSL interface based on dplyr verbs
Before jumping into implementation, I'd like to start a design discussion around the idea of offering a DSL interface based on dplyr verbs, which would get translated internally into valid ODSQL query strings.
ODSQL is a SQL-inspired, URL-encoded query language, but using it directly via httr2 or manually building strings can be error-prone and unidiomatic for R users.
By designing a DSL that mirrors the tidyverse's data manipulation grammar, we can let users write code like this:
Which would internally generate:
Design Questions
select()filter()group_by()+summarise()arrange()slice_head()forlimit=possiblymutate()for things likeyear = year(date)Custom expression translation: How do we handle things like
startsWith(),in(), and date/time functions? Should we reimplement a small expression parser using{rlang}?S3 query object: Should we introduce a class like
odsql_querythat holds the pipeline before it's collected viacollect()?Caching, error handling, metadata introspection: What should be built into the request pipeline vs delegated to user code?
Precedents
This kind of DSL translation has good grounding in the R ecosystem:
{dbplyr}translates dplyr verbs to SQL for remote databases.{httr2}itself uses a piped DSL to build HTTP requests.Worth looking into
{sos4R}{ckanr}{eurostat}