Skip to content

Commit e99cb3b

Browse files
committed
db-duck >> mcp ext; r-snip >> replace_values; r-tidy >> recode_values, replace_values; surv-anal >> calibration
1 parent e31b1ae commit e99cb3b

5 files changed

Lines changed: 303 additions & 63 deletions

File tree

qmd/db-duckdb.qmd

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1146,10 +1146,10 @@
11461146
```
11471147
- [Dash]{.underline}
11481148
- [Repo](https://github.com/gropaul/dash), [Docs](https://www.dash.builders/docs/dashboards#dashboards)
1149-
- A local first data exploration and visualization tool built on top of DuckDB. Use it in your browser or as a DuckDB Extension, analyze, and visualize your data with ease. (See also the web app in Misc \>\> Tools)
1149+
- A local first data **exploration and visualization tool** built on top of DuckDB. Use it in your browser or as a DuckDB Extension, analyze, and visualize your data with ease. (See also the web app in Misc \>\> Tools)
11501150
- [dplyr]{.underline}
11511151
- [Repo](https://github.com/mrchypark/libdplyr), [Docs](https://duckdb.org/community_extensions/extensions/dplyr)
1152-
- Enables R users to write database queries using familiar dplyr syntax and converts them to efficient SQL for execution.
1152+
- Enables R users to **write database queries using familiar dplyr syntax** and converts them to efficient SQL for execution.
11531153
- Supports multiple SQL dialects (PostgreSQL, MySQL, SQLite, DuckDB) for use across various database environments.
11541154
- [JSON]{.underline}
11551155
- [Webpage](https://duckdb.org/docs/extensions/json.html)
@@ -1201,7 +1201,7 @@
12011201
12021202
- Might need to use `FORCE INSTALL postgres`
12031203
1204-
- Allows DuckDB to connect to those systems and operate on them in the same way that it operates on its own native storage engine.
1204+
- **Allows DuckDB to connect to those systems and operate on them** in the same way that it operates on its own native storage engine.
12051205
12061206
- Use Cases
12071207
@@ -1272,22 +1272,26 @@
12721272
```
12731273
- [GSheets]{.underline}
12741274
- [Repo](https://github.com/evidence-dev/duckdb_gsheets)
1275-
- Extension for reading and writing Google Sheets with SQL
1275+
- Extension for **reading and writing Google Sheets** with SQL
12761276
- [Infera]{.underline}
12771277
- [Repo](https://github.com/CogitatorTech/infera)
12781278
- Allows you to use machine learning (ML) models directly in SQL queries to perform inference on data stored in DuckDB tables.
1279-
- These are pretrained models and this extension allows you to perform prediction within duckdb.
1279+
- These are **pretrained models and this extension allows you to perform prediction within duckdb.**
12801280
- Developed in Rust and uses Tract as the backend inference engine.
12811281
- Supports loading and running models in Open Neural Network Exchange (ONNX) format. See [repo](https://github.com/onnx/models), [huggingface](https://huggingface.co/onnxmodelzoo)
12821282
- Currently seems to be mostly Computer Vision and Natural Language Processing (NLP)
12831283
- There's also a forecasting model and a couple recommender models
1284+
- [duckdb_mcp]{.underline}
1285+
- [Webpage](https://duckdb.org/community_extensions/extensions/duckdb_mcp), [Intro](https://dailydrop.hrbrmstr.dev/2026/02/04/drop-767-2026-02-04-if-it-walks-like-a/)
1286+
- Enables seamless **integration between SQL databases and MCP servers**
1287+
- Provides both client capabilities for accessing remote MCP resources via SQL and server capabilities for exposing database content as MCP resources
12841288
- [mlpack]{.underline}
12851289
- [Repo](https://github.com/eddelbuettel/duckdb-mlpack), [Extension webpage](https://duckdb.org/community_extensions/extensions/mlpack)
1286-
- Allows to you to fit (or train) and predict (or classify) from the models implemented
1290+
- Allows to you to **fit (or train) and predict (or classify) from the models** implemented
12871291
- Currently just supports adaBoost and (regularized) linear regression
12881292
- [quackstore]{.underline}
12891293
- [Repo](https://github.com/coginiti-dev/QuackStore)
1290-
- For caching frequently queried files locally
1294+
- For **caching frequently queried files locally**
12911295
- When you query remote files (like CSV files from the web), DuckDB normally downloads them every time. With QuackStore, the first query downloads and caches the file locally. Subsequent queries use the cached version, making them much faster.
12921296
- Key Benefits
12931297
- Block-based caching: Only caches the parts of files you actually access (blocks)

qmd/db-postgres.qmd

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
- [Postgres is eating the database world](https://medium.com/@fengruohang/postgres-is-eating-the-database-world-157c204dcfc4)
99
- Packages
1010
- [{]{style="color: #990000"}[RPostgres](https://rpostgres.r-dbi.org/){style="color: #990000"}[}]{style="color: #990000"} - DBI-compliant interface to the postgres database
11-
- [{{]{style="color: goldenrod"}[psycopg](https://www.psycopg.org/psycopg3/docs/index.html){style="color: goldenrod"}[}}]{style="color: goldenrod"} - PostgreSQL database adapter
11+
- [{]{style="color: goldenrod"}[psycopg](https://www.psycopg.org/psycopg3/docs/index.html){style="color: goldenrod"}[}]{style="color: goldenrod"} - PostgreSQL database adapter
1212
- Resources
1313
- [Docs](https://docs.jade.fyi/postgres/postgres.html) - All on one page so you can just [ctrl + f]{.arg-text}
1414
- [Exploring Enterprise Databases with R: A Tidyverse Approach](https://smithjd.github.io/sql-pet/)
@@ -39,38 +39,38 @@
3939

4040
- [Apache AGE]{.underline}
4141
- [Website](https://age.apache.org/), [Docs](https://age.apache.org/age-manual/master/index.html)
42-
- The goal of the project is to create single storage that can handle both relational and graph model data so that users can use standard ANSI SQL along with openCypher, the Graph query language.
42+
- The goal of the project is to create single storage that can handle both **relational and graph model data** so that users can use standard ANSI SQL along with openCypher, the Graph query language.
4343
- Users can read and write graph data in nodes and edges. They can also use various algorithms such as variable length and edge traversal when analyzing data.
4444
- [pgai]{.underline}
4545
- [Repo](https://github.com/timescale/pgai/?ref=timescale.com), [Intro](https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/)
46-
- Simplifies the process of building search, and Retrieval Augmented Generation(RAG) AI applications with PostgreSQL.
46+
- Simplifies the process of **building search, and Retrieval Augmented Generation (RAG) AI applications** with PostgreSQL.
4747
- Features
4848
- Create embeddings for your data.
4949
- Retrieve LLM chat completions from models like OpenAI GPT4o.
5050
- Reason over your data and facilitate use cases like classification, summarization, and data enrichment on your existing relational data in PostgreSQL.
5151
- [pg_analytics]{.underline}
5252
- [Intro](https://blog.paradedb.com/pages/introducing_analytics), [Repo](https://github.com/paradedb/paradedb/tree/dev/pg_analytics)
53-
- Arrow and Datafusion integrated with Postgres
54-
- Delta Lake tables behave like regular Postgres tables but use a column-oriented layout via Apache Arrow and utilize Apache DataFusion, a query engine optimized for column-oriented data
53+
- **Arrow and Datafusion** integrated with Postgres
54+
- **Delta Lake tables** behave like regular Postgres tables but use a column-oriented layout via Apache Arrow and utilize Apache DataFusion, a query engine optimized for column-oriented data
5555
- Data is persisted to disk with Parquet
5656
- The delta-rs library is a Rust-based implementation of Delta Lake. This library adds ACID transactions, updates and deletes, and file compaction to Parquet storage. It also supports querying over data lakes like S3, which introduces the future possibility connecting Postgres tables to cloud data lakes.
5757
- [pg_bm25]{.underline}
5858
- [Intro](https://blog.paradedb.com/pages/introducing_bm25), [Repo](https://github.com/paradedb/paradedb/tree/dev/pg_bm25#overview)
59-
- Rust-based extension that significantly improves Postgres’ full text search capabilities
59+
- Rust-based extension that significantly improves Postgres’ **full text search** capabilities
6060
- Built to be an Elasticsearch inside of a postgres db
6161
- Performant on large tables, adds support for operations like fuzzy search, relevance tuning, or BM25 relevance scoring (same algo as Elasticsearch), real-time search — new data is immediately searchable without manual reindexing
6262
- Query times over 1M rows are 20x faster compared to tsquery and ts_ran (built-in search and sort)
6363
- Can be combined with PGVector for semantic fuzzy search
6464
- [Citus]{.underline}
6565
- [Website](https://www.citusdata.com/)
6666
- Distributed Postgres
67-
- Transforms a standalone cluster into a horizontally partitioned distributed database cluster.
67+
- Transforms a standalone cluster into a horizontally partitioned **distributed database cluster**.
6868
- Scales Postgres by distributing data & queries. You can start with a single Citus node, then add nodes & rebalance shards when you need to grow.
6969
- Can combine with PostGIS for a distributed geospatial database, PGVector for a distributed vector database, pg_bm25 for a distributed full-text search database, etc.
7070
- [yugabytedb](https://www.yugabyte.com/) is also an option for distributed postgres
7171
- [pg_duckdb]{.underline}
7272
- [Repo](https://github.com/duckdb/pg_duckdb), [Intro](https://motherduck.com/blog/pg_duckdb-postgresql-extension-for-duckdb-motherduck/)
73-
- Official Postgres extension for DuckDB
73+
- **Official Postgres extension for DuckDB**
7474
- Developed in collaboration with our partners, Hydra and MotherDuck
7575
- Embeds DuckDB's columnar-vectorized analytics engine and features into Postgres
7676
- `SELECT` queries executed by the DuckDB engine can directly read Postgres tables
@@ -87,72 +87,72 @@
8787
- Sync and Tranformation
8888
- Leverages PostgreSQL's logical replication system to capture and stream data changes. It uses NATS as a message broker to decouple reading from the WAL through the replicator and worker processes, providing flexibility and scalability. Transformations and filtrations are applied before the data reaches the destination.
8989
- Use Cases
90-
- Continuously sync production data to staging, leveraging powerful transformation rules to maintain data privacy and security practices.
91-
- Sync and transform data to separate databases for archiving, auditing and analytics purposes.
90+
- Continuously **sync production data to staging**, leveraging powerful transformation rules to maintain data privacy and security practices.
91+
- **Sync and transform data to separate databases** for archiving, auditing and analytics purposes.
9292
- [PGLite]{.underline}
9393
- [Website](https://pglite.dev/)
94-
- Embeddable Postgres (e.g. for things like apps)
94+
- **Embeddable** Postgres (e.g. for things like apps)
9595
- Run a full Postgres database locally in WASM with reactivity and live sync.
9696
- [pg_mooncake]{.underline}
9797
- [Repo](https://github.com/Mooncake-Labs/pg_mooncake), [Site](https://pgmooncake.com/)
9898
- Adds native columnstore tables with DuckDB execution for 1000x faster analytics.
99-
- Columnstore tables are stored as Iceberg or Delta Lake tables (parquet files + metadata) in object storage. Differes from pg_duckdb, because these tables support **transactional** and batch inserts, updates, and deletes, as well as joins with regular PostgreSQL tables.
99+
- Columnstore tables are stored as Iceberg or Delta Lake tables (parquet files + metadata) in object storage. **Differs from pg_duckdb, because these tables support transactional and batch inserts, updates, and deletes, as well as joins with regular PostgreSQL tables.**
100100
- Available on [Neon Postgres](https://neon.tech/home).
101101
- [pg_parquet]{.underline}
102102
- [Repo](https://github.com/CrunchyData/pg_parquet/), [Intro](https://www.crunchydata.com/blog/pg_parquet-an-extension-to-connect-postgres-and-parquet)
103103
- Sources: Locally or S3
104104
- Dependencies: Apache Arrow and pgrx extension
105105
- Features
106-
- Export Postgres tables/queries to Parquet files,
107-
- Ingest data from Parquet files to Postgres tables,
108-
- Inspect the schema and metadata of Parquet files.
106+
- **Export** Postgres tables/queries to **Parquet** files,
107+
- **Ingest** data from Parquet files to Postgres tables,
108+
- **Inspect** the schema and metadata of Parquet files.
109109
- [plprql]{.underline}
110110
- [Repo](https://github.com/kaspermarstal/plprql)
111-
- Enables you to run PRQL queries. PRQL has a syntax that is similar to [{dplyr}]{style="color: #990000"}
111+
- Enables you to run **PRQL** queries. PRQL has a syntax that is similar to [{dplyr}]{style="color: #990000"}
112112
- Built in Rust so you have to have [pgrx]{.underline} installed. Repo has directions.
113113
- [pgroll]{.underline}
114114
- [Repo](https://github.com/xataio/pgroll)
115-
- An open-source schema migration tool for Postgres, built to enable zero downtime, reversible schema migrations using the expand/contract pattern
115+
- An open-source **schema migration tool** for Postgres, built to enable zero downtime, reversible schema migrations using the expand/contract pattern
116116
- Creates virtual schemas based on PostgreSQL views on top of the physical tables. This allows you to make changes to your database without impacting the application.
117117
- [pgrx]{.underline}
118118
- [Repo](https://github.com/pgcentralfoundation/pgrx)
119119
- Framework for developing PostgreSQL extensions in Rust
120-
- To install extensions built in Rust, you need to have this extension installed
120+
- To **install extensions built in Rust,** you need to have this extension installed
121121
- [pg_sparse]{.underline}
122122
- [Intro](https://blog.paradedb.com/pages/introducing_sparse), [Repo](https://github.com/paradedb/paradedb/tree/dev/pg_sparse#overview)
123-
- Enables efficient storage and retrieval of *sparse* vectors using HNSW
123+
- Enables efficient **storage and retrieval of *sparse* vectors** using HNSW
124124
- SPLADE outputs sparse vectors with over 30,000 entries. Sparse vectors can detect the presence of exact keywords while also capturing semantic similarity between terms.
125125
- Fork of pgvector with modifications
126126
- Compatible alongside both pg_bm25 and pgvector
127127
- [pgstream]{.underline}
128128
- [Intro](https://xata.io/blog/postgres-webhooks-with-pgstream), [Site](https://xata.io/pgstream), [Repo](https://github.com/xataio/pgstream)
129-
- CDC (Change-Data-Capture) CLI tool that calls webhooks whenever there is a data (or schema) change
129+
- CDC (Change-Data-Capture) CLI tool that **calls webhooks whenever there is a data (or schema) change**
130130
- Whenever a row is inserted, updated, or deleted, or a table is created, altered, truncated or deleted, a webhook is notified of the relevant event detail
131131
- [pg_timeseries]{.underline}
132132
- [Intro](https://tembo.io/blog/pg-timeseries), [Repo](https://github.com/tembo-io/pg_timeseries)
133-
- An alternative to [TimescaleDB](https://github.com/timescale/timescaledb). That license restricts use of features such as compression, incremental materialized views, and bottomless storage but that might because the company ([tembo](https://tembo.io/)) that open sourced this extension has their own stack, cloud, etc.
133+
- An **alternative to [TimescaleDB](https://github.com/timescale/timescaledb)**. That license restricts use of features such as compression, incremental materialized views, and bottomless storage but that might because the company ([tembo](https://tembo.io/)) that open sourced this extension has their own stack, cloud, etc.
134134
- Features such as [native partitioning](#0), variety of [indexes](#0), [materialized views](#0), and [window / analytics functions](#0)
135135
- You can compress tables if the table data is older than a certain time period (e.g. 90 days)
136136
- [pg_tracing]{.underline}
137137
- [Repo](https://github.com/DataDog/pg_tracing)
138-
- Generates server-side spans for distributed tracing
138+
- Generates server-side spans for **distributed tracing**
139139
- [pgvector]{.underline}
140140
- [Repo](https://github.com/pgvector/pgvector)
141141
- Also see [Databases, Vector Databases](db-vector.qmd#sec-db-vect){style="color: green"} for alternatives and comparisons
142-
- Enables efficient storage and retrieval of *dense* vectors using HNSW
142+
- Enables efficient **storage and retrieval of *dense* vectors** using HNSW
143143
- OpenAI’s text-embedding-ada-002 model outputs dense vectors with 1536 entries
144144
- Exact and Approximate Nearest Neighbor search
145145
- L2 distance, Inner Product, and Cosine Distance
146146
- Supported inside AWS RDS
147147
- [pg_vectorize]{.underline}
148148
- [Repo](https://github.com/tembo-io/pg_vectorize)
149-
- Workflows for both vector search and RAG
149+
- Workflows for both **vector search and RAG**
150150
- Integrations with OpenAI's [embeddings](https://platform.openai.com/docs/guides/embeddings) and [chat-completion](https://platform.openai.com/docs/guides/text-generation) endpoints and a self-hosted container for running [Hugging Face Sentence-Transformers](https://huggingface.co/sentence-transformers)
151151
- Automated creation of Postgres triggers to keep your embeddings up to date
152152
- High level API - one function to initialize embeddings transformations, and another function to search
153153
- [pgvectorscale]{.underline}
154154
- [Repo](https://github.com/timescale/pgvectorscale/), [Intro](https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/)
155-
- A complement to pgvector for high performance, cost efficient vector search on large workloads.
155+
- A complement to pgvector for high performance, cost efficient **vector search on large workloads**.
156156
- Features
157157
- A new index type called StreamingDiskANN, inspired by the [DiskANN](https://github.com/microsoft/DiskANN) algorithm, based on research from Microsoft.
158158
- Statistical Binary Quantization: developed by Timescale researchers, This compression method improves on standard Binary Quantization.

qmd/r-snippets.qmd

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,30 @@
152152
153153
- tidyselect functions are used to select particular sets of variables
154154
155+
- Using `dplyr::replace_values` ([source](https://tidyverse.org/blog/2026/02/dplyr-1-2-0/#replace_values))
156+
157+
``` r
158+
state <- c("NC", "NY", "CA", NA, "NY", "Unknown", NA)
159+
160+
# Replace missing values with a constant
161+
replace_values(state, NA ~ "Unknown")
162+
#> [1] "NC" "NY" "CA" "Unknown" "NY" "Unknown" "Unknown"
163+
164+
# Replace missing values with the corresponding value from another column
165+
region <- c("South", "North", "West", "East", "North", "Unknown", "West")
166+
replace_values(state, NA ~ region)
167+
#> [1] "NC" "NY" "CA" "East" "NY" "Unknown" "West"
168+
169+
# Replace problematic values with a missing value
170+
replace_values(state, "Unknown" ~ NA)
171+
#> [1] "NC" "NY" "CA" NA "NY" NA NA
172+
173+
# Standardize multiple issues at once
174+
replace_values(state, c(NA, "Unknown") ~ "<missing>")
175+
#> [1] "NC" "NY" "CA" "<missing>" "NY" "<missing>"
176+
#> [7] "<missing>"
177+
```
178+
155179
- Find duplicate rows
156180
157181
- [{]{style="color: #990000"}[janitor::get_dupes](https://sfirke.github.io/janitor/reference/get_dupes.html){style="color: #990000"}[}]{style="color: #990000"}

0 commit comments

Comments
 (0)