Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
3a12c86
docs: add initial content plan draft
paddymul Mar 20, 2026
519a4ba
docs: add blog articles, DDD static embed generator, and RTD build step
paddymul Mar 20, 2026
068713d
fix: remove unused imports in generate_ddd_static_html.py
paddymul Mar 20, 2026
cdf93d3
fix: RTD build — stub missing JS artifacts, fix RST table width
paddymul Mar 20, 2026
16643e0
ci: post docs preview link as PR comment
paddymul Mar 20, 2026
06187ec
ci: add docs preview link to TestPyPI PR comment
paddymul Mar 20, 2026
20841dd
fix: build static-embed JS bundle on RTD so DDD iframes render
paddymul Mar 20, 2026
1c38349
fix: use full Buckaroo embed for DDD pages, not DFViewer
paddymul Mar 20, 2026
8f2be25
docs: add comments to each DDD code block describing the edge case
paddymul Mar 20, 2026
6c1d39c
docs: show raw ddd_library function defs in code blocks
paddymul Mar 20, 2026
638edde
docs: add missing ddd_library import to notebook example
paddymul Mar 20, 2026
d86a6c1
fix: address review comments — ship static-embed in wheel, use native…
paddymul Mar 20, 2026
faa8212
docs: add article tracing data pipeline from engine to browser
paddymul Mar 21, 2026
350cd0e
docs: convert dataframe viewer article from markdown to RST
paddymul Mar 21, 2026
f3f7480
docs: fix comparison table — correct entries from research, add quak
paddymul Mar 21, 2026
ad927ca
docs: update content plan — mark DDD, types-to-display, and viewer co…
paddymul Mar 21, 2026
17b2637
fix: reset files already merged via PRs 641/642 to match main
paddymul Mar 21, 2026
799eb35
docs: address review comments on types-to-display article
paddymul Mar 21, 2026
b913b04
docs: add Panel Tabulator, Streamlit, hyperlinks to viewer article
paddymul Mar 21, 2026
3188f6f
docs: flesh out Depot CI article with before/after and dependency tes…
paddymul Mar 22, 2026
0688c58
docs: remove Windows job from Depot article
paddymul Mar 22, 2026
d5dffcc
docs: update Depot article with accurate before/after job counts
paddymul Mar 22, 2026
b51ed57
docs: rewrite Depot value prop — consistency and confidence, not raw …
paddymul Mar 22, 2026
eb2642a
docs: fix timing — 3.5 min critical path, not 7 min (excludes Windows)
paddymul Mar 22, 2026
b33b460
docs: add performance and testing articles to content plan
paddymul Mar 23, 2026
301a24e
docs: update Depot article with benchmark data, add CI timing scripts
paddymul Mar 23, 2026
2391e1c
docs: rewrite Depot article with full benchmark data (21 runs)
paddymul Mar 23, 2026
7604611
docs: address review comments on Depot article
paddymul Mar 24, 2026
8d00072
docs: add repo transfer story to Depot article
paddymul Mar 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions docs/content-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

# Content Plan

## Published (merged or ready to merge)

### Dastardly DataFrame Dataset (PR #641)
Published at `docs/source/articles/dastardly-dataframe-dataset.rst`. Covers DDD with static embeds, full dtype coverage table, weird types for pandas and polars. Includes Polars DDD (issue #622).

### How types and data move from engine to browser
Published at `docs/source/articles/types-to-display.rst`. Column renaming (a,b,c..z,aa,ab), type coercion before parquet, fastparquet encoding, base64 transport, hyparquet decode in browser, displayer/formatter dispatch. Full pipeline trace for a single cell value.

### So you want to write a DataFrame viewer
Published at `docs/source/articles/so-you-want-to-write-a-dataframe-viewer.rst`. Comparison of open source DataFrame viewers (Buckaroo, Perspective, iTables, Great Tables, DTale, Mito, Marimo, ipydatagrid, quak). Research in `~/personal/buckaroo-writing/research/`.

### Why Buckaroo uses Depot for CI
Draft at `docs/source/articles/why-depot.rst`. Depot sponsorship story. Honest benchmarking: Depot isn't measurably faster than GitHub runners (I/O-bound workload), but consistent provisioning + no minute quotas gave confidence to grow from 3 to 23 CI jobs. Pending: email to Depot CTO for input before publishing.

## Planned

### Static embedding improvements
- Publish JS to CDN → reduced embed size. Talk about the journey: Jupyter → Marimo/Pyodide → static embedding → smaller static embedding
- Page weight comparison: dbt (501KB compressed, 28MB total, 1.41s DCL), Snowflake (128kb/1.28mb/22.51mb/445ms), Databricks (127kb/797kb/313ms)
- Customizing buckaroo via API for embeds — show styling, link to styling docs
- Static search — maybe, take a crack at it
- Link to the static embedding guide

### Styling buckaroo chrome
Based on https://github.com/buckaroo-data/buckaroo/pull/583

### Buckaroo embedding guide
- Why to embed buckaroo
- Which config makes sense for you — along with data sizes reasoning
- Customizing appearance
- Customizing buckaroo

### Embedding buckaroo for bigger data
Parquet range queries on S3/R2 buckets. Sponsored by Cloudflare?

### How I made Buckaroo fast
The philosophy: do the right things fast, but mostly just do less. Not a performance optimization article — it's about architecture decisions that avoid work entirely.
- Column renaming to a,b,c means shorter keys everywhere, no escaping
- Parquet instead of JSON: moved from Python JSON serialization (the slowest part of the original render) to binary Parquet. Faster encoding, smaller payloads, type preservation for free
- Sampling: don't process the whole DataFrame. Sample first, compute stats on the sample, display the sample. The user sees 500 rows, not 500,000
- Summary stats: compute once, cache. Don't recompute on every view switch
- hyparquet decodes in the browser — no round-trip to the server for data
- LRU cache on decoded Parquet so switching between main/stats views doesn't re-decode
- AG-Grid does the hard rendering work (virtual scrolling, column virtualization) — don't fight it, feed it clean data
- The lesson: most "performance work" was removing unnecessary work, not optimizing hot paths

### Testing Buckaroo: unit tests, integration tests, and everything in between
How a solo developer tests a project that spans Python + TypeScript across 8 deployment environments.
- **Python unit tests** (pytest): serialization, stats computation, type coercion, column renaming. Fast, reliable, the foundation. ~60s for the full suite
- **JS unit tests** (vitest): component logic, displayer/formatter functions, parquet decoding. Run in Node, no browser needed
- **Playwright integration tests** (6 suites): Storybook (component rendering), JupyterLab (full widget lifecycle), Marimo, WASM Marimo, Server (MCP/standalone), Static Embed. These catch "it works in Jupyter but is blank in Marimo" — the bugs you can't find any other way
- **Styling screenshot comparisons**: before/after captures on every PR using Storybook + Playwright. Catches visual regressions (column width changes, color map shifts) that no unit test can detect
- **Smoke tests**: install the wheel with each optional extras group (`[mcp]`, `[notebook]`, etc.) and verify imports work. Catches dependency conflicts
- **MCP integration tests**: install the wheel, start the MCP server, make a `tools/call` request, verify the response includes static assets
- **Dual dependency strategy**: run all Python tests twice — once with minimum pinned versions, once with `--resolution=highest`. Catches pandas/polars/pyarrow compatibility issues before users do
- **The DDD as a test suite**: the Dastardly DataFrame Dataset isn't just documentation — each weird DataFrame exercises edge cases through the full serialization → display pipeline
- What I don't test: VSCode, Google Colab (no headless automation), visual pixel-perfect matching (too brittle)
- The lesson: integration tests are worth the CI investment. Most real bugs are at boundaries (Python→Parquet→JS→AG-Grid), not inside any one layer

208 changes: 208 additions & 0 deletions docs/source/articles/buckaroo-compare.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
BuckarooCompare — Diff Your DataFrames
=======================================

When you change a pipeline, how do you know what changed in the output? When
you migrate a table from one database to another, how do you verify the data
matches? When two teams produce different versions of the same report, where
are the differences?

You diff them. But ``df1.equals(df2)`` returns a single boolean, and
``df1.compare(df2)`` only works if the DataFrames have identical shapes and
indexes. Real-world comparisons are messier: rows may be reordered, columns
may be added or removed, and the join key might not be the index.

Buckaroo's ``col_join_dfs`` function handles all of this and renders the
result as a color-coded interactive table where differences jump out
visually.


Quick start
-----------

.. code-block:: python

from buckaroo.compare import col_join_dfs
import pandas as pd

df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'score': [88.5, 92.1, 75.3, 96.7],
})

df2 = pd.DataFrame({
'id': [1, 2, 3, 5],
'name': ['Alice', 'Robert', 'Charlie', 'Eve'],
'score': [88.5, 92.1, 80.0, 81.0],
})

merged_df, column_config_overrides, eqs = col_join_dfs(
df1, df2,
join_columns=['id'],
how='outer'
)

The function returns three things:

1. **merged_df**: The joined DataFrame with all rows from both inputs,
plus hidden metadata columns for diff state
2. **column_config_overrides**: A dict of buckaroo styling config that
color-codes each cell based on whether it matches, differs, or is
missing from one side
3. **eqs**: A summary dict showing the diff count per column — how many
rows differ for each column


How the diff works
------------------

``col_join_dfs`` performs a ``pd.merge`` on the join columns, then for each
data column:

- Creates a hidden ``{col}|df2`` column with the df2 value
- Creates a hidden ``{col}|eq`` column encoding the combined state:
is the row in df1 only, df2 only, both-and-matching, or both-and-different?
- Generates a ``color_map_config`` that maps these states to colors

The color scheme:

.. list-table::
:header-rows: 1

* - State
- Color
- Meaning
* - df1 only
- Pink
- Row exists in df1 but not df2
* - df2 only
- Green
- Row exists in df2 but not df1
* - Match
- Light blue
- Row in both, values identical
* - Diff
- Dark blue
- Row in both, values differ

Join key columns are highlighted in purple so you can immediately see what
was used for matching.


The eqs summary
---------------

The third return value tells you at a glance where the differences are:

.. code-block:: python

>>> eqs
{
'id': {'diff_count': 'join_key'},
'name': {'diff_count': 2}, # 2 rows differ
'score': {'diff_count': 1}, # 1 row differs
}

Special values:

- ``"join_key"`` — this column was used for matching, not compared
- ``"df_1"`` — column only exists in df1
- ``"df_2"`` — column only exists in df2
- An integer — number of rows where values differ


Using it with the server
------------------------

The buckaroo server exposes a ``/load_compare`` endpoint that loads two
files, runs the diff, and pushes the styled result to any connected browser:

.. code-block:: bash

curl -X POST http://localhost:8888/load_compare \
-H "Content-Type: application/json" \
-d '{
"session": "my-session",
"path1": "/data/report_v1.csv",
"path2": "/data/report_v2.csv",
"join_columns": ["id"],
"how": "outer"
}'

The response includes the diff summary:

.. code-block:: json

{
"session": "my-session",
"rows": 5,
"columns": ["id", "name", "score"],
"eqs": {
"id": {"diff_count": "join_key"},
"name": {"diff_count": 2},
"score": {"diff_count": 1}
}
}

The browser view updates immediately with the color-coded merged table.
Hover over any differing cell to see the df2 value in a tooltip.


Multi-column joins
------------------

.. code-block:: python

merged_df, overrides, eqs = col_join_dfs(
df1, df2,
join_columns=['region', 'date'],
how='inner'
)

Composite join keys work naturally. Both ``region`` and ``date`` will be
highlighted in purple.


Use cases
---------

**Data migration validation**
Migrating from Postgres to Snowflake? Export both tables, diff them.
The color coding immediately shows which rows are missing and which
values changed.

**Pipeline output comparison**
Changed a transform? Diff the before and after. The ``eqs`` summary
tells you exactly which columns were affected and by how many rows.

**A/B test result inspection**
Compare experiment vs control DataFrames on a user ID join key. See
which metrics actually differ.

**Schema evolution**
When df2 has columns that df1 doesn't (or vice versa), those columns
are marked as ``"df_1"`` or ``"df_2"`` in the eqs summary, so you
can see schema changes alongside data changes.


Integration with datacompy
--------------------------

The ``docs/example-notebooks/datacompy_app.py`` example shows how to use
`datacompy <https://github.com/capitalone/datacompy>`_ for metadata-rich
comparison (column matching stats, row-level match rates) while using
buckaroo for the visual rendering.

This gives you the best of both: datacompy's statistical summary plus
buckaroo's interactive, color-coded table view.


Limitations
-----------

- Join columns must be unique in each DataFrame (no many-to-many joins).
If duplicates are detected, ``col_join_dfs`` raises a ``ValueError``.
- Column names cannot contain ``|df2`` or ``__buckaroo_merge`` (these are
used internally).
- Very large DataFrames (>100K rows) will work but the browser may be slow
to render the full color-coded table.
Loading
Loading