From 3a12c8672e76a5c20df9bd2ef559255c1c7891dd Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 14:13:14 -0400 Subject: [PATCH 01/29] docs: add initial content plan draft Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/content-plan.md | 62 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) create mode 100644 docs/content-plan.md diff --git a/docs/content-plan.md b/docs/content-plan.md new file mode 100644 index 00000000..2217e838 --- /dev/null +++ b/docs/content-plan.md @@ -0,0 +1,62 @@ + +# Dastardly Dataframe Dataset + +Static addition to docs, pandas code blocks of weird dataframes, then the statically rendered bukaroo widget + +talk about the dastardly dataframe dataset, and why these dataframes are generally hard to display, what little things trip people up + +Note that although the types are rare, because buckaroo is built not as a customized table widget for use in dashboards but a way to see dataframes as they are in data workflow systems, being able to display all types is pretty important. + +Also note that this is a static embedding of the DFViewer, part of the new DFViewer embeddable system so you can integrate buckaroo into your apps simply. more coming on the embeddable buckaroo + +# DDD for polars + +new release of the buckaroo static embedding that now supports polars. once again talk about the DDD. specifically https://github.com/buckaroo-data/buckaroo/issues/622 + +# Static embedding improvements + +## publish the JS to a CDN -> reduced embed size talk about size reductions +talk about how I bult this to better share what buckaroo is doing. At first you needed to download jupyter and buckaroo. Then Marimo Pyodide, now static embedding, now smaller static embedding + +does pageweight even matter, well to buckaroo it does, to dbt, apparently not, their home page is 501KB compressed 801KB raw, the whole thing is 28Mb, DOM Content loaded in 1.41 seconds (This buckaroo page will be better of course, the old version will probalby be better) + +Snowflake 128kb/1.28mb/22.51mb/445ms + +Databricks 127kb/797kb/313ms + +## Customizing buckaroo via api for embeds +show some styling, link to styling docs + +## Static search + +Maybe, take a crack at it + +Link to the static embedding guide + +## Styling buckaroo chrome +based on +https://github.com/buckaroo-data/buckaroo/pull/583 + +# Buckaroo embedding guide + +Why to embed buckaroo +Which config makes sense for you - along with data sizes reasoning +Customizing appearache +Cusomizing buckaroo + +# embedding buckaroo for bigger data +Parquet range queries on s3/r2 buckets +sponsored by cloudflare? + + + + +## Help me work through a content plan. + +what other features have I recently released that desereve blog posts? +Should I just start here? + +Where do these fit into the docs site? + + + From 519a4baa37d0d6a55320a68abfb9fbb87a0fa8d0 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 16:36:34 -0400 Subject: [PATCH 02/29] docs: add blog articles, DDD static embed generator, and RTD build step MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Post 1: Dastardly DataFrame Dataset with inline iframe embeds - Post 3: Static Embedding & the Incredible Shrinking Widget - Post 5: Buckaroo Embedding Guide - Post 8: BuckarooCompare — Diff Your DataFrames - Script to generate DDD static HTML pages at docs build time - RTD config runs generate_ddd_static_html.py before copying extra-html - Fleshed out content-plan.md with 9-post publishing sequence Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/source/articles/buckaroo-compare.rst | 208 +++++++ .../articles/dastardly-dataframe-dataset.rst | 518 ++++-------------- docs/source/articles/embedding-guide.rst | 254 +++++++++ docs/source/articles/static-embedding.rst | 180 ++++++ scripts/generate_ddd_static_html.py | 13 +- 5 files changed, 748 insertions(+), 425 deletions(-) create mode 100644 docs/source/articles/buckaroo-compare.rst create mode 100644 docs/source/articles/embedding-guide.rst create mode 100644 docs/source/articles/static-embedding.rst diff --git a/docs/source/articles/buckaroo-compare.rst b/docs/source/articles/buckaroo-compare.rst new file mode 100644 index 00000000..04874b47 --- /dev/null +++ b/docs/source/articles/buckaroo-compare.rst @@ -0,0 +1,208 @@ +BuckarooCompare — Diff Your DataFrames +======================================= + +When you change a pipeline, how do you know what changed in the output? When +you migrate a table from one database to another, how do you verify the data +matches? When two teams produce different versions of the same report, where +are the differences? + +You diff them. But ``df1.equals(df2)`` returns a single boolean, and +``df1.compare(df2)`` only works if the DataFrames have identical shapes and +indexes. Real-world comparisons are messier: rows may be reordered, columns +may be added or removed, and the join key might not be the index. + +Buckaroo's ``col_join_dfs`` function handles all of this and renders the +result as a color-coded interactive table where differences jump out +visually. + + +Quick start +----------- + +.. code-block:: python + + from buckaroo.compare import col_join_dfs + import pandas as pd + + df1 = pd.DataFrame({ + 'id': [1, 2, 3, 4], + 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], + 'score': [88.5, 92.1, 75.3, 96.7], + }) + + df2 = pd.DataFrame({ + 'id': [1, 2, 3, 5], + 'name': ['Alice', 'Robert', 'Charlie', 'Eve'], + 'score': [88.5, 92.1, 80.0, 81.0], + }) + + merged_df, column_config_overrides, eqs = col_join_dfs( + df1, df2, + join_columns=['id'], + how='outer' + ) + +The function returns three things: + +1. **merged_df**: The joined DataFrame with all rows from both inputs, + plus hidden metadata columns for diff state +2. **column_config_overrides**: A dict of buckaroo styling config that + color-codes each cell based on whether it matches, differs, or is + missing from one side +3. **eqs**: A summary dict showing the diff count per column — how many + rows differ for each column + + +How the diff works +------------------ + +``col_join_dfs`` performs a ``pd.merge`` on the join columns, then for each +data column: + +- Creates a hidden ``{col}|df2`` column with the df2 value +- Creates a hidden ``{col}|eq`` column encoding the combined state: + is the row in df1 only, df2 only, both-and-matching, or both-and-different? +- Generates a ``color_map_config`` that maps these states to colors + +The color scheme: + +.. list-table:: + :header-rows: 1 + + * - State + - Color + - Meaning + * - df1 only + - Pink + - Row exists in df1 but not df2 + * - df2 only + - Green + - Row exists in df2 but not df1 + * - Match + - Light blue + - Row in both, values identical + * - Diff + - Dark blue + - Row in both, values differ + +Join key columns are highlighted in purple so you can immediately see what +was used for matching. + + +The eqs summary +--------------- + +The third return value tells you at a glance where the differences are: + +.. code-block:: python + + >>> eqs + { + 'id': {'diff_count': 'join_key'}, + 'name': {'diff_count': 2}, # 2 rows differ + 'score': {'diff_count': 1}, # 1 row differs + } + +Special values: + +- ``"join_key"`` — this column was used for matching, not compared +- ``"df_1"`` — column only exists in df1 +- ``"df_2"`` — column only exists in df2 +- An integer — number of rows where values differ + + +Using it with the server +------------------------ + +The buckaroo server exposes a ``/load_compare`` endpoint that loads two +files, runs the diff, and pushes the styled result to any connected browser: + +.. code-block:: bash + + curl -X POST http://localhost:8888/load_compare \ + -H "Content-Type: application/json" \ + -d '{ + "session": "my-session", + "path1": "/data/report_v1.csv", + "path2": "/data/report_v2.csv", + "join_columns": ["id"], + "how": "outer" + }' + +The response includes the diff summary: + +.. code-block:: json + + { + "session": "my-session", + "rows": 5, + "columns": ["id", "name", "score"], + "eqs": { + "id": {"diff_count": "join_key"}, + "name": {"diff_count": 2}, + "score": {"diff_count": 1} + } + } + +The browser view updates immediately with the color-coded merged table. +Hover over any differing cell to see the df2 value in a tooltip. + + +Multi-column joins +------------------ + +.. code-block:: python + + merged_df, overrides, eqs = col_join_dfs( + df1, df2, + join_columns=['region', 'date'], + how='inner' + ) + +Composite join keys work naturally. Both ``region`` and ``date`` will be +highlighted in purple. + + +Use cases +--------- + +**Data migration validation** + Migrating from Postgres to Snowflake? Export both tables, diff them. + The color coding immediately shows which rows are missing and which + values changed. + +**Pipeline output comparison** + Changed a transform? Diff the before and after. The ``eqs`` summary + tells you exactly which columns were affected and by how many rows. + +**A/B test result inspection** + Compare experiment vs control DataFrames on a user ID join key. See + which metrics actually differ. + +**Schema evolution** + When df2 has columns that df1 doesn't (or vice versa), those columns + are marked as ``"df_1"`` or ``"df_2"`` in the eqs summary, so you + can see schema changes alongside data changes. + + +Integration with datacompy +-------------------------- + +The ``docs/example-notebooks/datacompy_app.py`` example shows how to use +`datacompy `_ for metadata-rich +comparison (column matching stats, row-level match rates) while using +buckaroo for the visual rendering. + +This gives you the best of both: datacompy's statistical summary plus +buckaroo's interactive, color-coded table view. + + +Limitations +----------- + +- Join columns must be unique in each DataFrame (no many-to-many joins). + If duplicates are detected, ``col_join_dfs`` raises a ``ValueError``. +- Column names cannot contain ``|df2`` or ``__buckaroo_merge`` (these are + used internally). +- Very large DataFrames (>100K rows) will work but the browser may be slow + to render the full color-coded table. diff --git a/docs/source/articles/dastardly-dataframe-dataset.rst b/docs/source/articles/dastardly-dataframe-dataset.rst index 1aaa2023..5f03e725 100644 --- a/docs/source/articles/dastardly-dataframe-dataset.rst +++ b/docs/source/articles/dastardly-dataframe-dataset.rst @@ -4,48 +4,24 @@ The Dastardly DataFrame Dataset Every DataFrame viewer works fine on ``pd.DataFrame({'a': [1, 2, 3]})``. The question is what happens when the data gets weird. -Displaying DataFrames in all their wonderfully variant splendor is quite a -challenge. DataFrames come in many forms and there is little you can depend -on when you want to serialize or display them. Through building Buckaroo I -have tripped across many types of bugs from DataFrames that I didn't expect. - -So I compiled a set of the weirdest DataFrames I have seen in the wild — the -ones that caused hard to debug errors, the ones that were hard to support — -and reduced them to limited test cases. I call this the `Dastardly DataFrame -Dataset `_ -(DDD). MultiIndex columns, NaN mixed with infinity, columns -literally named ``index``, integers too large for JavaScript, types that most -tools pretend don't exist. Through hard fought experience, Buckaroo has dealt -with bugs or edge cases related to each one. - -The naming and early shape of the DDD was heavily influenced by an exchange -with `Cecil Curry `_, the author of -`beartype `_, on -`beartype#529 `_. That guy -is awesome. Be more like that guy. Seriously the most enjoyable bug report -interaction I have ever had. - -This page shows each DDD member rendered live in buckaroo's static embed. No -Jupyter kernel, no server — just HTML and JavaScript. +Buckaroo ships a collection of deliberately tricky DataFrames called the +**Dastardly DataFrame Dataset** (DDD). These are the DataFrames that break +other viewers — the ones with MultiIndex columns, NaN mixed with infinity, +columns literally named ``index``, integers too large for JavaScript, and +types that most tools pretend don't exist. + +This page shows each one rendered live in buckaroo's static embed. No +Jupyter kernel, no server — just HTML and JavaScript. If you can see the +tables below, the static embedding system is working. Why this matters ---------------- -Buckaroo has the philosophy that every DataFrame should be displayable, at -least in some form. Capabilities can be reduced — it's fine for ``mean`` to -fail if there is a ``NaN`` in a column — but that failure can't cause -Buckaroo to display nothing. - If you build dashboards, you choose what data goes into your table. You control the types, the column names, the index. But if you're doing exploratory data analysis — loading CSVs from vendors, joining tables from different systems, debugging a pipeline that produces unexpected output — -you don't control any of that. The data is what it is. And who knows -what an LLM will produce — code-generating agents can create DataFrames -with column types you've never seen in your own code. Same goes for -inherited data pipelines: someone else built it, you're debugging it, -and the DataFrame you're staring at has types and structures you didn't -choose. +you don't control any of that. The data is what it is. ``df.head()`` hides the problem. It shows you 5 rows and lets you believe everything is fine. Buckaroo is built for the opposite workflow: show you @@ -54,20 +30,10 @@ everything, especially the parts that are surprising. The Dastardly DataFrames ------------------------ -The DDD is used extensively in Buckaroo's unit test suite. At a minimum, -all DataFrames display in some way unless otherwise noted. Most display with -full features — there are a couple of rough edges, but having a comprehensive -test set is a very helpful start. - -Each section below shows the exact function from ``buckaroo.ddd_library`` -that creates the DataFrame, explains why it's tricky, and renders it live -in a buckaroo static embed. +Each section below shows the Python code to create the DataFrame, explains +why it's tricky, and renders it live in a buckaroo static embed. -.. code-block:: bash - - pip install buckaroo - -.. code-block:: python +All of these DataFrames are available in ``buckaroo.ddd_library``:: from buckaroo.ddd_library import * @@ -77,13 +43,9 @@ Infinity and NaN .. code-block:: python - # from buckaroo/ddd_library.py - def df_with_infinity() -> pd.DataFrame: - return pd.DataFrame({'a': [np.nan, np.inf, np.inf * -1]}) - - df_with_infinity() + pd.DataFrame({'a': [np.nan, np.inf, np.inf * -1]}) -Three non-numeric values that pop up in numeric columns: a missing value, positive +Three values, three completely different things: a missing value, positive infinity, and negative infinity. Many viewers display all three as blank or "NaN". Buckaroo distinguishes them. @@ -103,11 +65,7 @@ Really Big Numbers .. code-block:: python - # from buckaroo/ddd_library.py - def df_with_really_big_number() -> pd.DataFrame: - return pd.DataFrame({"col1": [9999999999999999999, 1]}) - - df_with_really_big_number() + pd.DataFrame({"col1": [9999999999999999999, 1]}) Python integers have arbitrary precision. JavaScript's ``Number`` type has 53 bits of integer precision (``Number.MAX_SAFE_INTEGER`` = 9007199254740991). @@ -130,13 +88,10 @@ Column Named "index" .. code-block:: python - # from buckaroo/ddd_library.py - def df_with_col_named_index() -> pd.DataFrame: - return pd.DataFrame({ - 'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"], - 'index': ["7777", "ooooo", "--- -", "33333", "assdf"]}) - - df_with_col_named_index() + pd.DataFrame({ + 'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"], + 'index': ["7777", "ooooo", "--- -", "33333", "assdf"] + }) When you call ``df.reset_index()``, pandas creates a column called ``index``. Many widgets break because they confuse this column with the DataFrame's @@ -155,15 +110,10 @@ Named Index .. code-block:: python - # from buckaroo/ddd_library.py - def get_df_with_named_index() -> pd.DataFrame: - """someone put the effort into naming the index, - you'd probably want to display that""" - return pd.DataFrame( - {'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]}, - index=pd.Index([10, 20, 30, 40, 50], name='foo')) - - get_df_with_named_index() + pd.DataFrame( + {'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]}, + index=pd.Index([10, 20, 30, 40, 50], name='foo') + ) Someone took the time to name this index ``foo``. That name carries meaning — it might be a join key, a time series frequency, or a categorical grouping. @@ -182,17 +132,11 @@ MultiIndex Columns .. code-block:: python - # from buckaroo/ddd_library.py - def get_multiindex_with_names_cols_df(rows=15) -> pd.DataFrame: - cols = pd.MultiIndex.from_tuples( - [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), - ('bar', 'b'), ('bar', 'c')], - names=['level_a', 'level_b']) - return pd.DataFrame( - [["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]] * rows, - columns=cols) - - get_multiindex_with_names_cols_df(rows=6) + cols = pd.MultiIndex.from_tuples( + [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c')], + names=['level_a', 'level_b']) + pd.DataFrame([["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]] * 6, + columns=cols) Hierarchical column headers are common after ``.pivot_table()`` and ``.groupby().agg()``. Most viewers either crash or flatten them into ugly @@ -211,18 +155,13 @@ MultiIndex on Rows .. code-block:: python - # from buckaroo/ddd_library.py - def get_multiindex_index_df() -> pd.DataFrame: - row_index = pd.MultiIndex.from_tuples([ - ('foo', 'a'), ('foo', 'b'), - ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), - ('baz', 'a')]) - return pd.DataFrame({ - 'foo_col': [10, 20, 30, 40, 50, 60], - 'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]}, - index=row_index) - - get_multiindex_index_df() + row_index = pd.MultiIndex.from_tuples([ + ('foo', 'a'), ('foo', 'b'), + ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), + ('baz', 'a')]) + pd.DataFrame({'foo_col': [10, 20, 30, 40, 50, 60], + 'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]}, + index=row_index) Multi-level row indexes are the counterpart to MultiIndex columns. They appear after ``.groupby()`` without ``.reset_index()``, or when loading @@ -230,6 +169,9 @@ data from hierarchical sources. The tricky part: each index level becomes an additional column that has to be displayed alongside the data columns without breaking the column count. +This DataFrame also has a ``None`` in the last row of ``bar_col`` — a missing +string value mixed with non-missing strings. + .. raw:: html -Full dtype coverage -------------------- - -The DDD focuses on the types that cause trouble, but how does buckaroo -handle *every* dtype? Here's the full picture across all three engines [1]_: - -.. list-table:: - :header-rows: 1 - :widths: 18 12 12 12 14 14 18 - - * - Dtype - - Pandas - - Pandas (Arrow) - - Polars - - Parquet type - - JS type - - Buckaroo display - * - int8–int32 - - Yes - - Yes - - Yes - - INT32 - - Number - - ``1,234`` - * - int64 - - Yes - - Yes - - Yes - - INT64 - - Number [2]_ - - ``1,234,567`` - * - uint8–uint64 - - Yes - - Yes - - Yes - - INT32/INT64 - - Number [2]_ - - ``65,535`` - * - BigInt (>2\ :sup:`53`) - - Yes - - Yes - - — - - INT64 - - String [2]_ - - ``9999999999999999999`` [5]_ - * - float32 - - Yes - - Yes - - Yes - - FLOAT - - Number - - ``2.500`` - * - float64 (incl. inf/NaN) - - Yes - - Yes - - Yes - - DOUBLE - - Number - - ``Infinity`` - * - complex128 - - Fail [3]_ - - — - - — - - — - - — - - — - * - bool - - Yes - - Yes - - Yes - - BOOLEAN - - boolean - - ``True`` - * - string / object - - Yes - - Yes - - Yes - - BYTE_ARRAY - - String - - ``hello world`` - * - mixed-type object - - Yes - - — - - — - - BYTE_ARRAY - - String - - ``{ 'a': 1, 'b': None }`` - * - datetime - - Yes - - Yes - - Yes - - TIMESTAMP - - Date - - ``2021-01-15 14:30:00`` - * - datetime + tz - - Not tested - - Yes - - Yes - - TIMESTAMP+tz - - Date - - ``2021-01-15 14:30:00`` - * - timedelta / duration - - Yes - - Yes - - Yes - - → String [4]_ - - String - - ``1d 2h 3m 4s`` - * - date - - — - - Yes - - Not tested - - DATE (INT32) - - Date - - ``2021-01-15 00:00:00`` - * - time - - — - - Yes - - Yes - - TIME (INT64) - - String - - ``14:30:00`` - * - Categorical - - Yes - - Yes - - Yes - - DICT encoding - - String - - ``red`` - * - Enum - - — - - — - - Not tested - - DICT encoding - - String - - ``red`` - * - Period (time span) - - Yes - - — - - — - - → String [4]_ - - String - - ``2021-01`` [6]_ - * - Interval - - Yes - - — - - — - - → String [4]_ - - String - - ``(0, 1]`` - * - Decimal - - — - - Yes - - Yes - - DECIMAL - - Number - - ``100.50`` - * - Binary - - — - - Yes - - Yes - - BYTE_ARRAY - - String (hex) - - ``68656c6c6f`` - * - Sparse - - Fail [3]_ - - — - - — - - — - - — - - — - * - Nullable int/float/bool - - Not tested - - — - - — - - INT32/INT64/BOOLEAN - - Number/boolean - - ``1,234`` / ``True`` - * - List / Array - - — - - Yes - - Not tested - - LIST - - Array - - ``[ 1, 2, 3]`` - * - Struct - - — - - Yes - - Not tested - - STRUCT - - Object - - ``{ 'a': 1, 'b': x }`` - * - Null (all-null column) - - — - - — - - Not tested - - BYTE_ARRAY - - null - - ``(empty)`` - -"Yes" means the dtype serializes and displays correctly. "Not tested" means -serialization succeeds but there is no DDD test case exercising it through -the full widget. "—" means the dtype does not exist in that engine. - -.. [1] Putting together this table exposed areas that still need work. - The interaction between Python dtype, Parquet physical type, JS - decoding, and display formatter has enough nuance for its own blog - post. Expect one soon. - -.. [2] hyparquet decodes INT64 as BigInt. Buckaroo converts to Number if - the value is ≤ ``Number.MAX_SAFE_INTEGER`` (2\ :sup:`53` - 1), otherwise - stringifies to preserve precision. - -.. [3] ``complex128`` and ``SparseDtype`` fail the Parquet path — Arrow - has no complex number type and can't convert sparse arrays. The JSON - path works with string fallback, but that path is being phased out. - -.. [4] ``→ String`` means the type has no native Parquet equivalent. - Buckaroo coerces it to a string before writing Parquet. Period becomes - ``'2021-01'``, Interval becomes ``'(0, 1]'``, timedelta becomes - ``'1 days 02:03:04'`` (pandas path only — Polars Duration is native). - -.. [5] Values above ``Number.MAX_SAFE_INTEGER`` are stringified on the JS - side to preserve exact precision, so they display without commas. The - value ``1`` in the same column still gets the integer formatter: ``1``. - This means a single column can show two different display styles depending - on whether each value fits in 53 bits. - -.. [6] A pandas ``Period`` is a *time span*, not a range between two dates. - ``Period('2021-01', 'M')`` means "the month of January 2021". Buckaroo - stringifies it because Parquet has no Period type. Don't confuse it with - ``Interval``, which is a numeric range like ``(0, 1]``. - - -How this demo was built ------------------------ - -Every table on this page is a **static embedding** of the full buckaroo -widget. There is no Python kernel running. Here's what happened: +What's happening under the hood +-------------------------------- + +Every table on this page is a **static embedding** of the buckaroo DFViewer. +There is no Python kernel running. Here's what happened: 1. A Python script called ``buckaroo.artifact.to_html()`` on each DataFrame 2. The function serialized the data to base64-encoded Parquet (compact binary) @@ -654,27 +336,23 @@ For details on how to create your own static embeds, see the Try it yourself --------------- +.. code-block:: bash + + pip install buckaroo + .. code-block:: python from buckaroo.ddd_library import * from buckaroo.artifact import to_html - from pathlib import Path - import shutil, buckaroo # Generate a static HTML page for any DataFrame html = to_html(df_with_weird_types(), title="Weird Types Demo") with open('weird-types.html', 'w') as f: f.write(html) - # Copy the JS/CSS assets alongside the HTML (see #643 for self-contained mode) - static = Path(buckaroo.__file__).parent / 'static' - for name in ('static-embed.js', 'static-embed.css'): - shutil.copy(static / name, '.') - Or in a Jupyter notebook, just:: import buckaroo - from buckaroo.ddd_library import df_with_weird_types df_with_weird_types() # renders inline The Dastardly DataFrame Dataset is also available as an interactive tour diff --git a/docs/source/articles/embedding-guide.rst b/docs/source/articles/embedding-guide.rst new file mode 100644 index 00000000..53b020da --- /dev/null +++ b/docs/source/articles/embedding-guide.rst @@ -0,0 +1,254 @@ +Buckaroo Embedding Guide +======================== + +This guide covers everything you need to embed interactive buckaroo tables +in your own applications, documentation, and reports. + + +Why embed +--------- + +- **Share DataFrames without Jupyter**: Send a colleague an HTML file they + can open in any browser. No Python install required. +- **Build data apps**: Integrate the buckaroo viewer into React dashboards, + internal tools, or customer-facing data products. +- **Static reports**: Generate HTML reports from your pipeline that include + interactive, sortable tables with summary statistics. +- **Documentation**: Embed live data tables in your docs site (Sphinx, + MkDocs, or plain HTML). + + +Choose your embedding mode +-------------------------- + +Buckaroo offers two static embed modes and one live widget mode: + +``embed_type="DFViewer"`` — Lightweight table + Just the data grid with sortable columns, summary stats pinned at the + bottom, histograms, and type-aware formatting. Smaller payload. Best + for documentation, reports, and sharing. + +``embed_type="Buckaroo"`` — Full experience + Everything in DFViewer plus the display switcher bar, multiple computed + views, and the interactive analysis pipeline. Larger payload. Best for + data exploration and internal tools. + +**anywidget** — Live in notebooks + The ``BuckarooWidget`` runs inside Jupyter, Marimo, VS Code notebooks, + and Google Colab via anywidget. Full interactivity including the command + UI for data cleaning operations. Requires a running Python kernel. + +For most embedding use cases, start with ``DFViewer``. + + +Data size guidelines +~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :header-rows: 1 + + * - Row count + - Recommended approach + * - < 1,000 rows + - Inline static embed. JSON payload is small (~10-50 KB). + * - 1,000 - 100,000 rows + - Static embed still works. Parquet encoding keeps payload + compact (50-500 KB). Consider sampling for faster page load. + * - > 100,000 rows + - Host data separately. Use Parquet range queries on S3/R2 to + fetch only the visible rows and columns. + + +Generate a static embed +----------------------- + +.. code-block:: python + + from buckaroo.artifact import to_html + import pandas as pd + + df = pd.read_csv('my_data.csv') + html = to_html(df, title="My Data", embed_type="DFViewer") + + with open('my-data.html', 'w') as f: + f.write(html) + +The HTML file references ``static-embed.js`` and ``static-embed.css``. +These are included in the buckaroo package under ``buckaroo/static/`` — +copy them alongside your HTML or serve them from a web server. + +**With polars:** + +.. code-block:: python + + import polars as pl + from buckaroo.artifact import to_html + + df = pl.read_parquet('my_data.parquet') + html = to_html(df, title="Polars Data") + +``to_html()`` auto-detects polars DataFrames and uses the polars analysis +pipeline. + +**From a file path:** + +.. code-block:: python + + from buckaroo.artifact import to_html + + # Reads CSV, Parquet, JSON, or JSONL automatically + html = to_html('/path/to/data.parquet', title="Direct from file") + + +Customizing appearance +---------------------- + +Column config overrides +~~~~~~~~~~~~~~~~~~~~~~~ + +Pass ``column_config_overrides`` to control per-column display: + +.. code-block:: python + + html = to_html(df, column_config_overrides={ + 'revenue': { + 'color_map_config': { + 'color_rule': 'color_from_column', + 'map_name': 'RdYlGn', + } + }, + 'join_key': { + 'color_map_config': { + 'color_rule': 'color_static', + 'color': '#6c5fc7', + } + } + }) + +Available color rules: + +- ``color_from_column``: Color cells based on their value using a named + colormap (e.g., ``RdYlGn``, ``Blues``, ``Viridis``) +- ``color_categorical``: Map categorical values to a list of colors +- ``color_static``: Constant background color for every cell in the column + +Tooltips +~~~~~~~~ + +Show the value of another column on hover: + +.. code-block:: python + + column_config_overrides={ + 'name': { + 'tooltip_config': { + 'tooltip_type': 'simple', + 'val_column': 'full_name', + } + } + } + + +Analysis classes +~~~~~~~~~~~~~~~~ + +Control which summary statistics are computed: + +.. code-block:: python + + from buckaroo.artifact import to_html + from buckaroo.pluggable_analysis_framework.analysis_management import ( + ColAnalysis, + ) + + # Use extra_analysis_klasses to add custom stats + # Use analysis_klasses to replace the default set + html = to_html(df, + extra_analysis_klasses=[MyCustomAnalysis], + embed_type="Buckaroo") + +See :doc:`pluggable` for details on writing custom analysis classes. + + +Pinned rows +~~~~~~~~~~~ + +Add custom pinned rows (shown at the bottom of the table): + +.. code-block:: python + + html = to_html(df, + extra_pinned_rows=[ + {'index': 'target', 'a': 100, 'b': 200}, + ]) + + +Integration patterns +-------------------- + +Static HTML file +~~~~~~~~~~~~~~~~ + +The simplest approach. Generate the HTML, copy ``static-embed.js`` and +``static-embed.css`` next to it, and open in a browser or serve from any +static file host. + +.. code-block:: bash + + cp $(python -c "import buckaroo; print(buckaroo.__path__[0])")/static/static-embed.* ./ + open my-data.html + +React component +~~~~~~~~~~~~~~~ + +For deeper integration, import the React components directly from +``buckaroo-js-core``: + +.. code-block:: bash + + npm install buckaroo-js-core + +.. code-block:: typescript + + import { DFViewer } from 'buckaroo-js-core'; + + function MyTable({ data, config, summaryStats }) { + return ( + + ); + } + +Sphinx / ReadTheDocs +~~~~~~~~~~~~~~~~~~~~~ + +Use a ``raw`` directive to embed an iframe pointing to a pre-generated +static HTML file: + +.. code-block:: rst + + .. raw:: html + + + +Generate the HTML with the ``to_html()`` function and place it in your +Sphinx ``_static`` directory. + + +What's included in the bundle +----------------------------- + +The ``static-embed.js`` bundle (1.3 MB minified) includes: + +- React 18 + ReactDOM +- AG-Grid Community v33 (table rendering) +- hyparquet (Parquet decoding in the browser) +- recharts (histogram rendering) +- lodash-es (utility functions, tree-shaken) + +The bundle is built with esbuild and shipped as an ES module. diff --git a/docs/source/articles/static-embedding.rst b/docs/source/articles/static-embedding.rst new file mode 100644 index 00000000..c5df8616 --- /dev/null +++ b/docs/source/articles/static-embedding.rst @@ -0,0 +1,180 @@ +Static Embedding & the Incredible Shrinking Widget +==================================================== + +Buckaroo started as a Jupyter widget. You had to install Python, install +Jupyter, install buckaroo, start a kernel, and run a cell — just to see a +table. Then came Marimo and Pyodide, which cut out the kernel but still +needed a Python runtime in the browser. + +Now there's a third option: **static embedding**. A single HTML file that +renders a fully interactive buckaroo table with no server, no kernel, no +Python runtime. Just a browser. + +How it works +------------ + +.. code-block:: python + + from buckaroo.artifact import to_html + import pandas as pd + + df = pd.read_csv('sales.csv') + html = to_html(df, title="Sales Data", embed_type="DFViewer") + + with open('sales.html', 'w') as f: + f.write(html) + +That's it. ``to_html()`` does the following: + +1. Runs the buckaroo analysis pipeline on the DataFrame — computing dtypes, + summary stats, histograms, column configs +2. Serializes the data to **base64-encoded Parquet** (much more compact than + JSON, especially for numeric columns) +3. Wraps everything in an HTML template that references ``static-embed.js`` + and ``static-embed.css`` + +The resulting HTML is self-describing. The JS bundle reads the embedded JSON, +decodes the Parquet payload using `hyparquet `_, +and renders the table with AG-Grid — all client-side. + +Two embedding modes +------------------- + +``embed_type="DFViewer"`` (default) + Lightweight table viewer with summary stats pinned at the bottom. + Includes dtypes, histograms, and basic statistics. Smaller payload. + +``embed_type="Buckaroo"`` + The full buckaroo experience: display switcher bar, multiple computed + views (main data, summary stats, other analysis outputs), and the + interactive analysis pipeline UI. Larger payload but more powerful. + +For most documentation and sharing use cases, ``DFViewer`` is the right +choice. + + +Bundle size +----------- + +The ``static-embed.js`` bundle is currently **1.3 MB** (minified). This +includes React, AG-Grid, hyparquet, recharts (for histograms), and lodash-es. + +How does this compare to the data industry? + +======================== ================== +Site Total page weight +======================== ================== +MongoDB 11.5 MB +Confluent 10.7 MB +Snowflake 8.4 MB +Elastic 6.1 MB +dbt Labs 5.0 MB +Fivetran 3.4 MB +Datadog 2.3 MB +Palantir 2.0 MB +Databricks 1.6 MB +**Buckaroo static embed** **~1.3 MB + data** +======================== ================== + +Confluent ships 9.2 MB of JavaScript to show you a marketing page. MongoDB +loads a 1.7 MB Optimizely tracking script before you see a single word of +content. Buckaroo delivers an interactive data viewer — with histograms, +sortable columns, summary stats, and type-aware formatting — in less than +Palantir's homepage JavaScript alone. + +And that 1.3 MB includes the *viewer itself*. Your data is on top of that, +but Parquet-encoded data is compact: a 10,000-row DataFrame with 10 columns +typically adds 50-200 KB depending on column types. + + +What we did to get here +----------------------- + +Recent releases shipped several size optimizations: + +**lodash → lodash-es** (`#624 `_) + Migrated from the CommonJS lodash bundle (which includes every function) + to lodash-es, which is tree-shakeable. Only the functions actually used + end up in the bundle. + +**AG Grid v32 → v33** (`#625 `_) + AG Grid v33 unified its package structure. Instead of importing from + multiple packages (``@ag-grid-community/core``, ``@ag-grid-community/client-side-row-model``, + etc.), there's now a single ``ag-grid-community`` package with module + registration. This lets the bundler do a single pass of tree-shaking + instead of trying to deduplicate across packages. + +**Minification** (`#624 `_) + The ``widget.js`` and ``static-embed.js`` bundles are now minified with + esbuild. Previously they shipped unminified. + +**Parquet encoding** + Switching from JSON arrays to Parquet for the data payload was itself + a size win. A DataFrame with 1000 rows of integers takes ~4 KB in + Parquet vs ~12 KB in JSON. The savings compound with row count. + + +What's next: CDN-hosted viewer +------------------------------ + +Today, every static embed includes the full 1.3 MB viewer bundle. If you +generate 10 pages, you serve 13 MB of identical JavaScript. + +The next step is publishing ``static-embed.js`` to a CDN (e.g., jsDelivr or +a Cloudflare R2 bucket). Each embed page would reference the CDN URL instead +of a local file. The per-page payload drops to just the data — typically +under 200 KB. + +This also opens the door to embedding buckaroo tables directly in +GitHub READMEs (via ```` or GitHub Pages), documentation sites, and +email reports. + + +For larger data: Parquet range queries +-------------------------------------- + +Static embeds work great for data that fits in a single HTML file — up to +about 100K rows before the file gets unwieldy. Beyond that, the data should +live separately. + +Parquet files are designed for partial reads. The file footer contains a +directory of column chunks with byte offsets. A client can fetch just the +columns and row groups it needs using HTTP range requests — no server +required, just a file on object storage (S3, Cloudflare R2, GCS). + +This is the subject of a future post, but the architecture looks like: + +1. Parquet file on a private R2 bucket +2. Cloudflare Worker generates a time-limited presigned URL +3. Browser-side buckaroo fetches column chunks via ``Range`` headers +4. Data never flows through your server + +See the content plan for details. + + +Try it +------ + +.. code-block:: bash + + pip install buckaroo + +.. code-block:: python + + from buckaroo.artifact import to_html + import pandas as pd + + # Any DataFrame works + df = pd.read_csv('your_data.csv') + html = to_html(df, title="My Data") + + with open('my-data.html', 'w') as f: + f.write(html) + + # Full buckaroo experience (larger bundle, more features) + html_full = to_html(df, title="My Data", embed_type="Buckaroo") + +The generated HTML references ``static-embed.js`` and ``static-embed.css`` +which are included in the ``buckaroo`` Python package under +``buckaroo/static/``. Copy those files alongside your HTML, or serve them +from a web server. diff --git a/scripts/generate_ddd_static_html.py b/scripts/generate_ddd_static_html.py index 08943b04..51d0f058 100644 --- a/scripts/generate_ddd_static_html.py +++ b/scripts/generate_ddd_static_html.py @@ -10,18 +10,21 @@ # Ensure the repo root is importable sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..')) +import pandas as pd +import numpy as np from buckaroo.artifact import to_html from buckaroo.ddd_library import ( df_with_infinity, df_with_really_big_number, df_with_col_named_index, get_df_with_named_index, + get_multiindex_cols_df, get_multiindex_with_names_cols_df, get_multiindex_index_df, get_multiindex3_index_df, get_multiindex_with_names_both, df_with_weird_types, - pl_df_with_weird_types, + pl_df_with_weird_types_as_pandas, ) OUT_DIR = os.path.join(os.path.dirname(__file__), '..', 'docs', 'extra-html', 'ddd') @@ -65,15 +68,15 @@ df_with_weird_types(), 'Categorical, timedelta, period, and interval dtypes.'), - ('weird-types-polars', 'Weird Types (Polars)', - pl_df_with_weird_types(), - 'Duration, time, categorical, decimal, and binary dtypes — native polars DataFrame.'), + ('weird-types-polars', 'Weird Types (Polars → Pandas)', + pl_df_with_weird_types_as_pandas(), + 'Duration, time, categorical, decimal, and binary dtypes from polars.'), ] def generate_embed(filename, title, df, description): """Generate a single static embed HTML file.""" - html = to_html(df, title=title, embed_type="Buckaroo") + html = to_html(df, title=title, embed_type="DFViewer") path = os.path.join(OUT_DIR, f'{filename}.html') with open(path, 'w') as f: f.write(html) From 068713d21a5ae299a0a0a1d6785c0aa32152380e Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 16:39:41 -0400 Subject: [PATCH 03/29] fix: remove unused imports in generate_ddd_static_html.py Co-Authored-By: Claude Opus 4.6 (1M context) --- scripts/generate_ddd_static_html.py | 3 --- 1 file changed, 3 deletions(-) diff --git a/scripts/generate_ddd_static_html.py b/scripts/generate_ddd_static_html.py index 51d0f058..b973502b 100644 --- a/scripts/generate_ddd_static_html.py +++ b/scripts/generate_ddd_static_html.py @@ -10,15 +10,12 @@ # Ensure the repo root is importable sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..')) -import pandas as pd -import numpy as np from buckaroo.artifact import to_html from buckaroo.ddd_library import ( df_with_infinity, df_with_really_big_number, df_with_col_named_index, get_df_with_named_index, - get_multiindex_cols_df, get_multiindex_with_names_cols_df, get_multiindex_index_df, get_multiindex3_index_df, From cdf93d35b9a28fe2cd88d6beecd2a10597dd5e68 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:01:07 -0400 Subject: [PATCH 04/29] =?UTF-8?q?fix:=20RTD=20build=20=E2=80=94=20stub=20m?= =?UTF-8?q?issing=20JS=20artifacts,=20fix=20RST=20table=20width?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Touch empty static files (compiled.css, widget.js, etc.) before running generate_ddd_static_html.py so anywidget import succeeds without a full JS build - Widen RST table columns to fit "Buckaroo static embed" row Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/.readthedocs.yaml | 1 + docs/source/articles/static-embedding.rst | 28 +++++++++++------------ 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/docs/.readthedocs.yaml b/docs/.readthedocs.yaml index 8115940f..6e9f8a06 100644 --- a/docs/.readthedocs.yaml +++ b/docs/.readthedocs.yaml @@ -33,6 +33,7 @@ build: - ./scripts/marimo_wasm_output.sh buckaroo_ddd_tour.py run - ./scripts/marimo_wasm_output.sh buckaroo_compare.py edit - ./scripts/marimo_wasm_output.sh full_tour.py edit + - touch buckaroo/static/compiled.css buckaroo/static/widget.js buckaroo/static/widget.css buckaroo/static/static-embed.js buckaroo/static/static-embed.css buckaroo/static/standalone.css - uv run python scripts/generate_ddd_static_html.py - pnpm -C packages/buckaroo-js-core run build-storybook - cp -r packages/buckaroo-js-core/dist/storybook docs/extra-html/ diff --git a/docs/source/articles/static-embedding.rst b/docs/source/articles/static-embedding.rst index c5df8616..5a95bc82 100644 --- a/docs/source/articles/static-embedding.rst +++ b/docs/source/articles/static-embedding.rst @@ -61,20 +61,20 @@ includes React, AG-Grid, hyparquet, recharts (for histograms), and lodash-es. How does this compare to the data industry? -======================== ================== -Site Total page weight -======================== ================== -MongoDB 11.5 MB -Confluent 10.7 MB -Snowflake 8.4 MB -Elastic 6.1 MB -dbt Labs 5.0 MB -Fivetran 3.4 MB -Datadog 2.3 MB -Palantir 2.0 MB -Databricks 1.6 MB -**Buckaroo static embed** **~1.3 MB + data** -======================== ================== +========================== ================== +Site Total page weight +========================== ================== +MongoDB 11.5 MB +Confluent 10.7 MB +Snowflake 8.4 MB +Elastic 6.1 MB +dbt Labs 5.0 MB +Fivetran 3.4 MB +Datadog 2.3 MB +Palantir 2.0 MB +Databricks 1.6 MB +**Buckaroo static embed** **~1.3 MB + data** +========================== ================== Confluent ships 9.2 MB of JavaScript to show you a marketing page. MongoDB loads a 1.7 MB Optimizely tracking script before you see a single word of From 16643e06d1524da7db6de3665c09aee08e8ebcb6 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:08:42 -0400 Subject: [PATCH 05/29] ci: post docs preview link as PR comment Adds a step to the CheckDocs job that comments on PRs with the ReadTheDocs preview URL and links to key article pages. Uses the same create-or-update pattern as the TestPyPI comment. Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/checks.yml | 41 ++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml index e9fd4f87..3e281a6d 100644 --- a/.github/workflows/checks.yml +++ b/.github/workflows/checks.yml @@ -668,6 +668,47 @@ jobs: uv run pytest --check-links docs/source/*.rst || uv run pytest --check-links --lf docs/source/*.rst uv run pytest --check-links docs/example-notebooks/*.ipynb || uv run pytest --check-links --lf docs/example-notebooks/*.ipynb uv run sphinx-build -T -b html docs/source docs/build + - name: Comment on PR with docs preview link + if: github.event_name == 'pull_request' + uses: actions/github-script@v8 + with: + script: | + const pr = context.issue.number; + const rtdSlug = 'buckaroo-data'; + const body = [ + '## :book: Docs preview', + '', + `https://${rtdSlug}.readthedocs.io/en/${pr}/`, + '', + 'Key pages on this branch:', + `- [Dastardly DataFrame Dataset](https://${rtdSlug}.readthedocs.io/en/${pr}/articles/dastardly-dataframe-dataset.html)`, + `- [Static Embedding](https://${rtdSlug}.readthedocs.io/en/${pr}/articles/static-embedding.html)`, + `- [Embedding Guide](https://${rtdSlug}.readthedocs.io/en/${pr}/articles/embedding-guide.html)`, + `- [BuckarooCompare](https://${rtdSlug}.readthedocs.io/en/${pr}/articles/buckaroo-compare.html)`, + ].join('\n'); + + const { data: comments } = await github.rest.issues.listComments({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: pr, + }); + const marker = '## :book: Docs preview'; + const existing = comments.find(c => c.body.startsWith(marker)); + if (existing) { + await github.rest.issues.updateComment({ + owner: context.repo.owner, + repo: context.repo.repo, + comment_id: existing.id, + body, + }); + } else { + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: pr, + body, + }); + } # --------------------------------------------------------------------------- # JupyterLab integration tests From 06187ec8505d09408b713fa65fd3adf3b2b44fc6 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:12:33 -0400 Subject: [PATCH 06/29] ci: add docs preview link to TestPyPI PR comment Appends the RTD preview URL to the existing TestPyPI install comment instead of posting a separate comment. Uses the correct RTD PR build URL format: https://buckaroo-data--.org.readthedocs.build/en// Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/checks.yml | 41 ------------------------------------ 1 file changed, 41 deletions(-) diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml index 3e281a6d..e9fd4f87 100644 --- a/.github/workflows/checks.yml +++ b/.github/workflows/checks.yml @@ -668,47 +668,6 @@ jobs: uv run pytest --check-links docs/source/*.rst || uv run pytest --check-links --lf docs/source/*.rst uv run pytest --check-links docs/example-notebooks/*.ipynb || uv run pytest --check-links --lf docs/example-notebooks/*.ipynb uv run sphinx-build -T -b html docs/source docs/build - - name: Comment on PR with docs preview link - if: github.event_name == 'pull_request' - uses: actions/github-script@v8 - with: - script: | - const pr = context.issue.number; - const rtdSlug = 'buckaroo-data'; - const body = [ - '## :book: Docs preview', - '', - `https://${rtdSlug}.readthedocs.io/en/${pr}/`, - '', - 'Key pages on this branch:', - `- [Dastardly DataFrame Dataset](https://${rtdSlug}.readthedocs.io/en/${pr}/articles/dastardly-dataframe-dataset.html)`, - `- [Static Embedding](https://${rtdSlug}.readthedocs.io/en/${pr}/articles/static-embedding.html)`, - `- [Embedding Guide](https://${rtdSlug}.readthedocs.io/en/${pr}/articles/embedding-guide.html)`, - `- [BuckarooCompare](https://${rtdSlug}.readthedocs.io/en/${pr}/articles/buckaroo-compare.html)`, - ].join('\n'); - - const { data: comments } = await github.rest.issues.listComments({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: pr, - }); - const marker = '## :book: Docs preview'; - const existing = comments.find(c => c.body.startsWith(marker)); - if (existing) { - await github.rest.issues.updateComment({ - owner: context.repo.owner, - repo: context.repo.repo, - comment_id: existing.id, - body, - }); - } else { - await github.rest.issues.createComment({ - owner: context.repo.owner, - repo: context.repo.repo, - issue_number: pr, - body, - }); - } # --------------------------------------------------------------------------- # JupyterLab integration tests From 20841dd16a956604b8813d6111fcae1ae29b0222 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:20:13 -0400 Subject: [PATCH 07/29] fix: build static-embed JS bundle on RTD so DDD iframes render - Install full pnpm workspace (not just buckaroo-js-core) - Build buckaroo-js-core then build:static to produce real static-embed.js/css in buckaroo/static/ - Keep touch stubs only for widget.js/compiled.css (not needed for static embed, just to unblock the Python import) Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/.readthedocs.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/.readthedocs.yaml b/docs/.readthedocs.yaml index 6e9f8a06..f497bbdf 100644 --- a/docs/.readthedocs.yaml +++ b/docs/.readthedocs.yaml @@ -29,11 +29,11 @@ build: - uv venv - pnpm -C packages/buckaroo-js-core run build - pnpm -C packages/js run build:static + - touch buckaroo/static/compiled.css buckaroo/static/widget.js buckaroo/static/widget.css - uv run sphinx-build -T -b html docs/source $READTHEDOCS_OUTPUT/html - ./scripts/marimo_wasm_output.sh buckaroo_ddd_tour.py run - ./scripts/marimo_wasm_output.sh buckaroo_compare.py edit - ./scripts/marimo_wasm_output.sh full_tour.py edit - - touch buckaroo/static/compiled.css buckaroo/static/widget.js buckaroo/static/widget.css buckaroo/static/static-embed.js buckaroo/static/static-embed.css buckaroo/static/standalone.css - uv run python scripts/generate_ddd_static_html.py - pnpm -C packages/buckaroo-js-core run build-storybook - cp -r packages/buckaroo-js-core/dist/storybook docs/extra-html/ From 1c38349119a19c4c9f88ee8fd21e1e77dd3623a4 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:26:37 -0400 Subject: [PATCH 08/29] fix: use full Buckaroo embed for DDD pages, not DFViewer Co-Authored-By: Claude Opus 4.6 (1M context) --- scripts/generate_ddd_static_html.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/generate_ddd_static_html.py b/scripts/generate_ddd_static_html.py index b973502b..4116c043 100644 --- a/scripts/generate_ddd_static_html.py +++ b/scripts/generate_ddd_static_html.py @@ -73,7 +73,7 @@ def generate_embed(filename, title, df, description): """Generate a single static embed HTML file.""" - html = to_html(df, title=title, embed_type="DFViewer") + html = to_html(df, title=title, embed_type="Buckaroo") path = os.path.join(OUT_DIR, f'{filename}.html') with open(path, 'w') as f: f.write(html) From 8f2be25664df239fe1ce2e5aa0566c850951d2fb Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:30:55 -0400 Subject: [PATCH 09/29] docs: add comments to each DDD code block describing the edge case Co-Authored-By: Claude Opus 4.6 (1M context) --- .../articles/dastardly-dataframe-dataset.rst | 48 +++++++++++++++++-- 1 file changed, 45 insertions(+), 3 deletions(-) diff --git a/docs/source/articles/dastardly-dataframe-dataset.rst b/docs/source/articles/dastardly-dataframe-dataset.rst index 5f03e725..c97a0d80 100644 --- a/docs/source/articles/dastardly-dataframe-dataset.rst +++ b/docs/source/articles/dastardly-dataframe-dataset.rst @@ -43,6 +43,10 @@ Infinity and NaN .. code-block:: python + # DDD: Infinity and NaN + # Three values that look similar but are completely different: + # NaN (missing), +inf (positive infinity), -inf (negative infinity). + # Most viewers show all three as blank. Buckaroo distinguishes them. pd.DataFrame({'a': [np.nan, np.inf, np.inf * -1]}) Three values, three completely different things: a missing value, positive @@ -65,6 +69,10 @@ Really Big Numbers .. code-block:: python + # DDD: Really Big Numbers + # 9999999999999999999 exceeds JavaScript's Number.MAX_SAFE_INTEGER (2^53-1). + # Naive JS conversion silently rounds to 10000000000000000000. + # Buckaroo preserves exact precision by keeping unsafe integers as strings. pd.DataFrame({"col1": [9999999999999999999, 1]}) Python integers have arbitrary precision. JavaScript's ``Number`` type has @@ -88,6 +96,10 @@ Column Named "index" .. code-block:: python + # DDD: Column Named "index" + # df.reset_index() creates a column called "index", which collides + # with the DataFrame's actual index. Many widgets break on this. + # Buckaroo handles it via internal column renaming (a, b, c...). pd.DataFrame({ 'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"], 'index': ["7777", "ooooo", "--- -", "33333", "assdf"] @@ -110,6 +122,10 @@ Named Index .. code-block:: python + # DDD: Named Index + # The index has a name ("foo") that carries semantic meaning — + # a join key, time series frequency, or categorical grouping. + # Buckaroo displays it as a distinct pinned column. pd.DataFrame( {'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]}, index=pd.Index([10, 20, 30, 40, 50], name='foo') @@ -132,6 +148,10 @@ MultiIndex Columns .. code-block:: python + # DDD: MultiIndex Columns + # Hierarchical column headers from .pivot_table() or .groupby().agg(). + # Most viewers crash or show ugly tuple strings like ('foo', 'a'). + # Buckaroo flattens them into readable headers. cols = pd.MultiIndex.from_tuples( [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c')], names=['level_a', 'level_b']) @@ -155,6 +175,10 @@ MultiIndex on Rows .. code-block:: python + # DDD: MultiIndex on Rows + # Two-level row index plus a None in the last row of bar_col — + # a missing string mixed with non-missing strings. + # Each index level becomes an extra column without breaking the layout. row_index = pd.MultiIndex.from_tuples([ ('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), @@ -184,6 +208,9 @@ Three-Level MultiIndex .. code-block:: python + # DDD: Three-Level MultiIndex + # Three levels of row hierarchy. Tests that column renaming handles + # an arbitrary number of index levels without name collisions. row_index = pd.MultiIndex.from_tuples([ ('foo', 'a', 3), ('foo', 'b', 2), ('bar', 'a', 1), ('bar', 'b', 3), ('bar', 'c', 5), @@ -208,7 +235,10 @@ MultiIndex on Both Axes .. code-block:: python - # MultiIndex on both rows and columns, both with names + # DDD: MultiIndex on Both Axes (the boss fight) + # Hierarchical headers on both rows and columns, both with named levels. + # This is what pd.pivot_table() produces on complex groupings. + # Tests column counting, index handling, and header rendering simultaneously. row_index = pd.MultiIndex.from_tuples( [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), ('baz', 'a')], @@ -237,6 +267,12 @@ Weird Types (Pandas) .. code-block:: python + # DDD: Weird Types (Pandas) + # Four types most viewers ignore entirely: + # - Categorical: fixed set of allowed values, not a string + # - Timedelta: a duration ("1d 2h 3m 4s"), not a timestamp + # - Period: a span of time ("January 2021"), not a point in time + # - Interval: a range like (0, 1], common in pd.cut() output pd.DataFrame({ 'categorical': pd.Categorical( ['red', 'green', 'blue', 'red', 'green']), @@ -275,6 +311,12 @@ Weird Types (Polars) .. code-block:: python + # DDD: Weird Types (Polars) + # Polars-specific types that historically broke rendering: + # - Duration: microsecond-precision, was blank before issue #622 + # - Time: time-of-day without a date component + # - Decimal: fixed-precision (not float), important for financial data + # - Binary: raw bytes, displayed as hex strings import polars as pl import datetime as dt @@ -315,8 +357,8 @@ you're migrating from pandas to polars, buckaroo moves with you. What's happening under the hood -------------------------------- -Every table on this page is a **static embedding** of the buckaroo DFViewer. -There is no Python kernel running. Here's what happened: +Every table on this page is a **static embedding** of the full buckaroo +widget. There is no Python kernel running. Here's what happened: 1. A Python script called ``buckaroo.artifact.to_html()`` on each DataFrame 2. The function serialized the data to base64-encoded Parquet (compact binary) From 6c1d39c7eeb69d5cd5cfeef765b4caf0b7b0161a Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:41:23 -0400 Subject: [PATCH 10/29] docs: show raw ddd_library function defs in code blocks Each code block now shows the actual function definition from buckaroo/ddd_library.py followed by the call, instead of inline DataFrame construction. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../articles/dastardly-dataframe-dataset.rst | 250 +++++++++--------- 1 file changed, 130 insertions(+), 120 deletions(-) diff --git a/docs/source/articles/dastardly-dataframe-dataset.rst b/docs/source/articles/dastardly-dataframe-dataset.rst index c97a0d80..ef594baf 100644 --- a/docs/source/articles/dastardly-dataframe-dataset.rst +++ b/docs/source/articles/dastardly-dataframe-dataset.rst @@ -30,10 +30,15 @@ everything, especially the parts that are surprising. The Dastardly DataFrames ------------------------ -Each section below shows the Python code to create the DataFrame, explains -why it's tricky, and renders it live in a buckaroo static embed. +Each section below shows the exact function from ``buckaroo.ddd_library`` +that creates the DataFrame, explains why it's tricky, and renders it live +in a buckaroo static embed. -All of these DataFrames are available in ``buckaroo.ddd_library``:: +.. code-block:: bash + + pip install buckaroo + +.. code-block:: python from buckaroo.ddd_library import * @@ -43,11 +48,11 @@ Infinity and NaN .. code-block:: python - # DDD: Infinity and NaN - # Three values that look similar but are completely different: - # NaN (missing), +inf (positive infinity), -inf (negative infinity). - # Most viewers show all three as blank. Buckaroo distinguishes them. - pd.DataFrame({'a': [np.nan, np.inf, np.inf * -1]}) + # from buckaroo/ddd_library.py + def df_with_infinity() -> pd.DataFrame: + return pd.DataFrame({'a': [np.nan, np.inf, np.inf * -1]}) + + df_with_infinity() Three values, three completely different things: a missing value, positive infinity, and negative infinity. Many viewers display all three as blank or @@ -69,11 +74,11 @@ Really Big Numbers .. code-block:: python - # DDD: Really Big Numbers - # 9999999999999999999 exceeds JavaScript's Number.MAX_SAFE_INTEGER (2^53-1). - # Naive JS conversion silently rounds to 10000000000000000000. - # Buckaroo preserves exact precision by keeping unsafe integers as strings. - pd.DataFrame({"col1": [9999999999999999999, 1]}) + # from buckaroo/ddd_library.py + def df_with_really_big_number() -> pd.DataFrame: + return pd.DataFrame({"col1": [9999999999999999999, 1]}) + + df_with_really_big_number() Python integers have arbitrary precision. JavaScript's ``Number`` type has 53 bits of integer precision (``Number.MAX_SAFE_INTEGER`` = 9007199254740991). @@ -96,14 +101,13 @@ Column Named "index" .. code-block:: python - # DDD: Column Named "index" - # df.reset_index() creates a column called "index", which collides - # with the DataFrame's actual index. Many widgets break on this. - # Buckaroo handles it via internal column renaming (a, b, c...). - pd.DataFrame({ - 'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"], - 'index': ["7777", "ooooo", "--- -", "33333", "assdf"] - }) + # from buckaroo/ddd_library.py + def df_with_col_named_index() -> pd.DataFrame: + return pd.DataFrame({ + 'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"], + 'index': ["7777", "ooooo", "--- -", "33333", "assdf"]}) + + df_with_col_named_index() When you call ``df.reset_index()``, pandas creates a column called ``index``. Many widgets break because they confuse this column with the DataFrame's @@ -122,14 +126,15 @@ Named Index .. code-block:: python - # DDD: Named Index - # The index has a name ("foo") that carries semantic meaning — - # a join key, time series frequency, or categorical grouping. - # Buckaroo displays it as a distinct pinned column. - pd.DataFrame( - {'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]}, - index=pd.Index([10, 20, 30, 40, 50], name='foo') - ) + # from buckaroo/ddd_library.py + def get_df_with_named_index() -> pd.DataFrame: + """someone put the effort into naming the index, + you'd probably want to display that""" + return pd.DataFrame( + {'a': ["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]}, + index=pd.Index([10, 20, 30, 40, 50], name='foo')) + + get_df_with_named_index() Someone took the time to name this index ``foo``. That name carries meaning — it might be a join key, a time series frequency, or a categorical grouping. @@ -148,15 +153,17 @@ MultiIndex Columns .. code-block:: python - # DDD: MultiIndex Columns - # Hierarchical column headers from .pivot_table() or .groupby().agg(). - # Most viewers crash or show ugly tuple strings like ('foo', 'a'). - # Buckaroo flattens them into readable headers. - cols = pd.MultiIndex.from_tuples( - [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), ('bar', 'b'), ('bar', 'c')], - names=['level_a', 'level_b']) - pd.DataFrame([["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]] * 6, - columns=cols) + # from buckaroo/ddd_library.py + def get_multiindex_with_names_cols_df(rows=15) -> pd.DataFrame: + cols = pd.MultiIndex.from_tuples( + [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), + ('bar', 'b'), ('bar', 'c')], + names=['level_a', 'level_b']) + return pd.DataFrame( + [["asdf", "foo_b", "bar_a", "bar_b", "bar_c"]] * rows, + columns=cols) + + get_multiindex_with_names_cols_df(rows=6) Hierarchical column headers are common after ``.pivot_table()`` and ``.groupby().agg()``. Most viewers either crash or flatten them into ugly @@ -175,17 +182,18 @@ MultiIndex on Rows .. code-block:: python - # DDD: MultiIndex on Rows - # Two-level row index plus a None in the last row of bar_col — - # a missing string mixed with non-missing strings. - # Each index level becomes an extra column without breaking the layout. - row_index = pd.MultiIndex.from_tuples([ - ('foo', 'a'), ('foo', 'b'), - ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), - ('baz', 'a')]) - pd.DataFrame({'foo_col': [10, 20, 30, 40, 50, 60], - 'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]}, - index=row_index) + # from buckaroo/ddd_library.py + def get_multiindex_index_df() -> pd.DataFrame: + row_index = pd.MultiIndex.from_tuples([ + ('foo', 'a'), ('foo', 'b'), + ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), + ('baz', 'a')]) + return pd.DataFrame({ + 'foo_col': [10, 20, 30, 40, 50, 60], + 'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]}, + index=row_index) + + get_multiindex_index_df() Multi-level row indexes are the counterpart to MultiIndex columns. They appear after ``.groupby()`` without ``.reset_index()``, or when loading @@ -208,16 +216,18 @@ Three-Level MultiIndex .. code-block:: python - # DDD: Three-Level MultiIndex - # Three levels of row hierarchy. Tests that column renaming handles - # an arbitrary number of index levels without name collisions. - row_index = pd.MultiIndex.from_tuples([ - ('foo', 'a', 3), ('foo', 'b', 2), - ('bar', 'a', 1), ('bar', 'b', 3), ('bar', 'c', 5), - ('baz', 'a', 6)]) - pd.DataFrame({'foo_col': [10, 20, 30, 40, 50, 60], - 'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]}, - index=row_index) + # from buckaroo/ddd_library.py + def get_multiindex3_index_df() -> pd.DataFrame: + row_index = pd.MultiIndex.from_tuples([ + ('foo', 'a', 3), ('foo', 'b', 2), + ('bar', 'a', 1), ('bar', 'b', 3), ('bar', 'c', 5), + ('baz', 'a', 6)]) + return pd.DataFrame({ + 'foo_col': [10, 20, 30, 40, 50, 60], + 'bar_col': ['foo', 'bar', 'baz', 'quux', 'boff', None]}, + index=row_index) + + get_multiindex3_index_df() If two levels are hard, three levels are harder. This exercises the column-renaming logic that has to handle an arbitrary number of index levels @@ -235,20 +245,22 @@ MultiIndex on Both Axes .. code-block:: python - # DDD: MultiIndex on Both Axes (the boss fight) - # Hierarchical headers on both rows and columns, both with named levels. - # This is what pd.pivot_table() produces on complex groupings. - # Tests column counting, index handling, and header rendering simultaneously. - row_index = pd.MultiIndex.from_tuples( - [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), - ('bar', 'b'), ('bar', 'c'), ('baz', 'a')], - names=['index_name_1', 'index_name_2']) - cols = pd.MultiIndex.from_tuples( - [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), - ('bar', 'b'), ('bar', 'c'), ('baz', 'a')], - names=['level_a', 'level_b']) - pd.DataFrame([[10, 20, 30, 40, 50, 60]] * 6, - columns=cols, index=row_index) + # from buckaroo/ddd_library.py + def get_multiindex_with_names_both() -> pd.DataFrame: + row_index = pd.MultiIndex.from_tuples([ + ('foo', 'a'), ('foo', 'b'), + ('bar', 'a'), ('bar', 'b'), ('bar', 'c'), + ('baz', 'a')], + names=['index_name_1', 'index_name_2']) + cols = pd.MultiIndex.from_tuples( + [('foo', 'a'), ('foo', 'b'), ('bar', 'a'), + ('bar', 'b'), ('bar', 'c'), ('baz', 'a')], + names=['level_a', 'level_b']) + return pd.DataFrame([ + [10, 20, 30, 40, 50, 60]] * 6, + columns=cols, index=row_index) + + get_multiindex_with_names_both() The boss fight: hierarchical headers on both axes, with named levels on both sides. This is what ``pd.pivot_table()`` produces on complex groupings. @@ -267,25 +279,25 @@ Weird Types (Pandas) .. code-block:: python - # DDD: Weird Types (Pandas) - # Four types most viewers ignore entirely: - # - Categorical: fixed set of allowed values, not a string - # - Timedelta: a duration ("1d 2h 3m 4s"), not a timestamp - # - Period: a span of time ("January 2021"), not a point in time - # - Interval: a range like (0, 1], common in pd.cut() output - pd.DataFrame({ - 'categorical': pd.Categorical( - ['red', 'green', 'blue', 'red', 'green']), - 'timedelta': pd.to_timedelta( - ['1 days 02:03:04', '0 days 00:00:01', - '365 days', '0 days 00:00:00.001', - '0 days 00:00:00.000100']), - 'period': pd.Series( - pd.period_range('2021-01', periods=5, freq='M')), - 'interval': pd.Series( - pd.arrays.IntervalArray.from_breaks([0, 1, 2, 3, 4, 5])), - 'int_col': [10, 20, 30, 40, 50], - }) + # from buckaroo/ddd_library.py + def df_with_weird_types() -> pd.DataFrame: + """DataFrame with unusual dtypes that historically broke rendering. + Exercises: categorical, timedelta, period, interval.""" + return pd.DataFrame({ + 'categorical': pd.Categorical( + ['red', 'green', 'blue', 'red', 'green']), + 'timedelta': pd.to_timedelta( + ['1 days 02:03:04', '0 days 00:00:01', + '365 days', '0 days 00:00:00.001', + '0 days 00:00:00.000100']), + 'period': pd.Series( + pd.period_range('2021-01', periods=5, freq='M')), + 'interval': pd.Series( + pd.arrays.IntervalArray.from_breaks([0, 1, 2, 3, 4, 5])), + 'int_col': [10, 20, 30, 40, 50], + }) + + df_with_weird_types() Four types that most viewers ignore: @@ -311,30 +323,32 @@ Weird Types (Polars) .. code-block:: python - # DDD: Weird Types (Polars) - # Polars-specific types that historically broke rendering: - # - Duration: microsecond-precision, was blank before issue #622 - # - Time: time-of-day without a date component - # - Decimal: fixed-precision (not float), important for financial data - # - Binary: raw bytes, displayed as hex strings - import polars as pl - import datetime as dt - - pl.DataFrame({ - 'duration': pl.Series( - [100_000, 3_723_000_000, 86_400_000_000, 500, 60_000_000], - dtype=pl.Duration('us')), - 'time': [dt.time(14, 30), dt.time(9, 15, 30), - dt.time(0, 0, 1), dt.time(23, 59, 59), dt.time(12, 0)], - 'categorical': pl.Series( - ['red', 'green', 'blue', 'red', 'green']).cast(pl.Categorical), - 'decimal': pl.Series( - ['100.50', '200.75', '0.01', '99999.99', '3.14'] - ).cast(pl.Decimal(10, 2)), - 'binary': [b'hello', b'world', b'\x00\x01\x02', - b'test', b'\xff\xfe'], - 'int_col': [10, 20, 30, 40, 50], - }) + # from buckaroo/ddd_library.py + def pl_df_with_weird_types(): + """Polars DataFrame with unusual dtypes that historically broke + rendering. Exercises: Duration (#622), Time, Categorical, + Decimal, Binary.""" + import datetime as dt + import polars as pl + return pl.DataFrame({ + 'duration': pl.Series([100_000, 3_723_000_000, + 86_400_000_000, 500, 60_000_000], + dtype=pl.Duration('us')), + 'time': [dt.time(14, 30), dt.time(9, 15, 30), + dt.time(0, 0, 1), dt.time(23, 59, 59), + dt.time(12, 0)], + 'categorical': pl.Series( + ['red', 'green', 'blue', 'red', 'green'] + ).cast(pl.Categorical), + 'decimal': pl.Series( + ['100.50', '200.75', '0.01', '99999.99', '3.14'] + ).cast(pl.Decimal(10, 2)), + 'binary': [b'hello', b'world', b'\x00\x01\x02', + b'test', b'\xff\xfe'], + 'int_col': [10, 20, 30, 40, 50], + }) + + pl_df_with_weird_types() Polars has its own set of tricky types: @@ -378,10 +392,6 @@ For details on how to create your own static embeds, see the Try it yourself --------------- -.. code-block:: bash - - pip install buckaroo - .. code-block:: python from buckaroo.ddd_library import * From 638edde44a4b54ca18d6f99a154b3834aa4de1a8 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:47:48 -0400 Subject: [PATCH 11/29] docs: add missing ddd_library import to notebook example Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/source/articles/dastardly-dataframe-dataset.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/articles/dastardly-dataframe-dataset.rst b/docs/source/articles/dastardly-dataframe-dataset.rst index ef594baf..9c5f3973 100644 --- a/docs/source/articles/dastardly-dataframe-dataset.rst +++ b/docs/source/articles/dastardly-dataframe-dataset.rst @@ -405,6 +405,7 @@ Try it yourself Or in a Jupyter notebook, just:: import buckaroo + from buckaroo.ddd_library import df_with_weird_types df_with_weird_types() # renders inline The Dastardly DataFrame Dataset is also available as an interactive tour From d86a6c12fb8e3f974f8481f9d0864fbf0aa8aa50 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 17:54:26 -0400 Subject: [PATCH 12/29] =?UTF-8?q?fix:=20address=20review=20comments=20?= =?UTF-8?q?=E2=80=94=20ship=20static-embed=20in=20wheel,=20use=20native=20?= =?UTF-8?q?polars?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add static-embed.js/css to hatch build artifacts so they ship in the wheel (P1: users can actually copy the files the docs reference) - Use pl_df_with_weird_types() instead of the pandas-converted version so the DDD polars page exercises the real polars serialization path - Update embedding guide with reliable copy command Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/source/articles/embedding-guide.rst | 9 +++++++-- pyproject.toml | 2 +- scripts/generate_ddd_static_html.py | 8 ++++---- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/docs/source/articles/embedding-guide.rst b/docs/source/articles/embedding-guide.rst index 53b020da..2cba2d0d 100644 --- a/docs/source/articles/embedding-guide.rst +++ b/docs/source/articles/embedding-guide.rst @@ -74,8 +74,13 @@ Generate a static embed f.write(html) The HTML file references ``static-embed.js`` and ``static-embed.css``. -These are included in the buckaroo package under ``buckaroo/static/`` — -copy them alongside your HTML or serve them from a web server. +These are shipped in the buckaroo wheel under ``buckaroo/static/``. +Copy them alongside your generated HTML: + +.. code-block:: bash + + STATIC=$(python -c "from pathlib import Path; import buckaroo; print(Path(buckaroo.__file__).parent / 'static')") + cp "$STATIC/static-embed.js" "$STATIC/static-embed.css" ./ **With polars:** diff --git a/pyproject.toml b/pyproject.toml index 75040655..ca4d9b43 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -127,7 +127,7 @@ fallback-version = "0.0.0+unknown" [tool.hatch.build] only-packages = true -artifacts = ["buckaroo/static/*.js", "buckaroo/static/*.css", "scripts/hatch_build.py"] +artifacts = ["buckaroo/static/widget.js", "buckaroo/static/compiled.css", "buckaroo/static/standalone.js", "buckaroo/static/standalone.css", "buckaroo/static/static-embed.js", "buckaroo/static/static-embed.css", "scripts/hatch_build.py"] [tool.hatch.build.force-include] "buckaroo_mcp_tool.py" = "buckaroo_mcp_tool.py" diff --git a/scripts/generate_ddd_static_html.py b/scripts/generate_ddd_static_html.py index 4116c043..08943b04 100644 --- a/scripts/generate_ddd_static_html.py +++ b/scripts/generate_ddd_static_html.py @@ -21,7 +21,7 @@ get_multiindex3_index_df, get_multiindex_with_names_both, df_with_weird_types, - pl_df_with_weird_types_as_pandas, + pl_df_with_weird_types, ) OUT_DIR = os.path.join(os.path.dirname(__file__), '..', 'docs', 'extra-html', 'ddd') @@ -65,9 +65,9 @@ df_with_weird_types(), 'Categorical, timedelta, period, and interval dtypes.'), - ('weird-types-polars', 'Weird Types (Polars → Pandas)', - pl_df_with_weird_types_as_pandas(), - 'Duration, time, categorical, decimal, and binary dtypes from polars.'), + ('weird-types-polars', 'Weird Types (Polars)', + pl_df_with_weird_types(), + 'Duration, time, categorical, decimal, and binary dtypes — native polars DataFrame.'), ] From faa82125d3179cf785f577b6c07afc91dcb93878 Mon Sep 17 00:00:00 2001 From: Paddy Mullen Date: Fri, 20 Mar 2026 23:21:31 -0400 Subject: [PATCH 13/29] docs: add article tracing data pipeline from engine to browser Covers column renaming (a,b,c), type coercion before parquet, fastparquet encoding, base64 transport, hyparquet decode, and displayer/formatter dispatch with a full pipeline diagram. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/content-plan.md | 6 + docs/source/articles/types-to-display.rst | 301 ++++++++++++++++++++++ 2 files changed, 307 insertions(+) create mode 100644 docs/source/articles/types-to-display.rst diff --git a/docs/content-plan.md b/docs/content-plan.md index 2217e838..a28b3a1d 100644 --- a/docs/content-plan.md +++ b/docs/content-plan.md @@ -51,6 +51,12 @@ sponsored by cloudflare? +# How types and data move from engine to browser + +Column renaming (a,b,c..z,aa,ab), type coercion before parquet, fastparquet encoding, base64 transport, hyparquet decode in browser, displayer/formatter dispatch. Full pipeline trace for a single cell value. + +See `docs/source/articles/types-to-display.rst` + ## Help me work through a content plan. what other features have I recently released that desereve blog posts? diff --git a/docs/source/articles/types-to-display.rst b/docs/source/articles/types-to-display.rst new file mode 100644 index 00000000..6becf25d --- /dev/null +++ b/docs/source/articles/types-to-display.rst @@ -0,0 +1,301 @@ +How Types and Data Move from Engine to Browser +================================================ + +You have a DataFrame in Python. Moments later it's rendered in a +browser — scrollable, formatted, with histograms in the summary row. +What happened in between? + +This article traces the full path: column renaming, type coercion, +Parquet encoding, base64 transport, hyparquet decoding, and finally the +displayer/formatter system that turns raw values into what you see on +screen. + + +Column renaming: why everything becomes ``a, b, c`` +----------------------------------------------------- + +The very first thing buckaroo does when serializing a DataFrame is +rename every column. The original column ``"revenue"`` becomes ``a``. +``"cost"`` becomes ``b``. The 27th column becomes ``aa``, then ``ab``, +``ac``, and so on — base-26 using lowercase ASCII. + +.. code-block:: python + + # buckaroo/df_util.py + def to_chars(n: int) -> str: + digits = to_digits(n, 26) + return "".join(map(lambda x: chr(x + 97), digits)) + + def old_col_new_col(df): + return [(orig, to_chars(i)) for i, orig in enumerate(df.columns)] + +Why? Three reasons: + +1. **Column names can be anything.** Tuples (from MultiIndex), integers, + strings with spaces and special characters, even a column literally + called ``"index"``. Parquet column names must be strings. AG-Grid + field names should be simple identifiers. Renaming to ``a, b, c`` + sidesteps every edge case at once. + +2. **Collision avoidance.** When a DataFrame has a column named + ``"index"`` and we need to serialize the actual index as a column + too, there's a name collision. Renaming to short opaque names means + the index columns (``index``, ``index_a``, ``index_b`` for + MultiIndex levels) never collide with data columns. + +3. **Smaller payloads.** The column name is repeated in every row of the + JSON/Parquet output. ``"a"`` is smaller than + ``"quarterly_revenue_usd"``. + +The original name is preserved in the ``column_config`` that travels +alongside the data. On the JS side, each column's ``header_name`` +(or ``col_path`` for MultiIndex) tells AG-Grid what to display in the +header. The user never sees ``a, b, c`` — they see the real names. + +.. code-block:: python + + # In styling_core.py — fix_column_config maps col→header_name + base_cc['col_name'] = col # "a" + base_cc['header_name'] = str(orig_col_name) # "revenue" + + +Cleaning before serialization +------------------------------ + +Python's type system is richer than what Parquet (or JSON) can express +directly. Before writing to Parquet, buckaroo coerces the awkward types: + +.. list-table:: + :header-rows: 1 + :widths: 30 30 40 + + * - Python type + - Becomes + - Why + * - ``pd.Period`` (e.g. "2021-01") + - ``str`` + - Parquet has no period type + * - ``pd.Interval`` (e.g. ``(0, 1]``) + - ``str`` + - Parquet has no interval type + * - ``pd.Timedelta`` + - ``str`` (e.g. "1 days 02:03:04") + - fastparquet can't encode timedeltas + * - ``bytes`` (e.g. from ``pl.Binary``) + - hex string (e.g. ``"68656c6c6f"``) + - Parquet object columns need strings + * - PyArrow-backed strings + - ``object`` dtype + - fastparquet needs object, not ArrowDtype + * - Timezone-naive datetimes + - UTC datetimes + - Avoids ambiguous serialization + +For the main DataFrame, this happens in ``to_parquet()`` +(``serialization_utils.py``). The function also calls +``prepare_df_for_serialization()`` which does the column rename and +flattens MultiIndex levels into regular columns (``index_a``, +``index_b``, etc.). + +Summary stats have an additional wrinkle: each column's stats dict +contains mixed types (strings like ``"int64"`` for dtype, floats for +mean, lists for histogram bins). fastparquet can't handle mixed-type +columns, so ``sd_to_parquet_b64()`` JSON-encodes every cell value first, +making each column a pure string column. The JS side knows to +``JSON.parse`` each cell back. + +.. code-block:: python + + # Every cell becomes a JSON string before parquet encoding + def _json_encode_cell(val): + return json.dumps(_make_json_safe(val), default=str) + + +Parquet encoding and base64 transport +-------------------------------------- + +buckaroo uses **fastparquet** with a custom JSON codec to write the +DataFrame to an in-memory Parquet file. Categorical and object columns +get JSON-encoded within the Parquet file (fastparquet's ``object_encoding='json'``). + +The raw Parquet bytes are then base64-encoded into an ASCII string: + +.. code-block:: python + + def to_parquet_b64(df): + raw_bytes = to_parquet(df) + return base64.b64encode(raw_bytes).decode('ascii') + +The result is a tagged payload: + +.. code-block:: json + + {"format": "parquet_b64", "data": "UEFSMQ..."} + +This travels over the wire — via Jupyter's comm protocol, a WebSocket, +or embedded directly in an HTML ``