diff --git a/docs/content-plan.md b/docs/content-plan.md new file mode 100644 index 00000000..5d52a702 --- /dev/null +++ b/docs/content-plan.md @@ -0,0 +1,62 @@ + +# Content Plan + +## Published (merged or ready to merge) + +### Dastardly DataFrame Dataset (PR #641) +Published at `docs/source/articles/dastardly-dataframe-dataset.rst`. Covers DDD with static embeds, full dtype coverage table, weird types for pandas and polars. Includes Polars DDD (issue #622). + +### How types and data move from engine to browser +Published at `docs/source/articles/types-to-display.rst`. Column renaming (a,b,c..z,aa,ab), type coercion before parquet, fastparquet encoding, base64 transport, hyparquet decode in browser, displayer/formatter dispatch. Full pipeline trace for a single cell value. + +### So you want to write a DataFrame viewer +Published at `docs/source/articles/so-you-want-to-write-a-dataframe-viewer.rst`. Comparison of open source DataFrame viewers (Buckaroo, Perspective, iTables, Great Tables, DTale, Mito, Marimo, ipydatagrid, quak). Research in `~/personal/buckaroo-writing/research/`. + +### Why Buckaroo uses Depot for CI +Draft at `docs/source/articles/why-depot.rst`. Depot sponsorship story. Honest benchmarking: Depot isn't measurably faster than GitHub runners (I/O-bound workload), but consistent provisioning + no minute quotas gave confidence to grow from 3 to 23 CI jobs. Pending: email to Depot CTO for input before publishing. + +## Planned + +### Static embedding improvements +- Publish JS to CDN → reduced embed size. Talk about the journey: Jupyter → Marimo/Pyodide → static embedding → smaller static embedding +- Page weight comparison: dbt (501KB compressed, 28MB total, 1.41s DCL), Snowflake (128kb/1.28mb/22.51mb/445ms), Databricks (127kb/797kb/313ms) +- Customizing buckaroo via API for embeds — show styling, link to styling docs +- Static search — maybe, take a crack at it +- Link to the static embedding guide + +### Styling buckaroo chrome +Based on https://github.com/buckaroo-data/buckaroo/pull/583 + +### Buckaroo embedding guide +- Why to embed buckaroo +- Which config makes sense for you — along with data sizes reasoning +- Customizing appearance +- Customizing buckaroo + +### Embedding buckaroo for bigger data +Parquet range queries on S3/R2 buckets. Sponsored by Cloudflare? + +### How I made Buckaroo fast +The philosophy: do the right things fast, but mostly just do less. Not a performance optimization article — it's about architecture decisions that avoid work entirely. +- Column renaming to a,b,c means shorter keys everywhere, no escaping +- Parquet instead of JSON: moved from Python JSON serialization (the slowest part of the original render) to binary Parquet. Faster encoding, smaller payloads, type preservation for free +- Sampling: don't process the whole DataFrame. Sample first, compute stats on the sample, display the sample. The user sees 500 rows, not 500,000 +- Summary stats: compute once, cache. Don't recompute on every view switch +- hyparquet decodes in the browser — no round-trip to the server for data +- LRU cache on decoded Parquet so switching between main/stats views doesn't re-decode +- AG-Grid does the hard rendering work (virtual scrolling, column virtualization) — don't fight it, feed it clean data +- The lesson: most "performance work" was removing unnecessary work, not optimizing hot paths + +### Testing Buckaroo: unit tests, integration tests, and everything in between +How a solo developer tests a project that spans Python + TypeScript across 8 deployment environments. +- **Python unit tests** (pytest): serialization, stats computation, type coercion, column renaming. Fast, reliable, the foundation. ~60s for the full suite +- **JS unit tests** (vitest): component logic, displayer/formatter functions, parquet decoding. Run in Node, no browser needed +- **Playwright integration tests** (6 suites): Storybook (component rendering), JupyterLab (full widget lifecycle), Marimo, WASM Marimo, Server (MCP/standalone), Static Embed. These catch "it works in Jupyter but is blank in Marimo" — the bugs you can't find any other way +- **Styling screenshot comparisons**: before/after captures on every PR using Storybook + Playwright. Catches visual regressions (column width changes, color map shifts) that no unit test can detect +- **Smoke tests**: install the wheel with each optional extras group (`[mcp]`, `[notebook]`, etc.) and verify imports work. Catches dependency conflicts +- **MCP integration tests**: install the wheel, start the MCP server, make a `tools/call` request, verify the response includes static assets +- **Dual dependency strategy**: run all Python tests twice — once with minimum pinned versions, once with `--resolution=highest`. Catches pandas/polars/pyarrow compatibility issues before users do +- **The DDD as a test suite**: the Dastardly DataFrame Dataset isn't just documentation — each weird DataFrame exercises edge cases through the full serialization → display pipeline +- What I don't test: VSCode, Google Colab (no headless automation), visual pixel-perfect matching (too brittle) +- The lesson: integration tests are worth the CI investment. Most real bugs are at boundaries (Python→Parquet→JS→AG-Grid), not inside any one layer + diff --git a/docs/source/articles/buckaroo-compare.rst b/docs/source/articles/buckaroo-compare.rst new file mode 100644 index 00000000..04874b47 --- /dev/null +++ b/docs/source/articles/buckaroo-compare.rst @@ -0,0 +1,208 @@ +BuckarooCompare — Diff Your DataFrames +======================================= + +When you change a pipeline, how do you know what changed in the output? When +you migrate a table from one database to another, how do you verify the data +matches? When two teams produce different versions of the same report, where +are the differences? + +You diff them. But ``df1.equals(df2)`` returns a single boolean, and +``df1.compare(df2)`` only works if the DataFrames have identical shapes and +indexes. Real-world comparisons are messier: rows may be reordered, columns +may be added or removed, and the join key might not be the index. + +Buckaroo's ``col_join_dfs`` function handles all of this and renders the +result as a color-coded interactive table where differences jump out +visually. + + +Quick start +----------- + +.. code-block:: python + + from buckaroo.compare import col_join_dfs + import pandas as pd + + df1 = pd.DataFrame({ + 'id': [1, 2, 3, 4], + 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], + 'score': [88.5, 92.1, 75.3, 96.7], + }) + + df2 = pd.DataFrame({ + 'id': [1, 2, 3, 5], + 'name': ['Alice', 'Robert', 'Charlie', 'Eve'], + 'score': [88.5, 92.1, 80.0, 81.0], + }) + + merged_df, column_config_overrides, eqs = col_join_dfs( + df1, df2, + join_columns=['id'], + how='outer' + ) + +The function returns three things: + +1. **merged_df**: The joined DataFrame with all rows from both inputs, + plus hidden metadata columns for diff state +2. **column_config_overrides**: A dict of buckaroo styling config that + color-codes each cell based on whether it matches, differs, or is + missing from one side +3. **eqs**: A summary dict showing the diff count per column — how many + rows differ for each column + + +How the diff works +------------------ + +``col_join_dfs`` performs a ``pd.merge`` on the join columns, then for each +data column: + +- Creates a hidden ``{col}|df2`` column with the df2 value +- Creates a hidden ``{col}|eq`` column encoding the combined state: + is the row in df1 only, df2 only, both-and-matching, or both-and-different? +- Generates a ``color_map_config`` that maps these states to colors + +The color scheme: + +.. list-table:: + :header-rows: 1 + + * - State + - Color + - Meaning + * - df1 only + - Pink + - Row exists in df1 but not df2 + * - df2 only + - Green + - Row exists in df2 but not df1 + * - Match + - Light blue + - Row in both, values identical + * - Diff + - Dark blue + - Row in both, values differ + +Join key columns are highlighted in purple so you can immediately see what +was used for matching. + + +The eqs summary +--------------- + +The third return value tells you at a glance where the differences are: + +.. code-block:: python + + >>> eqs + { + 'id': {'diff_count': 'join_key'}, + 'name': {'diff_count': 2}, # 2 rows differ + 'score': {'diff_count': 1}, # 1 row differs + } + +Special values: + +- ``"join_key"`` — this column was used for matching, not compared +- ``"df_1"`` — column only exists in df1 +- ``"df_2"`` — column only exists in df2 +- An integer — number of rows where values differ + + +Using it with the server +------------------------ + +The buckaroo server exposes a ``/load_compare`` endpoint that loads two +files, runs the diff, and pushes the styled result to any connected browser: + +.. code-block:: bash + + curl -X POST http://localhost:8888/load_compare \ + -H "Content-Type: application/json" \ + -d '{ + "session": "my-session", + "path1": "/data/report_v1.csv", + "path2": "/data/report_v2.csv", + "join_columns": ["id"], + "how": "outer" + }' + +The response includes the diff summary: + +.. code-block:: json + + { + "session": "my-session", + "rows": 5, + "columns": ["id", "name", "score"], + "eqs": { + "id": {"diff_count": "join_key"}, + "name": {"diff_count": 2}, + "score": {"diff_count": 1} + } + } + +The browser view updates immediately with the color-coded merged table. +Hover over any differing cell to see the df2 value in a tooltip. + + +Multi-column joins +------------------ + +.. code-block:: python + + merged_df, overrides, eqs = col_join_dfs( + df1, df2, + join_columns=['region', 'date'], + how='inner' + ) + +Composite join keys work naturally. Both ``region`` and ``date`` will be +highlighted in purple. + + +Use cases +--------- + +**Data migration validation** + Migrating from Postgres to Snowflake? Export both tables, diff them. + The color coding immediately shows which rows are missing and which + values changed. + +**Pipeline output comparison** + Changed a transform? Diff the before and after. The ``eqs`` summary + tells you exactly which columns were affected and by how many rows. + +**A/B test result inspection** + Compare experiment vs control DataFrames on a user ID join key. See + which metrics actually differ. + +**Schema evolution** + When df2 has columns that df1 doesn't (or vice versa), those columns + are marked as ``"df_1"`` or ``"df_2"`` in the eqs summary, so you + can see schema changes alongside data changes. + + +Integration with datacompy +-------------------------- + +The ``docs/example-notebooks/datacompy_app.py`` example shows how to use +`datacompy `_ for metadata-rich +comparison (column matching stats, row-level match rates) while using +buckaroo for the visual rendering. + +This gives you the best of both: datacompy's statistical summary plus +buckaroo's interactive, color-coded table view. + + +Limitations +----------- + +- Join columns must be unique in each DataFrame (no many-to-many joins). + If duplicates are detected, ``col_join_dfs`` raises a ``ValueError``. +- Column names cannot contain ``|df2`` or ``__buckaroo_merge`` (these are + used internally). +- Very large DataFrames (>100K rows) will work but the browser may be slow + to render the full color-coded table. diff --git a/docs/source/articles/embedding-guide.rst b/docs/source/articles/embedding-guide.rst new file mode 100644 index 00000000..2cba2d0d --- /dev/null +++ b/docs/source/articles/embedding-guide.rst @@ -0,0 +1,259 @@ +Buckaroo Embedding Guide +======================== + +This guide covers everything you need to embed interactive buckaroo tables +in your own applications, documentation, and reports. + + +Why embed +--------- + +- **Share DataFrames without Jupyter**: Send a colleague an HTML file they + can open in any browser. No Python install required. +- **Build data apps**: Integrate the buckaroo viewer into React dashboards, + internal tools, or customer-facing data products. +- **Static reports**: Generate HTML reports from your pipeline that include + interactive, sortable tables with summary statistics. +- **Documentation**: Embed live data tables in your docs site (Sphinx, + MkDocs, or plain HTML). + + +Choose your embedding mode +-------------------------- + +Buckaroo offers two static embed modes and one live widget mode: + +``embed_type="DFViewer"`` — Lightweight table + Just the data grid with sortable columns, summary stats pinned at the + bottom, histograms, and type-aware formatting. Smaller payload. Best + for documentation, reports, and sharing. + +``embed_type="Buckaroo"`` — Full experience + Everything in DFViewer plus the display switcher bar, multiple computed + views, and the interactive analysis pipeline. Larger payload. Best for + data exploration and internal tools. + +**anywidget** — Live in notebooks + The ``BuckarooWidget`` runs inside Jupyter, Marimo, VS Code notebooks, + and Google Colab via anywidget. Full interactivity including the command + UI for data cleaning operations. Requires a running Python kernel. + +For most embedding use cases, start with ``DFViewer``. + + +Data size guidelines +~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :header-rows: 1 + + * - Row count + - Recommended approach + * - < 1,000 rows + - Inline static embed. JSON payload is small (~10-50 KB). + * - 1,000 - 100,000 rows + - Static embed still works. Parquet encoding keeps payload + compact (50-500 KB). Consider sampling for faster page load. + * - > 100,000 rows + - Host data separately. Use Parquet range queries on S3/R2 to + fetch only the visible rows and columns. + + +Generate a static embed +----------------------- + +.. code-block:: python + + from buckaroo.artifact import to_html + import pandas as pd + + df = pd.read_csv('my_data.csv') + html = to_html(df, title="My Data", embed_type="DFViewer") + + with open('my-data.html', 'w') as f: + f.write(html) + +The HTML file references ``static-embed.js`` and ``static-embed.css``. +These are shipped in the buckaroo wheel under ``buckaroo/static/``. +Copy them alongside your generated HTML: + +.. code-block:: bash + + STATIC=$(python -c "from pathlib import Path; import buckaroo; print(Path(buckaroo.__file__).parent / 'static')") + cp "$STATIC/static-embed.js" "$STATIC/static-embed.css" ./ + +**With polars:** + +.. code-block:: python + + import polars as pl + from buckaroo.artifact import to_html + + df = pl.read_parquet('my_data.parquet') + html = to_html(df, title="Polars Data") + +``to_html()`` auto-detects polars DataFrames and uses the polars analysis +pipeline. + +**From a file path:** + +.. code-block:: python + + from buckaroo.artifact import to_html + + # Reads CSV, Parquet, JSON, or JSONL automatically + html = to_html('/path/to/data.parquet', title="Direct from file") + + +Customizing appearance +---------------------- + +Column config overrides +~~~~~~~~~~~~~~~~~~~~~~~ + +Pass ``column_config_overrides`` to control per-column display: + +.. code-block:: python + + html = to_html(df, column_config_overrides={ + 'revenue': { + 'color_map_config': { + 'color_rule': 'color_from_column', + 'map_name': 'RdYlGn', + } + }, + 'join_key': { + 'color_map_config': { + 'color_rule': 'color_static', + 'color': '#6c5fc7', + } + } + }) + +Available color rules: + +- ``color_from_column``: Color cells based on their value using a named + colormap (e.g., ``RdYlGn``, ``Blues``, ``Viridis``) +- ``color_categorical``: Map categorical values to a list of colors +- ``color_static``: Constant background color for every cell in the column + +Tooltips +~~~~~~~~ + +Show the value of another column on hover: + +.. code-block:: python + + column_config_overrides={ + 'name': { + 'tooltip_config': { + 'tooltip_type': 'simple', + 'val_column': 'full_name', + } + } + } + + +Analysis classes +~~~~~~~~~~~~~~~~ + +Control which summary statistics are computed: + +.. code-block:: python + + from buckaroo.artifact import to_html + from buckaroo.pluggable_analysis_framework.analysis_management import ( + ColAnalysis, + ) + + # Use extra_analysis_klasses to add custom stats + # Use analysis_klasses to replace the default set + html = to_html(df, + extra_analysis_klasses=[MyCustomAnalysis], + embed_type="Buckaroo") + +See :doc:`pluggable` for details on writing custom analysis classes. + + +Pinned rows +~~~~~~~~~~~ + +Add custom pinned rows (shown at the bottom of the table): + +.. code-block:: python + + html = to_html(df, + extra_pinned_rows=[ + {'index': 'target', 'a': 100, 'b': 200}, + ]) + + +Integration patterns +-------------------- + +Static HTML file +~~~~~~~~~~~~~~~~ + +The simplest approach. Generate the HTML, copy ``static-embed.js`` and +``static-embed.css`` next to it, and open in a browser or serve from any +static file host. + +.. code-block:: bash + + cp $(python -c "import buckaroo; print(buckaroo.__path__[0])")/static/static-embed.* ./ + open my-data.html + +React component +~~~~~~~~~~~~~~~ + +For deeper integration, import the React components directly from +``buckaroo-js-core``: + +.. code-block:: bash + + npm install buckaroo-js-core + +.. code-block:: typescript + + import { DFViewer } from 'buckaroo-js-core'; + + function MyTable({ data, config, summaryStats }) { + return ( + + ); + } + +Sphinx / ReadTheDocs +~~~~~~~~~~~~~~~~~~~~~ + +Use a ``raw`` directive to embed an iframe pointing to a pre-generated +static HTML file: + +.. code-block:: rst + + .. raw:: html + + + +Generate the HTML with the ``to_html()`` function and place it in your +Sphinx ``_static`` directory. + + +What's included in the bundle +----------------------------- + +The ``static-embed.js`` bundle (1.3 MB minified) includes: + +- React 18 + ReactDOM +- AG-Grid Community v33 (table rendering) +- hyparquet (Parquet decoding in the browser) +- recharts (histogram rendering) +- lodash-es (utility functions, tree-shaken) + +The bundle is built with esbuild and shipped as an ES module. diff --git a/docs/source/articles/so-you-want-to-write-a-dataframe-viewer.rst b/docs/source/articles/so-you-want-to-write-a-dataframe-viewer.rst new file mode 100644 index 00000000..a7cda8bd --- /dev/null +++ b/docs/source/articles/so-you-want-to-write-a-dataframe-viewer.rst @@ -0,0 +1,325 @@ +So You Want to Write a DataFrame Viewer +======================================== + +You want to write a better viewer for tabular data. That's great, the +world needs better interfaces in this space, and there is so much that +can be improved on. Here are some of the biggest design decisions and +their potential side effects, along with projects that chose different +routes. There are many closed source data table viewers with various +levels of capability. It seems like every new notebook hosting +environment feels compelled to build their own dataframe viewer. In this +article I will draw on my own experience creating +`Buckaroo `__, as well as +observations from looking at popular open source table viewers like +`Perspective `__, +`Great Tables `__, +`DTale `__, +`Marimo `__, +`iTables `__, +`ipydatagrid `__, +`Panel Tabulator `__, +and Streamlit's +`st.dataframe `__. + +I have run into each one of these issues while building buckaroo. + + +Use-case questions +------------------- + +Before starting, think about what use case you are looking to solve for. +Are you trying to build tables for relatively static display (PDF to +Huggingface data browser)? Do you want to serve dashboards (a limited +set of interactions with users willing to customize heavily and +specifically for styling)? Do you want to facilitate interactive use in +an IDE like environment (VSCode notebooks, some internal data bench)? Do +you want to work in notebook environments? What size datasets do you +expect your users to work with? What performance expectations do your +users have? Do you want users to be able to customize the experience? +Without writing JS? Do you want to deal with streaming data? Do you want +to allow editing of data? + + +Processing: server-side or browser-based +----------------------------------------- + +The biggest decision to make when building a table viewer is what to do +with the data. Do you want the entire dataset to reside in the browser +or do you want to leave it on the server and page the currently viewed +section back and forth to the browser. Both approaches have their place. + +Browser based approaches are much cheaper to serve at scale. Browsers +have improved significantly in the past decade and there are many +applications that put over a gigabyte of data into the browser with no +ill effects. Further with HTTP range requests, the full dataset doesn't +even have to be loaded at once. Apache Arrow and Parquet make this +approach more performant and attractive. This approach scales with little +cost because S3 and Cloudflare are incredibly performant and inexpensive +compared to spinning up server infrastructure. + +Browser based approaches fall down with datasets over 1 GB. Additionally +1 GB is about the total limit of memory use that you want a single page +to have, so if you have multiple dataframes that you want to display +simultaneously, keep that in mind. Finally, browser based solutions +require using browser based analytics engines instead of familiar tools +like pandas and polars. Apache Arrow is packageable into a WebAssembly +module, but packaging it into a JS build is tricky. + +Server based solutions are more familiar as traditional web apps, +sometimes with some twists. Server based solutions excel for very large +datasets that are backed by analytics engines. If your 10 GB table is +already in a relational database, let the database do the sorting, and +only send over the limited rows that are being displayed. Server based +solutions with persistent connections also allow many more tables to be +displayed simultaneously while limiting browser memory usage. If you have +infrastructure built around analytics pipelines in traditional +environments, server side solutions are often the better way to go. +Sorting and histograms in particular can be hard to implement identically +in different numerical engines. + +The downsides of a server based approach are that you always need to have +the server running to make the table work. At the small end this means +you can't simply host an artifact with your table in it. You can't serve +a Jupyter notebook statically in a GitHub repo. If you intend to host an +analytics system with your table, you now need server infrastructure to +back it. Server infrastructure connected to a relational database or +data warehouse is one level of expense — it is even more expensive (in +terms of memory and CPU) to host Python-based analytics server-side. + + +Serializing data +----------------- + +For buckaroo, serializing data to JSON was the slowest part of the +initial render (not true anymore, because of better lazy fetching). +Serializing dataframes is hard. There are multiple numerical Python +(Arrow, computation) concepts that don't have direct equivalents in JS +or JSON. Notably infinity and NaN aren't valid in JSON. Furthermore +datetime handling across JSON requires a processing layer — you will +either encode strings or millisecond offsets, either requiring a metadata +layer that can then be interpreted. Then there are common Python +datatypes like timedelta that have no native JS equivalent. + +Next we get to the difficulty of serializing pandas data structures. +Pandas indexes which apply to rows and columns occur in a variety of +formats. Multi-level indexes can be challenging for display — they have +to be special-cased in your display code regardless of how they are +serialized. Pandas columns can also be named in a variety of ways, +including as numerics or strings. + +These different dataframe configurations are challenging because they are +hard to completely anticipate. In my experience, when a user constructed +a dataframe with an unexpected structure, it was one of the most likely +things to blow up buckaroo with a JS typing error. There were also +exceptions thrown through most of the pandas processing code. + +Polars is a bit easier in this regard. Polars eschews having an index. + +Many of these issues exist when serializing to a binary format like +Feather or Parquet, but are a bit different. With Feather/Parquet, make +sure Python objects and lists serialize properly. Also if you want a +single-file static HTML export to work, you will need to base64 encode +the binary data. True binary-to-binary transfer requires a network +connection. + + +The table viewer component +--------------------------- + +There are many table components, so much so that there is a site +dedicated to tracking their popularity. Increasing in complexity you +have everything from static HTML, to jQuery-based libraries, to modern +table grids, to AG-Grid, to extreme custom-coded frontend libraries. +HTML-based tables allow simple customizability along with a great story +for static export to the widest list of targets. jQuery-based libraries +(limited table rows, pagination) are relatively simple to use and limit +complexity — previously they were much easier to package into the Jupyter +frontend environment than full JS build chains. + +Then there are modern table libraries that aren't AG-Grid. +`React-data-grid `_, +angular-grid, +`tanstack-table `_, +`handsome-table `_. +These libraries might be familiar. They have a straightforward licensing +story. They also tend to have rough edges, limited adoption, and they +tend to be abandoned. I haven't investigated these packages as much. + +Next up is `AG-Grid `_. AG-Grid is +the reliable gold standard for tables, under active development for over +a decade. AG-Grid has a full commercial company behind it, along with a +permissively licensed community edition. From my experience they haven't +kneecapped the community edition in favor of the commercial edition, and +aim to have the community edition as the best free table widget on the +market. The tool is extensively documented with working examples. The +company is completely unresponsive to bug reports from non-paying users +in my experience. I chose AG-Grid after listening to +`an interview with their founder +`_ +on the JS Jabber podcast. + +Then there are custom table widgets like +`Perspective `__, +`glide-data-grid `__, +and whatever you cooked up yourself. Perspective has a very impressive +table, and I suspect it has better performance than AG-Grid. It is +minimally documented and doesn't have the wide community adoption that +generates Stack Overflow guidance. glide-data-grid is an impressive +piece of software, rendering to canvas. It is solo-maintained by its +creator at Glide Apps — actively developed but quietly, with Streamlit +as its biggest downstream consumer. + +If you are writing your own table, congrats. You will have ultimate +control over your user experience. You won't have to worry about +dependencies on ``isEven`` or other npm trash. You will have a very +complex core piece to maintain. At a minimum I'd recommend thoroughly +investigating other widgets to see how they approached problems. + + +The notebook environment +------------------------- + +There are many different notebook environments. Jupyter Notebook, Google +Colab, VSCode notebooks, classic notebooks (before Notebook 7.0), +Marimo, Jupyter running on WASM (JupyterLite). All have slight +differences that become especially significant for frontend code. +Styling works differently, loading JavaScript is a bit different. +`Anywidget `_ was developed to make all of this +easier, and it does. Before anywidget, this section would have been much +longer. + +Even determining what environment you are running in is challenging. +This will come up when users file bugs. `widget_utils.py +`_ +is my function for determining which Jupyter environment I'm running in. + + +Other questions +---------------- + +**Do you want to enable editing tables?** It isn't too challenging to +enable frontend edits to modify the core dataframe of a table. But then +what? For a full fledged application, you have a bunch of options. In the +Jupyter notebook, you don't have many good options. Accessing widget +state in a Jupyter notebook is possible, but it isn't obvious. Jupyter +notebooks also make it easy to inadvertently rerun a cell — which would +cause your user to lose all edits — a very frustrating experience. + +**What about events and callbacks?** Adding click handling events plumbed +through to Python is an attractive option. But now your users have to +make sure they don't have cycles in the event handlers. This is another +place where building a tool for a Jupyter widget is different than +building a tool for a framework or dashboard. + + +Conclusion +----------- + +I'm not suggesting that you avoid creating a table for the Jupyter +environment. I am suggesting that you understand how broad a task it is, +and the ways it could fail. + + +Comparison of open source DataFrame viewers +--------------------------------------------- + +.. list-table:: + :header-rows: 1 + :widths: 15 12 10 10 12 10 15 12 + + * - Name + - Server / Browser + - JSON / Numeric + - Static Export + - Jupyter Compatible + - Dynamic + - Table Viewer + - Built on Anywidget? + * - `Buckaroo `_ + - Server + - Numeric + - Yes + - Yes + - Yes + - AG-Grid + - Yes + * - `ipydatagrid `_ + - Server + - JSON + - No + - Yes + - Yes + - Lumino DataGrid (canvas) + - No + * - `Perspective `_ + - Both + - Arrow + - Yes + - Yes + - Yes + - Custom + - No + * - `iTables `_ + - Browser + - JSON + - Yes + - Yes + - No + - datatables (jQuery based) + - Optional + * - `Great Tables `_ + - Browser + - HTML + - Yes + - Yes + - No + - HTML + - No + * - `DTale `_ + - Server + - JSON + - No + - Yes + - Yes + - react-virtualized + - No + * - `Mito `_ + - Server + - JSON + - No + - Yes + - Yes + - Endo (custom) + - No + * - `Marimo `_ + - Server + - JSON + - Yes + - No + - Yes + - tanstack-table + - No + * - `Panel Tabulator `_ + - Both + - JSON + - Yes + - Yes + - Yes + - Tabulator.js + - No + * - `Streamlit `_ + - Server + - Arrow + - No + - No + - Yes + - glide-data-grid (canvas) + - No + * - `quak `_ + - Server + - Arrow + - No + - Yes + - Yes + - Custom HTML + - Yes diff --git a/docs/source/articles/static-embedding.rst b/docs/source/articles/static-embedding.rst new file mode 100644 index 00000000..5a95bc82 --- /dev/null +++ b/docs/source/articles/static-embedding.rst @@ -0,0 +1,180 @@ +Static Embedding & the Incredible Shrinking Widget +==================================================== + +Buckaroo started as a Jupyter widget. You had to install Python, install +Jupyter, install buckaroo, start a kernel, and run a cell — just to see a +table. Then came Marimo and Pyodide, which cut out the kernel but still +needed a Python runtime in the browser. + +Now there's a third option: **static embedding**. A single HTML file that +renders a fully interactive buckaroo table with no server, no kernel, no +Python runtime. Just a browser. + +How it works +------------ + +.. code-block:: python + + from buckaroo.artifact import to_html + import pandas as pd + + df = pd.read_csv('sales.csv') + html = to_html(df, title="Sales Data", embed_type="DFViewer") + + with open('sales.html', 'w') as f: + f.write(html) + +That's it. ``to_html()`` does the following: + +1. Runs the buckaroo analysis pipeline on the DataFrame — computing dtypes, + summary stats, histograms, column configs +2. Serializes the data to **base64-encoded Parquet** (much more compact than + JSON, especially for numeric columns) +3. Wraps everything in an HTML template that references ``static-embed.js`` + and ``static-embed.css`` + +The resulting HTML is self-describing. The JS bundle reads the embedded JSON, +decodes the Parquet payload using `hyparquet `_, +and renders the table with AG-Grid — all client-side. + +Two embedding modes +------------------- + +``embed_type="DFViewer"`` (default) + Lightweight table viewer with summary stats pinned at the bottom. + Includes dtypes, histograms, and basic statistics. Smaller payload. + +``embed_type="Buckaroo"`` + The full buckaroo experience: display switcher bar, multiple computed + views (main data, summary stats, other analysis outputs), and the + interactive analysis pipeline UI. Larger payload but more powerful. + +For most documentation and sharing use cases, ``DFViewer`` is the right +choice. + + +Bundle size +----------- + +The ``static-embed.js`` bundle is currently **1.3 MB** (minified). This +includes React, AG-Grid, hyparquet, recharts (for histograms), and lodash-es. + +How does this compare to the data industry? + +========================== ================== +Site Total page weight +========================== ================== +MongoDB 11.5 MB +Confluent 10.7 MB +Snowflake 8.4 MB +Elastic 6.1 MB +dbt Labs 5.0 MB +Fivetran 3.4 MB +Datadog 2.3 MB +Palantir 2.0 MB +Databricks 1.6 MB +**Buckaroo static embed** **~1.3 MB + data** +========================== ================== + +Confluent ships 9.2 MB of JavaScript to show you a marketing page. MongoDB +loads a 1.7 MB Optimizely tracking script before you see a single word of +content. Buckaroo delivers an interactive data viewer — with histograms, +sortable columns, summary stats, and type-aware formatting — in less than +Palantir's homepage JavaScript alone. + +And that 1.3 MB includes the *viewer itself*. Your data is on top of that, +but Parquet-encoded data is compact: a 10,000-row DataFrame with 10 columns +typically adds 50-200 KB depending on column types. + + +What we did to get here +----------------------- + +Recent releases shipped several size optimizations: + +**lodash → lodash-es** (`#624 `_) + Migrated from the CommonJS lodash bundle (which includes every function) + to lodash-es, which is tree-shakeable. Only the functions actually used + end up in the bundle. + +**AG Grid v32 → v33** (`#625 `_) + AG Grid v33 unified its package structure. Instead of importing from + multiple packages (``@ag-grid-community/core``, ``@ag-grid-community/client-side-row-model``, + etc.), there's now a single ``ag-grid-community`` package with module + registration. This lets the bundler do a single pass of tree-shaking + instead of trying to deduplicate across packages. + +**Minification** (`#624 `_) + The ``widget.js`` and ``static-embed.js`` bundles are now minified with + esbuild. Previously they shipped unminified. + +**Parquet encoding** + Switching from JSON arrays to Parquet for the data payload was itself + a size win. A DataFrame with 1000 rows of integers takes ~4 KB in + Parquet vs ~12 KB in JSON. The savings compound with row count. + + +What's next: CDN-hosted viewer +------------------------------ + +Today, every static embed includes the full 1.3 MB viewer bundle. If you +generate 10 pages, you serve 13 MB of identical JavaScript. + +The next step is publishing ``static-embed.js`` to a CDN (e.g., jsDelivr or +a Cloudflare R2 bucket). Each embed page would reference the CDN URL instead +of a local file. The per-page payload drops to just the data — typically +under 200 KB. + +This also opens the door to embedding buckaroo tables directly in +GitHub READMEs (via ```` or GitHub Pages), documentation sites, and +email reports. + + +For larger data: Parquet range queries +-------------------------------------- + +Static embeds work great for data that fits in a single HTML file — up to +about 100K rows before the file gets unwieldy. Beyond that, the data should +live separately. + +Parquet files are designed for partial reads. The file footer contains a +directory of column chunks with byte offsets. A client can fetch just the +columns and row groups it needs using HTTP range requests — no server +required, just a file on object storage (S3, Cloudflare R2, GCS). + +This is the subject of a future post, but the architecture looks like: + +1. Parquet file on a private R2 bucket +2. Cloudflare Worker generates a time-limited presigned URL +3. Browser-side buckaroo fetches column chunks via ``Range`` headers +4. Data never flows through your server + +See the content plan for details. + + +Try it +------ + +.. code-block:: bash + + pip install buckaroo + +.. code-block:: python + + from buckaroo.artifact import to_html + import pandas as pd + + # Any DataFrame works + df = pd.read_csv('your_data.csv') + html = to_html(df, title="My Data") + + with open('my-data.html', 'w') as f: + f.write(html) + + # Full buckaroo experience (larger bundle, more features) + html_full = to_html(df, title="My Data", embed_type="Buckaroo") + +The generated HTML references ``static-embed.js`` and ``static-embed.css`` +which are included in the ``buckaroo`` Python package under +``buckaroo/static/``. Copy those files alongside your HTML, or serve them +from a web server. diff --git a/docs/source/articles/types-to-display.rst b/docs/source/articles/types-to-display.rst new file mode 100644 index 00000000..2925c171 --- /dev/null +++ b/docs/source/articles/types-to-display.rst @@ -0,0 +1,334 @@ +How Types and Data Move from Engine to Browser +================================================ + +You have a DataFrame in Python. Moments later it's rendered in a +browser — scrollable, formatted, with histograms in the summary row. +What happened in between? + +This article traces the full path: column renaming, type coercion, +Parquet encoding, base64 transport, hyparquet decoding, and finally the +displayer/formatter system that turns raw values into what you see on +screen. + + +Column renaming: why everything becomes ``a, b, c`` +----------------------------------------------------- + +The very first thing buckaroo does when serializing a DataFrame is +rename every column. The original column ``"revenue"`` becomes ``a``. +``"cost"`` becomes ``b``. The 27th column becomes ``aa``, then ``ab``, +``ac``, and so on — base-26 using lowercase ASCII. + +Why? Two reasons: + +1. **Column names can be anything.** Tuples (from MultiIndex), integers, + strings with spaces and special characters, even a column literally + called ``"index"``. Parquet column names must be strings. AG-Grid + field names should be simple identifiers. Renaming to ``a, b, c`` + sidesteps every edge case at once. + +2. **Collision avoidance.** When a DataFrame has a column named + ``"index"`` and we need to serialize the actual index as a column + too, there's a name collision. Renaming to short opaque names means + the index columns (``index``, ``index_a``, ``index_b`` for + MultiIndex levels) never collide with data columns. + +The original name is preserved in the ``column_config`` that travels +alongside the data. On the JS side, each column's ``header_name`` +(or ``col_path`` for MultiIndex) tells AG-Grid what to display in the +header. The user never sees ``a, b, c`` — they see the real names. + +.. code-block:: python + + # In styling_core.py — fix_column_config maps col→header_name + base_cc['col_name'] = col # "a" + base_cc['header_name'] = str(orig_col_name) # "revenue" + + +Cleaning before serialization +------------------------------ + +Python's type system is richer than what Parquet (or JSON) can express +directly. Before writing to Parquet, buckaroo coerces the awkward types: + +.. list-table:: + :header-rows: 1 + :widths: 30 30 40 + + * - Python type + - Becomes + - Why + * - ``pd.Period`` (e.g. "2021-01") + - ``str`` + - Parquet has no period type + * - ``pd.Interval`` (e.g. ``(0, 1]``) + - ``str`` + - Parquet has no interval type + * - ``pd.Timedelta`` + - ``str`` (e.g. "1 days 02:03:04") + - fastparquet can't encode timedeltas + * - ``bytes`` (e.g. from ``pl.Binary``) + - hex string (e.g. ``"68656c6c6f"``) + - Parquet object columns need strings + * - PyArrow-backed strings + - ``object`` dtype + - fastparquet needs object, not ArrowDtype + * - Timezone-naive datetimes + - UTC datetimes + - Avoids ambiguous serialization + +For the main DataFrame, this happens in ``to_parquet()`` +(``serialization_utils.py``). The function also calls +``prepare_df_for_serialization()`` which does the column rename and +flattens MultiIndex levels into regular columns (``index_a``, +``index_b``, etc.). + +Summary stats have an additional wrinkle: each column's stats dict +contains mixed types (strings like ``"int64"`` for dtype, floats for +mean, lists for histogram bins). fastparquet can't handle mixed-type +columns, so ``sd_to_parquet_b64()`` JSON-encodes every cell value first, +making each column a pure string column. The JS side knows to +``JSON.parse`` each cell back. + +.. code-block:: python + + # Every cell becomes a JSON string before parquet encoding + def _json_encode_cell(val): + return json.dumps(_make_json_safe(val), default=str) + + +Parquet encoding and base64 transport +-------------------------------------- + +buckaroo uses **fastparquet** with a custom JSON codec to write the +DataFrame to an in-memory Parquet file. Categorical and object columns +get JSON-encoded within the Parquet file (fastparquet's ``object_encoding='json'``). + +The raw Parquet bytes are then base64-encoded into an ASCII string: + +.. code-block:: python + + def to_parquet_b64(df): + raw_bytes = to_parquet(df) + return base64.b64encode(raw_bytes).decode('ascii') + +The result is a tagged payload: + +.. code-block:: json + + {"format": "parquet_b64", "data": "UEFSMQ..."} + +This travels over the wire — via Jupyter's comm protocol, a WebSocket, +or embedded directly in an HTML ``