Buckaroo Visual Wrangler is a data visualization tool that enables users to visually detect errors in their data and apply data wranglers to clean the dataset. Users can choose from 3 provided datasets:
- StackOverflow survey (stackoverflow_db_uncleaned.csv)
- Chicago crimes (Crimes_-_One_year_prior_to_present_20250421.csv)
- Student loan complaints (complaints-2025-04-21_17_31.csv)
The user may explore their data using various visualization styles, such as heatmaps, scatterplots, or histograms. The user may select data by clicking on the plots and applying various wrangling techniques. After performing the desired wrangling actions, the user may export a python script of those actions to run on the dataset outside of the tool.
If you want to see an early version of buckaroo (without going through any set up), you can try out an in-memory client-only version here. This is the version that was documented in our VLDB2025 demo paper and has some slight differences from the current version.
- we are assuming you will have python3, and postgreSQL installed on your machine, if you do not then make sure to install those before doing the below steps
- navigate to
/appand create a new file called:database.json - set up a user and db in postgres you will use for this app --- (this project requires a local installation of postgreSQL)
- do
./run.sh, this will prompt the construction of yourdatabase.json - now you should have a .venv built for the project to run in, and a database.json created, if this didn't work you need to set up the parameters in database.json manually
- make sure you have npm installed and it's installed/accessible in the .venv
- if your terminal does not have (.venv) at the start of it's name then, you need to active the .venv that was just created
- run
./start.pyin the venv to start the flask server - open a new terminal,
cd ui npm run devin the to start the front-end
- A very efficient setup is to use Jetbrains Pycharm, and setup a compound debug config using a Javascript Debug config, npm config, and a Python config all in one so that you can do full-stack debugging.
- Jetbrains Pycharm also allows for postgres integration so that you can see the tables and the state of the database while developing.
There is a doc called DEVNOTES.md which can be helpful to understand the arch of the app, and how things flow. This isn't comprehensive, but explains a lot. Definitely worth a skim at least when getting into the codebase. If other developers on Buckaroo change any of the core functionality in main, please update this doc so that future students or others doing development on Buckaroo can continue to reference the DEVNOTES.md in the future :).
- Integrate refactor-detector-port into main without breaking functionality in main
- Make dirty rows table infinitely scrollable (right now it just shows the top 10 rows, but want it to scroll to show the next 10 top rows and so on)
- The tool currently bins numerical values, however, it does not bin string values. Thus, any strings like dates, unique IDs etc. will all receive their own tick mark on the axis, resulting in a crowded and often unreadable plot. Future work on this project should handle dates in a more sophisticated way, such as binning by month or year. We discussed even binning all clean data into one bin and then leaving any data with errors unbinned so it can easily be spotted. Could also select a subset of the clean data to show and then keep all the dirty data to repair. Could also bin by error type.
- Make dirty row table headers clickable to sort by. So if a user clicks on "Age" for example, the table will show the top 10 rows with an error in the Age column.
- Python Script export non-existent - needs to be re-implemented