Skip to content

Latest commit

 

History

History
100 lines (83 loc) · 3.66 KB

File metadata and controls

100 lines (83 loc) · 3.66 KB

PDFParser

Status: [In-Progress]
Published: [date here]
Updated: [9/22]

Vanderbilt Hustler internal tool to convert PDFs to spreadsheets!

How this tool works

This Python script uses the tabula library to read a PDF, build it into a dataframe, and export it as a csv. For pdfs with multiple tables, the script outputs each table as separate sheets.

[Fixed] How to use this tool

  1. After cloning, cd into backend and run app.py
cd backend
python app.py
  1. Now, in a separate terminal window, cd into frontend and enter npm start
cd frontend
npm start
  1. The webpage will load up (likely at local host 3000). Upload PDFs as required!
  2. Once finished, you can stop the servers by hitting "Ctrl-C" in your terminals.

How to use this tool [old]

  1. Add your PDF file to the repository. You can do this by dragging and dropping the file into the folder.

  2. Add an empty Excel file to the repository. You can do this by right-clicking on the file explorer and selecting New File. Name the file with the .xlsx extension.

  3. Run the Python script using the command

python pdf_to_excel.py

Things being worked through/considered

/pdf-parser-tool
├── /backend
│   ├── app.py                   # python script
│   ├── requirements.txt         # dependencies for python (e.g. tabula-py, pandas, etc.)
│   └── ...                      # other backend files
├── /frontend
│   ├── /public                  # public assets (index.html, favicon, etc.)
│   ├── /src
│   │   ├── /components          # react components (e.g., UploadForm, TableView)
│   │   ├── /hooks               # custom hooks (for API calls, etc.)
│   │   ├── /styles              # CSS
│   │   ├── App.tsx              # main react component
│   │   ├── index.tsx            # entry point for react
│   │   └── api.ts               # API functions to interact w/ backend
│   ├── package.json             # dependencies for frontend (react, typescript, etc.)
│   └── tsconfig.json            # typescript config
└── README.md                    # project docs

Possible Requirements

(Will be workshopped -- consider a requirements.txt)

  • pip install tabula-py
  • pip install JPype1
  • Install Java 64-Bit @ https://www.java.com/en/download/manual.jsp
  • Add it to your environment variabls
  • Add it to your path (%JAVA_HOME%\bin)
  • e.g. "(C:\Program Files (x86)\Java\jre1.8.0_421)"
  • pip install openpyxl

Directory

install tree (mac example shown)

brew install tree

use tree command in terminal to generate

tree -I 'node_modules|.git' --dirsfirst | pbcopy

Deployment History

  • 9/12: Deploy PDF Script

Credits

  • Front-end Design | [Name], [Name]
  • Back-end Design | [Name], [Name]

Thank you to [credit any inspiration, open source code, or advisors] for [X].

Powered by The Vanderbilt Hustler Data Team

For questions, comments or curiosities:

  • Hustler staff: Slack the #data team.
  • The rest of the 🌎: email Data Editor Katherine Oung