Skip to content

NTU-SCSE/pdf-server

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Ebook Text Extraction & Cleaning Tool wtih RESTful Server

Structure of This Document

Context

PDF ebooks usually have table of content (TOC). Hence, the texts in a PDF ebook are naturally in a certain hierarchy. Such text and hierarchy information extracted from PDF ebooks can be used for research in machine learning (especially hierarchical text classification) and natural language processing.

Project Scope

The purpose of this project is to provide a RESTful backend and an admin site to

  • Organize and process PDF files
  • Extract table of content (TOC) tree and store it in a relational database
  • Extract cleaned and lemmatized plain text
  • Maintain the hierarchical structure of the extracted texts according to the (TOC)
  • Generate WordCloud images for all chapters, sections, and sub-sections
  • Access the TOC, plain text content (by chapter or section), WordCloud images and more through RESTful APIs

Skills & Tools Required

  • Adobe Acrobat (2015 DC or later) is required to pre-process the PDF files and convert them into HTML files split by bookmarks. The PDF and HTML files will be then fed into the system.
  • Adequate knowledge of Django is required to be able to maintain and further expand this project. Please make sure you understand Django well enough before reading the rest of this document.
  • Due to copyright issues, we cannot distribute PDF ebooks or any processed text extracted from them. This project contents only the source code, without any database or files. You are supposed to setup your own database and connent to it before running the system.

Text Cleaning Techniques

The built-in cleaner performs the following operations sequentially on the extracted texts:

  1. Perform known replacements (e.g. "fi" --> "fi"; if you don't see any difference, try to copy the first "fi" as TWO letters, "f" and "i". You will realize you can't, because "fi" is actually ONE unicode character)
  2. Replace non-ascii characters with underscore ("_")
  3. Remove all URIs
  4. Remove all emails
  5. Remove stop words according the the MySQL stop word list
  6. Remove all punctuation marks
  7. Remove all digits
  8. Remove one-letter and two-letter words
  9. Remove redundant whitespace characters
  10. Lemmatize

You may customize the text cleaning procedure by adding/removing individual steps in the Cleaner class in crawler/cleaner.py. See more in How to Contribute.

Getting Started

  1. Install Pyhton 3
  2. Clone the project
  3. Install python dependencies:
    $ pip install -r requirements.txt
  4. Download WordNet data for NLTK
  5. (For Ubuntu Users ONLY) Install package libjpeg-dev
    $ sudo apt-get install libjpeg-dev
  6. Setup your own database and update the connection configuration in settings.py, then migrate.
  7. Run the development server
    $ python3 manage.py runserver

Admin Site

This is a standard admin site of Django. It can be accessed by the URL <your-domain>/admin/. If you are running the default Django development server, the complete URL is http://127.0.0.1:8000/admin/.

You may need to create a superuser for the first time to log in to the admin site. This can be done by

$ python3 manage.py createsuperuser

If you didn't know about this yet, please learn Django first before trying the following steps. See Introduction.

Create an Entry for the Book Model

The following fields are required to create a new entry.

Field Explanation Required for New Entry
Title The title of the PDF book YES
Toc html path The path of the html file generated by Adobe Acrobat. There must be a directory of the same name next to the html file. This is the default output format of Adobe Acrobat when converting PDF to HTML with the "Split by Bookmarks" option turned on. YES
Target dir path The path of the directory where the extracted, cleaned and structured plain text files will be stored. By default it's the target directory. Optional

Please leave all other fields unmodified.

Process a Book

After creating a Book entry,

  1. Go back to the model page /admin/book/book
  2. Tick the book that was newly created
  3. Select "Process book" in the "Action" dropdown menu
  4. Click "GO"

This will take from a few seconds to 10+ minutes for most cases, depending on the structure and the length of the book. Most importantly, DO NOT close the browser tab or shut down the server while a book is being processed. Data integrity could NOT be preserved (in other words, the system WILL fail) if the process is interrupted in the middle.

You can view the progress of the process in the terminal where you started the Django server.

RESTful API

Overview

The docs and an emulated client are available at http://<your-domain>/docs/

The root URL of all RESTful APIs is /api/v1 (e.g. the book-list api is at http://<your-admin>/api/v1/book/list/). There are to sub-group of API endpoints: Book and Section.

Group URL
Book /book
Section /section

Book

Endpoint URL Method
List /list/ GET
Detail /detail/{pk}/ GET
TOC /toc/{pk}/ GET

List

Response example:

[
  {
    "id": 2,
    "title": "My Sample Handbook",
    "root_section": 25
  },
  {
    "id": 3,
    "title": "My Sample Textbook",
    "root_section": 72
  }
]

More details on the fields:

Field Type Explanation
id int The id of the book
title string The title of the book
root_section int The id of the root section that represents the entire book

Detail

Parameters in URL:

Parameter Type Explanation
pk int The id of the book

Response example:

{
  "id": 7,
  "title": "Digital Signal Processing System Analysis and Design",
  "root_section": 1011
}

This is a single element of the list returned by the /book/list/ API.

TOC

Parameters in URL:

Parameter Type Explanation
pk int The id of the book

Response example:

{
	"title": "My Sample Handbook",
	"slugified": "my-sample-handbook",
	"id": 25,
	"children": [
		{
			"title": "Chapter 1",
			"slugified": "chapter-1",
			"id": 26,
			"children": [
				{
					"title": "Section 1.1",
					"slugified": "section-1-1",
					"id": 27,
					"children": []
				},
				{
					"title": "Section 1.2",
					"slugified": "section-1-2",
					"id": 28,
					"children": []
				}
			]
		},
		{
			"title": "Chapter 2",
			"slugified": "chapter-2",
			"id": 29,
			"children": []
		}
	]
}

This is a nested, recursive JSON that represents the table of content tree. Each node in the tree has the following fields:

Field Type Explanation
title string The title of the section
slugified string The slugified title of the section
id int The id of the section
children array An array of immediate children nodes of the current node

Section

Endpoint URL
Detail /detail/{pk}/
Children /children/{pk}/
Word Cloud /wordcloud/{pk}/
Content /content/{pk}/
Aggregate Content /content/{pk}/aggregate/

Detail

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

Response example:

{
	"title": "Section 1.1",
	"slugified": "section-1-1",
	"id": 27,
	"has_children": false
}

Children

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

The response is an array of "Detail"s in the /detail/{pk}/ API.

Word Cloud

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

The response has the HTTP header Content-Type: image/jpeg that is a word cloud image generated based on the aggregated text of the section itself and all its descendents in the TOC tree.

Content

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

Response example:

{
	"content": "processed text content"
}

The response includes ONLY the immediate text of the section itself.

Also please note that the texts are AFTER the entire text cleaning & lemmatization procedure.

Aggregate Content

Parameters in URL:

Parameter Type Explanation
pk int The id of the section

Response example:

{
	"content": "processed text content including descendents"
}

The response includes the cleaned and lemmatized text of the section itself and ALL its descendents in the TOC tree.

How to Contribute

* to be continued *

About

PDF Ebook Text Extraction & Cleaning Tool wtih RESTful Server

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%