PDF Ebook Text Extraction & Cleaning Tool wtih RESTful Server

Structure of This Document

Introduction
Getting Started
Admin Site
RESTful API
How to Contribute

Introduction

Context

PDF ebooks usually have table of content (TOC). Hence, the texts in a PDF ebook are naturally in a certain hierarchy. Such text and hierarchy information extracted from PDF ebooks can be used for research in machine learning (especially hierarchical text classification) and natural language processing.

Project Scope

The purpose of this project is to provide a RESTful backend and an admin site to

Organize and process PDF files
Extract table of content (TOC) tree and store it in a relational database
Extract cleaned and lemmatized plain text
Maintain the hierarchical structure of the extracted texts according to the (TOC)
Generate WordCloud images for all chapters, sections, and sub-sections
Access the TOC, plain text content (by chapter or section), WordCloud images and more through RESTful APIs

Skills & Tools Required

Adobe Acrobat (2015 DC or later) is required to pre-process the PDF files and convert them into HTML files split by bookmarks. The PDF and HTML files will be then fed into the system.
Adequate knowledge of Django is required to be able to maintain and further expand this project. Please make sure you understand Django well enough before reading the rest of this document.
Due to copyright issues, we cannot distribute PDF ebooks or any processed text extracted from them. This project contents only the source code, without any database or files. You are supposed to setup your own database and connent to it before running the system.

Text Cleaning Techniques

The built-in cleaner performs the following operations sequentially on the extracted texts:

Perform known replacements (e.g. "ﬁ" --> "fi"; if you don't see any difference, try to copy the first "ﬁ" as TWO letters, "f" and "i". You will realize you can't, because "ﬁ" is actually ONE unicode character)
Replace non-ascii characters with underscore ("_")
Remove all URIs
Remove all emails
Remove stop words according the the MySQL stop word list
Remove all punctuation marks
Remove all digits
Remove one-letter and two-letter words
Remove redundant whitespace characters
Lemmatize

You may customize the text cleaning procedure by adding/removing individual steps in the Cleaner class in crawler/cleaner.py. See more in How to Contribute.

Getting Started

Install Pyhton 3
Clone the project
Install python dependencies:
$ pip install -r requirements.txt
Download WordNet data for NLTK
(For Ubuntu Users ONLY) Install package libjpeg-dev
$ sudo apt-get install libjpeg-dev
Setup your own database and update the connection configuration in settings.py, then migrate.
Run the development server
$ python3 manage.py runserver

Admin Site

This is a standard admin site of Django. It can be accessed by the URL <your-domain>/admin/. If you are running the default Django development server, the complete URL is http://127.0.0.1:8000/admin/.

You may need to create a superuser for the first time to log in to the admin site. This can be done by

$ python3 manage.py createsuperuser

If you didn't know about this yet, please learn Django first before trying the following steps. See Introduction.

Create an Entry for the `Book` Model

The following fields are required to create a new entry.

Field	Explanation	Required for New Entry
Title	The title of the PDF book	YES
Toc html path	The path of the html file generated by Adobe Acrobat. There must be a directory of the same name next to the html file. This is the default output format of Adobe Acrobat when converting PDF to HTML with the "Split by Bookmarks" option turned on.	YES
Target dir path	The path of the directory where the extracted, cleaned and structured plain text files will be stored. By default it's the `target` directory.	Optional

Please leave all other fields unmodified.

Process a Book

After creating a Book entry,

Go back to the model page /admin/book/book
Tick the book that was newly created
Select "Process book" in the "Action" dropdown menu
Click "GO"

This will take from a few seconds to 10+ minutes for most cases, depending on the structure and the length of the book. Most importantly, DO NOT close the browser tab or shut down the server while a book is being processed. Data integrity could NOT be preserved (in other words, the system WILL fail) if the process is interrupted in the middle.

You can view the progress of the process in the terminal where you started the Django server.

RESTful API

Overview

The docs and an emulated client are available at http://<your-domain>/docs/

The root URL of all RESTful APIs is /api/v1 (e.g. the book-list api is at http://<your-admin>/api/v1/book/list/). There are to sub-group of API endpoints: Book and Section.

Group	URL
Book	`/book`
Section	`/section`

Book

Endpoint	URL	Method
List	`/list/`	GET
Detail	`/detail/{pk}/`	GET
TOC	`/toc/{pk}/`	GET

List

Response example:

[
  {
    "id": 2,
    "title": "My Sample Handbook",
    "root_section": 25
  },
  {
    "id": 3,
    "title": "My Sample Textbook",
    "root_section": 72
  }
]

More details on the fields:

Field	Type	Explanation
`id`	`int`	The id of the book
`title`	`string`	The title of the book
`root_section`	`int`	The id of the root section that represents the entire book

Detail

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the book

Response example:

{
  "id": 7,
  "title": "Digital Signal Processing System Analysis and Design",
  "root_section": 1011
}

This is a single element of the list returned by the /book/list/ API.

{
	"title": "My Sample Handbook",
	"slugified": "my-sample-handbook",
	"id": 25,
	"children": [
		{
			"title": "Chapter 1",
			"slugified": "chapter-1",
			"id": 26,
			"children": [
				{
					"title": "Section 1.1",
					"slugified": "section-1-1",
					"id": 27,
					"children": []
				},
				{
					"title": "Section 1.2",
					"slugified": "section-1-2",
					"id": 28,
					"children": []
				}
			]
		},
		{
			"title": "Chapter 2",
			"slugified": "chapter-2",
			"id": 29,
			"children": []
		}
	]
}

This is a nested, recursive JSON that represents the table of content tree. Each node in the tree has the following fields:

Field	Type	Explanation
`title`	`string`	The title of the section
`slugified`	`string`	The slugified title of the section
`id`	`int`	The id of the section
`children`	`array`	An array of immediate children nodes of the current node

Section

Endpoint	URL
Detail	`/detail/{pk}/`
Children	`/children/{pk}/`
Word Cloud	`/wordcloud/{pk}/`
Content	`/content/{pk}/`
Aggregate Content	`/content/{pk}/aggregate/`

Detail

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

Response example:

{
	"title": "Section 1.1",
	"slugified": "section-1-1",
	"id": 27,
	"has_children": false
}

Children

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

The response is an array of "Detail"s in the /detail/{pk}/ API.

Word Cloud

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

The response has the HTTP header Content-Type: image/jpeg that is a word cloud image generated based on the aggregated text of the section itself and all its descendents in the TOC tree.

Content

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

Response example:

{
	"content": "processed text content"
}

The response includes ONLY the immediate text of the section itself.

Also please note that the texts are AFTER the entire text cleaning & lemmatization procedure.

Aggregate Content

Parameters in URL:

Parameter	Type	Explanation
`pk`	`int`	The id of the section

Response example:

{
	"content": "processed text content including descendents"
}

The response includes the cleaned and lemmatized text of the section itself and ALL its descendents in the TOC tree.

How to Contribute

* to be continued *

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
api		api
book		book
crawler		crawler
pdf_viewer_server		pdf_viewer_server
section		section
.gitignore		.gitignore
README.md		README.md
license.txt		license.txt
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Ebook Text Extraction & Cleaning Tool wtih RESTful Server

Structure of This Document

Introduction

Context

Project Scope

Skills & Tools Required

Text Cleaning Techniques

Getting Started

Admin Site

Create an Entry for the `Book` Model

Process a Book

RESTful API

Overview

Book

List

Detail

TOC

Section

Detail

Children

Word Cloud

Content

Aggregate Content

How to Contribute

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Ebook Text Extraction & Cleaning Tool wtih RESTful Server

Structure of This Document

Introduction

Context

Project Scope

Skills & Tools Required

Text Cleaning Techniques

Getting Started

Admin Site

Create an Entry for the Book Model

Process a Book

RESTful API

Overview

Book

List

Detail

TOC

Section

Detail

Children

Word Cloud

Content

Aggregate Content

How to Contribute

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Create an Entry for the `Book` Model

Packages