Skip to content

[Dev] Support downloading and parsing books from smashwords.com #3

@AbrahamSanders

Description

@AbrahamSanders

smashwords.com was used as the source of the original BookCorpus dataset, built for the 2015 paper Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books.

We should support smashwords as an alternate source of books, since it can provide more modern works than those in project Gutenberg. Dialogs and narratives written in a modern style are absolutely necessary to train a model that will work well with the way people speak and write today.

This likely involves:

  1. Implementing a downloader for smashwords.com
  2. Implementing an adapter, if necessary, to format the downloaded texts in the way that gutenberg-dialog's pipeline expects.

https://github.com/soskek/bookcorpus may be a good starting point, as it implements a crawler for smashwords.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions