Skip to content

coleygroup/PolymerLit

Repository files navigation

PolymerLit

The data repo for polymer image recognition

PolymerLit-MT

This is the subset containing 300 images with corresponding molblocks from the BigSMILES Machine Translation (MT) paper by Deagen et al [1]. The raw data have been made available publicly by the authors. There were some minor inconsistencies in the molblocks where aromatic bonds were recorded as type 4 (i.e., non-kekulized) but the images actually displayed alternating single and double bonds (i.e., kekulized). We have manually corrected all these blocks.

PolymerLit-Olsen

This is the subset containing 468 images with corresponding molblocks from 3 publications by the Olsen group at MIT [2,3,4]. These manuscripts and SIs are all open-access, but the images were originally unlabeled. We used our PolymerScribe model to predict the molblocks which were manually corrected afterward.

PolymerLit-OA

This is the subset containing 1,000 images with corresponding molblocks from Open-Access (OA) articles. We have carefully chosen the images to be from only articles with CC-BY-NC-ND and less restrictive licenses. Because of the ND clause (Non-Derivative) for some, we decided to release the images exactly as how they appear originally, together with the coordinates of the bounding polygons surrounding the polymer structures. These bounding polygons were also drawn manually with the help of the open-source tool CVAT.ai, and the "cropped" images can be easily reconstructed with the provided script,

$ pip install pillow
$ python generate_cropped_images.py

The "cropped" images will be populated under PolymerLit-OA_processed, whose filenames should match the provided molblocks (doi_suffix.corrected.mol) which were predicted by PolymerScribe and manually corrected in a similar manner. The references for all images and their licenses are recorded in PolymerLit-OA_refs.xlsx.

About

The data repo for polymer image recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages