Skip to content

Commit e29031d

Browse files
author
Julien LEROUGE
committed
Initial release of DocXPand tool
1 parent 2177e4d commit e29031d

147 files changed

Lines changed: 215439 additions & 3 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

DATASET_LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
placeholder
1+
The synthetic ID document images dataset ("DocXPand-25k"), released alongside this tool, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Dockerfile

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
2+
RUN apt-get -y update && DEBIAN_FRONTEND=noninteractive apt-get -y install --no-install-recommends git curl make cmake xz-utils pkg-config build-essential wget locales libxi-dev libxrandr-dev libfreetype6-dev libfontconfig1-dev python3.10-dev libjpeg-dev libcairo2-dev liblcms2-dev libboost-dev libopenjp2-7-dev libopenjp2-tools libleptonica-dev imagemagick qpdf pdftk libdmtx0b mesa-common-dev libgl1-mesa-dev libglu1-mesa-dev libgl1-mesa-glx libmagic1
3+
# poetry environment variable (https://python-poetry.org/docs/configuration/#using-environment-variables)
4+
ENV POETRY_VERSION=1.6.1 \
5+
# make poetry install to this location
6+
POETRY_HOME="/opt/poetry" \
7+
# avoid poetry creating virtual environment in the project's root
8+
POETRY_VIRTUALENVS_IN_PROJECT=false \
9+
# do not ask any interactive question
10+
POETRY_NO_INTERACTION=1
11+
12+
# install poetry - respects $POETRY_VERSION & $POETRY_HOME
13+
RUN curl -sSL https://install.python-poetry.org | python3 -
14+
ENV PATH="${POETRY_HOME}/bin:$PATH"
15+
COPY . /app
16+
WORKDIR /app
17+
RUN poetry install

LICENCE renamed to LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2023 QuickSign
3+
Copyright (c) 2024 QuickSign
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
66

README.md

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,38 @@
1-
Synthetic identity documents dataset
1+
# Requirement
2+
* [Python](https://www.python.org/downloads/) 3.9 or 3.10
3+
* [Poetry](https://python-poetry.org/)
4+
* [Chrome](https://www.google.com/chrome/) and the corresponding [webdriver](https://googlechromelabs.github.io/chrome-for-testing/)
5+
* Stable diffusion for face generation, see [stable_diffusion](stable_diffusion/README.md)
6+
# Functionalities
7+
8+
This repository exposes functions to generate documents using templates and generators, contained in [docxpand/templates](docxpand/templates):
9+
10+
* Templates are SVG files, containing information about the appearence of the documents to generate, i.e. their backgrounds, the fields contained in the document, the positions of these fields etc.
11+
* Generators are JSON files, containing information on how to generate the fields content.
12+
13+
This repository allows to :
14+
* Generate documents for known templates ([id_card_td1_a](docxpand/templates/id_card_td1_a), [id_card_td1_b](docxpand/templates/id_card_td1_b), [id_card_td2_a](docxpand/templates/id_card_td2_a), [id_card_td2_b](docxpand/templates/id_card_td2_b), [pp_td3_a](docxpand/templates/pp_td3_a), [pp_td3_b](docxpand/templates/pp_td3_b), [pp_td3_c](docxpand/templates/pp_td3_c), [rp_card_td1](docxpand/templates/rp_card_td1) and [rp_card_td2](docxpand/templates/rp_card_td2) ), by filling the templates with random fake information.
15+
- These templates are inspired from European ID cards, passports and residence permits. Their format follow the [ISO/IEC 7810
16+
](https://en.wikipedia.org/wiki/ISO/IEC_7810), and they contains machine-readable zone (MRZ) that follow the [Machine Readable Travel Documents Specifications](https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf).
17+
- To generate documents, use the [generate_fake_structured_documents.py](scripts/dataset/generate_fake_structured_documents.py) script, that takes as input the name of one of the templates, the number of fake documents to generate, an output directory, an url to request that can serve generated photos of human faces using [stable diffusion](stable_diffusion/README.md), and a [chrome webdriver](https://googlechromelabs.github.io/chrome-for-testing/) corresponding to the installed version of your installed chrome browser.
18+
* Integrate generated document in some scenes, to replace other documents originally present in the scenes.
19+
- It implies you have some dataset of background scenes usable for this task, with coordinates of original documents to replace by generated fake documents.
20+
- To integrate documents, use the [insert_generated_documents_in_scenes.py](scripts/dataset/insert_generated_documents_in_scenes.py) script, that takes as input the directory containing the generated document images, a JSON dataset containing information obout those document images (generated by above script), the directory containing "scene" (background) images, a JSON dataset containing localization information, and an output directory to store the final images. The background scene images must contain images that are present in the [docxpand/specimens](docxpand/specimens) directory. See the [SOURCES.md](docxpand/specimens/SOURCES.md) file for more information.
21+
- All JSON datasets must follow the `DocFakerDataset` format, defined in [docxpand/dataset.py](docxpand/dataset.py).
22+
23+
## Installation
24+
25+
Run
26+
27+
poetry install
28+
29+
## Usage
30+
31+
To generate documents, run:
32+
33+
poetry run python scripts/generate_fake_structured_documents.py -n <number_to_generate> -o <output_directory> -t <template_to_use> -w <path_to_chrome_driver_path>
34+
35+
To insert document in target images, run:
36+
37+
poetry run python scripts/insert_generated_documents_in_scenes.py -di <document_images_directory> -dd <documents_dataset> -sd <scene_images_directory> -sd <scenes_dataset> -o <output_directory>
38+

docxpand/blank.svg

Lines changed: 54 additions & 0 deletions
Loading

docxpand/canvas.py

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
"""Helper functions to manipulate Inkscape SVG content.
2+
3+
Original version can be found at https://github.com/letuananh/pyinkscape
4+
5+
@author: Le Tuan Anh <tuananh.ke@gmail.com>
6+
@license: MIT
7+
"""
8+
9+
# Copyright (c) 2017, Le Tuan Anh <tuananh.ke@gmail.com>
10+
#
11+
# Permission is hereby granted, free of charge, to any person obtaining a copy
12+
# of this software and associated documentation files (the "Software"), to deal
13+
# in the Software without restriction, including without limitation the rights
14+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
15+
# copies of the Software, and to permit persons to whom the Software is
16+
# furnished to do so, subject to the following conditions:
17+
#
18+
# The above copyright notice and this permission notice shall be included in
19+
# all copies or substantial portions of the Software.
20+
#
21+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
22+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
23+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
24+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
25+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
26+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
27+
# THE SOFTWARE.
28+
29+
########################################################################
30+
31+
import logging
32+
import os
33+
import typing as tp
34+
from xml.dom.minidom import Element
35+
36+
from lxml import etree
37+
from lxml.etree import XMLParser
38+
39+
_BLANK_CANVAS = os.path.join(os.path.dirname(os.path.realpath(__file__)), "blank.svg")
40+
41+
logger = logging.getLogger(__name__)
42+
logging.basicConfig()
43+
logger.setLevel(logging.INFO)
44+
45+
INKSCAPE_NS = "http://www.inkscape.org/namespaces/inkscape"
46+
SVG_NS = "http://www.w3.org/2000/svg"
47+
SVG_NAMESPACES = {
48+
"ns": SVG_NS,
49+
"svg": SVG_NS,
50+
"dc": "http://purl.org/dc/elements/1.1/",
51+
"cc": "http://creativecommons.org/ns#",
52+
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
53+
"sodipodi": "http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd",
54+
"inkscape": INKSCAPE_NS,
55+
}
56+
XLINK_NS = "http://www.w3.org/1999/xlink"
57+
58+
59+
class Point:
60+
def __init__(self, x: float, y: float):
61+
self.x = x
62+
self.y = y
63+
64+
65+
class Dimension:
66+
def __init__(self, width, height):
67+
self.width = width
68+
self.height = height
69+
70+
71+
class BBox:
72+
"""A bounding box represents by a top-left anchor (x1, y1) and a dimension (width, height)"""
73+
74+
def __init__(self, x, y, width, height):
75+
self._anchor = Point(x, y)
76+
self._dimension = Dimension(width, height)
77+
78+
@property
79+
def width(self):
80+
"""Width of the bounding box"""
81+
return self._dimension.width
82+
83+
@property
84+
def height(self):
85+
"""Height of the bounding box"""
86+
return self._dimension.height
87+
88+
89+
class Canvas:
90+
"""This class represents an Inkscape drawing page (i.e. a SVG file)."""
91+
92+
def __init__(self, filepath=tp.Optional[str], *args, **kwargs):
93+
"""Create a new blank canvas or read from an existing file.
94+
95+
To create a blank canvas, just ignore the filepath property.
96+
>>> c = Canvas()
97+
98+
To open an existing file, use
99+
>>> c = Canvas("/path/to/file.svg")
100+
101+
Arguments:
102+
filepath: Path to an existing SVG file.
103+
"""
104+
self._filepath = filepath
105+
self._tree = None
106+
self._root = None
107+
self._units = "mm"
108+
self._width = 0
109+
self._height = 0
110+
self._viewbox = None
111+
self._scale = 1.0
112+
self._elem_group_map = {}
113+
self._elements_by_ids = {}
114+
if filepath is not None:
115+
self._load_file(*args, **kwargs)
116+
117+
def _load_file(self, remove_blank_text=True, encoding="utf-8", **kwargs):
118+
with open(
119+
_BLANK_CANVAS if not self._filepath else self._filepath,
120+
encoding=encoding,
121+
) as infile:
122+
kwargs["remove_blank_text"] = remove_blank_text # lxml specific
123+
parser = XMLParser(**kwargs)
124+
self._tree = etree.parse(infile, parser)
125+
self._root = self._tree.getroot()
126+
self._update_svg_info()
127+
128+
def _update_svg_info(self):
129+
# load SVG information
130+
if self._svg_node.get("viewBox"):
131+
self._viewbox = BBox(
132+
*(float(x) for x in self._svg_node.get("viewBox").split())
133+
)
134+
if not self._width:
135+
self._width = self._viewbox.width
136+
if not self._height:
137+
self._width = self._viewbox.height
138+
if self.viewBox and self._width:
139+
self._scale = self.viewBox.width / self._width
140+
141+
@property
142+
def _svg_node(self):
143+
return self._root
144+
145+
@property
146+
def viewBox(self):
147+
return self._viewbox
148+
149+
def to_xml_string(self, encoding="utf-8", pretty_print=True, **kwargs):
150+
return etree.tostring(
151+
self._root,
152+
encoding=encoding,
153+
pretty_print=pretty_print,
154+
**kwargs,
155+
).decode("utf-8")
156+
157+
def _xpath_query(self, query_string, namespaces=None):
158+
return self._root.xpath(query_string, namespaces=namespaces)
159+
160+
def element_by_id(self, id: str) -> tp.Optional[Element]:
161+
"""Get one XML element by its ID.
162+
163+
Arguments:
164+
id: the ID of the element
165+
166+
Raises:
167+
RuntimeError: when more than two elements share the exact same ID
168+
"""
169+
elements = self._xpath_query(f".//ns:*[@id='{id}']", namespaces=SVG_NAMESPACES)
170+
if not elements:
171+
return None
172+
if len(elements) > 1:
173+
raise RuntimeError(f"Found {len(elements)} elements with the same id {id}")
174+
return elements[0]
175+
176+
def render(self, outpath, overwrite=False, encoding="utf-8"):
177+
if not overwrite and os.path.isfile(outpath):
178+
logger.warning(f"File {outpath} exists. SKIPPED")
179+
else:
180+
output = self.to_xml_string(pretty_print=False)
181+
with open(outpath, mode="w", encoding=encoding) as outfile:
182+
outfile.write(output)
183+
logger.info("Written output to {}".format(outfile.name))

docxpand/conditionals.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
import random
2+
import typing as tp
3+
4+
5+
class Conditional:
6+
def __init__(seed: tp.Optional[int] = None):
7+
if seed is not None:
8+
random.seed(seed)
9+
10+
@staticmethod
11+
def uniform(probability: float = 0.5) -> bool:
12+
return random.random() <= probability
13+
14+
@staticmethod
15+
def maybe(**kwargs) -> bool:
16+
raise NotImplementedError("Must be implemented in child class")
17+
18+
19+
class BirthNameConditional(Conditional):
20+
@staticmethod
21+
def maybe(**kwargs) -> bool:
22+
gender: str = kwargs.get("gender", "nonbinary")
23+
probability_by_gender = kwargs.get(
24+
"probability_by_gender",
25+
{"male": 0.05, "female": 0.2, "nonbinary": 0.2},
26+
)
27+
probability = probability_by_gender.get(gender, 0.2)
28+
return Conditional.uniform(probability)

0 commit comments

Comments
 (0)