Skip to content

Commit 7b92f10

Browse files
authored
Merge pull request #92 from dlcs/feature/cropbox
Allow use_cropbox to be controlled
2 parents ed07588 + 4ddebfc commit 7b92f10

4 files changed

Lines changed: 7 additions & 1 deletion

File tree

Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ RUN apt-get update && apt-get --yes install apt-utils && apt-get --yes upgrade \
1313
&& apt-get --yes install poppler-data poppler-utils \
1414
&& apt-get --yes autoremove && apt-get --yes autoclean && apt-get --yes clean \
1515
&& useradd --create-home --home-dir /srv/dlcs --shell /bin/bash --uid 1000 dlcs \
16-
&& python -m pip install --upgrade pip
16+
&& python -m pip install --upgrade pip \
17+
&& python -m pip install --upgrade setuptools
1718

1819
# Copy nginx config and create appropriate folders
1920
COPY --chown=dlcs:dlcs ./nginx.conf /etc/nginx/nginx.conf

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ The following list of environment variables are supported:
8080
| `PDF_RASTERIZER_FALLBACK_DPI` | `200` | Engine | The DPI to use for images that exceed pdftoppm memory size and produce a 1x1 pixel (see https://github.com/Belval/pdf2image/issues/34) |
8181
| `PDF_RASTERIZER_FORMAT` | `jpg` | Engine | The format to generate rasterized images in. Supported values are `ppm`, `jpeg` / `jpg`, `png` and `tiff` |
8282
| `PDF_RASTERIZER_MAX_LENGTH` | `0` | Engine | Optional, the maximum size of pixels on longest edge that will be saved. If rasterized image exceeds this it will be resized, maintaining aspect ratio. |
83+
| `PDF_RASTERIZER_USE_CROPBOX` | `False` | Engine | If `True` the PDF cropbox is used instead of mediabox. The MediaBox is the largest page box in a PDF. The other page boxes can equal the size of the MediaBox but they cannot be larger. The CropBox defines the region to which the page contents are to be clipped. |
8384
| `DLCS_API_ROOT` | `https://api.dlcs.digirati.io` | Engine | The root URI of the API of the target DLCS deployment, without the trailing slash. |
8485
| `DLCS_S3_BUCKET_NAME` | `dlcs-composite-images` | Engine | The S3 bucket that the Composite Handler will push rasterized images to, for consumption by the wider DLCS. Both the Composite Handler and the DLCS must have access to this bucket. |
8586
| `DLCS_S3_OBJECT_KEY_PREFIX` | `composites` | Engine | The S3 key prefix to use when pushing images to the `DLCS_S3_BUCKET_NAME` - in other words, the folder within the S3 bucket into which images are stored. |

src/app/engine/rasterizers.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ def __init__(self):
2424
self._fmt = settings.PDF_RASTERIZER["format"]
2525
self._thread_count = settings.PDF_RASTERIZER["thread_count"]
2626
self._max_length = settings.PDF_RASTERIZER["max_length"]
27+
self._use_cropbox = settings.PDF_RASTERIZER["use_cropbox"]
2728

2829
def rasterize_pdf(self, subfolder_path):
2930
# Typically, pdf2image will write generated images to a temporary path, after
@@ -51,6 +52,7 @@ def __rasterize(
5152
thread_count=self._thread_count,
5253
output_file=output_file,
5354
output_folder=subfolder_path,
55+
use_cropbox=self._use_cropbox,
5456
)
5557

5658
def __validate_rasterized_images(self, images, pdf_source, subfolder_path):
@@ -90,6 +92,7 @@ def __ensure_image_size(self, idx, im: Image):
9092
logger.info(
9193
f"resizing image index {idx} from {w},{h} to {scale_w},{scale_h}"
9294
)
95+
9396
with im.resize((scale_w, scale_h), resample=Image.LANCZOS) as resized:
9497
resized.save(filename)
9598
return ResizeResult.RESIZED

src/app/settings.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,7 @@
177177
"dpi": env("PDF_RASTERIZER_DPI", cast=int, default=500),
178178
"fallback_dpi": env("PDF_RASTERIZER_FALLBACK_DPI", cast=int, default=200),
179179
"max_length": env("PDF_RASTERIZER_MAX_LENGTH", cast=int, default=0),
180+
"use_cropbox": env("PDF_RASTERIZER_USE_CROPBOX", cast=bool, default=False),
180181
}
181182

182183
ORIGIN_CONFIG = {"chunk_size": env("ORIGIN_CHUNK_SIZE", cast=int, default=8192)}

0 commit comments

Comments
 (0)