📰 PDFFile News¶

v0.6.1 - Auto-DPI for full-page rasterization¶

New choose_pixmap_dpi(page) picks a render DPI matching the page’s embedded-image resolution. Walks each image, computes its native placed DPI from the pixel dimensions and bbox, returns the highest value clamped to [DEFAULT_PIXMAP_DPI, MAX_PIXMAP_DPI] (150–300). Pages with no images return the default. Tiny tracking pixels and decorative dots are filtered via min_bbox_fraction.
PDFFile.read_full_pixmap_jpeg(index, dpi=None) now defaults to the auto-picked DPI; pass an integer to override. Existing callers see slightly higher rendering quality on multi-image pages and a consistent 150 DPI floor on pure vector pages — comic-resolution output by default instead of the previous 72 DPI.
New module-level constants DEFAULT_PIXMAP_DPI = 150 and MAX_PIXMAP_DPI = 300.

Features
- New PDFFile.classify_page(index) returns a PageVerdict describing how the page should be served: IMAGE_DIRECT (embedded image is browser-safe as-stored), IMAGE_TRANSCODE (embedded image needs RGB-JPEG re-encoding — CMYK, JBIG2, JPEG 2000, rotated pages), or PDF_FALLBACK (page has vector content; serve through the PDF path). Detection runs on parsed metadata — no rasterization — and costs single-digit milliseconds per page.
- New PDFFile.read_image_if_dominant(index) returns (bytes, ext) for image-dominant pages or None for fall-through, letting browser readers serve comic-style PDFs as plain <img> instead of going through pdf.js.
- New PDFFile.read_full_pixmap_jpeg(index) renders any page to RGB JPEG for callers that need an always-image response.
- New PageFormat.IMAGE_IF_DOMINANT and PageFormat.PIXMAP_JPEG values for the read() interface.
Fixes
- read_pdf passes no_new_id=True to tobytes() so single-page PDF output is deterministic across calls. Without this, pymupdf stamps a fresh random /ID array on every save and byte-equality fixtures break on every run.

Extract PDF pages as originals rather than recomposing them. Fixes a bug where OCR text became visible.

Fix reading PDF metadata breaking on datetimes.
Unreadable or unconvertable PDF datetimes are substituted with the start of the epoch and log a warning instead of raising an exception and abandoning parsing.

read() also extract original image files. Format now specified with an string. File extension passed back in an optional props dict.
read_image() reads the first image on a page in the original format
read_pixmap() converts the page to a ppm.
read_pdf() converts the page into a one page pdf.
read_embedded_file() reads a named embedded file
PageFormat convenience enum to show options.

namelist() lists embedded files.
infolist() lists embedded files.
read() can also read named embedded files.
writestr() writes named embedded files. Throws if page numbers are submitted.
remove() will remove named embedded files or pages if the name evaluates to a non-negative integer.

Automatically converts pdf datestrings to python datetimes and back.
Automatically converts pdf/xml bool string to python bools and back
PDFFile static methods to_datetime, to_pdf_date, and to_bool and to_xml_bool do this manually.
Deflate images on save.