pdffile - PDFFile

source package pdffile

Access PDFS with a ZipFile-like API.

Classes

PageFormat — Read Format.
PDFFile — ZipFile like API to PDFs.

source class PageFormat()

Bases : Enum

Read Format.

Attributes

IMAGE_IF_DOMINANT — the two by inspecting the ext written to props.
PIXMAP_JPEG — always-image response for any page (e.g. force-image override).

source class PDFFile(path: Path)

ZipFile like API to PDFs.

Initialize document.

Methods

valid_pagenum — Check if a string is a non-negative integeger.
to_datetime — Convert a PDF date string to a datetime.
to_pdf_date — Convert a datetime to a PDF date string.
to_bool — Convert a boolean string to a python bool.
to_xml_bool — Convert a boolean value to an xml string.
is_pdffile — Is the path a pdf.
save — Save PDF doc to disk.
close — Close the fitz doc.
pagelist — Zero padded page names.
namelist — Return sortable zero padded index strings.
infolist — Return ZipFile like infolist.
read_image — Read first image from page in original format.
read_pixmap — Convert page to pixmap.
classify_page — Decide how page index should be served.
read_image_if_dominant — Return (bytes, ext) if page is image-dominant, else None.
read_full_pixmap_jpeg — Render the whole page to RGB JPEG.
read_pdf — Read a pdf page as a complete one-page pdf.
read_embedded_file — Read embedded file.
read — Return a single page pdf doc, image or pixmap or embedded file.
get_page_count — Get the page count from the doc or the default highnum.
get_metadata — Return metadata from the pdf doc.
write_metadata — Set metadata to the pdf doc.
remove — Remove files or pages from the pdf.
writestr — Write string to an embedded file.
repack — Noop. For compatibility with zipfile-patch.

source staticmethod PDFFile.valid_pagenum(name: str) → int

Check if a string is a non-negative integeger.

Raises

ValueError

source staticmethod PDFFile.to_datetime(pdf_date: str) → datetime | None

Convert a PDF date string to a datetime.

source staticmethod PDFFile.to_pdf_date(value: datetime | str) → str | None

Convert a datetime to a PDF date string.

source staticmethod PDFFile.to_bool(value: Any) → bool

Convert a boolean string to a python bool.

source staticmethod PDFFile.to_xml_bool(value: Any) → str

Convert a boolean value to an xml string.

source classmethod PDFFile.is_pdffile(path: str) → bool

Is the path a pdf.

source method PDFFile.save() → None

Save PDF doc to disk.

source method PDFFile.close() → None

Close the fitz doc.

source method PDFFile.pagelist() → list[str]

Zero padded page names.

source method PDFFile.namelist() → list[str]

Return sortable zero padded index strings.

source method PDFFile.infolist() → list[ZipInfo]

Return ZipFile like infolist.

source method PDFFile.read_image(index: int) → tuple[bytes, str]

Read first image from page in original format.

source method PDFFile.read_pixmap(index: int) → tuple[bytes, str]

Convert page to pixmap.

source method PDFFile.classify_page(index: int) → PageVerdict

Decide how page index should be served.

Returns a :class:PageVerdict. PageMode.PDF_FALLBACK means the caller should use the regular PDF path; the other modes mean the page is image-dominant and can be served as raw image bytes via :meth:read_image_if_dominant.

Cheap — runs on parsed PDF metadata, single-digit milliseconds per page even on text-heavy documents.

source method PDFFile.read_image_if_dominant(index: int) → tuple[bytes, str] | None

Return (bytes, ext) if page is image-dominant, else None.

ext is the embedded image’s encoding (‘jpeg’, ‘png’, ‘webp’) for IMAGE_DIRECT verdicts, or ‘jpeg’ for IMAGE_TRANSCODE verdicts (CMYK / JBIG2 / rotated pages re-encoded via Pixmap).

None means the caller should use :meth:read_pdf (or another fallback path) — the page has vector content that would be lost in a raw-image serve.

source method PDFFile.read_full_pixmap_jpeg(index: int, *, dpi: int | None = None) → tuple[bytes, str]

Render the whole page to RGB JPEG.

Faster than :meth:read_pixmap for browser callers (PPM is not browser-renderable; PIL would need to be in the loop to transcode). Tries the cheap embedded-image path first when the page happens to be image-dominant.

dpi=None (default) auto-picks a render DPI from the page’s embedded-image resolution via :func:choose_pixmap_dpi; pages with no images render at :data:DEFAULT_PIXMAP_DPI. Pass an integer to override. The auto path doesn’t apply when the cheap embedded-image branch fires — those return the embedded image at its native resolution regardless.

Always succeeds for valid pages — raises if PyMuPDF can’t render the page at all.

Raises

RuntimeError

source method PDFFile.read_pdf(index: int) → tuple[bytes, str]

Read a pdf page as a complete one-page pdf.

Uses ``insert_pdf`` rather than ``Document.convert_to_pdf``

the latter rebuilds the page’s content stream and during that rebuild it drops text rendering mode operators (notably 3 Tr, invisible text) and renames specialised OCR fonts like HiddenHorzOCR to ordinary text fonts. The net effect on Acrobat-OCR’d PDFs is that the invisible OCR overlay turns visible — text “doubles up” against the page’s raster under any renderer that follows the spec (PDF.js, MuPDF itself). insert_pdf copies the page faithfully — same operators, same fonts, no warnings, pixel-identical render to the source.

source method PDFFile.read_embedded_file(filename: str) → tuple[bytes, str]

Read embedded file.

source method PDFFile.read(filename: str, fmt: str = ‘’, props: dict | None = None) → bytes

Return a single page pdf doc, image or pixmap or embedded file.

If a props dict is passed in, the read file extension is written to the ext key. For IMAGE_IF_DOMINANT callers inspect ext to distinguish a successful image serve (jpeg/png/webp) from the PDF fall-through (pdf).

source method PDFFile.get_page_count() → int

Get the page count from the doc or the default highnum.

source method PDFFile.get_metadata() → dict

Return metadata from the pdf doc.

source method PDFFile.write_metadata(metadata: Mapping) → None

Set metadata to the pdf doc.

source method PDFFile.remove(name: str) → None

Remove files or pages from the pdf.

source method PDFFile.writestr(name: str, buffer: str | bytes | bytearray | memoryview[int], **_kwargs) → None

Write string to an embedded file.

Accept compress_type & compress args but discard them.

Raises

NotImplementedError

source method PDFFile.repack() → None

Noop. For compatibility with zipfile-patch.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

pdffile¶