-
PageFormat — Read Format.
-
PDFFile — ZipFile like API to PDFs.
pdffile¶
Access PDFS with a ZipFile-like API.
Classes
source class PageFormat()
Bases : Enum
Read Format.
Attributes
-
IMAGE_IF_DOMINANT — the two by inspecting the
extwritten toprops. -
PIXMAP_JPEG — always-image response for any page (e.g. force-image override).
source class PDFFile(path: Path)
ZipFile like API to PDFs.
Initialize document.
Methods
-
valid_pagenum — Check if a string is a non-negative integeger.
-
to_datetime — Convert a PDF date string to a datetime.
-
to_pdf_date — Convert a datetime to a PDF date string.
-
to_bool — Convert a boolean string to a python bool.
-
to_xml_bool — Convert a boolean value to an xml string.
-
is_pdffile — Is the path a pdf.
-
save — Save PDF doc to disk.
-
close — Close the fitz doc.
-
pagelist — Zero padded page names.
-
namelist — Return sortable zero padded index strings.
-
infolist — Return ZipFile like infolist.
-
read_image — Read first image from page in original format.
-
read_pixmap — Convert page to pixmap.
-
classify_page — Decide how page
indexshould be served. -
read_image_if_dominant — Return
(bytes, ext)if page is image-dominant, elseNone. -
read_full_pixmap_jpeg — Render the whole page to RGB JPEG.
-
read_pdf — Read a pdf page as a complete one-page pdf.
-
read_embedded_file — Read embedded file.
-
read — Return a single page pdf doc, image or pixmap or embedded file.
-
get_page_count — Get the page count from the doc or the default highnum.
-
get_metadata — Return metadata from the pdf doc.
-
write_metadata — Set metadata to the pdf doc.
-
remove — Remove files or pages from the pdf.
-
writestr — Write string to an embedded file.
-
repack — Noop. For compatibility with zipfile-patch.
source staticmethod PDFFile.valid_pagenum(name: str) → int
Check if a string is a non-negative integeger.
Raises
-
ValueError
source staticmethod PDFFile.to_datetime(pdf_date: str) → datetime | None
Convert a PDF date string to a datetime.
source staticmethod PDFFile.to_pdf_date(value: datetime | str) → str | None
Convert a datetime to a PDF date string.
source staticmethod PDFFile.to_bool(value: Any) → bool
Convert a boolean string to a python bool.
source staticmethod PDFFile.to_xml_bool(value: Any) → str
Convert a boolean value to an xml string.
source classmethod PDFFile.is_pdffile(path: str) → bool
Is the path a pdf.
source method PDFFile.save() → None
Save PDF doc to disk.
source method PDFFile.close() → None
Close the fitz doc.
source method PDFFile.pagelist() → list[str]
Zero padded page names.
source method PDFFile.namelist() → list[str]
Return sortable zero padded index strings.
source method PDFFile.infolist() → list[ZipInfo]
Return ZipFile like infolist.
source method PDFFile.read_image(index: int) → tuple[bytes, str]
Read first image from page in original format.
source method PDFFile.read_pixmap(index: int) → tuple[bytes, str]
Convert page to pixmap.
source method PDFFile.classify_page(index: int) → PageVerdict
Decide how page index should be served.
Returns a :class:PageVerdict. PageMode.PDF_FALLBACK means the caller should use the regular PDF path; the other modes mean the page is image-dominant and can be served as raw image bytes via :meth:read_image_if_dominant.
Cheap — runs on parsed PDF metadata, single-digit milliseconds per page even on text-heavy documents.
source method PDFFile.read_image_if_dominant(index: int) → tuple[bytes, str] | None
Return (bytes, ext) if page is image-dominant, else None.
ext is the embedded image’s encoding (‘jpeg’, ‘png’, ‘webp’) for IMAGE_DIRECT verdicts, or ‘jpeg’ for IMAGE_TRANSCODE verdicts (CMYK / JBIG2 / rotated pages re-encoded via Pixmap).
None means the caller should use :meth:read_pdf (or another fallback path) — the page has vector content that would be lost in a raw-image serve.
source method PDFFile.read_full_pixmap_jpeg(index: int, *, dpi: int | None = None) → tuple[bytes, str]
Render the whole page to RGB JPEG.
Faster than :meth:read_pixmap for browser callers (PPM is not browser-renderable; PIL would need to be in the loop to transcode). Tries the cheap embedded-image path first when the page happens to be image-dominant.
dpi=None (default) auto-picks a render DPI from the page’s embedded-image resolution via :func:choose_pixmap_dpi; pages with no images render at :data:DEFAULT_PIXMAP_DPI. Pass an integer to override. The auto path doesn’t apply when the cheap embedded-image branch fires — those return the embedded image at its native resolution regardless.
Always succeeds for valid pages — raises if PyMuPDF can’t render the page at all.
Raises
-
RuntimeError
source method PDFFile.read_pdf(index: int) → tuple[bytes, str]
Read a pdf page as a complete one-page pdf.
Uses ``insert_pdf`` rather than ``Document.convert_to_pdf``
the latter rebuilds the page’s content stream and during that rebuild it drops text rendering mode operators (notably 3 Tr, invisible text) and renames specialised OCR fonts like HiddenHorzOCR to ordinary text fonts. The net effect on Acrobat-OCR’d PDFs is that the invisible OCR overlay turns visible — text “doubles up” against the page’s raster under any renderer that follows the spec (PDF.js, MuPDF itself). insert_pdf copies the page faithfully — same operators, same fonts, no warnings, pixel-identical render to the source.
source method PDFFile.read_embedded_file(filename: str) → tuple[bytes, str]
Read embedded file.
source method PDFFile.read(filename: str, fmt: str = ‘’, props: dict | None = None) → bytes
Return a single page pdf doc, image or pixmap or embedded file.
If a props dict is passed in, the read file extension is written to the ext key. For IMAGE_IF_DOMINANT callers inspect ext to distinguish a successful image serve (jpeg/png/webp) from the PDF fall-through (pdf).
source method PDFFile.get_page_count() → int
Get the page count from the doc or the default highnum.
source method PDFFile.get_metadata() → dict
Return metadata from the pdf doc.
source method PDFFile.write_metadata(metadata: Mapping) → None
Set metadata to the pdf doc.
source method PDFFile.remove(name: str) → None
Remove files or pages from the pdf.
source method PDFFile.writestr(name: str, buffer: str | bytes | bytearray | memoryview[int], **_kwargs) → None
Write string to an embedded file.
Accept compress_type & compress args but discard them.
Raises
-
NotImplementedError
source method PDFFile.repack() → None
Noop. For compatibility with zipfile-patch.