pdffile

source package pdffile

Access PDFS with a ZipFile-like API.

Classes

source class PageFormat()

Bases : Enum

Read Format.

Attributes

  • IMAGE_IF_DOMINANT the two by inspecting the ext written to props.

  • PIXMAP_JPEG always-image response for any page (e.g. force-image override).

source class PDFFile(path: Path)

ZipFile like API to PDFs.

Initialize document.

Methods

  • valid_pagenum Check if a string is a non-negative integeger.

  • to_datetime Convert a PDF date string to a datetime.

  • to_pdf_date Convert a datetime to a PDF date string.

  • to_bool Convert a boolean string to a python bool.

  • to_xml_bool Convert a boolean value to an xml string.

  • is_pdffile Is the path a pdf.

  • save Save PDF doc to disk.

  • close Close the fitz doc.

  • pagelist Zero padded page names.

  • namelist Return sortable zero padded index strings.

  • infolist Return ZipFile like infolist.

  • read_image Read first image from page in original format.

  • read_pixmap Convert page to pixmap.

  • classify_page Decide how page index should be served.

  • read_image_if_dominant Return (bytes, ext) if page is image-dominant, else None.

  • read_full_pixmap_jpeg Render the whole page to RGB JPEG.

  • read_pdf Read a pdf page as a complete one-page pdf.

  • read_embedded_file Read embedded file.

  • read Return a single page pdf doc, image or pixmap or embedded file.

  • get_page_count Get the page count from the doc or the default highnum.

  • get_metadata Return metadata from the pdf doc.

  • write_metadata Set metadata to the pdf doc.

  • remove Remove files or pages from the pdf.

  • writestr Write string to an embedded file.

  • repack Noop. For compatibility with zipfile-patch.

source staticmethod PDFFile.valid_pagenum(name: str) → int

Check if a string is a non-negative integeger.

Raises

  • ValueError

source staticmethod PDFFile.to_datetime(pdf_date: str) → datetime | None

Convert a PDF date string to a datetime.

source staticmethod PDFFile.to_pdf_date(value: datetime | str) → str | None

Convert a datetime to a PDF date string.

source staticmethod PDFFile.to_bool(value: Any) → bool

Convert a boolean string to a python bool.

source staticmethod PDFFile.to_xml_bool(value: Any) → str

Convert a boolean value to an xml string.

source classmethod PDFFile.is_pdffile(path: str) → bool

Is the path a pdf.

source method PDFFile.save() → None

Save PDF doc to disk.

source method PDFFile.close() → None

Close the fitz doc.

source method PDFFile.pagelist() → list[str]

Zero padded page names.

source method PDFFile.namelist() → list[str]

Return sortable zero padded index strings.

source method PDFFile.infolist() → list[ZipInfo]

Return ZipFile like infolist.

source method PDFFile.read_image(index: int) → tuple[bytes, str]

Read first image from page in original format.

source method PDFFile.read_pixmap(index: int) → tuple[bytes, str]

Convert page to pixmap.

source method PDFFile.classify_page(index: int) → PageVerdict

Decide how page index should be served.

Returns a :class:PageVerdict. PageMode.PDF_FALLBACK means the caller should use the regular PDF path; the other modes mean the page is image-dominant and can be served as raw image bytes via :meth:read_image_if_dominant.

Cheap — runs on parsed PDF metadata, single-digit milliseconds per page even on text-heavy documents.

source method PDFFile.read_image_if_dominant(index: int) → tuple[bytes, str] | None

Return (bytes, ext) if page is image-dominant, else None.

ext is the embedded image’s encoding (‘jpeg’, ‘png’, ‘webp’) for IMAGE_DIRECT verdicts, or ‘jpeg’ for IMAGE_TRANSCODE verdicts (CMYK / JBIG2 / rotated pages re-encoded via Pixmap).

None means the caller should use :meth:read_pdf (or another fallback path) — the page has vector content that would be lost in a raw-image serve.

source method PDFFile.read_full_pixmap_jpeg(index: int, *, dpi: int | None = None) → tuple[bytes, str]

Render the whole page to RGB JPEG.

Faster than :meth:read_pixmap for browser callers (PPM is not browser-renderable; PIL would need to be in the loop to transcode). Tries the cheap embedded-image path first when the page happens to be image-dominant.

dpi=None (default) auto-picks a render DPI from the page’s embedded-image resolution via :func:choose_pixmap_dpi; pages with no images render at :data:DEFAULT_PIXMAP_DPI. Pass an integer to override. The auto path doesn’t apply when the cheap embedded-image branch fires — those return the embedded image at its native resolution regardless.

Always succeeds for valid pages — raises if PyMuPDF can’t render the page at all.

Raises

  • RuntimeError

source method PDFFile.read_pdf(index: int) → tuple[bytes, str]

Read a pdf page as a complete one-page pdf.

Uses ``insert_pdf`` rather than ``Document.convert_to_pdf``

the latter rebuilds the page’s content stream and during that rebuild it drops text rendering mode operators (notably 3 Tr, invisible text) and renames specialised OCR fonts like HiddenHorzOCR to ordinary text fonts. The net effect on Acrobat-OCR’d PDFs is that the invisible OCR overlay turns visible — text “doubles up” against the page’s raster under any renderer that follows the spec (PDF.js, MuPDF itself). insert_pdf copies the page faithfully — same operators, same fonts, no warnings, pixel-identical render to the source.

source method PDFFile.read_embedded_file(filename: str) → tuple[bytes, str]

Read embedded file.

source method PDFFile.read(filename: str, fmt: str = ‘’, props: dict | None = None) → bytes

Return a single page pdf doc, image or pixmap or embedded file.

If a props dict is passed in, the read file extension is written to the ext key. For IMAGE_IF_DOMINANT callers inspect ext to distinguish a successful image serve (jpeg/png/webp) from the PDF fall-through (pdf).

source method PDFFile.get_page_count() → int

Get the page count from the doc or the default highnum.

source method PDFFile.get_metadata() → dict

Return metadata from the pdf doc.

source method PDFFile.write_metadata(metadata: Mapping) → None

Set metadata to the pdf doc.

source method PDFFile.remove(name: str) → None

Remove files or pages from the pdf.

source method PDFFile.writestr(name: str, buffer: str | bytes | bytearray | memoryview[int], **_kwargs) → None

Write string to an embedded file.

Accept compress_type & compress args but discard them.

Raises

  • NotImplementedError

source method PDFFile.repack() → None

Noop. For compatibility with zipfile-patch.