Extracting Provenance Text with Tesseract OCR

Operational Context

Museum digitization teams routinely ingest high-resolution TIFFs and multi-page PDFs from accession ledgers, donor correspondence, and provenance cards. The operational objective is to extract structured provenance strings for direct synchronization into collection management systems. Legacy workflows rely on ad-hoc OCR calls that produce fragmented output. Modern pipelines require deterministic, memory-bounded extraction engines. Validated JSON payloads must align with institutional metadata standards before ingestion. This guide replaces heuristic scraping with a production-grade architecture.

flowchart LR
    Img["Provenance card<br/>TIFF"] --> Pre["1 · Preprocess<br/>threshold · deskew"]
    Pre --> Tess["2 · Tesseract<br/>PSM 6"]
    Tess --> Batch["3 · Memory-bounded<br/>batch"]
    Batch --> Val["4 · Validate<br/>ProvenanceRecord"]
    Val --> IIIF["5 · IIIF manifest<br/>+ CMS sync"]

Root Cause Analysis

Default Tesseract configurations target modern, high-contrast, single-column documents. Archival provenance materials violate core LSTM engine assumptions. Cascading failures emerge across four primary vectors.

Page Segmentation Mode (PSM) mismatches cause structural fragmentation. Provenance cards contain mixed orientations, marginalia, and institutional stamps. The default --psm 3 forces a rigid grid layout. Text blocks interleave with stamp overlays, destroying reading order.

Insufficient preprocessing pipelines degrade contour detection. Faded iron-gall ink and yellowed paper lack binary contrast. Without adaptive thresholding and morphological noise removal, ligatures and diacritics fail recognition.

Subprocess memory bloat destabilizes batch execution. pytesseract spawns independent CLI processes per image. High-DPI TIFFs accumulate orphaned workers. Shared ingestion servers trigger OOM kills under concurrent load.

Unvalidated output ingestion corrupts downstream sync queues. Raw OCR strings bypass schema enforcement. Malformed dates and restricted PII propagate into CMS records. This breaks Automated Record Ingestion & Sync Workflows and forces manual reconciliation.

Step 1: Deterministic Image Preprocessing

Archival extraction requires adaptive normalization before engine invocation. OpenCV provides deterministic binarization and geometric correction. The pipeline must isolate text regions while suppressing background degradation.

python

import cv2
import numpy as np
from pathlib import Path

def preprocess_provenance_image(image_path: Path) -> np.ndarray:
    img = cv2.imread(str(image_path), cv2.IMREAD_GRAYSCALE)
    if img is None:
        raise ValueError("Invalid image path or unsupported format")

    # Adaptive thresholding for uneven illumination
    binary = cv2.adaptiveThreshold(
        img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2
    )

    # Morphological noise removal
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
    denoised = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

    # Deskew via minimum area bounding rectangle around the text pixels.
    # After adaptive THRESH_BINARY the ink is 0 and the background 255, so
    # select the zero pixels. cv2.minAreaRect needs float32 points ordered
    # (x, y), whereas np.where returns (row, col) = (y, x).
    ys, xs = np.where(denoised == 0)
    coords = np.column_stack((xs, ys)).astype(np.float32)
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    elif angle > 45:
        angle = angle - 90

    (h, w) = img.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC, borderValue=255)
    return rotated

Step 2: Tesseract Configuration & Layout Tuning

Engine parameters must override default heuristics. Provenance cards require block-level segmentation and LSTM optimization. Stamp overlays demand explicit exclusion zones or secondary pass filtering.

python

import pytesseract
import numpy as np

TESSERACT_CONFIG = "--oem 1 --psm 6 -c preserve_interword_spaces=1"

def extract_text(image_array: np.ndarray, lang: str = "eng") -> str:
    custom_config = f"{TESSERACT_CONFIG} --tessdata-dir /usr/share/tessdata"
    return pytesseract.image_to_string(
        image_array, lang=lang, config=custom_config
    ).strip()

PSM 6 enforces a single uniform text block. This prevents stamp text from fragmenting primary provenance lines. The preserve_interword_spaces flag maintains typewriter spacing for downstream parsing. Teams handling non-Latin scripts must install corresponding .traineddata files.

Step 3: Memory-Bounded Batch Execution

Concurrent processing requires explicit resource isolation. Python 3.9+ supports typed worker pools with strict concurrency caps. Memory limits prevent subprocess accumulation during multi-page PDF ingestion.

python

import concurrent.futures
import psutil
from pathlib import Path
from typing import Iterator

def process_batch(
    image_paths: list[Path],
    max_workers: int = 4,
    min_free_bytes: int = 512 * 1024 * 1024,
) -> Iterator[tuple[Path, str]]:
    def _run(path: Path) -> tuple[Path, str]:
        return path, extract_text(preprocess_provenance_image(path))

    queue = list(image_paths)
    pending: set[concurrent.futures.Future] = set()
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        while queue or pending:
            # Hold back new submissions while free RAM is below the floor,
            # but always keep at least one task in flight to guarantee progress.
            while queue and len(pending) < max_workers and (
                psutil.virtual_memory().available > min_free_bytes or not pending
            ):
                pending.add(executor.submit(_run, queue.pop()))
            done, pending = concurrent.futures.wait(
                pending, return_when=concurrent.futures.FIRST_COMPLETED
            )
            for future in done:
                try:
                    yield future.result()
                except Exception as exc:
                    yield Path("<unknown>"), f"OCR_FAILURE: {exc}"

The ThreadPoolExecutor works well here because OpenCV and the Tesseract subprocess release the GIL during native execution, so the bounded pool achieves real parallelism rather than merely interleaving. Submission is throttled against available system RAM: when free memory drops below min_free_bytes the loop stops enqueuing new pages until in-flight work drains, capping peak memory during multi-page PDF ingestion. This architecture aligns with Automating OCR Metadata Extraction scaling guidelines.

Step 4: Schema Validation & LIDO Mapping

Raw strings require structural enforcement before CMS transmission. Pydantic v2 models validate dates, restrict PII, and map to a LIDO provenance event — a lido:eventSet whose lido:event carries the acquisition lido:displayEvent and lido:eventDate. Validation failures route to quarantine queues.

python

from pydantic import BaseModel, Field, field_validator
from datetime import date, datetime
import re

class ProvenanceRecord(BaseModel):
    object_id: str = Field(pattern=r"^[A-Z]{3,4}-\d{4,6}$")
    provenance_text: str = Field(min_length=5, max_length=500)
    acquisition_date: date | None = None
    previous_owner: str | None = None
    transaction_type: str | None = None

    @field_validator("provenance_text")
    @classmethod
    def strip_pii(cls, v: str) -> str:
        return re.sub(r"\b\d{3}-\d{2}-\d{4}\b", "[REDACTED]", v)

    @field_validator("acquisition_date", mode="before")
    @classmethod
    def parse_date(cls, v: str | None) -> date | None:
        if not v:
            return None
        for fmt in ("%Y-%m-%d", "%d %B %Y", "%m/%d/%Y"):
            try:
                return datetime.strptime(v, fmt).date()
            except ValueError:
                continue
        raise ValueError(f"Unrecognized date format: {v}")

The ProvenanceRecord model enforces institutional ID patterns and sanitizes sensitive identifiers. Date parsing normalizes archival variations into ISO 8601. Field validators execute synchronously during model instantiation. Invalid payloads fail fast, preventing CMS corruption.

Step 5: IIIF Integration & CMS Sync

Validated records must link to source imagery for auditability. IIIF Presentation API 3.0 manifests wrap extracted metadata alongside high-resolution derivatives. The sync pipeline serializes payloads into CMS-compatible JSON.

python

from typing import Any
import json

# IIIF 3.0 requires every resource id to be a dereferenceable HTTP(S) URI.
IIIF_BASE = "https://iiif.example.org"

def build_iiif_provenance_manifest(record: ProvenanceRecord, image_uri: str) -> dict[str, Any]:
    base = f"{IIIF_BASE}/{record.object_id}"
    canvas_id = f"{base}/canvas"
    return {
        "@context": "http://iiif.io/api/presentation/3/context.json",
        "id": f"{base}/manifest",
        "type": "Manifest",
        "label": {"en": [f"Provenance Record: {record.object_id}"]},
        "items": [{
            "id": canvas_id,
            "type": "Canvas",
            "height": 3000,
            "width": 2400,
            "items": [{
                "id": f"{base}/page",
                "type": "AnnotationPage",
                "items": [{
                    "id": f"{base}/anno",
                    "type": "Annotation",
                    "motivation": "commenting",
                    "body": {
                        "type": "TextualBody",
                        "value": record.provenance_text,
                        "format": "text/plain"
                    },
                    "target": canvas_id
                }]
            }],
            "annotations": [{
                "id": f"{base}/painting-page",
                "type": "AnnotationPage",
                "items": [{
                    "id": f"{base}/painting-anno",
                    "type": "Annotation",
                    "motivation": "painting",
                    "body": {"id": image_uri, "type": "Image", "format": "image/tiff"},
                    "target": canvas_id
                }]
            }]
        }],
        "metadata": [
            {"label": {"en": ["LIDO Provenance"]}, "value": {"en": [record.provenance_text]}},
            {"label": {"en": ["Acquisition Date"]}, "value": {"en": [str(record.acquisition_date)]}}
        ]
    }

The manifest structure binds OCR output to the source canvas via IIIF annotations. Metadata fields map directly to LIDO lido:provenanceText and lido:eventDate. Serialization produces CMS-ready payloads. Downstream ingestion handlers consume the JSON without transformation overhead.

Production Considerations

Batch pipelines require deterministic retry logic and structured telemetry. Transient OCR failures route to exponential backoff queues. Persistent errors trigger quarantine alerts with full context payloads. Memory profiling must run continuously during peak ingestion windows.

Logging frameworks should capture PSM overrides, preprocessing parameters, and validation rejection reasons. Structured JSON logs enable rapid root cause analysis. Teams must monitor subprocess exit codes and worker thread saturation. Scaling requires horizontal pod expansion with shared tessdata volumes.

Compliance mandates strict access controls around donor restriction flags. Redaction validators execute before any network transmission. Audit trails preserve original TIFF hashes alongside extracted strings. This ensures full provenance chain integrity.

Conclusion

The five-stage pipeline — preprocess, configure Tesseract, bound the batch, validate with Pydantic, publish via IIIF — converts fragile ad-hoc OCR calls into a deterministic, memory-safe extraction service. Each stage has explicit failure boundaries: preprocessing errors surface before Tesseract is invoked, PII redaction runs before network transmission, and invalid payloads are quarantined rather than propagated into the CMS.

Extracting Provenance Text with Tesseract OCR

Operational Context #

Root Cause Analysis #

Step 1: Deterministic Image Preprocessing #

Step 2: Tesseract Configuration & Layout Tuning #

Step 3: Memory-Bounded Batch Execution #

Step 4: Schema Validation & LIDO Mapping #

Step 5: IIIF Integration & CMS Sync #

Production Considerations #

Conclusion #

Related pages

Operational Context

Root Cause Analysis

Step 1: Deterministic Image Preprocessing

Step 2: Tesseract Configuration & Layout Tuning

Step 3: Memory-Bounded Batch Execution

Step 4: Schema Validation & LIDO Mapping

Step 5: IIIF Integration & CMS Sync

Production Considerations

Conclusion