Failure Context & Specific Intent

Vendor-delivered Dublin Core manifests frequently stall ingestion pipelines. Silent type coercion failures disrupt batch operations. The objective is deterministic pre-commit validation. This process transforms flat DC payloads into CollectionBase’s relational schema. It prevents data loss and eliminates downstream API 422 rejections. Collections managers require transparent error triage. Engineers demand memory-efficient Python automation for multi-gigabyte datasets. Establishing the Core Architecture & Collection Taxonomy baseline ensures structural alignment before execution.

Root Cause Analysis

Dublin Core prioritizes interoperability through minimalism. It permits untyped dc:subject strings and unbounded dc:creator repetitions. CollectionBase enforces strict cardinality and ISO 8601 normalization. It requires controlled vocabulary binding via AAT/TGN URIs. Naive parsing loads entire DOMs into memory. This triggers OOM failures during large-scale ingestion. Namespace resolution often drops prefixed elements silently. Unvalidated string coercion bypasses database constraints until network transmission. A deterministic mapping contract shifts validation left. This aligns with established Mapping LIDO to Internal Databases practices.

Pipeline Architecture & Data Flow

The validation layer operates as a streaming transformer. Raw XML enters a memory-optimized iterator. Parsed nodes pass through a Pydantic schema validator. Controlled vocabulary terms resolve against a local cache. Validated records serialize to CollectionBase-compatible JSON. Invalid records route to a structured error manifest. The pipeline maintains constant memory overhead regardless of payload size.

flowchart LR
    X["Dublin Core XML"] --> S["Stream records<br/>iterparse"]
    S --> N["Namespace resolve<br/>accumulate repeats"]
    N --> V{"Pydantic validate<br/>ISO 8601 · URIs"}
    V -->|invalid| Q["Error manifest"]
    V -->|valid| B["Vocabulary binding<br/>AAT / TGN"]
    B --> O["CollectionBase JSON"]

Step 1: Streaming Ingestion & Namespace Resolution

Replace bulk DOM loading with lxml.etree.iterparse. Configure the parser to strip whitespace and resolve namespaces upfront. Map Dublin Core prefixes to their canonical URIs. Process elements sequentially to maintain sub-100MB RAM footprints.

python
from __future__ import annotations
from lxml import etree
from typing import Iterator, Dict, Any, List

DC_NS = "http://purl.org/dc/elements/1.1/"

def stream_dc_records(xml_path: str) -> Iterator[Dict[str, List[str]]]:
    context = etree.iterparse(xml_path, events=("end",), tag=f"{{{DC_NS}}}record")
    for _, elem in context:
        record: Dict[str, List[str]] = {}
        for child in elem:
            if not child.text or not child.text.strip():
                continue
            key = child.tag.split("}")[-1]
            # Dublin Core elements are repeatable (e.g. multiple dc:creator),
            # so accumulate values into a list rather than overwriting.
            record.setdefault(key, []).append(child.text.strip())
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
        yield record

Step 2: Schema Enforcement & Type Coercion

Pydantic v2 models enforce strict type boundaries. Define explicit validators for cardinality and date formatting. Field validators run during model validation (at construction time), rejecting ambiguous inputs before they ever reach serialization. This prevents silent data degradation. Consult Pydantic v2 documentation for validator lifecycle details.

python
from pydantic import BaseModel, field_validator, ValidationError
import re

class CollectionBaseRecord(BaseModel):
    object_id: str
    title_primary: str
    creator_normalized: str
    date_created: str
    subject_uris: list[str]

    @field_validator("date_created")
    @classmethod
    def validate_iso8601(cls, v: str) -> str:
        if not re.match(r"^\d{4}(-\d{2}){0,2}$", v):
            raise ValueError("Requires YYYY, YYYY-MM, or YYYY-MM-DD format")
        return v

    @field_validator("subject_uris")
    @classmethod
    def enforce_uri_format(cls, v: list[str]) -> list[str]:
        if not all(uri.startswith(("http://", "https://")) for uri in v):
            raise ValueError("Subjects must resolve to absolute URIs")
        return v

Step 3: Vocabulary Binding & URI Resolution

Free-text DC subjects require normalization against Getty vocabularies. Implement a cached resolver to map strings to AAT/TGN identifiers. Reject unresolvable terms during validation. This ensures semantic consistency across institutional boundaries. Reference the official Dublin Core Metadata Initiative for baseline element definitions.

python
from functools import lru_cache
import json
import requests

@lru_cache(maxsize=2048)
def resolve_aat_uri(term: str) -> str:
    endpoint = "https://vocab.getty.edu/aat/sparql.json"
    # Declare the SKOS prefix and bind the term via a FILTER so values
    # containing apostrophes (e.g. O'Keeffe) cannot break or inject the query.
    query = (
        "PREFIX skos: <http://www.w3.org/2004/02/skos/core#> "
        "SELECT ?uri WHERE { ?uri skos:prefLabel ?label . "
        f"FILTER(STR(?label) = {json.dumps(term)} && LANG(?label) = 'en') }}"
    )
    response = requests.get(endpoint, params={"query": query}, timeout=10)
    response.raise_for_status()
    data = response.json()
    bindings = data.get("results", {}).get("bindings")
    if not bindings:
        raise ValueError(f"Unresolvable term: {term}")
    return bindings[0]["uri"]["value"]

Step 4: Batch Processing & Error Triage

Orchestrate the pipeline using Python generators. Yield validated records directly to the ingestion queue. Capture ValidationError exceptions into a structured log. Provide collections managers with actionable CSV reports. This eliminates manual reconciliation cycles.

python
def validate_and_dispatch(records: Iterator[Dict[str, Any]]) -> None:
    for idx, raw in enumerate(records):
        try:
            normalized = CollectionBaseRecord(**raw)
            print(f"[OK] Record {idx} validated")
        except ValidationError as e:
            print(f"[FAIL] Record {idx}: {e.json()}")

Compliance Verification & IIIF Alignment

Validated metadata must align with IIIF Presentation API 3.0 requirements. Ensure object_id maps to canonical manifest URIs. Verify that date_created supports temporal filtering in discovery interfaces. Cross-reference schema outputs against LIDO v1.1 specifications for semantic interoperability. Automated compliance checks should run prior to production deployment. This guarantees stable cross-institutional metadata harmonization.

Conclusion

The Dublin Core to CollectionBase pipeline is a three-contract problem: the XML namespace contract (handled by iterparse + namespace stripping), the cardinality contract (handled by accumulating repeatable elements into lists before Pydantic sees them), and the vocabulary contract (handled by lru_cache-backed AAT SPARQL resolution). Violations at any layer route to a structured error manifest rather than propagating silently into the relational schema.