Failure Context & Specific Intent
Vendor-delivered Dublin Core manifests frequently stall ingestion pipelines. Silent type coercion failures disrupt batch operations. The objective is deterministic pre-commit validation. This process transforms flat DC payloads into CollectionBase’s relational schema. It prevents data loss and eliminates downstream API 422 rejections. Collections managers require transparent error triage. Engineers demand memory-efficient Python automation for multi-gigabyte datasets. Establishing the Core Architecture & Collection Taxonomy baseline ensures structural alignment before execution.
Root Cause Analysis
Dublin Core prioritizes interoperability through minimalism. It permits untyped dc:subject strings and unbounded dc:creator repetitions. CollectionBase enforces strict cardinality and ISO 8601 normalization. It requires controlled vocabulary binding via AAT/TGN URIs. Naive parsing loads entire DOMs into memory. This triggers OOM failures during large-scale ingestion. Namespace resolution often drops prefixed elements silently. Unvalidated string coercion bypasses database constraints until network transmission. A deterministic mapping contract shifts validation left. This aligns with established Mapping LIDO to Internal Databases practices.
Pipeline Architecture & Data Flow
The validation layer operates as a streaming transformer. Raw XML enters a memory-optimized iterator. Parsed nodes pass through a Pydantic schema validator. Controlled vocabulary terms resolve against a local cache. Validated records serialize to CollectionBase-compatible JSON. Invalid records route to a structured error manifest. The pipeline maintains constant memory overhead regardless of payload size.
flowchart LR
X["Dublin Core XML"] --> S["Stream records<br/>iterparse"]
S --> N["Namespace resolve<br/>accumulate repeats"]
N --> V{"Pydantic validate<br/>ISO 8601 · URIs"}
V -->|invalid| Q["Error manifest"]
V -->|valid| B["Vocabulary binding<br/>AAT / TGN"]
B --> O["CollectionBase JSON"]Step 1: Streaming Ingestion & Namespace Resolution
Replace bulk DOM loading with lxml.etree.iterparse. Configure the parser to strip whitespace and resolve namespaces upfront. Map Dublin Core prefixes to their canonical URIs. Process elements sequentially to maintain sub-100MB RAM footprints.
from __future__ import annotations
from lxml import etree
from typing import Iterator, Dict, Any, List
DC_NS = "http://purl.org/dc/elements/1.1/"
def stream_dc_records(xml_path: str) -> Iterator[Dict[str, List[str]]]:
context = etree.iterparse(xml_path, events=("end",), tag=f"{{{DC_NS}}}record")
for _, elem in context:
record: Dict[str, List[str]] = {}
for child in elem:
if not child.text or not child.text.strip():
continue
key = child.tag.split("}")[-1]
# Dublin Core elements are repeatable (e.g. multiple dc:creator),
# so accumulate values into a list rather than overwriting.
record.setdefault(key, []).append(child.text.strip())
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
yield recordStep 2: Schema Enforcement & Type Coercion
Pydantic v2 models enforce strict type boundaries. Define explicit validators for cardinality and date formatting. Field validators run during model validation (at construction time), rejecting ambiguous inputs before they ever reach serialization. This prevents silent data degradation. Consult Pydantic v2 documentation for validator lifecycle details.
from pydantic import BaseModel, field_validator, ValidationError
import re
class CollectionBaseRecord(BaseModel):
object_id: str
title_primary: str
creator_normalized: str
date_created: str
subject_uris: list[str]
@field_validator("date_created")
@classmethod
def validate_iso8601(cls, v: str) -> str:
if not re.match(r"^\d{4}(-\d{2}){0,2}$", v):
raise ValueError("Requires YYYY, YYYY-MM, or YYYY-MM-DD format")
return v
@field_validator("subject_uris")
@classmethod
def enforce_uri_format(cls, v: list[str]) -> list[str]:
if not all(uri.startswith(("http://", "https://")) for uri in v):
raise ValueError("Subjects must resolve to absolute URIs")
return vStep 3: Vocabulary Binding & URI Resolution
Free-text DC subjects require normalization against Getty vocabularies. Implement a cached resolver to map strings to AAT/TGN identifiers. Reject unresolvable terms during validation. This ensures semantic consistency across institutional boundaries. Reference the official Dublin Core Metadata Initiative for baseline element definitions.
from functools import lru_cache
import json
import requests
@lru_cache(maxsize=2048)
def resolve_aat_uri(term: str) -> str:
endpoint = "https://vocab.getty.edu/aat/sparql.json"
# Declare the SKOS prefix and bind the term via a FILTER so values
# containing apostrophes (e.g. O'Keeffe) cannot break or inject the query.
query = (
"PREFIX skos: <http://www.w3.org/2004/02/skos/core#> "
"SELECT ?uri WHERE { ?uri skos:prefLabel ?label . "
f"FILTER(STR(?label) = {json.dumps(term)} && LANG(?label) = 'en') }}"
)
response = requests.get(endpoint, params={"query": query}, timeout=10)
response.raise_for_status()
data = response.json()
bindings = data.get("results", {}).get("bindings")
if not bindings:
raise ValueError(f"Unresolvable term: {term}")
return bindings[0]["uri"]["value"]Step 4: Batch Processing & Error Triage
Orchestrate the pipeline using Python generators. Yield validated records directly to the ingestion queue. Capture ValidationError exceptions into a structured log. Provide collections managers with actionable CSV reports. This eliminates manual reconciliation cycles.
def validate_and_dispatch(records: Iterator[Dict[str, Any]]) -> None:
for idx, raw in enumerate(records):
try:
normalized = CollectionBaseRecord(**raw)
print(f"[OK] Record {idx} validated")
except ValidationError as e:
print(f"[FAIL] Record {idx}: {e.json()}")Compliance Verification & IIIF Alignment
Validated metadata must align with IIIF Presentation API 3.0 requirements. Ensure object_id maps to canonical manifest URIs. Verify that date_created supports temporal filtering in discovery interfaces. Cross-reference schema outputs against LIDO v1.1 specifications for semantic interoperability. Automated compliance checks should run prior to production deployment. This guarantees stable cross-institutional metadata harmonization.
Conclusion
The Dublin Core to CollectionBase pipeline is a three-contract problem: the XML namespace contract (handled by iterparse + namespace stripping), the cardinality contract (handled by accumulating repeatable elements into lists before Pydantic sees them), and the vocabulary contract (handled by lru_cache-backed AAT SPARQL resolution). Violations at any layer route to a structured error manifest rather than propagating silently into the relational schema.