Failure Context and Operational Intent
Ingest pipelines routinely fail when mapping RightsStatements.org v1.0 URIs to legacy CMS fields. Target columns include rights_statement, copyright_status, and access_restrictions. The operational objective is deterministic transformation. Standardized declarations must become queryable metadata. This metadata drives public portals, IIIF manifests, and internal embargo routing.
Production environments consistently encounter ValueError exceptions during bulk validation. Silent field truncation occurs in SQL-backed tables. Mismatched enum constraints break downstream API consumers. At scale, synchronous validation triggers out-of-memory conditions. The requirement is a memory-efficient mapping layer. It must enforce institutional policy while maintaining idempotent update cycles.
Root Cause Analysis
Three architectural mismatches drive these failures. First, URI canonicalization drifts across legacy ETL scripts. Trailing slashes are stripped. Protocols downgrade to http://. Query parameters append during string concatenation. Exact-match validators reject valid identifiers.
Second, CMS schema rigidity limits field lengths. Many databases enforce VARCHAR(50) constraints. Full URIs exceed this threshold without explicit aliasing. Lookup normalization becomes mandatory.
Third, namespace collisions occur during mixed-batch ingestion. Rights metadata combines RightsStatements.org URIs, Creative Commons licenses, and legacy copyright codes. Parsers default to the first matched pattern. This misclassifies In Copyright as Public Domain. It triggers false positives during Automating Copyright Status Checks. Performance degrades further when pipelines load entire datasets into RAM. Row-wise transformations and unbatched writes create linear memory growth.
Step-by-Step Resolution
Implement a chunked, schema-validated mapping pipeline. Use polars for zero-copy memory management. Pair it with pydantic for strict compliance enforcement. The workflow normalizes URIs, maps them to CMS-compatible enums, and validates against institutional thresholds.
flowchart LR
Csv["CSV chunk<br/>read_csv_batched"] --> Nm["Normalize URI<br/>strip · trailing slash"]
Nm --> Cl{"Classify"}
Cl -->|"InC*"| IC["IN_COPYRIGHT"]
Cl -->|"NoC*"| NC["NO_COPYRIGHT"]
Cl -->|"CNE · UND · NKC · other"| UN["UNKNOWN"]
IC --> Vd["Validate vs VALID_RS_URIS"]
NC --> Vd
UN --> Vd
Vd --> Out["LIDO + IIIF rights"]# pipeline/rights_mapper.py
import polars as pl
from pydantic import BaseModel, field_validator, ValidationError
from typing import Literal, Generator
from urllib.parse import urlparse
VALID_RS_URIS = {
# In Copyright family
"https://rightsstatements.org/vocab/InC/1.0/",
"https://rightsstatements.org/vocab/InC-OW-EU/1.0/",
"https://rightsstatements.org/vocab/InC-EDU/1.0/",
"https://rightsstatements.org/vocab/InC-NC/1.0/",
"https://rightsstatements.org/vocab/InC-RUU/1.0/",
# No Copyright family
"https://rightsstatements.org/vocab/NoC-CR/1.0/",
"https://rightsstatements.org/vocab/NoC-NC/1.0/",
"https://rightsstatements.org/vocab/NoC-OKLR/1.0/",
"https://rightsstatements.org/vocab/NoC-US/1.0/",
# Other
"https://rightsstatements.org/vocab/CNE/1.0/",
"https://rightsstatements.org/vocab/UND/1.0/",
"https://rightsstatements.org/vocab/NKC/1.0/",
}
class RightsMapping(BaseModel):
original_uri: str
canonical_uri: str
cms_enum: Literal["IN_COPYRIGHT", "PUBLIC_DOMAIN", "NO_COPYRIGHT", "UNKNOWN"]
lido_element: str
iiif_rights_field: str
@field_validator("canonical_uri", mode="before")
@classmethod
def normalize_uri(cls, v: str) -> str:
# Only the canonical form is normalized; original_uri is preserved
# verbatim for audit and manual review.
parsed = urlparse(str(v).strip().rstrip("/"))
return f"{parsed.scheme}://{parsed.netloc}{parsed.path}/"
@field_validator("canonical_uri")
@classmethod
def validate_rs_uri(cls, v: str) -> str:
if v not in VALID_RS_URIS:
raise ValueError(f"Invalid RightsStatements.org URI: {v}")
return v
def process_chunk(df: pl.DataFrame) -> pl.DataFrame:
normalized = df["raw_rights_uri"].str.strip_chars().str.replace_all(r"/+$", "/")
return df.with_columns([
normalized.alias("canonical_uri"),
pl.when(normalized.str.contains("InC")).then(pl.lit("IN_COPYRIGHT"))
.when(normalized.str.contains("NoC")).then(pl.lit("NO_COPYRIGHT"))
# CNE (Copyright Not Evaluated), UND (Undetermined) and NKC are all
# explicitly "unknown" — never fall through to PUBLIC_DOMAIN.
.when(normalized.str.contains("CNE")).then(pl.lit("UNKNOWN"))
.when(normalized.is_in(["https://rightsstatements.org/vocab/UND/1.0/",
"https://rightsstatements.org/vocab/NKC/1.0/"]))
.then(pl.lit("UNKNOWN"))
# Anything unrecognized is treated as UNKNOWN, not public domain.
.otherwise(pl.lit("UNKNOWN"))
.alias("cms_enum")
])
def stream_pipeline(source_path: str, chunk_size: int = 10_000) -> Generator[pl.DataFrame, None, None]:
# read_csv_batched returns true row chunks; LazyFrame.fetch only returns
# the first n rows once and is not a chunk iterator.
reader = pl.read_csv_batched(source_path, batch_size=chunk_size)
batches = reader.next_batches(1)
while batches:
yield process_chunk(batches[0])
batches = reader.next_batches(1)Validation & Compliance Enforcement
Schema validation must occur before database writes. The pydantic model enforces URI normalization. It rejects malformed strings at the row level. The cms_enum field aligns with institutional access policies. It routes records to appropriate visibility tiers. This structure directly supports Rights Metadata Mapping & Licensing Automation workflows.
LIDO compliance requires mapping to <lido:rightsWork> and <lido:rightsResource>. The pipeline injects these values into the XML export queue. IIIF Presentation API 3.0 expects a rights property. The iiif_rights_field column populates this attribute automatically. See the official IIIF Presentation API 3.0 Specification for structural requirements.
Integration with IIIF/LIDO Workflows
IIIF manifests require absolute URIs for the rights key. The pipeline strips legacy abbreviations. It injects the canonical URI directly. LIDO v1.1 structures rights data hierarchically. The mapper flattens complex statements into discrete <lido:rightsType> nodes. Downstream harvesters consume these standardized outputs.
OAI-PMH endpoints serialize the data without transformation overhead. Crosswalks to Dublin Core and MODS remain intact. This ensures interoperability with aggregators. Refer to the RightsStatements.org Standard for authoritative URI definitions. Python type hints enforce strict contract validation across the stack. Consult the Python typing module for advanced generic patterns.
Error Handling & Monitoring
Production pipelines require graceful degradation. The ValidationError catch block logs malformed URIs to a quarantine table. It preserves the original record for manual review. Retry logic applies exponential backoff during database commits. Connection pooling prevents transaction deadlocks.
Metrics track mapping success rates and enum distribution. Alerts trigger when rejection thresholds exceed two percent. Fallback chains route missing data to embargo workflows until curatorial review completes. Idempotent writes guarantee safe pipeline restarts. Deterministic hashing prevents duplicate record insertion.
Conclusion
The classification logic has one critical invariant: CNE, UND, and NKC — all of which mean “copyright status not determined” — must map to UNKNOWN, never to PUBLIC_DOMAIN. Treating ambiguous status as public domain is the single most dangerous failure mode in rights pipelines; the otherwise(pl.lit("UNKNOWN")) catch-all in the Polars expression enforces this conservatively for any unrecognized URI pattern.