Collections management systems routinely misroute restricted assets during bulk metadata synchronization. Legacy exports output free-text rights statements instead of canonical URIs. This pipeline ingests raw CMS data, validates against the Creative Commons legal code, and generates deterministic dcterms:rights tags. The output targets IIIF manifests, LIDO-compliant catalogs, and public discovery layers. Proper validation prevents accidental exposure of embargoed materials.
The workflow relies on a generator-driven architecture to prevent memory exhaustion during high-volume ingestion. Loading an entire export into a DataFrame holds every record in memory at once and can trigger OOM kills on large files. Instead, the pipeline streams records through a validation gate, applies URI canonicalization, and writes chunks to disk. This approach aligns with Rights Metadata Mapping & Licensing Automation best practices for deterministic state management.
flowchart LR
P1["Phase 1<br/>Schema + URI<br/>canonicalization"] --> P2["Phase 2<br/>Stream ingest<br/>LIDO / IIIF map"]
P2 --> P3["Phase 3<br/>Chunked output<br/>idempotent writes"]
P1 -.->|invalid URI| Q["Quarantine CSV"]Phase 1: Schema Enforcement and URI Canonicalization
Define a Pydantic model to enforce strict URI normalization. The validator strips whitespace, enforces HTTPS, and rejects non-canonical variants.
from pydantic import BaseModel, field_validator, ValidationError
from typing import Optional
class CC_BY_NC_ND_Record(BaseModel):
asset_id: str
license_uri: str
rights_statement: Optional[str] = None
@field_validator('license_uri')
@classmethod
def normalize_and_validate_uri(cls, v: str) -> str:
canonical = "https://creativecommons.org/licenses/by-nc-nd/4.0/"
normalized = v.strip().lower().replace(" ", "-").rstrip('/') + '/'
if not normalized.startswith("https://"):
raise ValueError(f"License URI must use HTTPS: {v}")
if normalized != canonical:
raise ValueError(f"Non-canonical URI: {v}")
return canonicalThis model guarantees type safety and rejects malformed payloads before downstream routing. Integration with Routing Creative Commons Licenses requires exact URI matches to prevent namespace collisions in JSON-LD contexts.
Phase 2: Streaming Ingestion and LIDO/IIIF Mapping
Process legacy CSV and XML exports using Python’s built-in csv module. Avoid loading entire datasets into memory. Use generator expressions to yield validated records sequentially.
Map validated URIs to LIDO lido:rightsWork and IIIF rights properties. The IIIF Presentation API 3.0 specification mandates explicit rights declarations for restricted assets. Reference the official IIIF Presentation API 3.0 documentation for context alignment.
import csv
from pathlib import Path
from typing import Iterator
def stream_csv_records(input_path: Path) -> Iterator[dict]:
with open(input_path, mode="r", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
try:
validated = CC_BY_NC_ND_Record(
asset_id=row["asset_id"],
license_uri=row.get("license_uri", ""),
rights_statement=row.get("rights_statement")
)
yield validated.model_dump()
except ValidationError as e:
yield {"asset_id": row.get("asset_id"), "error": str(e)}This pattern isolates parsing errors without halting the batch process. Failed records route to a quarantine queue for manual review.
Phase 3: Chunked Output and Idempotent Writes
Write validated payloads in fixed-size chunks to ensure transactional integrity. Use json serialization with explicit context definitions. Implement idempotent file naming to prevent duplicate writes during pipeline restarts.
import json
from itertools import islice
CHUNK_SIZE = 1000
def write_manifest_chunks(records: Iterator[dict], output_dir: Path) -> None:
output_dir.mkdir(parents=True, exist_ok=True)
chunk_iter = iter(records)
chunk_idx = 0
while True:
chunk = list(islice(chunk_iter, CHUNK_SIZE))
if not chunk:
break
# A IIIF Collection whose members are Manifest references — the
# @context is a single string, and each item is typed Manifest.
collection = {
"@context": "http://iiif.io/api/presentation/3/context.json",
"id": f"https://example.org/iiif/collection/cc_by_nc_nd_batch_{chunk_idx:04d}",
"type": "Collection",
"label": {"en": ["CC BY-NC-ND batch"]},
"items": []
}
for rec in chunk:
if "error" in rec:
continue
collection["items"].append({
"id": f"https://example.org/iiif/{rec['asset_id']}/manifest",
"type": "Manifest",
"label": {"en": [rec["asset_id"]]},
"rights": rec["license_uri"]
})
out_file = output_dir / f"cc_by_nc_nd_batch_{chunk_idx:04d}.json"
with open(out_file, "w", encoding="utf-8") as f:
json.dump(collection, f, indent=2, ensure_ascii=False)
chunk_idx += 1Chunked serialization prevents memory spikes and enables safe resumption. Each batch operates independently, satisfying the requirements for Routing Creative Commons Licenses ingestion pipelines.
Error Routing and Fallback Chains
Implement structured logging with the standard logging module. Route validation failures to a separate CSV for reconciliation. Align fallback logic with Rights Metadata Mapping & Licensing Automation protocols for missing rights data. When license_uri is absent, apply a conservative embargo state rather than defaulting to open access.
import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
def process_pipeline(input_csv: Path, output_dir: Path) -> None:
logging.info("Starting CC-BY-NC-ND tagging pipeline")
records = stream_csv_records(input_csv)
write_manifest_chunks(records, output_dir)
logging.info("Pipeline complete. Check quarantine logs for validation failures.")This deterministic approach eliminates race conditions and ensures auditability. Integration with external CMS APIs should use exponential backoff and connection pooling. Consult the Python logging documentation for production-grade handler configuration.
Verification and Compliance
Validate output manifests against the LIDO v1.1 schema and IIIF Presentation 3.0 context. Run automated tests to verify URI canonicalization and chunk boundaries. Ensure all dcterms:rights fields resolve to the official Creative Commons Legal Code. This pipeline guarantees compliance, prevents unauthorized distribution, and maintains institutional rights integrity.
Conclusion
The canonicalization validator is intentionally strict: it rejects anything that does not exactly match https://creativecommons.org/licenses/by-nc-nd/4.0/ after normalization. A lenient validator that accepted partial matches would silently allow CC BY or CC BY-NC URIs through, incorrectly permitting commercial reuse on assets whose rights require the -ND (no derivatives) and -NC (non-commercial) restrictions. Rejected records route to a quarantine CSV so curators can review and correct the source data rather than silently absorbing the error.