Workflow Context
Museum DAMS pipelines ingest heterogeneous rights statements from legacy CMS exports, donor contracts, and digitization logs. Routing Creative Commons licenses requires deterministic normalization of free-text fields into machine-actionable URIs. The process must enforce strict version control between CC 3.0 and 4.0 specifications. Non-compliant records require isolation before publication. This subsystem operates within broader Rights Metadata Mapping & Licensing Automation architectures. High-throughput queues demand idempotent processing and explicit fallback routing.
Architecture and Data Flow
The routing engine decouples ingestion, normalization, validation, and distribution into discrete async stages. A semaphore-controlled worker pool prevents upstream CMS overload during batch operations. Metadata flows through a strict validation gate before reaching IIIF manifest generators. LIDO export routines consume the same normalized payload. Concurrent processing requires thread-safe dead-letter queues for malformed inputs. Automating Copyright Status Checks provides upstream signals that feed this routing stage.
flowchart LR
St["Raw rights statement"] --> Nm["Normalize<br/>lowercase · 'cc'"]
Nm --> Rs{"Resolve license<br/>longest match first"}
Rs -->|matched| Mp["Map to canonical CC URI"]
Rs -->|none| DLQ["Dead-letter queue"]
Mp --> Rt{"Allows reuse<br/>(contains 'by')?"}
Rt -->|yes| IIIF["IIIF manifest"]
Rt -->|no| Arch["Internal archive"]Core Routing Implementation
Production implementations rely on asyncio for concurrency and pydantic for schema enforcement. The following engine normalizes raw statements, maps them to canonical URIs, and routes assets to designated endpoints. Python 3.9+ type hints and modern validation patterns ensure type safety.
import asyncio
import hashlib
import logging
from enum import Enum
from typing import Any
from pydantic import BaseModel, Field, field_validator
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("cc_license_router")
class CCLicense(str, Enum):
CC0 = "https://creativecommons.org/publicdomain/zero/1.0/"
CC_BY = "https://creativecommons.org/licenses/by/4.0/"
CC_BY_SA = "https://creativecommons.org/licenses/by-sa/4.0/"
CC_BY_NC = "https://creativecommons.org/licenses/by-nc/4.0/"
CC_BY_ND = "https://creativecommons.org/licenses/by-nd/4.0/"
CC_BY_NC_SA = "https://creativecommons.org/licenses/by-nc-sa/4.0/"
CC_BY_NC_ND = "https://creativecommons.org/licenses/by-nc-nd/4.0/"
class AssetRightsPayload(BaseModel):
asset_id: str = Field(..., min_length=3, pattern=r"^[A-Z0-9\-]+$")
raw_rights_statement: str
rights_source: str = "legacy_cms"
embargo_until: str | None = None
@field_validator("raw_rights_statement", mode="before")
@classmethod
def normalize_statement(cls, v: str) -> str:
return v.strip().lower().replace("creative commons", "cc")
class RoutedAsset(BaseModel):
asset_id: str
license_uri: str
routing_destination: str
validation_status: str = "passed"
metadata_hash: str | None = None
class LicenseRouter:
def __init__(self, max_concurrency: int = 10):
self.semaphore = asyncio.Semaphore(max_concurrency)
self.dead_letter_queue: list[dict[str, Any]] = []
self._mapping = {
"cc0": CCLicense.CC0,
"cc by": CCLicense.CC_BY,
"cc by-sa": CCLicense.CC_BY_SA,
"cc by-nc": CCLicense.CC_BY_NC,
"cc by-nd": CCLicense.CC_BY_ND,
"cc by-nc-sa": CCLicense.CC_BY_NC_SA,
"cc by-nc-nd": CCLicense.CC_BY_NC_ND,
}
def _resolve_license(self, statement: str) -> CCLicense | None:
# Match the most specific key first: "cc by" is a substring of
# "cc by-nc-nd", so test longest keys before shortest to avoid
# mis-tagging every by-* variant as plain CC BY.
for key in sorted(self._mapping, key=len, reverse=True):
if key in statement:
return self._mapping[key]
return None
async def process_asset(self, payload: AssetRightsPayload) -> RoutedAsset | None:
async with self.semaphore:
try:
license_uri = self._resolve_license(payload.raw_rights_statement)
if not license_uri:
raise ValueError("Unrecognized CC license pattern")
destination = "iiif_manifest" if "by" in payload.raw_rights_statement else "internal_archive"
metadata_hash = hashlib.sha256(f"{payload.asset_id}:{license_uri}".encode()).hexdigest()
return RoutedAsset(
asset_id=payload.asset_id,
license_uri=license_uri,
routing_destination=destination,
metadata_hash=metadata_hash
)
except Exception as exc:
logger.warning("Routing failed for %s: %s", payload.asset_id, exc)
self.dead_letter_queue.append({
"asset_id": payload.asset_id,
"error": str(exc),
"raw_statement": payload.raw_rights_statement
})
return None
async def run_batch(self, payloads: list[AssetRightsPayload]) -> list[RoutedAsset]:
tasks = [self.process_asset(p) for p in payloads]
# return_exceptions=True keeps one failure from cancelling the batch;
# keep only successfully routed assets.
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if isinstance(r, RoutedAsset)]IIIF and LIDO Schema Alignment
Normalized URIs must map directly to presentation layer specifications. IIIF Presentation API 3.0 expects the rights property to contain a resolvable URI. LIDO requires <rightsWork> blocks with <rightsType> and <rightsHolder> elements. The router output feeds directly into these XML/JSON structures. Implementing Embargo Workflows handles temporal restrictions before IIIF manifest generation.
Fallback Chains and Telemetry
Ambiguous rights statements require deterministic fallback routing. The engine routes unrecognized patterns to a staging endpoint for curator review. Structured telemetry captures routing decisions, validation failures, and hash collisions. Audit logs must comply with institutional retention policies. Automating CC-BY-NC-ND Tagging in Python extends this pipeline for restricted commercial use cases.
External validation against the IIIF Presentation API 3.0 specification ensures manifest compatibility. LIDO schema definitions are maintained at lido-schema.org. Canonical CC RDF vocabulary provides authoritative URI resolution paths.
Conclusion
The license resolver’s correctness depends on matching longest keys first. Without that sort, "cc by" matches before "cc by-nc-nd" and every derivative of CC BY gets mis-tagged as plain attribution. The sorted(self._mapping, key=len, reverse=True) pattern is the specific fix. Unrecognized statements route to the dead-letter queue for curator review rather than defaulting to any license — including CC0, which would inadvertently surrender rights.