Threshold Tuning for Public Domain

Workflow Context & Pipeline Integration

Threshold tuning operates as the deterministic decision layer between raw rights metadata ingestion and final digital asset publication. In museum pipelines, public domain status rarely exists as a static flag at ingestion. It emerges from probabilistic scoring across creation dates, jurisdictional life-plus terms, author mortality data, and institutional provenance records. The tuning mechanism establishes a configurable confidence cutoff. This cutoff dictates automatic routing to public domain distribution, manual review queues, or fallback compliance workflows.

flowchart TD
    A["AssetRightsPayload"] --> Sc["compute_threshold_score<br/>penalties + jurisdiction"]
    Sc --> R{"Confidence score?"}
    R -->|"≥ cutoff"| PD["public_domain"]
    R -->|"≥ cutoff × ratio"| MR["manual_review"]
    R -->|below| OW["orphan_work_queue"]

This layer sits downstream from Automating Copyright Status Checks and upstream from publishing endpoints. It ensures only assets meeting strict evidentiary standards bypass curator review. Effective threshold tuning requires strict separation of scoring logic from routing logic. The scoring engine aggregates normalized date vectors and jurisdictional rules into a single confidence metric. The routing engine applies institution-specific thresholds. These parameters must remain version-controlled and fully auditable.

When integrated into the broader Rights Metadata Mapping & Licensing Automation architecture, threshold configurations should be exposed via externalized management. Legal counsel can adjust cutoffs without triggering pipeline redeployment. Configuration drift is prevented through schema validation and immutable deployment artifacts.

Scoring Architecture & Jurisdictional Logic

Public domain determination relies on precise temporal and geographic calculations. LIDO-compliant metadata provides structured event dates and actor lifespans. These fields map directly to scoring vectors. The engine normalizes creation years against jurisdictional copyright terms. Life-plus-70, life-plus-50, and fixed-term publication rules apply dynamically. Missing author death dates trigger statistical decay functions. Provenance flags introduce penalty multipliers. The final score represents a probability that the asset resides in the public domain under current law.

IIIF manifests require explicit rights statements in the rights property. Threshold tuning validates these statements against computed probabilities before manifest generation. If the computed score exceeds the public domain threshold, the system injects http://creativecommons.org/publicdomain/mark/1.0/. Lower scores route to Routing Creative Commons Licenses for alternative licensing assignment. This prevents premature public domain claims on assets with ambiguous authorship or recent institutional acquisition.

Jurisdictional rules change frequently. The scoring engine references authoritative legal databases rather than hardcoded constants. Python’s zoneinfo and datetime modules handle temporal normalization accurately. External validation against LIDO Schema ensures consistent field mapping across heterogeneous collection systems.

Python 3.9+ Implementation

Production pipelines require asynchronous batching to process thousands of asset records. Blocking I/O or exhausting database connections degrades throughput. The following implementation demonstrates an async worker pool. It applies threshold tuning, batches external validations, and routes records based on configurable cutoffs.

python

from __future__ import annotations
import asyncio
import logging
from dataclasses import dataclass
from datetime import date
from typing import List, Optional, Dict
from pydantic import BaseModel, Field, ValidationError
from enum import Enum

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")

class RoutingDestination(str, Enum):
    PUBLIC_DOMAIN = "public_domain"
    MANUAL_REVIEW = "manual_review"
    ORPHAN_WORK_QUEUE = "orphan_work_queue"

@dataclass(frozen=True)
class ThresholdConfig:
    pd_confidence_cutoff: float = 0.92
    manual_review_ratio: float = 0.75  # fraction of the cutoff that still warrants review
    batch_size: int = 250
    max_retries: int = 3
    retry_delay: float = 1.5

class AssetRightsPayload(BaseModel):
    asset_id: str
    creation_year: Optional[int] = None
    author_death_year: Optional[int] = None
    jurisdiction: str = Field(pattern=r"^[A-Z]{2}$")
    raw_confidence_score: float = Field(ge=0.0, le=1.0)
    provenance_flags: List[str] = Field(default_factory=list)

def compute_threshold_score(payload: AssetRightsPayload, config: ThresholdConfig) -> float:
    """Apply jurisdictional decay and provenance penalties to raw confidence."""
    base = payload.raw_confidence_score
    penalty = 0.0

    # The US publication cutoff advances every January 1 (95-year term), so
    # derive it from the current year rather than hardcoding a constant.
    us_pd_cutoff = date.today().year - 96  # latest publication year now in the US public domain

    if "unverified_author" in payload.provenance_flags:
        penalty += 0.15
    if payload.author_death_year is None and payload.creation_year and payload.creation_year > us_pd_cutoff:
        penalty += 0.10
    if payload.jurisdiction == "US" and payload.creation_year and payload.creation_year <= us_pd_cutoff:
        base = min(base + 0.05, 1.0)

    adjusted = max(base - penalty, 0.0)
    return round(adjusted, 3)

async def route_asset(payload: AssetRightsPayload, config: ThresholdConfig) -> RoutingDestination:
    score = compute_threshold_score(payload, config)

    if score >= config.pd_confidence_cutoff:
        return RoutingDestination.PUBLIC_DOMAIN
    elif score >= config.pd_confidence_cutoff * config.manual_review_ratio:
        return RoutingDestination.MANUAL_REVIEW
    else:
        return RoutingDestination.ORPHAN_WORK_QUEUE

async def process_batch(
    payloads: List[AssetRightsPayload],
    config: ThresholdConfig
) -> Dict[str, RoutingDestination]:
    results: Dict[str, RoutingDestination] = {}
    tasks = [route_asset(p, config) for p in payloads]
    routed = await asyncio.gather(*tasks, return_exceptions=True)

    for payload, outcome in zip(payloads, routed):
        if isinstance(outcome, Exception):
            logging.error("Routing failed for %s: %s", payload.asset_id, outcome)
            results[payload.asset_id] = RoutingDestination.MANUAL_REVIEW
        else:
            results[payload.asset_id] = outcome

    return results

The implementation uses asyncio.gather for concurrent execution. Pydantic v2 enforces strict schema validation at ingestion boundaries. Configuration objects remain immutable to prevent runtime mutation errors. Retry logic and timeout handling should wrap external API calls in production deployments.

Threshold Calibration & Audit Trails

Static cutoffs fail when legal interpretations evolve or collection composition shifts. Calibration requires continuous feedback loops. Curator overrides generate labeled training data. The system tracks false positives and false negatives per jurisdiction. Threshold adjustments follow a documented change management process. Version-controlled configuration files store historical cutoffs alongside effective dates. This enables rollback during compliance audits.

Audit logging captures the exact input vectors, computed scores, and routing decisions. Each log entry includes the configuration hash and timestamp. This traceability satisfies institutional risk management requirements. When assets transition from restricted to open access, the pipeline records the justification chain. Missing data triggers fallback evaluation rather than automatic rejection. This aligns with Handling Orphan Works in Digital Collections protocols for ambiguous records.

Pipeline Integration & Validation

Threshold tuning must integrate with existing collection management systems. LIDO XML exports provide structured event and actor data. The scoring engine parses these exports into normalized payloads. IIIF Image API endpoints consume the routing output to apply appropriate watermarking or access restrictions. Public domain assets receive unrestricted tile serving. Restricted assets enforce origin-based token validation.

Automated testing validates threshold boundaries against known public domain corpora. Test suites verify jurisdictional edge cases, leap year calculations, and author lifespan gaps. Integration tests confirm that routing destinations trigger correct downstream workflows. Continuous deployment pipelines enforce configuration schema validation before threshold updates propagate. This ensures deterministic behavior across staging and production environments.

Conclusion

The scoring architecture has two key correctness properties: penalties are applied to the raw confidence score (not set as absolute thresholds), so borderline assets accumulate risk rather than failing binary checks; and the US public domain cutoff is computed dynamically as today().year - 96 rather than hardcoded, so the threshold advances automatically every January 1 as the 95-year term lapses for another publication year.

Threshold Tuning for Public Domain

Workflow Context & Pipeline Integration #

Scoring Architecture & Jurisdictional Logic #

Python 3.9+ Implementation #

Threshold Calibration & Audit Trails #

Pipeline Integration & Validation #

Conclusion #

Explore this section