Tayler Erbe · Project Case Study · RIMS / University Library · 2024–2025

Archival Image Intelligence
& Sensitive Content Detection

A multimodal AI proof-of-concept for automatically detecting sensitive and culturally significant content in large-scale historical image archives — combining LLaVA visual classification, semantic similarity embeddings, and IPTC-compliant metadata extraction across 10,000+ archival TIFF scans.

View Code on GitHub →
Partners
RIMS Committee · University Library · AITS
Stakeholders
Joanne Kaczmarek · RIMS Committee · University Archivist
Status
Active POC · Expansion Planning Underway
Role
Lead Data Scientist · Full Lifecycle Ownership
10,000+
TIFF Images Processed
9
Sensitive Content Categories
2
Classification Methods Compared
IPTC
Dublin Core–Aligned Metadata

Project Overview

The University of Illinois holds enormous collections of historical archival imagery — photographs, documents, and scans spanning decades of institutional history. Before these collections can be digitized, published, or made publicly accessible, they must be reviewed for sensitive, offensive, or culturally significant content that may require masking, restriction, or specialist review prior to release.

Manual review of collections at this scale is not feasible. Tayler Erbe scoped, designed, and led the development of an automated multimodal AI proof-of-concept to address this directly — building two distinct detection pipelines, a sensitive content taxonomy grounded in archival and legal standards, and a metadata extraction framework aligned with IPTC and Dublin Core archival preservation standards.

The project was developed in active partnership with the RIMS (Records and Information Management) Committee at UIUC and the University Library cataloguing program, with the goal of producing a replicable, scalable detection framework for eventual deployment across the full institutional image archive.

Problem Statement

Archival collections contain images that — while historically significant — may depict racialized performances, human remains, Indigenous cultural materials, or personally identifiable information. No automated screening existed. Before collections can be made publicly accessible, each image must be individually evaluated.

Two-Pronged Architecture

Method 1: Semantic Similarity — generate LLaVA descriptions, embed with SentenceTransformer, match against taxonomy keywords via cosine similarity.

Method 2: LLaVA Direct Classification — send each image to a multimodal LLM with a structured moderation prompt and parse the JSON output.

Metadata Pillar

Alongside content classification, the project produced a full EXIF/TIFF/IPTC metadata extraction pipeline for 10,000 archival TIFF images, generating structured records aligned with Dublin Core and IPTC standards for long-term preservation and cataloguing interoperability.

Sensitive Content Taxonomy

One of the most significant contributions of this project is the domain-specific sensitive content taxonomy developed for archival collections at a research university. Unlike general-purpose content moderation systems, this taxonomy was designed specifically for the institutional, historical, and legal context of a university archive — grounding each category in both ethical obligations and regulatory frameworks (FERPA, HIPAA, NAGPRA).

# Category Description Regulatory / Ethical Basis
01 Historical Racialized Performance Blackface, yellowface, minstrelsy, ethnic caricature in costume or makeup Institutional equity obligations; public harm potential
02 Human Remains Bones, skulls, mummies, cadavers — archaeological or anthropological contexts NAGPRA; cultural sensitivity; repatriation obligations
03 Native American / Indigenous Imagery Native individuals, sacred regalia, tribal ceremonies, mascot representations (Chief Illiniwek) NAGPRA; tribal sovereignty; cultural appropriation harm
04 Nudity / Sexual Content Explicit or suggestive imagery; hard block for minors COPPA; platform publishing standards
05 Violence / Graphic Content Weapons, injuries, graphic scenes of harm or death Platform standards; researcher and staff safety
06 Hate Symbols Swastikas, KKK iconography, white supremacist symbols, confederate imagery Institutional DEI obligations; legal liability
07 Medical / Health Records X-rays, patient charts, hospital forms, identifiable health data HIPAA; institutional privacy obligations
08 Student / PII Records Student IDs, transcripts, SSNs, enrollment forms, FERPA-protected data FERPA; GDPR (where applicable)
09 Other Sensitive Categories Terrorism imagery, drug paraphernalia, self-harm depictions Platform publishing standards; staff welfare
Design rationale: Each category was mapped to keyword sets used for semantic matching. The taxonomy was built to be extensible — new categories or keywords can be added without retraining any models.

Pipeline Architecture

Method 1 — Semantic Similarity
01
Image Conversion
TIFF images are converted to JPEG (max 4000px), handling multi-frame and truncated files with PIL. A separate downsampling variant (50% size reduction) was tested and benchmarked.
02
Description Generation — Natural Prompt
LLaVA (llava-llama3 via Ollama) is prompted: "Only describe what you actually see. Don't guess names or locations." Generates a full natural-language description per image.
03
Description Generation — Strict Prompt
A second prompt generates a one-to-two sentence factual summary: "Output must be strict and factual, using only observable details." Both descriptions are combined for richer embedding context.
04
Semantic Embedding
Taxonomy keywords and combined image descriptions are embedded using SentenceTransformer (all-mpnet-base-v2). Cosine similarity is computed between description and keyword embedding vectors.
05
Matching & Flagging
Images scoring cosine similarity ≥ 0.45 against any taxonomy keyword are flagged. Output captures matched labels, matched keywords, and a detailed JSON breakdown per image.
Method 2 — LLaVA Direct Classification
01
Image Preparation
Same TIFF-to-JPEG conversion pipeline. Intermediate saves every 100 images guard against data loss during long batch runs (~33–40 hrs for 1,000 images).
02
Structured Moderation Prompt
Each image is sent to LLaVA with a strict classification prompt listing all 9 taxonomy categories. The model is asked to respond in JSON only: {"offensive": true/false, "category": "...", "rationale": "..."}
03
JSON Parsing & Fallback
Responses are parsed via json.loads(). Non-JSON outputs (model hallucination) are stored in the rationale field for manual review rather than discarded.
04
Results Export
Classification results saved to CSV with filename, offensive flag, assigned category, and rationale. Resume capability handles partial runs — only unprocessed files are queued.
Metadata Pipeline (Parallel)

A third parallel pipeline extracts File, EXIF/TIFF, and IPTC metadata from all TIFF images using tifffile, PIL, and iptcinfo3. Output is a structured CSV aligned with Dublin Core and IPTC standards, enabling cataloguing interoperability with library management systems.

Method Comparison

Semantic Similarity
Better Performer
Processing Time (1,000 images) ~9 minutes
Images Flagged 79 / 1,000
Confirmed Correct (manual) 13 / 79 (16.5%)
Misclassifications 8 / 79 (10.1%)
Ambiguous / Borderline 58 / 79 (73.4%)
Taxonomy Extensibility High — keyword-only update
Captures nuanced semantic meaning. Fast at scale. False positives often stem from semantic overlap between unrelated categories (e.g., medieval ceremonial dress flagging as Native American imagery). Threshold tuning can improve precision.
SentenceTransformer all-mpnet-base-v2 Cosine Similarity LLaVA Descriptions
LLaVA Direct
Needs Fine-Tuning
Processing Time (1,000 images) ~33–40 hours
Images Flagged 0 / 1,000
Sensitive Content Detected None (all false)
Rationale Quality High — descriptive
Taxonomy Extensibility High — prompt-only update
Bottleneck Model inference (not image size)
Strong visual reasoning capability but failed to flag any content as offensive in this dataset. Model appears calibrated for contemporary content moderation and may require domain-specific fine-tuning for historical archival material. Notable: model correctly described racialized content in rationale text while still returning offensive: false.
LLaVA-LLaMA3 Ollama Structured JSON Output
Downsampling Finding

Images resized to 50% of original dimensions were tested independently. No measurable difference was found in description quality or processing speed. The inference bottleneck is the model itself, not image input size. Original resolution is recommended unless storage efficiency is the priority.

The 73% Ambiguity Problem

Of the 79 images flagged by the semantic method, 73.4% were borderline — neither clearly correct nor clearly wrong. Many were contextually sensitive (ceremonial dress, historical portraits) but not actionably offensive. This reflects the inherent subjectivity of archival sensitivity and points to the need for a human-in-the-loop review tier.

Results & Findings

Manual review of the 10,000-image test set identified approximately 8–10 images as genuinely sensitive — a rate of roughly 1% of the corpus. This low base rate is important context for interpreting both classification performance and false positive rates: at 1% prevalence, even a highly accurate model will produce more false positives than true positives in absolute terms.

The semantic method's 16.5% precision on flagged items reflects both the challenge of this low-prevalence setting and the inherent ambiguity of archival content. The 73.4% "borderline" rate suggests that the sensitivity threshold is approximately correct, but that human expert judgment remains essential for a meaningful portion of flagged images.

The most reliable detected categories were Human Remains (skull, skeleton keywords) and Violence/Graphic Content (explosion, fire keywords) — categories where visual descriptions map cleanly to unambiguous taxonomy terms. Categories like Native American Imagery and Historical Racialized Performance proved harder to classify reliably, as descriptions often lacked explicit cultural context cues.

79
Flagged by Semantic Method
16.5%
Confirmed Accurate
~1%
True Sensitive Rate (Manual)
Example: True Positive (Human Remains) — Image 0000207
Archival image 0000207 — man holding skull
0000207.tif — flagged: Human Remains

Strict description: "A man is holding up a skull to the face of a dummy head for display purposes."

Matched keyword: skull → Category: Human Remains. Clean, unambiguous match.

Example: False Positive (Native American Imagery) — Image 0000700
Archival image 0000700 — medieval outfits
0000700.tif — false positive: Native American Imagery

Description: "Two women wearing medieval outfits stand side by side with a man."

Flagged as Native American imagery due to semantic overlap between "ceremonial" and "medieval costume." A clear threshold tuning and context-awareness challenge.

Metadata Extraction & Archival Standards

The metadata extraction pipeline is a standalone contribution of this project, independent of content classification. For archival collections, structured, standards-aligned metadata is as important as content moderation — enabling discovery, cataloguing, long-term preservation, and integration with library management systems.

Using tifffile, Pillow (PIL), and iptcinfo3, the pipeline extracts metadata across three levels for every TIFF image in the collection. All output fields are mapped to IPTC Core and Dublin Core schemas to ensure interoperability with the University Library's cataloguing systems.

File-Level
  • › Filename
  • › File size
  • › Format type
  • › MIME type
EXIF / TIFF
  • › Dimensions
  • › Resolution (DPI)
  • › Compression
  • › Photometric interp.
  • › Software / Dates
IPTC
  • › Object name
  • › Caption / abstract
  • › Keywords
Standards alignment: IPTC Core and Dublin Core compliance ensures metadata is portable across library management systems (ExLibris Alma, CONTENTdm, ArchivesSpace) without transformation work.

Recommendations & Next Steps

The POC establishes a working framework and clear performance baseline. The path forward depends on the scale and specificity required by the RIMS Committee and the University Library. The following options are recommended in order of increasing investment and capability.

Priority Action Rationale Effort
Immediate Expand to 10,000 images Current 1% sensitive rate means the 10,000-image sample has too few confirmed true positives to fully validate detection performance — further expansion is recommended to improve precision/recall estimation. A 10x sample is the minimum for meaningful precision/recall estimation. Low — pipeline already built
Near-term Establish metadata standards Confirm which IPTC/Dublin Core fields are required for integration with the Library's cataloguing system before scaling metadata extraction. Low — coordination work
Near-term Tune similarity threshold Testing threshold values from 0.35 to 0.55 on a labeled validation set will substantially improve precision while controlling false positive rate. Low — analytical work
Strategic Define prioritization criteria Large-scale processing requires scoping decisions: by collection era, type, or institutional risk level. Early prioritization prevents indiscriminate resource consumption. Medium — planning work
Strategic Third-party API integration Google Vision SafeSearch, AWS Rekognition, or Azure Content Moderator offer pretrained moderation models with strong nudity/violence performance. Fastest path to reliable detection; cost and procurement are primary constraints. Medium — procurement + dev
Long-term Custom model fine-tuning Fine-tune LLaVA on UIUC-specific labeled archival images to improve detection of domain-specific content (historical racialized performance, Indigenous cultural materials). Most accurate long-term path but requires a labeled training dataset first. High — ML development

Technology Stack

Vision & Language Models
  • LLaVA-LLaMA3 (via Ollama)
  • SentenceTransformer
  • all-mpnet-base-v2
  • Cosine similarity (PyTorch)
Image Processing
  • Pillow (PIL)
  • tifffile
  • iptcinfo3
  • LANCZOS resampling
Data & Infrastructure
  • pandas / numpy
  • Python / Jupyter
  • Linux analytics server
  • GitHub
LLaVA Multimodal AI Semantic Similarity SentenceTransformer TIFF / IPTC Dublin Core PIL Ollama Python RIMS / Library Partnership
← Back to Portfolio