Tayler Erbe · Project Case Study · RIMS / University Library · 2024–2025

Archival Image Intelligence
& Sensitive Content Detection

A multimodal AI proof-of-concept for automatically detecting sensitive and culturally significant content in large-scale historical image archives — combining LLaVA visual classification, semantic similarity embeddings, and IPTC-compliant metadata extraction across 10,000+ archival TIFF scans.

View Code on GitHub →

Partners

RIMS Committee · University Library · AITS

Stakeholders

Joanne Kaczmarek · RIMS Committee · University Archivist

Status

Active POC · Expansion Planning Underway

Role

Lead Data Scientist · Full Lifecycle Ownership

Code

GitHub Repository →

10,000+

TIFF Images Processed

Sensitive Content Categories

Classification Methods Compared

IPTC

Dublin Core–Aligned Metadata

Project Overview

The University of Illinois holds enormous collections of historical archival imagery — photographs, documents, and scans spanning decades of institutional history. Before these collections can be digitized, published, or made publicly accessible, they must be reviewed for sensitive, offensive, or culturally significant content that may require masking, restriction, or specialist review prior to release.

Manual review of collections at this scale is not feasible. Tayler Erbe scoped, designed, and led the development of an automated multimodal AI proof-of-concept to address this directly — building two distinct detection pipelines, a sensitive content taxonomy grounded in archival and legal standards, and a metadata extraction framework aligned with IPTC and Dublin Core archival preservation standards.

The project was developed in active partnership with the RIMS (Records and Information Management) Committee at UIUC and the University Library cataloguing program, with the goal of producing a replicable, scalable detection framework for eventual deployment across the full institutional image archive.

Problem Statement

Archival collections contain images that — while historically significant — may depict racialized performances, human remains, Indigenous cultural materials, or personally identifiable information. No automated screening existed. Before collections can be made publicly accessible, each image must be individually evaluated.

Two-Pronged Architecture

Method 1: Semantic Similarity — generate LLaVA descriptions, embed with SentenceTransformer, match against taxonomy keywords via cosine similarity.

Method 2: LLaVA Direct Classification — send each image to a multimodal LLM with a structured moderation prompt and parse the JSON output.

Metadata Pillar

Alongside content classification, the project produced a full EXIF/TIFF/IPTC metadata extraction pipeline for 10,000 archival TIFF images, generating structured records aligned with Dublin Core and IPTC standards for long-term preservation and cataloguing interoperability.

Sensitive Content Taxonomy

One of the most significant contributions of this project is the domain-specific sensitive content taxonomy developed for archival collections at a research university. Unlike general-purpose content moderation systems, this taxonomy was designed specifically for the institutional, historical, and legal context of a university archive — grounding each category in both ethical obligations and regulatory frameworks (FERPA, HIPAA, NAGPRA).

#	Category	Description	Regulatory / Ethical Basis
01	Historical Racialized Performance	Blackface, yellowface, minstrelsy, ethnic caricature in costume or makeup	Institutional equity obligations; public harm potential
02	Human Remains	Bones, skulls, mummies, cadavers — archaeological or anthropological contexts	NAGPRA; cultural sensitivity; repatriation obligations
03	Native American / Indigenous Imagery	Native individuals, sacred regalia, tribal ceremonies, mascot representations (Chief Illiniwek)	NAGPRA; tribal sovereignty; cultural appropriation harm
04	Nudity / Sexual Content	Explicit or suggestive imagery; hard block for minors	COPPA; platform publishing standards
05	Violence / Graphic Content	Weapons, injuries, graphic scenes of harm or death	Platform standards; researcher and staff safety
06	Hate Symbols	Swastikas, KKK iconography, white supremacist symbols, confederate imagery	Institutional DEI obligations; legal liability
07	Medical / Health Records	X-rays, patient charts, hospital forms, identifiable health data	HIPAA; institutional privacy obligations
08	Student / PII Records	Student IDs, transcripts, SSNs, enrollment forms, FERPA-protected data	FERPA; GDPR (where applicable)
09	Other Sensitive Categories	Terrorism imagery, drug paraphernalia, self-harm depictions	Platform publishing standards; staff welfare

Design rationale: Each category was mapped to keyword sets used for semantic matching. The taxonomy was built to be extensible — new categories or keywords can be added without retraining any models.

Pipeline Architecture

Method 1 — Semantic Similarity

Image Conversion

TIFF images are converted to JPEG (max 4000px), handling multi-frame and truncated files with PIL. A separate downsampling variant (50% size reduction) was tested and benchmarked.

Description Generation — Natural Prompt

LLaVA (llava-llama3 via Ollama) is prompted: "Only describe what you actually see. Don't guess names or locations." Generates a full natural-language description per image.

Description Generation — Strict Prompt

A second prompt generates a one-to-two sentence factual summary: "Output must be strict and factual, using only observable details." Both descriptions are combined for richer embedding context.

Semantic Embedding

Taxonomy keywords and combined image descriptions are embedded using SentenceTransformer (all-mpnet-base-v2). Cosine similarity is computed between description and keyword embedding vectors.

Matching & Flagging

Images scoring cosine similarity ≥ 0.45 against any taxonomy keyword are flagged. Output captures matched labels, matched keywords, and a detailed JSON breakdown per image.

Method 2 — LLaVA Direct Classification

Image Preparation

Same TIFF-to-JPEG conversion pipeline. Intermediate saves every 100 images guard against data loss during long batch runs (~33–40 hrs for 1,000 images).

Structured Moderation Prompt

Each image is sent to LLaVA with a strict classification prompt listing all 9 taxonomy categories. The model is asked to respond in JSON only: {"offensive": true/false, "category": "...", "rationale": "..."}

JSON Parsing & Fallback

Responses are parsed via json.loads(). Non-JSON outputs (model hallucination) are stored in the rationale field for manual review rather than discarded.

Results Export

Classification results saved to CSV with filename, offensive flag, assigned category, and rationale. Resume capability handles partial runs — only unprocessed files are queued.

Metadata Pipeline (Parallel)

A third parallel pipeline extracts File, EXIF/TIFF, and IPTC metadata from all TIFF images using tifffile, PIL, and iptcinfo3. Output is a structured CSV aligned with Dublin Core and IPTC standards, enabling cataloguing interoperability with library management systems.

Method Comparison

Semantic Similarity

Better Performer

Processing Time (1,000 images) ~9 minutes

Images Flagged 79 / 1,000

Confirmed Correct (manual) 13 / 79 (16.5%)

Misclassifications 8 / 79 (10.1%)

Ambiguous / Borderline 58 / 79 (73.4%)

Taxonomy Extensibility High — keyword-only update

Captures nuanced semantic meaning. Fast at scale. False positives often stem from semantic overlap between unrelated categories (e.g., medieval ceremonial dress flagging as Native American imagery). Threshold tuning can improve precision.

SentenceTransformer all-mpnet-base-v2 Cosine Similarity LLaVA Descriptions

LLaVA Direct

Needs Fine-Tuning

Processing Time (1,000 images) ~33–40 hours

Images Flagged 0 / 1,000

Sensitive Content Detected None (all false)

Rationale Quality High — descriptive

Taxonomy Extensibility High — prompt-only update

Bottleneck Model inference (not image size)

Strong visual reasoning capability but failed to flag any content as offensive in this dataset. Model appears calibrated for contemporary content moderation and may require domain-specific fine-tuning for historical archival material. Notable: model correctly described racialized content in rationale text while still returning offensive: false.

LLaVA-LLaMA3 Ollama Structured JSON Output

Downsampling Finding

Images resized to 50% of original dimensions were tested independently. No measurable difference was found in description quality or processing speed. The inference bottleneck is the model itself, not image input size. Original resolution is recommended unless storage efficiency is the priority.

The 73% Ambiguity Problem

Of the 79 images flagged by the semantic method, 73.4% were borderline — neither clearly correct nor clearly wrong. Many were contextually sensitive (ceremonial dress, historical portraits) but not actionably offensive. This reflects the inherent subjectivity of archival sensitivity and points to the need for a human-in-the-loop review tier.

Results & Findings

Manual review of the 10,000-image test set identified approximately 8–10 images as genuinely sensitive — a rate of roughly 1% of the corpus. This low base rate is important context for interpreting both classification performance and false positive rates: at 1% prevalence, even a highly accurate model will produce more false positives than true positives in absolute terms.

The semantic method's 16.5% precision on flagged items reflects both the challenge of this low-prevalence setting and the inherent ambiguity of archival content. The 73.4% "borderline" rate suggests that the sensitivity threshold is approximately correct, but that human expert judgment remains essential for a meaningful portion of flagged images.

The most reliable detected categories were Human Remains (skull, skeleton keywords) and Violence/Graphic Content (explosion, fire keywords) — categories where visual descriptions map cleanly to unambiguous taxonomy terms. Categories like Native American Imagery and Historical Racialized Performance proved harder to classify reliably, as descriptions often lacked explicit cultural context cues.

Flagged by Semantic Method

16.5%

Confirmed Accurate

~1%

True Sensitive Rate (Manual)

Example: True Positive (Human Remains) — Image 0000207

Archival image 0000207 — man holding skull

0000207.tif — flagged: Human Remains

Strict description: "A man is holding up a skull to the face of a dummy head for display purposes."

Matched keyword: skull → Category: Human Remains. Clean, unambiguous match.

Example: False Positive (Native American Imagery) — Image 0000700

Archival image 0000700 — medieval outfits

0000700.tif — false positive: Native American Imagery

Description: "Two women wearing medieval outfits stand side by side with a man."

Flagged as Native American imagery due to semantic overlap between "ceremonial" and "medieval costume." A clear threshold tuning and context-awareness challenge.

Metadata Extraction & Archival Standards

The metadata extraction pipeline is a standalone contribution of this project, independent of content classification. For archival collections, structured, standards-aligned metadata is as important as content moderation — enabling discovery, cataloguing, long-term preservation, and integration with library management systems.

Using tifffile, Pillow (PIL), and iptcinfo3, the pipeline extracts metadata across three levels for every TIFF image in the collection. All output fields are mapped to IPTC Core and Dublin Core schemas to ensure interoperability with the University Library's cataloguing systems.

File-Level

› Filename
› File size
› Format type
› MIME type

EXIF / TIFF

› Dimensions
› Resolution (DPI)
› Compression
› Photometric interp.
› Software / Dates

IPTC

› Object name
› Caption / abstract
› Keywords

Standards alignment: IPTC Core and Dublin Core compliance ensures metadata is portable across library management systems (ExLibris Alma, CONTENTdm, ArchivesSpace) without transformation work.

Recommendations & Next Steps

The POC establishes a working framework and clear performance baseline. The path forward depends on the scale and specificity required by the RIMS Committee and the University Library. The following options are recommended in order of increasing investment and capability.

Priority	Action	Rationale	Effort
Immediate	Expand to 10,000 images	Current 1% sensitive rate means the 10,000-image sample has too few confirmed true positives to fully validate detection performance — further expansion is recommended to improve precision/recall estimation. A 10x sample is the minimum for meaningful precision/recall estimation.	Low — pipeline already built
Near-term	Establish metadata standards	Confirm which IPTC/Dublin Core fields are required for integration with the Library's cataloguing system before scaling metadata extraction.	Low — coordination work
Near-term	Tune similarity threshold	Testing threshold values from 0.35 to 0.55 on a labeled validation set will substantially improve precision while controlling false positive rate.	Low — analytical work
Strategic	Define prioritization criteria	Large-scale processing requires scoping decisions: by collection era, type, or institutional risk level. Early prioritization prevents indiscriminate resource consumption.	Medium — planning work
Strategic	Third-party API integration	Google Vision SafeSearch, AWS Rekognition, or Azure Content Moderator offer pretrained moderation models with strong nudity/violence performance. Fastest path to reliable detection; cost and procurement are primary constraints.	Medium — procurement + dev
Long-term	Custom model fine-tuning	Fine-tune LLaVA on UIUC-specific labeled archival images to improve detection of domain-specific content (historical racialized performance, Indigenous cultural materials). Most accurate long-term path but requires a labeled training dataset first.	High — ML development

Technology Stack

Vision & Language Models

› LLaVA-LLaMA3 (via Ollama)
› SentenceTransformer
› all-mpnet-base-v2
› Cosine similarity (PyTorch)

Image Processing

› Pillow (PIL)
› tifffile
› iptcinfo3
› LANCZOS resampling

Data & Infrastructure

› pandas / numpy
› Python / Jupyter
› Linux analytics server
› GitHub

LLaVA Multimodal AI Semantic Similarity SentenceTransformer TIFF / IPTC Dublin Core PIL Ollama Python RIMS / Library Partnership

← Back to Portfolio