Tayler Erbe · Project Case Study · AITS · University of Illinois System · 2024–2025

File Archiving &
Intelligent Search

A full end-to-end digital preservation pipeline that connects to Box cloud storage, downloads every file regardless of format, extracts or generates textual metadata for each one — including photographs via a local vision model — summarizes everything with a locally-running LLaMA 3.2 model, and indexes all content into a FAISS semantic search engine. Designed to make any legacy archive fully searchable with natural language.

Institution

University of Illinois System (AITS)

Role

Lead Data Scientist · Full Lifecycle Ownership

Status

Delivered · Extensible Framework

Code

GitHub Repository →

15+

File Types Processed

100%

Local — No Cloud AI APIs

FAISS

Semantic Vector Index

LLaVA

Vision Model for Images

Project Overview

Institutions accumulate enormous amounts of unstructured digital content — often belonging to individuals who have retired, passed away, or transitioned out. This content typically lives in cloud storage (Box, SharePoint, etc.) as thousands of mixed-format files with no metadata, no descriptions, and no way to search them beyond exact filename matches.

This project was built around a real institutional archive: the life's work of a faculty member spanning decades of academic research — papers, correspondence, scanned photographs, experimental data files, presentations, and emails. None of it was searchable in any meaningful way.

The system connects to Box via the Box SDK, downloads every file, and routes each one through the appropriate text extraction method for its format. Images are described by a local vision-language model. Structured files like Excel are analyzed by reading column headers rather than row data. Every file ultimately receives an LLM-generated summary, title, and document type classification — then gets indexed into a FAISS semantic search engine.

The entire pipeline runs locally with no cloud AI APIs. All LLM and vision model inference is handled by Ollama running on-device. This was a deliberate architectural choice: the files in this archive contained private personal correspondence, research data, and institutional records, making cloud API transmission inappropriate.

The Problem Statement

When a researcher retires, passes away, or moves on, decades of institutional knowledge — research, correspondence, data, images — becomes inaccessible. Standard file systems offer only filename search. The goal was to make any archive fully searchable through natural language, regardless of file type, regardless of whether content was ever labeled or organized.

Why Everything Runs Locally

The archive included private email correspondence, personal photographs, confidential administrative records, and decades of unpublished research. Routing any of this through a cloud LLM API would raise serious privacy and institutional data governance concerns. Ollama running LLaMA 3.2 and LLaVA-LLaMA3 locally was the only architecturally acceptable approach.

Pipeline Architecture

The system is a sequential eight-stage pipeline. Each stage produces a structured output that feeds the next. Stages 3–5 are the most complex — text extraction and LLM summarization vary substantially by file type and require per-type engineering. Every processing loop includes resume support: progress is checkpointed after every batch of rows, so the pipeline can be interrupted and restarted without reprocessing completed work.

Step 01Box Connection

File Inventory — Box SDK Recursive Crawl

Authenticates to Box via OAuth2 developer token. Recursively walks the entire folder tree, collecting metadata for every file: full path, extension, size, created/modified timestamps, and the Box file_id — a unique numeric identifier that becomes the primary key throughout the entire pipeline. Output saved as file_path_directory.csv.

Step 02Download

Bulk Download — All Files to Local Directory

Downloads every file from Box to a local working directory. Filenames are sanitized (spaces, colons, slashes removed) and the Box file_id is appended to ensure uniqueness and traceability across all downstream steps. Already-downloaded files are skipped on re-run. Failures are logged but do not halt the process.

Step 03Extraction

Text Extraction — 15+ File Types, Type-Specific Methods

Each file type is routed to the appropriate extraction method. PDFs use PyPDF2. Images are passed to LLaVA-LLaMA3 via Ollama for vision-model descriptions. Word docs use python-docx with Win32 COM fallback for legacy formats. Excel files read column headers only — not row data — and pass those to LLaMA to infer content. Emails are parsed into structured rows (Subject/From/To/Date/Body). HTML uses BeautifulSoup. RTF uses striprtf.

Step 04Pre-Summarize

Semantic Compression — SentenceTransformer Before LLM

Documents over 250 sentences are compressed before reaching LLaMA. Each sentence is embedded with all-MiniLM-L6-v2, scored by average pairwise cosine similarity (a proxy for semantic centrality), and the top 250 most representative sentences are retained. This preserves the document's core meaning without naive truncation, and ensures every document fits within LLaMA's context window.

Step 05LLM Pass

LLaMA 3.2 — Summary, Title, Document Type (Local via Ollama)

Every document passes through three separate LLaMA 3.2 prompts: a plain-language summary, a one-sentence title, and a document type classification (e.g., "research paper on turbulent pipe flow", "email between faculty regarding a budget request"). Prompts are designed to produce clean single-output responses. Processing is batched in groups of 3 with Parquet checkpoint saves after each batch.

Step 06Merge

Schema Standardization — All Types to Unified DataFrame

All per-file-type DataFrames are standardized to a unified schema: file_path, file_name, file_id, extension, original_text, llama_generated_summary. Joined back to the Box inventory on file_id to attach size, timestamps, and original Box path. Produces final_df_summaries.csv — the master output.

Step 07Clustering

Topic Discovery — TF-IDF + KMeans + LLaMA Topic Titles

TF-IDF vectorizes all summaries (bigrams, frequency-filtered). The elbow method selects K. KMeans clusters documents into topics. Top 20 keywords per cluster are sent to LLaMA to generate human-readable topic titles. A hierarchical second pass creates subtopics within each top-level cluster. Result: automatic discovery of the thematic structure of the entire archive.

Step 08Search

FAISS Semantic Index — Natural Language Search CLI

All summaries are embedded with SentenceTransformer and indexed into a FAISS HNSW index persisted to disk. An interactive CLI allows natural language queries with top-k or similarity threshold search. Image results open directly in the OS viewer. Text results print full extracted content. The index loads in seconds even across thousands of documents.

File Type Coverage

One of the core engineering challenges was handling a genuinely heterogeneous archive — not just the clean file types, but decades-old formats, broken extensions, vendor-specific variants, and files with no extension at all. Each type required a different approach, and several required special handling or fallback strategies.

.pdf / .pdf1 / .pd

PyPDF2 + Text Cleaning

Page-by-page text extraction. Post-processing removes copyright footers, page markers, figure labels, and non-ASCII artifacts from OCR-generated PDFs.

.jpg / .png / .tif / .gif / .eps

LLaVA-LLaMA3 (Vision Model)

TIF and GIF converted to JPG via PIL. Each image passed to LLaVA-LLaMA3 running locally via Ollama. The model generates a detailed natural language description — making photographs fully semantically searchable for the first time.

.docx / .doc / .doc1 / .doc2

python-docx + Win32 COM

.docx uses python-docx (pure Python). Legacy .doc variants first try antiword, then fall back to Win32 COM automation via Microsoft Word. Covers the full range of Word formats found in older institutional archives.

.xlsx / .xls / .xls2

Win32 COM → Headers Only → LLaMA

Rather than loading row data (which could be thousands of rows), the system reads sheet names and column headers only, then asks LLaMA: "based on these column names, what does this spreadsheet contain?" Efficient and surprisingly accurate.

.ppt / .ppt1 / .pptx

python-pptx + Win32 COM

.pptx uses python-pptx to extract text from all shapes across all slides. Legacy .ppt and .ppt1 use Win32 COM automation (requires PowerPoint installed). Text concatenated by slide.

.eml / .mbox

Email Parser → Structured Rows

Email files are parsed into structured rows: Subject, From, To, Date, Body. Each individual message within a file becomes its own DataFrame row. Original source filename preserved for traceability back to the Box archive.

.mhtml / .mht / .htm / .html

BeautifulSoup

HTML-archived web pages and email newsletters. BeautifulSoup strips markup, scripts, and styles, extracting only visible text content.

.rtf

striprtf (Pure Python)

Rich Text Format converted to plain text using the striprtf library. No external binary dependencies required.

.txt / plain text

Direct Read

UTF-8 read with error tolerance. Pre-summarization applied if content exceeds 5,000 tokens before LLaMA pass.

Note on Unknown & Malformed Extensions

The real-world archive contained dozens of files with unusual or broken extensions — .toc, .particleturbulence, .lagrangian, .hanrttydoc, batch labels like 00361-00850, and files with no extension at all. These were catalogued, inspected via file content sniffing, and either processed if readable or flagged for manual review. The pipeline was designed to log and continue rather than fail on unknown formats.

LLM & Vision Model Strategy

Three local models are used across the pipeline, each chosen for a specific role. All run via Ollama with no internet connection required after the initial model pull.

LLaMA 3.2

Text Summarization

Role Summary · Title · Document Type

Deployment Local via Ollama

Prompting strategy Strict output-only, no preamble

Resume support Parquet checkpoint per 3 rows

LLaVA-LLaMA3

Vision / Images

Role Describe image contents in detail

Input JPG / PNG (TIF/GIF converted first)

Why not OCR? Archive contains photographs, not just scanned text

all-MiniLM-L6-v2

Embeddings

Role 1 Pre-summarization (semantic sentence selection)

Role 2 FAISS index embedding (search)

Framework SentenceTransformer

Pre-Summarization Logic

Before any document reaches LLaMA, a semantic compression pass runs if it exceeds 250 sentences (or 5,000 tokens). This is one of the more nuanced engineering decisions in the pipeline.

Simple truncation (take the first N sentences) fails for academic papers and research documents, which typically front-load abstract and introduction content while burying key findings and conclusions. A naive head-truncation would systematically bias the summaries toward intros.

The semantic approach: embed every sentence, compute each sentence's average cosine similarity to all others in the document (a proxy for how "central" it is to the overall content), and keep the top 250. This tends to select sentences that are thematically representative of the full document rather than just the beginning.

Result: Every document, regardless of length, arrives at LLaMA as a semantically representative ~250-sentence excerpt rather than an arbitrary truncation. This consistently produced better summaries on long academic papers than naive truncation approaches.

Excel Strategy: Headers Over Rows

For Excel files, loading thousands of rows into a summarization prompt produces poor results — the LLM gets lost in numbers. Instead, the pipeline reads only sheet names and column header labels, then prompts LLaMA: "based on these column names, write a summary of what this spreadsheet contains." The model infers purpose from structure alone, and does so remarkably well. A sheet named "QG34_TLC_26-29_June_Houston_mtg" with columns like "Attendee", "Organization", "Day 1 AM" is correctly identified as a conference meeting roster.

Semantic Search Engine

Once all documents have summaries, the search system indexes them using FAISS (Facebook AI Similarity Search). Every LLaMA-generated summary is embedded with SentenceTransformer into a 384-dimensional vector, and the full set is added to a HNSW (Hierarchical Navigable Small World) index for fast approximate nearest-neighbor lookup.

The index is persisted to disk (summaries_hnsw.index) alongside a metadata pickle of file paths, names, and summaries. Subsequent runs load the index in seconds without re-embedding.

An interactive command-line interface supports two search modes: top-k (return the N most similar documents) and threshold (return all documents above a minimum similarity score). For image results, the matched file opens directly in the OS image viewer. For text files, the full extracted content prints to the terminal.

Why HNSW?

HNSW provides approximate nearest-neighbor search with sub-linear query time. For archival use at the scale of thousands of documents, it loads instantly from disk and queries in milliseconds — no database server required.

Haystack Alternative

A parallel implementation using the Haystack framework provides a higher-level abstraction with FAISSDocumentStore and EmbeddingRetriever, making it easier to extend with extractive QA or hybrid BM25 retrieval.

What Gets Indexed

The FAISS index embeds LLaMA-generated summaries, not raw text. This means the search space is clean, normalized, and consistently phrased — dramatically improving recall over raw-text search.

Image Search

Photographs are indexed the same way as text documents — via their LLaVA-generated description. A query like "photo of students at a protest" finds matching images by semantic similarity of their descriptions.

Search Interface — Example Output

The interactive CLI accepts natural language queries and returns ranked results with scores, file paths, and summary previews. Image files open in the system viewer.

photo of students at a campus rally or protest

0.847score

IMG_4319.jpg

Research Archive / Photos / 2017 / IMG_4319_1844529697695.jpg

A photograph taken outdoors showing a crowd of people gathered on what appears to be a university campus, several holding handwritten protest signs. An American flag is visible in the background. The setting suggests a political demonstration or rally.

0.821score

IMG_4360.jpg

Research Archive / Photos / 2017 / IMG_4360_1844529556559.jpg

An outdoor photograph of a large gathering of people in winter clothing, standing on a paved area in front of an institutional building. Several individuals are holding signs and banners. The image appears to capture a protest or community organizing event.

0.793score

PANEL_EVENT_Immigration.pdf

Research Archive / Documents / PANEL_EVENT_President_Trump_Executive_Order_Immigration.pdf

A flyer or announcement for a panel event discussing the President's Executive Order on Immigration, organized at a university. The document lists faculty and staff panelists and provides event date, time, and location information.

research on annular flow and turbulence in pipes

0.912score

annular5_draft6.pdf

Research Archive / Documents / annular5_draft6_1844495505566.pdf

A research paper draft examining the behavior of liquid films and droplet entrainment in horizontal annular two-phase flow. The document presents experimental results, discusses interfacial wave dynamics, and compares findings to existing theoretical models for turbulent pipe flow regimes.

0.889score

Horizongtal_annular_flows.docx

Research Archive / Documents / Horizongtal_annular_flows_1844492692438.docx

A Word document containing a technical review of horizontal annular flow characteristics, including data tables, figure references, and discussion of critical velocity thresholds and film thickness measurements across varying Reynolds numbers.

Topic Clustering & Archive Intelligence

Beyond enabling search, the pipeline automatically discovers the thematic structure of the entire archive without any manual labeling. TF-IDF vectorization of all summaries, followed by KMeans clustering with elbow-method K selection, groups documents into coherent topics.

For each cluster, the top 20 keywords by centroid weight are extracted and sent to LLaMA with the prompt: "Given these topic keywords, write a concise one-sentence title summarizing this topic." The result is a set of human-readable topic labels that can be browsed by institutional stakeholders to understand what the archive contains before running any searches.

A hierarchical second pass runs within each top-level topic: KMeans again at a smaller granularity, producing subtopics with their own LLaMA-generated titles. This two-level taxonomy gives both a high-level overview and fine-grained navigation.

Example Topics Discovered in Archive

Topic 0 — 150 documents

"Visualizing Data: A Guide to Creating Effective Graphs and Diagrams" — Figures, charts, and data visualization files from decades of research publications.

Topic 1 — 124 documents

"Unraveling the Dynamics of Gas-Liquid Flows and Turbulence in Slug-Flow Regimes" — Core research papers, drafts, and data files on multiphase fluid dynamics.

Topic 2 — 101 documents

"Crafting a Striking Image: Mastering Composition on a Gray Background" — Scanned photographs, lab images, and personal photographs from the archive.

Topic 3 — 79 documents

"The Capturing of a Scene: People in a Group Setting" — Event photographs, conference photos, and group portraits from academic career.

Topic 4 — 52 documents

"University of Illinois Intellectual Property and Technology Transfer Committee Reports" — Administrative records, committee minutes, and institutional correspondence.

Design Decisions & Methodology

File ID as Primary Key Throughout

Box assigns a unique numeric ID to every file. Every downloaded filename has the Box file_id appended. This ID propagates through every DataFrame join across all eight pipeline stages. Any search result can always be traced back to its exact original location in Box — even years later, even if the filename was ambiguous or contained special characters.

Resume Support Everywhere

At the scale of a real institutional archive (hundreds to thousands of files), LLaMA summarization takes hours. Every processing loop — PDF extraction, image description, LLaMA summarization, title generation, topic clustering — checks for already-processed rows and skips them. Progress is saved after every 3 rows. Machine sleep, Ollama crashes, and network issues are not problems — the pipeline resumes exactly where it left off.

LLaVA for Images — Not OCR

Standard OCR (Tesseract) would only work on scanned text. The archive contained photographs — of people, campus events, lab equipment, political rallies, experimental apparatus. LLaVA produces rich natural language descriptions that make photographs semantically searchable in a way OCR never could. A query for "photo of someone holding a protest sign" correctly retrieves matching images without any manual tagging.

Indexing Summaries, Not Raw Text

The FAISS index embeds LLaMA-generated summaries rather than raw extracted text. This is a deliberate choice. Raw text is noisy — OCR artifacts, copyright footers, email headers, table formatting — and embedding it directly produces poor search precision. Summaries are clean, normalized, consistently phrased, and represent the semantic content of the document rather than its formatting noise.

Strict LLM Output Formatting

All LLaMA prompts are engineered to produce clean single-output responses: "ONLY output the summary. Do NOT include any introductions, explanations, or additional text." For document type classification, the prompt specifies: "Be specific (e.g., 'research paper on fluid dynamics', 'email about budget approval'). ONLY output the document type as a single sentence." This prevents the verbose preambles that would corrupt downstream processing.

Win32 COM for Legacy Office Formats

The archive contained dozens of legacy Word and PowerPoint files (.doc, .ppt, .doc1, .ppt1) that pure-Python libraries cannot reliably open. On Windows, Win32 COM automation — launching the actual Microsoft Word or PowerPoint application via pywin32 — reads these files natively. This handled everything including malformed documents that would cause python-docx or python-pptx to crash.

Technology Stack

LLM / AI Models

LLaMA 3.2

LLaVA-LLaMA3

all-MiniLM-L6-v2

Ollama (local inference)

SentenceTransformers

Search & Indexing

FAISS (HNSW + Flat)

Haystack (alt. impl.)

TF-IDF (sklearn)

KMeans clustering

Cosine similarity

File Extraction

PyPDF2 / PyMuPDF

python-docx

python-pptx

BeautifulSoup4

striprtf · PIL · pywin32

Infrastructure

Box Python SDK

pandas · numpy

Parquet (pyarrow)

Jupyter Notebooks

Python 3.11

Local LLM Inference Vision-Language Models Vector Search Digital Preservation Box SDK Python FAISS Ollama SentenceTransformers pywin32 HNSW Index NLP Pipeline Metadata Extraction

← Back to Portfolio