A full end-to-end digital preservation pipeline that connects to Box cloud storage, downloads every file regardless of format, extracts or generates textual metadata for each one — including photographs via a local vision model — summarizes everything with a locally-running LLaMA 3.2 model, and indexes all content into a FAISS semantic search engine. Designed to make any legacy archive fully searchable with natural language.
Institutions accumulate enormous amounts of unstructured digital content — often belonging to individuals who have retired, passed away, or transitioned out. This content typically lives in cloud storage (Box, SharePoint, etc.) as thousands of mixed-format files with no metadata, no descriptions, and no way to search them beyond exact filename matches.
This project was built around a real institutional archive: the life's work of a faculty member spanning decades of academic research — papers, correspondence, scanned photographs, experimental data files, presentations, and emails. None of it was searchable in any meaningful way.
The system connects to Box via the Box SDK, downloads every file, and routes each one through the appropriate text extraction method for its format. Images are described by a local vision-language model. Structured files like Excel are analyzed by reading column headers rather than row data. Every file ultimately receives an LLM-generated summary, title, and document type classification — then gets indexed into a FAISS semantic search engine.
The entire pipeline runs locally with no cloud AI APIs. All LLM and vision model inference is handled by Ollama running on-device. This was a deliberate architectural choice: the files in this archive contained private personal correspondence, research data, and institutional records, making cloud API transmission inappropriate.
When a researcher retires, passes away, or moves on, decades of institutional knowledge — research, correspondence, data, images — becomes inaccessible. Standard file systems offer only filename search. The goal was to make any archive fully searchable through natural language, regardless of file type, regardless of whether content was ever labeled or organized.
The archive included private email correspondence, personal photographs, confidential administrative records, and decades of unpublished research. Routing any of this through a cloud LLM API would raise serious privacy and institutional data governance concerns. Ollama running LLaMA 3.2 and LLaVA-LLaMA3 locally was the only architecturally acceptable approach.
The system is a sequential eight-stage pipeline. Each stage produces a structured output that feeds the next. Stages 3–5 are the most complex — text extraction and LLM summarization vary substantially by file type and require per-type engineering. Every processing loop includes resume support: progress is checkpointed after every batch of rows, so the pipeline can be interrupted and restarted without reprocessing completed work.
Authenticates to Box via OAuth2 developer token. Recursively walks the entire folder tree, collecting metadata for every file: full path, extension, size, created/modified timestamps, and the Box file_id — a unique numeric identifier that becomes the primary key throughout the entire pipeline. Output saved as file_path_directory.csv.
Downloads every file from Box to a local working directory. Filenames are sanitized (spaces, colons, slashes removed) and the Box file_id is appended to ensure uniqueness and traceability across all downstream steps. Already-downloaded files are skipped on re-run. Failures are logged but do not halt the process.
Each file type is routed to the appropriate extraction method. PDFs use PyPDF2. Images are passed to LLaVA-LLaMA3 via Ollama for vision-model descriptions. Word docs use python-docx with Win32 COM fallback for legacy formats. Excel files read column headers only — not row data — and pass those to LLaMA to infer content. Emails are parsed into structured rows (Subject/From/To/Date/Body). HTML uses BeautifulSoup. RTF uses striprtf.
Documents over 250 sentences are compressed before reaching LLaMA. Each sentence is embedded with all-MiniLM-L6-v2, scored by average pairwise cosine similarity (a proxy for semantic centrality), and the top 250 most representative sentences are retained. This preserves the document's core meaning without naive truncation, and ensures every document fits within LLaMA's context window.
Every document passes through three separate LLaMA 3.2 prompts: a plain-language summary, a one-sentence title, and a document type classification (e.g., "research paper on turbulent pipe flow", "email between faculty regarding a budget request"). Prompts are designed to produce clean single-output responses. Processing is batched in groups of 3 with Parquet checkpoint saves after each batch.
All per-file-type DataFrames are standardized to a unified schema: file_path, file_name, file_id, extension, original_text, llama_generated_summary. Joined back to the Box inventory on file_id to attach size, timestamps, and original Box path. Produces final_df_summaries.csv — the master output.
TF-IDF vectorizes all summaries (bigrams, frequency-filtered). The elbow method selects K. KMeans clusters documents into topics. Top 20 keywords per cluster are sent to LLaMA to generate human-readable topic titles. A hierarchical second pass creates subtopics within each top-level cluster. Result: automatic discovery of the thematic structure of the entire archive.
All summaries are embedded with SentenceTransformer and indexed into a FAISS HNSW index persisted to disk. An interactive CLI allows natural language queries with top-k or similarity threshold search. Image results open directly in the OS viewer. Text results print full extracted content. The index loads in seconds even across thousands of documents.
One of the core engineering challenges was handling a genuinely heterogeneous archive — not just the clean file types, but decades-old formats, broken extensions, vendor-specific variants, and files with no extension at all. Each type required a different approach, and several required special handling or fallback strategies.
The real-world archive contained dozens of files with unusual or broken extensions — .toc, .particleturbulence, .lagrangian, .hanrttydoc, batch labels like 00361-00850, and files with no extension at all. These were catalogued, inspected via file content sniffing, and either processed if readable or flagged for manual review. The pipeline was designed to log and continue rather than fail on unknown formats.
Three local models are used across the pipeline, each chosen for a specific role. All run via Ollama with no internet connection required after the initial model pull.
Before any document reaches LLaMA, a semantic compression pass runs if it exceeds 250 sentences (or 5,000 tokens). This is one of the more nuanced engineering decisions in the pipeline.
Simple truncation (take the first N sentences) fails for academic papers and research documents, which typically front-load abstract and introduction content while burying key findings and conclusions. A naive head-truncation would systematically bias the summaries toward intros.
The semantic approach: embed every sentence, compute each sentence's average cosine similarity to all others in the document (a proxy for how "central" it is to the overall content), and keep the top 250. This tends to select sentences that are thematically representative of the full document rather than just the beginning.
For Excel files, loading thousands of rows into a summarization prompt produces poor results — the LLM gets lost in numbers. Instead, the pipeline reads only sheet names and column header labels, then prompts LLaMA: "based on these column names, write a summary of what this spreadsheet contains." The model infers purpose from structure alone, and does so remarkably well. A sheet named "QG34_TLC_26-29_June_Houston_mtg" with columns like "Attendee", "Organization", "Day 1 AM" is correctly identified as a conference meeting roster.
Once all documents have summaries, the search system indexes them using FAISS (Facebook AI Similarity Search). Every LLaMA-generated summary is embedded with SentenceTransformer into a 384-dimensional vector, and the full set is added to a HNSW (Hierarchical Navigable Small World) index for fast approximate nearest-neighbor lookup.
The index is persisted to disk (summaries_hnsw.index) alongside a metadata pickle of file paths, names, and summaries. Subsequent runs load the index in seconds without re-embedding.
An interactive command-line interface supports two search modes: top-k (return the N most similar documents) and threshold (return all documents above a minimum similarity score). For image results, the matched file opens directly in the OS image viewer. For text files, the full extracted content prints to the terminal.
HNSW provides approximate nearest-neighbor search with sub-linear query time. For archival use at the scale of thousands of documents, it loads instantly from disk and queries in milliseconds — no database server required.
A parallel implementation using the Haystack framework provides a higher-level abstraction with FAISSDocumentStore and EmbeddingRetriever, making it easier to extend with extractive QA or hybrid BM25 retrieval.
The FAISS index embeds LLaMA-generated summaries, not raw text. This means the search space is clean, normalized, and consistently phrased — dramatically improving recall over raw-text search.
Photographs are indexed the same way as text documents — via their LLaVA-generated description. A query like "photo of students at a protest" finds matching images by semantic similarity of their descriptions.
Beyond enabling search, the pipeline automatically discovers the thematic structure of the entire archive without any manual labeling. TF-IDF vectorization of all summaries, followed by KMeans clustering with elbow-method K selection, groups documents into coherent topics.
For each cluster, the top 20 keywords by centroid weight are extracted and sent to LLaMA with the prompt: "Given these topic keywords, write a concise one-sentence title summarizing this topic." The result is a set of human-readable topic labels that can be browsed by institutional stakeholders to understand what the archive contains before running any searches.
A hierarchical second pass runs within each top-level topic: KMeans again at a smaller granularity, producing subtopics with their own LLaMA-generated titles. This two-level taxonomy gives both a high-level overview and fine-grained navigation.