An end-to-end applied AI pipeline for processing a 20-year institutional email archive, converting raw legacy PST and EML formats into research-ready, fully anonymized, and semantically searchable records using NLP, PII detection, topic modeling, and locally deployed LLM summarization. Initiated through a direct referral from institutional leadership and delivered as a working prototype with full scalability analysis.
Universities and large institutions accumulate decades of digital correspondence from faculty, staff, and administrators who retire, leave, or pass away. This material often contains historically significant records, operational decisions, project documentation, and institutional knowledge, but sits entirely inaccessible because it arrives in bulk, in legacy formats, with no metadata, and cannot be manually reviewed at scale.
This project was initiated in December 2024 when the Associate Director for Information Governance at RIMS reached out directly, on recommendation from a senior colleague who identified the right person for the work. RIMS had been working with email archiving and e-discovery tools for several years and had developed a word frequency-based redaction strategy. The outreach was an explicit request for advanced data science and applied AI expertise to push that work further. The project was formally scoped in an April 2025 working meeting attended by the University Archivist, the Associate Director for Information Governance, and Dimuthu Tilakaratne, who now serves as Chief Data and Analytics Officer (CDAO) for the University of Illinois System. That meeting concluded with Tayler being tasked with designing and testing the full automated archival pipeline.
The resulting pipeline takes raw .pst legacy archive files, converts them to structured data, extracts and cleans individual messages from email threads, detects and replaces all personally identifiable information using named entity recognition, and generates fully anonymized plain-language summaries of each message using a locally deployed large language model. The output is a dataset that can be made available to researchers without any manual review step.
The pipeline was designed from first principles with privacy as a non-negotiable constraint: no data leaves the institution's environment at any stage, and the two-layer anonymization approach ensures both structured PII (names, phone numbers, emails) and unstructured contextual sensitivity are addressed.
"Tayler also found novel solutions to these challenges by researching and identifying Python libraries and large language models that can be quickly leveraged. After meeting with Tayler, each of them was excited and inspired by her thoughts and engaged with her to develop solutions iteratively."
Dimuthu Tilakaratne, CDAO, University of Illinois System · Performance Appraisal 2024 to 2025
Digital archives from former staff arrive in bulk, often decades old, in mixed formats, with no metadata, and potentially containing sensitive personal information. Manual processing is not scalable. Content typically sits untouched for years, making historically significant institutional records completely inaccessible to researchers. This project was scoped to solve exactly that problem at institutional scale.
Layer 1, Presidio NER: Microsoft's presidio-analyzer detects and replaces structured PII (names, phone numbers, email addresses, dates, SSNs) with numbered placeholders like <PERSON1>, <PHONE1>.
Layer 2, LLM Rewriting: Llama 3.2, running locally via Ollama, rewrites each message as a fully anonymized plain-language summary, handling contextual and unstructured sensitive content that rule-based systems miss.
The entire pipeline runs on local infrastructure. No email content, no PII, and no archive data is sent to any external API or cloud service at any stage of processing. The Llama 3.2 model is deployed and queried entirely on-device via Ollama.
This project contributed to an Extraordinary / Distinguished performance rating from Dimuthu Tilakaratne (now CDAO, University of Illinois System) in 2023 to 2024, and a subsequent AITS Excellence Award finalist nomination. The work was cited specifically for bringing novel AI solutions to complex institutional challenges and inspiring stakeholder engagement.
The archive originated as a set of legacy .pst (Personal Storage Table) files (Microsoft Outlook's proprietary binary archive format). These were first converted to individual .eml files using an external conversion tool, preserving the original Outlook folder hierarchy in the resulting directory structure. A single Python notebook then recursively walked the directory tree and extracted the plain-text body from every .eml file.
The key design decision at ingestion was to preserve the full absolute file path for every message. The folder names in the original archive carried meaningful organizational metadata (project names, department labels, topic categories) that would have been lost if only the filename was retained. This path is later split into individual Folder_N columns in the downstream pipeline.
The ingestion script handled MIME multipart emails, single-part plain text emails, charset encoding variations across 20+ years of archive content, and attachment detection, all with graceful error handling that allowed processing to continue past any individual file failures.
-----Original Message----- delimiter and exploded to one row per message segment using pandas .explode(). 44,677 files → 79,676 individual message rows.Privacy protection was the most technically demanding aspect of the project. The archive contained 20+ years of real personal correspondence, names, phone numbers, email addresses, dates, and potentially Social Security Numbers appear throughout. A two-stage anonymization architecture was designed: a structured PII extraction pass using Microsoft's presidio-analyzer, followed by an LLM rewriting pass using Llama 3.2 to handle the residual unstructured contextual sensitivity that rule-based systems cannot address.
The core component of the PII extraction stage is Presidio's AnalyzerEngine, which scans unstructured text and identifies sensitive entities using a combination of pre-trained NLP models and regex-based pattern matching. Five entity types are mapped to dedicated output columns:
<PERSON1>, <PERSON2>, etc. The original value is stored in a mapping dictionary so the substitution is fully auditable.Entities replaced with typed, numbered tokens. <PERSON>, <DATE_TIME>. Original values stored in mapping dictionaries. Preserves document structure while enabling full auditability.
Detected entities removed entirely from text, replaced with empty space. Produces the most conservative output. Used when downstream analysis does not require any information about the removed content.
PII spans replaced with block characters (████) matching the original character length. Preserves document structure visually while making all sensitive content unreadable. Useful for document review workflows.
Phone numbers masked to show only last 4 digits, ***-***-1234. Preserves formatting characters while removing identifying digits. Suitable for audit trails where partial identification is acceptable.
PERSON entities converted to irreversible cryptographic hashes. Enables cross-document communication pattern analysis (same person always produces same hash) without exposing any identifiable information.
Full dataset processed in batches of 10 rows. Pipeline writes output incrementally, if interrupted, processing resumes from the last completed row. Critical for a dataset of ~80K rows with ~1–2 seconds per row processing time.
| Batch Time (10 rows) | Avg Per Row | 100,000 Rows | 1,000,000 Rows | Status |
|---|---|---|---|---|
| 10 seconds | 1.0 sec | ~27.8 hrs | ~278 hrs | Acceptable |
| 15 seconds | 1.5 sec | ~41.7 hrs | ~417 hrs | Acceptable |
| 20 seconds | 2.0 sec | ~55.6 hrs | ~556 hrs | Optimize |
AnalyzerEngine once outside the per-row loop eliminates repeated model loading and reduces per-row time by 40–60%.Even after Presidio PII extraction, the cleaned text retains content that can be sensitive or identifying in context, references to specific projects, systems, operational incidents, or personal circumstances that are not captured by named-entity patterns. The second anonymization stage uses Llama 3.2 deployed locally via Ollama to rewrite each message as a concise, fully de-identified summary.
The rewrite_email_summary() function sends each cleaned email to the model with a structured prompt instructing it to replace all specific identifying details with generic terms, personal names become "the writer" or "a contact," specific dates become relative time references, named projects become "a project," and operational details are generalized.
Two-stage design is deliberately sequential: the LLM receives text that already has names and phone numbers replaced with <PERSON1> placeholders. This reduces the cognitive load on the model (it does not need to detect PII, only summarize) and eliminates the risk of the model failing to redact an edge-case entity that Presidio already caught. Validation testing confirmed that pre-cleaned text with typed placeholders produced more coherent summaries than raw text with names still present, as Llama treated the placeholders as generic role indicators rather than trying to infer intent from names.
The summarization pipeline runs with full resume support, output is appended to CSV incrementally after each batch of 10 rows, allowing the multi-day process to be interrupted and restarted without data loss.
| Stage | Time / 10 rows | Avg / row | Bottleneck |
|---|---|---|---|
| Presidio PII | ~10–20 sec | ~1–2 sec | Acceptable |
| Ollama Summarize | 347.80 sec (observed) | ~34.78 sec | CLI reload |
Topic modeling was applied at two stages of the pipeline, first on email subject lines, and second on the LLM-generated summaries, to automatically discover the major content themes present in the archive. LDA (Latent Dirichlet Allocation) surfaces keyword clusters; Llama 3.2 then converts those clusters into human-readable topic labels, making the archive's content map immediately interpretable to archivists and researchers with no NLP background.
max_df=0.9 (ignore near-universal terms), min_df=2 (ignore rare one-offs). Produces a document-term matrix focused on meaningful mid-frequency vocabulary.n_components=5 (subject lines) or n_components=10 (summaries). Each document assigned to its dominant topic via argmax on the topic probability distribution.power, plant, project, gas, line, procurement, schedule
"ABBOTT Power Plant Procurement Project"
work, order, app, data, system, request, maintenance
"Work Order & System Management"
steam, gas, plant, maintenance, controls, tunnel, service
"Steam Plant Operations & Controls"
boiler, report, fuel, meeting, oil, gas, bid, plan
"Energy Contract & Fuel Planning"
"OSHA Safety Record Keeping"
"Employee Issue Resolution"
"Employee Relations & HR"
"Probationary Onboarding Process"
Beyond anonymization and topic discovery, the pipeline includes analytical tools that allow archivists and researchers to explore the structure of the archive without accessing any individual message content, enabling metadata-level discovery as a precursor to requesting access to specific materials.
The temporal analysis module parses the sent_email field into proper datetime objects using pd.to_datetime(..., errors="coerce"), the errors="coerce" argument is essential for a 20-year archive where date formats vary significantly across email clients and eras. Messages are then aggregated by Year-Month period and visualized as horizontal bar charts showing communication volume over time.
The fuzzy sender search module uses fuzzywuzzy partial_ratio matching to identify all messages from a specified sender, even when the sender's name appears in different formats across the archive, "Smith, Jane", "Jane Smith (Dept)", "J. Smith <jsmith@org.edu>" all match the same query at a threshold of 80.
Monthly email volume charted across the full 20-year archive span. Identifies periods of high institutional activity, project phases, and communication gaps. Useful for archivists building finding aids and for historians contextualizing the archive's contents.
For any target sender, fuzzy matching surfaces all relevant messages, which are then aggregated and charted by month. Enables researchers to trace an individual's communication activity over time without accessing message content.
fuzzywuzzy's partial_ratio computes similarity of the best-matching substring rather than full string. This catches short names like "Smith, J." against long formatted strings like "Smith, Jane A. (Engineering)", a critical distinction for inconsistently formatted archival sender fields.
Fuzzy search across ~79,676 message rows for a single sender returned 386 matching records, demonstrating both the precision of the matching approach and the feasibility of targeted retrieval at scale without full-text indexing.
Presidio catches structured PII reliably, names, phone numbers, SSNs, but cannot handle contextual sensitivity. A message referencing "the boiler 7 incident on North Campus" may identify an individual even with names removed. LLM rewriting generalizes this residual context in a way no rule-based system can.
Llama 3.2 is deployed on-device via Ollama. No email content, no names, no institutional data is sent to any external API. This is non-negotiable for an archive containing real personal correspondence. The privacy guarantee is enforced architecturally, not by policy.
All EML extraction results are accumulated in a Python list before the single pd.DataFrame() call. DataFrame append in a loop is O(n²) due to reallocation at each step, the correct pattern for 44,677 files reduces processing time by orders of magnitude.
Every message retains its full original file path, which encodes the PST folder hierarchy. Downstream splitting creates Folder_N columns that preserve organizational context, department, project, and topic area, as queryable structured metadata without any manual tagging.
PII spans are replaced from right to left through each message string. Replacing earlier spans first shifts character positions of all subsequent spans, producing incorrect replacements. Processing in reverse order keeps all byte offsets accurate throughout the substitution pass.
All file reads and byte decoding use errors='replace' rather than 'strict' or 'ignore'. An archive spanning 20+ years contains emails from many clients, encodings, and eras. 'replace' substitutes malformed bytes with the Unicode replacement character, visible and detectable, rather than silently dropping content or crashing.
Beyond the core approach implemented, the project evaluated and documented a comprehensive landscape of redaction methodologies. This serves as a technical reference and roadmap for future iterations of the pipeline, and documents the trade-offs that informed the current design choices.
| Method | Approach | Libraries | Best For |
|---|---|---|---|
| Word Frequency / TF-IDF | Redact infrequently appearing words, rare words are more likely to be identifying | NLTK, scikit-learn | Fast baseline; low resource environments |
| Named Entity Recognition (NER) | Pre-trained NER models detect and redact names, orgs, locations, phone numbers | spaCy, Flair, Presidio, HuggingFace | High recall on structured PII |
| Regex Pattern Matching | Regular expressions target structured patterns: email addresses, SSNs, phone numbers | re, pandas | Fast, deterministic, auditable |
| Differential Privacy | Laplace noise added to word frequencies, reducing re-identification risk mathematically | PySyft, DiffPrivLib | Formal privacy guarantees; research datasets |
| Contextual Embeddings (BERT/RoBERTa) | Transformer models understand word context to detect sensitivity not visible to regex | transformers, sentence-transformers | Highest accuracy; context-dependent PII |
| LDA Topic Modeling | Outlier words that don't belong to common topics flagged for redaction | sklearn, gensim | Identifying rare, potentially identifying vocabulary |
| Word Replacement / Anonymization | Sensitive terms replaced with synthetic placeholders preserving document structure | faker, textacy, spacy-anonymizer | Readability preservation; research access |
| LLM Rewriting (Implemented) | Local LLM rewrites email as anonymized plain-language summary | Llama 3.2, Ollama | Residual contextual sensitivity; research-ready summaries |
Embed all LLM summaries using a sentence transformer model and build a vector index (FAISS or ChromaDB) to enable semantic query-based retrieval. Researchers could query "correspondence about boiler safety compliance" and surface relevant anonymized summaries without any keyword matching.
Auto-generate structured finding aid documents (EAD XML or plain text) describing the contents of each archive batch, suitable for submission to institutional repositories. Topic clusters and temporal analysis outputs form the raw material for machine-generated finding aids.
Apply the same ingest-extract-anonymize-summarize pipeline to non-email file types: Word documents, PDFs, scanned images (via OCR), and spreadsheets. The meeting that scoped this project identified this as the primary expansion path for processing full institutional digital legacies.
Presidio processing parallelized via multiprocessing or joblib, projected 4 to 8x speedup on multi-core hardware. Ollama switched to server mode with REST API calls to eliminate model reload overhead, projected approximately 10x speedup for LLM summarization stage.
Evaluate differential privacy techniques (Laplace noise injection via DiffPrivLib) for production deployment where formal mathematical privacy guarantees are required. Develop an evaluation framework comparing readability preservation scores vs. privacy protection metrics across approaches.
Connect pipeline output to institutional records management systems for automated retention scheduling and disposition tracking. Provide a researcher-facing web interface for searching and reviewing anonymized summaries without requiring direct data access.