Tayler Erbe · Project Case Study · University of Illinois System · December 2024 to Present

Email Archiving
& E-Discovery
NLP Pipeline

An end-to-end applied AI pipeline for processing a 20-year institutional email archive, converting raw legacy PST and EML formats into research-ready, fully anonymized, and semantically searchable records using NLP, PII detection, topic modeling, and locally deployed LLM summarization. Initiated through a direct referral from institutional leadership and delivered as a working prototype with full scalability analysis.

Institution

University of Illinois System, RIMS & Decision Support

Stakeholders

University Archivist, UIUC Library · Associate Director, Information Governance (RIMS) · Dimuthu Tilakaratne, CDAO, University of Illinois System

Status

Prototype Delivered · Scaling Analysis Complete · Production Path Defined

Role

Lead Data Scientist · Full Pipeline Design & Implementation

44,677

EML Files Parsed

79,676

Message Segments Extracted

20+

Years of Archive Processed

5,810

Records Through Full PII Pipeline

PII Entity Types Anonymized

Project Overview

Universities and large institutions accumulate decades of digital correspondence from faculty, staff, and administrators who retire, leave, or pass away. This material often contains historically significant records, operational decisions, project documentation, and institutional knowledge, but sits entirely inaccessible because it arrives in bulk, in legacy formats, with no metadata, and cannot be manually reviewed at scale.

This project was initiated in December 2024 when the Associate Director for Information Governance at RIMS reached out directly, on recommendation from a senior colleague who identified the right person for the work. RIMS had been working with email archiving and e-discovery tools for several years and had developed a word frequency-based redaction strategy. The outreach was an explicit request for advanced data science and applied AI expertise to push that work further. The project was formally scoped in an April 2025 working meeting attended by the University Archivist, the Associate Director for Information Governance, and Dimuthu Tilakaratne, who now serves as Chief Data and Analytics Officer (CDAO) for the University of Illinois System. That meeting concluded with Tayler being tasked with designing and testing the full automated archival pipeline.

The resulting pipeline takes raw .pst legacy archive files, converts them to structured data, extracts and cleans individual messages from email threads, detects and replaces all personally identifiable information using named entity recognition, and generates fully anonymized plain-language summaries of each message using a locally deployed large language model. The output is a dataset that can be made available to researchers without any manual review step.

The pipeline was designed from first principles with privacy as a non-negotiable constraint: no data leaves the institution's environment at any stage, and the two-layer anonymization approach ensures both structured PII (names, phone numbers, emails) and unstructured contextual sensitivity are addressed.

"Tayler also found novel solutions to these challenges by researching and identifying Python libraries and large language models that can be quickly leveraged. After meeting with Tayler, each of them was excited and inspired by her thoughts and engaged with her to develop solutions iteratively."

Dimuthu Tilakaratne, CDAO, University of Illinois System · Performance Appraisal 2024 to 2025

Institutional Problem

Digital archives from former staff arrive in bulk, often decades old, in mixed formats, with no metadata, and potentially containing sensitive personal information. Manual processing is not scalable. Content typically sits untouched for years, making historically significant institutional records completely inaccessible to researchers. This project was scoped to solve exactly that problem at institutional scale.

Two-Layer Privacy Architecture

Layer 1, Presidio NER: Microsoft's presidio-analyzer detects and replaces structured PII (names, phone numbers, email addresses, dates, SSNs) with numbered placeholders like <PERSON1>, <PHONE1>.

Layer 2, LLM Rewriting: Llama 3.2, running locally via Ollama, rewrites each message as a fully anonymized plain-language summary, handling contextual and unstructured sensitive content that rule-based systems miss.

Privacy-First Design

The entire pipeline runs on local infrastructure. No email content, no PII, and no archive data is sent to any external API or cloud service at any stage of processing. The Llama 3.2 model is deployed and queried entirely on-device via Ollama.

Performance Recognition

This project contributed to an Extraordinary / Distinguished performance rating from Dimuthu Tilakaratne (now CDAO, University of Illinois System) in 2023 to 2024, and a subsequent AITS Excellence Award finalist nomination. The work was cited specifically for bringing novel AI solutions to complex institutional challenges and inspiring stakeholder engagement.

Data Pipeline & Ingestion

The archive originated as a set of legacy .pst (Personal Storage Table) files (Microsoft Outlook's proprietary binary archive format). These were first converted to individual .eml files using an external conversion tool, preserving the original Outlook folder hierarchy in the resulting directory structure. A single Python notebook then recursively walked the directory tree and extracted the plain-text body from every .eml file.

The key design decision at ingestion was to preserve the full absolute file path for every message. The folder names in the original archive carried meaningful organizational metadata (project names, department labels, topic categories) that would have been lost if only the filename was retained. This path is later split into individual Folder_N columns in the downstream pipeline.

The ingestion script handled MIME multipart emails, single-part plain text emails, charset encoding variations across 20+ years of archive content, and attachment detection, all with graceful error handling that allowed processing to continue past any individual file failures.

PST → EML Conversion

Legacy .pst binary archive files converted to individual .eml format using readpst, preserving the full original Outlook folder hierarchy as a directory structure. 44,677 individual .eml files produced across the full archive.

Recursive Directory Walk & Text Extraction

Python os.walk() traversal of the full directory tree. For each .eml file: MIME structure parsed with Python's email library, text/plain parts extracted (multipart and single-part), character encoding read from headers with UTF-8 fallback. All results collected into a flat CSV with file_name, file_path, and email_text columns.

Folder Hierarchy Extraction

file_path split on backslash separator, creating Folder_1 through Folder_N columns representing each level of the original PST hierarchy. Folder names like APP- Finance- Budget, Admin- Safety, and APP- Sys- Water Treatment preserved as queryable structured metadata.

Email Chain Splitting

Each .eml file may contain an entire conversation thread (reply-to-reply chains). Messages split on the -----Original Message----- delimiter and exploded to one row per message segment using pandas .explode(). 44,677 files → 79,676 individual message rows.

Header Parsing

Structured metadata extracted from each message's header block using a multi-field regex pattern: From, Sent, To, Cc (optional), Subject, and Importance (optional). Header block removed from body; clean message body stored in email_text_content for downstream NLP.

Python email (stdlib) pandas os.walk MIME regex UTF-8 / charset handling

PII Detection & Anonymization

Privacy protection was the most technically demanding aspect of the project. The archive contained 20+ years of real personal correspondence, names, phone numbers, email addresses, dates, and potentially Social Security Numbers appear throughout. A two-stage anonymization architecture was designed: a structured PII extraction pass using Microsoft's presidio-analyzer, followed by an LLM rewriting pass using Llama 3.2 to handle the residual unstructured contextual sensitivity that rule-based systems cannot address.

Stage 1, Presidio Analyzer (Microsoft)

The core component of the PII extraction stage is Presidio's AnalyzerEngine, which scans unstructured text and identifies sensitive entities using a combination of pre-trained NLP models and regex-based pattern matching. Five entity types are mapped to dedicated output columns:

→

PERSON → names column

Personal names detected via spaCy NER model. Replaced with <PERSON1>, <PERSON2>, etc. The original value is stored in a mapping dictionary so the substitution is fully auditable.

→

PHONE_NUMBER → phone_numbers column

Phone numbers in US and international formats detected. Replaced with numbered placeholders. Multiple redaction modes available: full removal, blackout characters (████), last-4-digit masking (***-***-1234), or SHA-256 hashing for consistent cross-document identity tracking.

→

EMAIL_ADDRESS · DATE_TIME · US_SSN

Email addresses, date/time expressions (including relative terms like "this week", "yesterday"), and Social Security Numbers each stored in dedicated mapping columns. DATE_TIME detection is intentionally aggressive, even relative temporal expressions are flagged and replaced.

Replacement is performed in reverse order by character position within each message to prevent offset drift, as earlier characters are replaced with longer or shorter placeholder strings, the byte positions of later entities remain accurate.

PII Replacement, Live Example

Before, Raw Email Text

Hi all,
The document management plan will be ready for Mike's review when
he returns on June 16. Inventorying is being done with
Tayler Erbe and began yesterday, June 9.
Please reach me at ext. 5512.

Regards, Beth

↓ presidio_analyzer · AnalyzerEngine · reverse-order replacement

After, Cleaned Text with Placeholders

Hi all,
The document management plan will be ready for <PERSON1>'s review when
he returns on <DATE_TIME1>. Inventorying is being done with
<PERSON2> and began <DATE_TIME2>.
Please reach me at <PHONE1>.

, <PERSON3>

Mapping Dictionaries Stored Per Row

names: {'<PERSON1>': '[Mike]', '<PERSON2>': '[Tayler Erbe]', '<PERSON3>': '[Beth]'}
phone_numbers: {'<PHONE1>': '[ext. 5512]'}
dates: {'<DATE_TIME1>': '[date]', '<DATE_TIME2>': '[date]'}

Five Anonymization Modes Implemented & Demonstrated

Placeholder Replacement

Entities replaced with typed, numbered tokens. <PERSON>, <DATE_TIME>. Original values stored in mapping dictionaries. Preserves document structure while enabling full auditability.

Full Redaction

Detected entities removed entirely from text, replaced with empty space. Produces the most conservative output. Used when downstream analysis does not require any information about the removed content.

Blackout Style

PII spans replaced with block characters (████) matching the original character length. Preserves document structure visually while making all sensitive content unreadable. Useful for document review workflows.

Partial Masking

Phone numbers masked to show only last 4 digits, ***-***-1234. Preserves formatting characters while removing identifying digits. Suitable for audit trails where partial identification is acceptable.

SHA-256 Hashing

PERSON entities converted to irreversible cryptographic hashes. Enables cross-document communication pattern analysis (same person always produces same hash) without exposing any identifiable information.

Batch Processing with Resume

Full dataset processed in batches of 10 rows. Pipeline writes output incrementally, if interrupted, processing resumes from the last completed row. Critical for a dataset of ~80K rows with ~1–2 seconds per row processing time.

Presidio Processing Time Estimates

Batch Time (10 rows)	Avg Per Row	100,000 Rows	1,000,000 Rows	Status
10 seconds	1.0 sec	~27.8 hrs	~278 hrs	Acceptable
15 seconds	1.5 sec	~41.7 hrs	~417 hrs	Acceptable
20 seconds	2.0 sec	~55.6 hrs	~556 hrs	Optimize

Key optimization: instantiating AnalyzerEngine once outside the per-row loop eliminates repeated model loading and reduces per-row time by 40–60%.

LLM Summarization & Stage 2 Anonymization

Even after Presidio PII extraction, the cleaned text retains content that can be sensitive or identifying in context, references to specific projects, systems, operational incidents, or personal circumstances that are not captured by named-entity patterns. The second anonymization stage uses Llama 3.2 deployed locally via Ollama to rewrite each message as a concise, fully de-identified summary.

The rewrite_email_summary() function sends each cleaned email to the model with a structured prompt instructing it to replace all specific identifying details with generic terms, personal names become "the writer" or "a contact," specific dates become relative time references, named projects become "a project," and operational details are generalized.

Two-stage design is deliberately sequential: the LLM receives text that already has names and phone numbers replaced with <PERSON1> placeholders. This reduces the cognitive load on the model (it does not need to detect PII, only summarize) and eliminates the risk of the model failing to redact an edge-case entity that Presidio already caught. Validation testing confirmed that pre-cleaned text with typed placeholders produced more coherent summaries than raw text with names still present, as Llama treated the placeholders as generic role indicators rather than trying to infer intent from names.

The summarization pipeline runs with full resume support, output is appended to CSV incrementally after each batch of 10 rows, allowing the multi-day process to be interrupted and restarted without data loss.

LLM Summarization, Example

Input, After Presidio (Placeholders in place)

<PERSON2>,
Below is <PERSON1>'s response to my request to meet with them.
They referred me to <PERSON3>, but it looks promising so far.
Do I need to notify anyone else? <PERSON4>? <PERSON5>?
, <PERSON6>

↓ Llama 3.2 · Ollama · local inference · structured prompt

Output, Anonymized LLM Summary

A meeting with a contact has been scheduled and another
individual has been informed of the arrangement. The writer
is uncertain whether additional parties need to be notified
about the upcoming event.

Research-ready output. The summary communicates the intent and nature of the communication, scheduling coordination, seeking approval, without revealing any participant, date, project, or institutional context.

LLM Processing Time (Observed)

Stage	Time / 10 rows	Avg / row	Bottleneck
Presidio PII	~10–20 sec	~1–2 sec	Acceptable
Ollama Summarize	347.80 sec (observed)	~34.78 sec	CLI reload

The primary bottleneck is the Ollama CLI invocation, the model reloads from disk on each call. Switching to Ollama server mode + REST API eliminates this overhead and projects a approximately 10x speedup.

Topic Modeling & Content Discovery

Topic modeling was applied at two stages of the pipeline, first on email subject lines, and second on the LLM-generated summaries, to automatically discover the major content themes present in the archive. LDA (Latent Dirichlet Allocation) surfaces keyword clusters; Llama 3.2 then converts those clusters into human-readable topic labels, making the archive's content map immediately interpretable to archivists and researchers with no NLP background.

LDA Pipeline

Text Cleaning

Subject lines (or summaries) lowercased, stripped of digits and punctuation, and filtered to remove English stopwords via NLTK. Cleaned text passed to vectorizer.

CountVectorizer, Bag of Words

max_df=0.9 (ignore near-universal terms), min_df=2 (ignore rare one-offs). Produces a document-term matrix focused on meaningful mid-frequency vocabulary.

Latent Dirichlet Allocation

LDA fitted on document-term matrix with n_components=5 (subject lines) or n_components=10 (summaries). Each document assigned to its dominant topic via argmax on the topic probability distribution.

Llama 3.2 Topic Labeling

Top 10 keywords per topic passed to Llama 3.2 with a prompt requesting a 5-word-or-fewer descriptive title. Converts abstract keyword clusters into plain-language labels that require no domain knowledge to interpret.

Discovered Topics, Subject Line Analysis

Topic 0

power, plant, project, gas, line, procurement, schedule

"ABBOTT Power Plant Procurement Project"

Topic 1

work, order, app, data, system, request, maintenance

"Work Order & System Management"

Topic 2

steam, gas, plant, maintenance, controls, tunnel, service

"Steam Plant Operations & Controls"

Topic 3

boiler, report, fuel, meeting, oil, gas, bid, plan

"Energy Contract & Fuel Planning"

Email Subject Topics Named by Llama 3.2

LDA topic distribution across ~79,676 message subject lines, with human-readable labels generated by Llama 3.2. Topic 0 dominates significantly, reflecting the Abbott Power Plant procurement and project management focus of the archive. Topics 1 through 4 each capture roughly 6,000 messages covering operational, engineering, and administrative themes.

Source: Data_Table_Preparation_Email_Archiving_Version_1.ipynb · number_of_emails_per_subject_matter.png

Topics from LLM Summaries (10-topic run)

Topic 2

"OSHA Safety Record Keeping"

Topic 4

"Employee Issue Resolution"

Topic 7

"Employee Relations & HR"

Topic 8

"Probationary Onboarding Process"

Email Summary Topics Named by Llama 3.2 (10-Topic Run)

LDA topics derived from LLM-generated summaries rather than raw subject lines, revealing deeper thematic structure. The "Sensitive Email Briefing" cluster (418 emails) emerged as the dominant topic on summaries, reflecting how often emails contained contextually sensitive material the LLM generalized into neutral language. HR and compliance themes invisible in subject-line analysis surface clearly here.

Source: Data_Table_Preparation_Email_Archiving_Version_1.ipynb · top_topics_found_in_emails_simple.png

Topic modeling on LLM summaries produces richer, more semantically coherent clusters than subject-line analysis alone, because summaries are longer, consistently worded, and already stripped of junk terms like "FW:", "RE:", and abbreviations.

Temporal & Sender Analysis

Beyond anonymization and topic discovery, the pipeline includes analytical tools that allow archivists and researchers to explore the structure of the archive without accessing any individual message content, enabling metadata-level discovery as a precursor to requesting access to specific materials.

The temporal analysis module parses the sent_email field into proper datetime objects using pd.to_datetime(..., errors="coerce"), the errors="coerce" argument is essential for a 20-year archive where date formats vary significantly across email clients and eras. Messages are then aggregated by Year-Month period and visualized as horizontal bar charts showing communication volume over time.

The fuzzy sender search module uses fuzzywuzzy partial_ratio matching to identify all messages from a specified sender, even when the sender's name appears in different formats across the archive, "Smith, Jane", "Jane Smith (Dept)", "J. Smith <jsmith@org.edu>" all match the same query at a threshold of 80.

Archive-Wide Temporal Analysis

Email volume by year-month across the full archive corpus. Peak activity concentrated in early-to-mid 2003, with the archive spanning records from 1903 through early 2004. The volume ramp from 1999 onward reflects the engineer's increasing reliance on email as a primary communication channel.

Source: emails_parsed_cleaned_PII_Extracted_v3.csv · number_of_emails_per_month.png

Temporal Volume Analysis

Monthly email volume charted across the full 20-year archive span. Identifies periods of high institutional activity, project phases, and communication gaps. Useful for archivists building finding aids and for historians contextualizing the archive's contents.

Sender-Specific Timeline

For any target sender, fuzzy matching surfaces all relevant messages, which are then aggregated and charted by month. Enables researchers to trace an individual's communication activity over time without accessing message content.

Why partial_ratio vs ratio

fuzzywuzzy's partial_ratio computes similarity of the best-matching substring rather than full string. This catches short names like "Smith, J." against long formatted strings like "Smith, Jane A. (Engineering)", a critical distinction for inconsistently formatted archival sender fields.

Scale Result

Fuzzy search across ~79,676 message rows for a single sender returned 386 matching records, demonstrating both the precision of the matching approach and the feasibility of targeted retrieval at scale without full-text indexing.

Methodology & Design Decisions

Why Two Anonymization Layers

Presidio catches structured PII reliably, names, phone numbers, SSNs, but cannot handle contextual sensitivity. A message referencing "the boiler 7 incident on North Campus" may identify an individual even with names removed. LLM rewriting generalizes this residual context in a way no rule-based system can.

Local-Only LLM Deployment

Llama 3.2 is deployed on-device via Ollama. No email content, no names, no institutional data is sent to any external API. This is non-negotiable for an archive containing real personal correspondence. The privacy guarantee is enforced architecturally, not by policy.

List Accumulation Before DataFrame

All EML extraction results are accumulated in a Python list before the single pd.DataFrame() call. DataFrame append in a loop is O(n²) due to reallocation at each step, the correct pattern for 44,677 files reduces processing time by orders of magnitude.

Full File Path Preservation

Every message retains its full original file path, which encodes the PST folder hierarchy. Downstream splitting creates Folder_N columns that preserve organizational context, department, project, and topic area, as queryable structured metadata without any manual tagging.

Reverse-Order PII Replacement

PII spans are replaced from right to left through each message string. Replacing earlier spans first shifts character positions of all subsequent spans, producing incorrect replacements. Processing in reverse order keeps all byte offsets accurate throughout the substitution pass.

errors='replace' Throughout

All file reads and byte decoding use errors='replace' rather than 'strict' or 'ignore'. An archive spanning 20+ years contains emails from many clients, encodings, and eras. 'replace' substitutes malformed bytes with the Unicode replacement character, visible and detectable, rather than silently dropping content or crashing.

Redaction Method Landscape

Beyond the core approach implemented, the project evaluated and documented a comprehensive landscape of redaction methodologies. This serves as a technical reference and roadmap for future iterations of the pipeline, and documents the trade-offs that informed the current design choices.

Method	Approach	Libraries	Best For
Word Frequency / TF-IDF	Redact infrequently appearing words, rare words are more likely to be identifying	NLTK, scikit-learn	Fast baseline; low resource environments
Named Entity Recognition (NER)	Pre-trained NER models detect and redact names, orgs, locations, phone numbers	spaCy, Flair, Presidio, HuggingFace	High recall on structured PII
Regex Pattern Matching	Regular expressions target structured patterns: email addresses, SSNs, phone numbers	re, pandas	Fast, deterministic, auditable
Differential Privacy	Laplace noise added to word frequencies, reducing re-identification risk mathematically	PySyft, DiffPrivLib	Formal privacy guarantees; research datasets
Contextual Embeddings (BERT/RoBERTa)	Transformer models understand word context to detect sensitivity not visible to regex	transformers, sentence-transformers	Highest accuracy; context-dependent PII
LDA Topic Modeling	Outlier words that don't belong to common topics flagged for redaction	sklearn, gensim	Identifying rare, potentially identifying vocabulary
Word Replacement / Anonymization	Sensitive terms replaced with synthetic placeholders preserving document structure	faker, textacy, spacy-anonymizer	Readability preservation; research access
LLM Rewriting (Implemented)	Local LLM rewrites email as anonymized plain-language summary	Llama 3.2, Ollama	Residual contextual sensitivity; research-ready summaries

Future Work & Roadmap

Semantic Search Layer

Embed all LLM summaries using a sentence transformer model and build a vector index (FAISS or ChromaDB) to enable semantic query-based retrieval. Researchers could query "correspondence about boiler safety compliance" and surface relevant anonymized summaries without any keyword matching.

Finding Aid Generator

Auto-generate structured finding aid documents (EAD XML or plain text) describing the contents of each archive batch, suitable for submission to institutional repositories. Topic clusters and temporal analysis outputs form the raw material for machine-generated finding aids.

Extended File Type Support

Apply the same ingest-extract-anonymize-summarize pipeline to non-email file types: Word documents, PDFs, scanned images (via OCR), and spreadsheets. The meeting that scoped this project identified this as the primary expansion path for processing full institutional digital legacies.

Parallelization & Performance

Presidio processing parallelized via multiprocessing or joblib, projected 4 to 8x speedup on multi-core hardware. Ollama switched to server mode with REST API calls to eliminate model reload overhead, projected approximately 10x speedup for LLM summarization stage.

Differential Privacy Evaluation

Evaluate differential privacy techniques (Laplace noise injection via DiffPrivLib) for production deployment where formal mathematical privacy guarantees are required. Develop an evaluation framework comparing readability preservation scores vs. privacy protection metrics across approaches.

Institutional Systems Integration

Connect pipeline output to institutional records management systems for automated retention scheduling and disposition tracking. Provide a researcher-facing web interface for searching and reviewing anonymized summaries without requiring direct data access.

Technology Stack

Ingestion & Processing

Python 3.11 (stdlib: os, email, re) pandas (DataFrame ops, explode, to_datetime) MIME / RFC 2822 email parsing readpst (PST → EML conversion) numpy (topic assignment via argmax)

NLP & Privacy

Microsoft Presidio (presidio-analyzer, presidio-anonymizer) spaCy en_core_web_lg (NER model) NLTK (stopwords, corpus) scikit-learn (CountVectorizer, LDA) fuzzywuzzy / python-Levenshtein

LLM & Visualization

Llama 3.2 (local inference, on-device) Ollama CLI (model deployment & serving) subprocess (Python → Ollama bridge) matplotlib (temporal & topic bar charts) Jupyter (interactive development environment)

← Back to Portfolio