Case Study · 2025 · University of Illinois System

Research Proposal
Processing Pipeline

End-to-end NLP pipeline that converts research proposal and award PDFs from the Kuali Research system into a structured, classified, analytical layer — the foundation for a Research Proposal Intelligence platform supporting Sponsored Programs Administration and the System's ~$517M annual research portfolio.

View Code on GitHub →
Institution
University of Illinois System
Partner
Sponsored Programs Administration (SPA)
Status
In Production · Intelligence Platform Emerging
Role
Senior Data Scientist · Lead Engineer
~$517M
Annual Research Portfolio
26
HERDS Taxonomy Fields Classified
3-way
Keyword Extraction Ensemble
Idempotent
Anti-Join Change Detection

Project Overview

Universities collect enormous volumes of research-proposal content — narratives, scopes of work, budget justifications — but the material lives as unstructured PDF attachments inside a grants-management system. Decision makers in Sponsored Programs Administration can see that a proposal was submitted, but not what the proposal was actually about in any way a database can answer.

This project closes that gap. The pipeline ingests proposal and award attachments from Kuali Research, runs them through a structured NLP pathway, and produces an analytical layer where every proposal carries extracted keywords, an entity profile, and a predicted classification against the NSF HERDS research-field taxonomy — the same taxonomy used for federal research-activity reporting.

I scoped, designed, and built the pipeline end-to-end: Oracle integration across two security zones, PDF text extraction with format validation, a three-way keyword extraction ensemble, a sentence-transformer-based classifier, and the three-stage staging-table design that makes the output queryable by analysts without exposing the intermediate text.

This pipeline represents the first phase of a broader Research Proposal Intelligence partnership — moving from descriptive reporting of what was funded toward predictive understanding of why, so strategy decisions can be informed by the full text of what researchers actually proposed.

Project framing · AITS / SPA partnership, 2025

Institutional Impact

Every proposal and award in the University of Illinois System now carries a structured NLP annotation — keywords, entities, and HERDS field classification — that makes unstructured proposal content queryable for the first time. Outputs feed analyst dashboards and lay the foundation for a Research Proposal Intelligence platform.

Two Parallel Pipelines

Proposals: submitted proposal attachments, classified by research field and annotated for downstream analytics. Rebuilds the governance filter that gates which proposals enter production.

Awards: funded award attachments, annotated the same way, producing the success-labeled corpus that future predictive work will train on.

Why It Matters

The University of Illinois System manages roughly $517M in annual research funding. Small marginal improvements in how proposals are strategized, positioned, and supported translate into millions of dollars in additional direct research funding — and further millions in indirect cost recovery that sustains core institutional operations.

Until now, leadership could answer financial and operational questions about the portfolio (who submitted, what was funded, by which sponsor) but not analytical ones about the content of the work. The questions that matter for strategy — which research areas are our strongest submissions in, where are we underperforming relative to our capacity, which sponsors align with which kinds of proposals — all require the proposal text itself as a structured input.

This pipeline is the data infrastructure that makes those questions answerable. Every proposal carries a consistent, machine-generated summary of its research content, and every award carries the same. Paired with existing financial and administrative data, the combined layer supports descriptive analytics today and LLM-driven feature extraction tomorrow.

Before

Proposals visible only as financial records. Content locked in PDF attachments. Strategic questions about research mix answered anecdotally, if at all.

After

Every proposal and award carries extracted keywords, entities, and a HERDS field classification. Content is queryable in Oracle alongside existing financial data.

Decision Intelligence

Common thread with prior work: analytics and AI for institutional planning — not just operations. Builds directly on the Legislation LLM Feature Extraction methodology.

Emerging Platform

First phase of a Research Proposal Intelligence partnership. Infrastructure in place; next phase layers LLM-driven extraction and predictive modeling on top.

Architecture

The pipeline spans two Oracle databases in different security zones — the Kuali Research source of truth and the analytics warehouse — with no dblink between them. Change detection is computed in Python via anti-join, which keeps the network boundary intact and makes the pipeline trivially re-runnable.

Processing is organized as service modules (one responsibility each — text cleaning, keyword extraction, HERDS classification, entity extraction, etc.) orchestrated by thin pipeline controllers. The output is three staging tables serving three audiences: raw-text for auditing, NLP-annotated for engineering, production-aggregated for analysts.

01
Change Detection · Anti-Join in Python
Queries Kuali for all current proposal/award attachment IDs, queries the analytics staging tables for already-processed IDs, computes the delta in memory. No dblink between security zones. The delta is the work queue.
02
BLOB Retrieval · Batched Oracle Pull
Downloads attachment BLOBs from Kuali in batches with chunked CLOB handling. setinputsizes(CONTENT=oracledb.CLOB) binding avoids ORA-01461 on large payloads.
03
Text Extraction · PDF with Magic-Byte Validation
PyMuPDF extraction with a magic-byte validator that catches DOCX files misnamed as .pdf — a real pattern in the source system. Invalid payloads are tagged and routed out instead of crashing the batch.
04
Cleaning · Vocabulary-Backed Filter
Text is normalized and filtered against a pre-cached ~500K-word vocabulary — NLTK words ∪ WordNet lemmas, blended with wordfreq zipf-frequency thresholds for modern terminology the static dictionaries miss. A separate removal list strips budget boilerplate, place names, and common PI-name artifacts that would otherwise dominate keyword extraction.
05
Keyword Extraction · Three-Way Ensemble
Each method captures a different kind of signal, so the pipeline runs all three and reconciles: TF-IDF for term rarity, KeyBERT with MMR diversity=0.5 for semantic relevance with coverage, and LDA for latent topic structure. KeyBERT's embedding is shared with the HERDS classifier to keep memory bounded.
06
HERDS Classification · Semantic + Keyword Blend
Each proposal is embedded with the same SentenceTransformer used by KeyBERT, then matrix-multiplied against a pre-computed taxonomy-embedding matrix (26 HERDS fields, one matrix op). The semantic score is blended 0.75 / 0.25 with a keyword-overlap score against curated taxonomy-keyword lists; a MIN_SCORE_THRESHOLD of 0.12 gates the final label.
07
Entity Extraction · spaCy NER Profile
Named entities extracted per proposal — organizations, people, locations — producing an entity profile column used downstream for sponsor-alignment analysis and collaboration graphs.
08
Write · Three Staging Tables, MERGE Semantics
Writes to three staging tables serving three audiences: raw-text (engineering audit), NLP-annotated (downstream ML), and production-aggregated (analyst-facing view). Every write is idempotent (INSERT … WHERE NOT EXISTS or MERGE), post-write row counts are verified, and rollback SQL is printed on every successful run.
09
Governance Filter Rebuild
Rebuilds the filter table that gates which proposals enter the production analytics view, joining Kuali proposal / EPS / admin-details tables. Join handles a data-quality quirk — proposal numbers stored zero-padded as strings on one side and as integers on the other — via TO_NUMBER on the padded column.
10
Structured Logging · Parquet to Monitor
Every stage logs through a canonical PipelineLogger whose parquet output is consumed by a separate job-execution monitor. A human-readable text log is produced in parallel.

The HERDS Taxonomy

HERDS — the NSF Higher Education Research and Development Survey — is the taxonomy universities use to report federal research activity. Classifying proposals against it has a real-world anchor: it matches the categories leadership already reports on.

I built a 26-field taxonomy with a curated keyword list per field, deduplicated so no keyword appears in more than one field (which would confuse the keyword-overlap signal). Overlaps were adjudicated per-keyword: optimization moved to Mathematics & Statistics, thermodynamics to Physics, policy to Political Science, and so on — eight reassignments in total — following the principle that each term should belong where it originates methodologically, not where it's applied.

The taxonomy embeddings are built once and cached to disk. The classifier performs a single matrix multiplication against all 26 fields per proposal rather than 26 separate similarity queries.

Classified fields
Agricultural Sciences
Anthropology
Architecture & Design
Astronomy
Biological Sciences
Chemistry
Civil Engineering
Communication
Computer Science
Earth Sciences
Economics
Education
Electrical Engineering
Industrial Engineering
Law
Mathematics & Statistics
Mechanical Engineering
Medical Sciences
Natural Resources
Nursing
Philosophy
Physics
Political Science
Psychology
Public Health
Sociology

Key Design Decisions

A handful of decisions shaped the pipeline far more than the individual algorithms. These are the ones worth surfacing for anyone reading the code.

Shared Transformer
KeyBERT and the HERDS classifier both need a sentence encoder. Loading two copies of all-MiniLM-L6-v2 costs ~500MB of RAM and 30 seconds of startup for no benefit, so a single instance is constructed once and passed into both services. Small change, big win on memory-constrained runs.
Anti-Join Over Dblink
Kuali and the analytics warehouse live in different security zones. Rather than chase a cross-zone database link (operationally expensive, policy-sensitive), change detection is computed in Python: pull both ID sets, diff in a set operation, use the delta as the work queue. This also means the pipeline can run from anywhere with credentials to both sides.
Ensemble over Single Method
No single keyword method is right for proposal text. TF-IDF rewards rarity but surfaces noise; KeyBERT gives semantic relevance but can cluster around the same idea; LDA captures latent topic structure but drifts on short documents. Running all three and reconciling produces a keyword set noticeably more useful than any one method alone.
0.75 / 0.25 Blend
The HERDS classifier blends a semantic-similarity score with a keyword-overlap score at a 0.75 / 0.25 ratio. Semantic dominates because taxonomy definitions are abstract and the embedding captures that; keyword overlap contributes a calibration signal for when taxonomy-specific vocabulary is actually present. Blending outperforms either signal alone on the validation set.
Three Staging Tables
Raw-text, NLP-annotated, and production-aggregated staging tables serve three different audiences and make debugging tractable. Engineering can trace any row in the production view back through its NLP annotations back to the original extracted text. Analysts only see the production view.
Idempotency by Construction
Every insert is WHERE NOT EXISTS or MERGE. Re-running the pipeline on an already-processed batch is a no-op. The one exception — the production TRUNCATE — is double-gated behind dry_run=False AND confirm_production=True flags that must both be set explicitly.
Rollback SQL Inline
Every successful run prints the exact SQL that would undo it — row counts, IDs, the whole cleanup sequence. Recovery doesn't require reading code.
Magic-Byte Validation
A non-trivial number of .pdf attachments in the source system are actually DOCX files with the wrong extension. PyMuPDF raises on these. A four-byte magic-number check up front catches them, and they're routed to a rejection bucket with a reason code instead of crashing the batch.

Data Sources

Technology Stack

Data & Storage
Oracle python-oracledb SQL Parquet Pandas
NLP & ML
sentence-transformers KeyBERT scikit-learn gensim (LDA) spaCy NLTK wordfreq all-MiniLM-L6-v2
Infrastructure
Python 3.12 PyMuPDF PyYAML Jupyter Linux

What's Next

This pipeline is the infrastructure layer. The intelligence layer on top of it is where the business value compounds.

Phase 2: LLM feature extraction. Move from keyword-level summaries to structured-field extraction — research aims, hypotheses, proposed methodologies, collaboration structures — using the same methodology I developed for Legislation LLM Feature Extraction. Each proposal yields a structured record that joins cleanly to the financial layer.

Phase 3: Predictive modeling. With the combined content-plus-outcome corpus, model the characteristics associated with funding success. Identify where the System is strong relative to the national field, where it's underperforming relative to its own capacity, and which sponsor–research-area combinations are the highest-leverage strategic targets.

Phase 4: Decision interface. Surface the insights to SPA and research-administration leadership through dashboards and retrieval interfaces, supporting proposal-strategy decisions at the point of drafting — not months later in a retrospective report.

The Larger Thread

Decision intelligence for institutional planning and strategy — not just operations. This project extends the same pattern as my UIC enrollment work and the legislative analysis platform: take a high-stakes, content-heavy domain where leadership currently decides on intuition, and build the data infrastructure that lets them decide on evidence.

Scale Note

At a ~$517M annual research portfolio, marginal improvements in proposal strategy translate into millions of dollars of additional direct funding — plus further millions in indirect cost recovery that sustains core institutional operations. The return on getting this right is measured in research programs, not percentage points.

Repository Layout

The repository contains the production implementation scrubbed of credentials and infrastructure specifics. Source paths, schema names, and service accounts are parameterized through configs/database.yaml (excluded from version control; a .example template is committed).

research-proposal-pipeline/
├── configs/              # YAML configuration + loader (database.yaml gitignored)
├── ingestion/            # Oracle connection, BLOB fetch, change detection
├── orchestration/        # Pipeline controllers — proposals, awards, NLP
├── services/             # Single-responsibility NLP + DB service modules
│   ├── text_cleaning_service.py
│   ├── text_filter_service.py
│   ├── text_extraction_service.py
│   ├── keyword_service.py              # TF-IDF + KeyBERT + LDA ensemble
│   ├── herds_classification_service.py # Semantic + keyword blend
│   ├── entity_extraction_service.py
│   ├── change_detection_service.py
│   ├── table_build_service.py
│   ├── metanode_service.py             # Governance filter rebuild
│   └── ...
├── utils/                # Batching, logging, file helpers
├── data/
│   └── herds_taxonomy.py # 26 fields, deduplicated keyword lists
├── scripts/              # Vocabulary cache builder, schema inspector
├── notebooks/
│   └── main.ipynb        # Pipeline entry point / exploratory harness
├── main.py               # Same pipeline, script form
├── docs/
│   ├── architecture.md
│   ├── roadmap.md
│   └── portfolio.html    # This page
└── README.md
← back to portfolio