End-to-end NLP pipeline that converts research proposal and award PDFs from the Kuali Research system into a structured, classified, analytical layer — the foundation for a Research Proposal Intelligence platform supporting Sponsored Programs Administration and the System's ~$517M annual research portfolio.
Universities collect enormous volumes of research-proposal content — narratives, scopes of work, budget justifications — but the material lives as unstructured PDF attachments inside a grants-management system. Decision makers in Sponsored Programs Administration can see that a proposal was submitted, but not what the proposal was actually about in any way a database can answer.
This project closes that gap. The pipeline ingests proposal and award attachments from Kuali Research, runs them through a structured NLP pathway, and produces an analytical layer where every proposal carries extracted keywords, an entity profile, and a predicted classification against the NSF HERDS research-field taxonomy — the same taxonomy used for federal research-activity reporting.
I scoped, designed, and built the pipeline end-to-end: Oracle integration across two security zones, PDF text extraction with format validation, a three-way keyword extraction ensemble, a sentence-transformer-based classifier, and the three-stage staging-table design that makes the output queryable by analysts without exposing the intermediate text.
This pipeline represents the first phase of a broader Research Proposal Intelligence partnership — moving from descriptive reporting of what was funded toward predictive understanding of why, so strategy decisions can be informed by the full text of what researchers actually proposed.
Project framing · AITS / SPA partnership, 2025
Every proposal and award in the University of Illinois System now carries a structured NLP annotation — keywords, entities, and HERDS field classification — that makes unstructured proposal content queryable for the first time. Outputs feed analyst dashboards and lay the foundation for a Research Proposal Intelligence platform.
Proposals: submitted proposal attachments, classified by research field and annotated for downstream analytics. Rebuilds the governance filter that gates which proposals enter production.
Awards: funded award attachments, annotated the same way, producing the success-labeled corpus that future predictive work will train on.
The University of Illinois System manages roughly $517M in annual research funding. Small marginal improvements in how proposals are strategized, positioned, and supported translate into millions of dollars in additional direct research funding — and further millions in indirect cost recovery that sustains core institutional operations.
Until now, leadership could answer financial and operational questions about the portfolio (who submitted, what was funded, by which sponsor) but not analytical ones about the content of the work. The questions that matter for strategy — which research areas are our strongest submissions in, where are we underperforming relative to our capacity, which sponsors align with which kinds of proposals — all require the proposal text itself as a structured input.
This pipeline is the data infrastructure that makes those questions answerable. Every proposal carries a consistent, machine-generated summary of its research content, and every award carries the same. Paired with existing financial and administrative data, the combined layer supports descriptive analytics today and LLM-driven feature extraction tomorrow.
Proposals visible only as financial records. Content locked in PDF attachments. Strategic questions about research mix answered anecdotally, if at all.
Every proposal and award carries extracted keywords, entities, and a HERDS field classification. Content is queryable in Oracle alongside existing financial data.
Common thread with prior work: analytics and AI for institutional planning — not just operations. Builds directly on the Legislation LLM Feature Extraction methodology.
First phase of a Research Proposal Intelligence partnership. Infrastructure in place; next phase layers LLM-driven extraction and predictive modeling on top.
The pipeline spans two Oracle databases in different security zones — the Kuali Research source of truth and the analytics warehouse — with no dblink between them. Change detection is computed in Python via anti-join, which keeps the network boundary intact and makes the pipeline trivially re-runnable.
Processing is organized as service modules (one responsibility each — text cleaning, keyword extraction, HERDS classification, entity extraction, etc.) orchestrated by thin pipeline controllers. The output is three staging tables serving three audiences: raw-text for auditing, NLP-annotated for engineering, production-aggregated for analysts.
setinputsizes(CONTENT=oracledb.CLOB) binding avoids ORA-01461 on large payloads..pdf — a real pattern in the source system. Invalid payloads are tagged and routed out instead of crashing the batch.wordfreq zipf-frequency thresholds for modern terminology the static dictionaries miss. A separate removal list strips budget boilerplate, place names, and common PI-name artifacts that would otherwise dominate keyword extraction.0.75 / 0.25 with a keyword-overlap score against curated taxonomy-keyword lists; a MIN_SCORE_THRESHOLD of 0.12 gates the final label.INSERT … WHERE NOT EXISTS or MERGE), post-write row counts are verified, and rollback SQL is printed on every successful run.TO_NUMBER on the padded column.PipelineLogger whose parquet output is consumed by a separate job-execution monitor. A human-readable text log is produced in parallel.HERDS — the NSF Higher Education Research and Development Survey — is the taxonomy universities use to report federal research activity. Classifying proposals against it has a real-world anchor: it matches the categories leadership already reports on.
I built a 26-field taxonomy with a curated keyword list per field, deduplicated so no keyword appears in more than one field (which would confuse the keyword-overlap signal). Overlaps were adjudicated per-keyword: optimization moved to Mathematics & Statistics, thermodynamics to Physics, policy to Political Science, and so on — eight reassignments in total — following the principle that each term should belong where it originates methodologically, not where it's applied.
The taxonomy embeddings are built once and cached to disk. The classifier performs a single matrix multiplication against all 26 fields per proposal rather than 26 separate similarity queries.
A handful of decisions shaped the pipeline far more than the individual algorithms. These are the ones worth surfacing for anyone reading the code.
all-MiniLM-L6-v2 costs ~500MB of RAM and 30 seconds of startup for no benefit, so a single instance is constructed once and passed into both services. Small change, big win on memory-constrained runs.WHERE NOT EXISTS or MERGE. Re-running the pipeline on an already-processed batch is a no-op. The one exception — the production TRUNCATE — is double-gated behind dry_run=False AND confirm_production=True flags that must both be set explicitly..pdf attachments in the source system are actually DOCX files with the wrong extension. PyMuPDF raises on these. A four-byte magic-number check up front catches them, and they're routed to a rejection bucket with a reason code instead of crashing the batch.data/herds_taxonomy.py.wordfreq zipf threshold for modern terminology. Built once by scripts/build_vocab_cache.py and serialized to disk.This pipeline is the infrastructure layer. The intelligence layer on top of it is where the business value compounds.
Phase 2: LLM feature extraction. Move from keyword-level summaries to structured-field extraction — research aims, hypotheses, proposed methodologies, collaboration structures — using the same methodology I developed for Legislation LLM Feature Extraction. Each proposal yields a structured record that joins cleanly to the financial layer.
Phase 3: Predictive modeling. With the combined content-plus-outcome corpus, model the characteristics associated with funding success. Identify where the System is strong relative to the national field, where it's underperforming relative to its own capacity, and which sponsor–research-area combinations are the highest-leverage strategic targets.
Phase 4: Decision interface. Surface the insights to SPA and research-administration leadership through dashboards and retrieval interfaces, supporting proposal-strategy decisions at the point of drafting — not months later in a retrospective report.
Decision intelligence for institutional planning and strategy — not just operations. This project extends the same pattern as my UIC enrollment work and the legislative analysis platform: take a high-stakes, content-heavy domain where leadership currently decides on intuition, and build the data infrastructure that lets them decide on evidence.
At a ~$517M annual research portfolio, marginal improvements in proposal strategy translate into millions of dollars of additional direct funding — plus further millions in indirect cost recovery that sustains core institutional operations. The return on getting this right is measured in research programs, not percentage points.
The repository contains the production implementation scrubbed of credentials and infrastructure specifics. Source paths, schema names, and service accounts are parameterized through configs/database.yaml (excluded from version control; a .example template is committed).
research-proposal-pipeline/ ├── configs/ # YAML configuration + loader (database.yaml gitignored) ├── ingestion/ # Oracle connection, BLOB fetch, change detection ├── orchestration/ # Pipeline controllers — proposals, awards, NLP ├── services/ # Single-responsibility NLP + DB service modules │ ├── text_cleaning_service.py │ ├── text_filter_service.py │ ├── text_extraction_service.py │ ├── keyword_service.py # TF-IDF + KeyBERT + LDA ensemble │ ├── herds_classification_service.py # Semantic + keyword blend │ ├── entity_extraction_service.py │ ├── change_detection_service.py │ ├── table_build_service.py │ ├── metanode_service.py # Governance filter rebuild │ └── ... ├── utils/ # Batching, logging, file helpers ├── data/ │ └── herds_taxonomy.py # 26 fields, deduplicated keyword lists ├── scripts/ # Vocabulary cache builder, schema inspector ├── notebooks/ │ └── main.ipynb # Pipeline entry point / exploratory harness ├── main.py # Same pipeline, script form ├── docs/ │ ├── architecture.md │ ├── roadmap.md │ └── portfolio.html # This page └── README.md