Tayler Erbe · Project Case Study · UIC · 2022–2023

Student Enrollment Classification
& Forecasting

A full-lifecycle predictive analytics system for the University of Illinois Chicago, combining machine learning classification of at-risk students with ARIMA time-series enrollment forecasting across 14 colleges and four student population segments.

View Code on GitHub →

Institution

University of Illinois Chicago (UIC)

Stakeholders

Executive Director, Institutional Research · AVP Advising · Registrar · Associate VP Data Science

Status

Delivered · Reactivation Planned

Role

Lead Data Scientist · Full Lifecycle Ownership

Code

GitHub Repository →

~82%

Classification Accuracy

600K+

Student Records Modeled

Colleges Forecasted via ARIMA

10yr

Historical Training Window

Project Overview

Universities typically approach enrollment planning reactively, with short planning horizons and limited data-driven infrastructure for identifying students at risk before they leave. Tayler Erbe scoped, designed, and delivered a full predictive analytics platform for UIC to directly address both gaps — proactive retention risk identification and long-range enrollment forecasting.

The platform produced two interconnected systems: a machine learning classification system that identifies which currently enrolled students are likely to discontinue before the next Fall term, segmented by student level and term seasonality; and an ARIMA enrollment forecasting system that projects total Fall enrollment for incoming, continuing, graduating, and discontinuing populations across 14 UIC colleges with a 10-year horizon.

Tayler Erbe owned the full project lifecycle — from problem definition and data acquisition through feature engineering, modeling, validation, documentation, and executive presentation. Outputs spanned individual student-level risk scores for advisors, college-level projections for deans, and university-wide summaries for institutional leadership.

"The enrollment targets project at UIC was an idea that Tayler brought to life. She gathered interns and staff, analyzed data, evaluated multiple approaches, and delivered the best possible results with the data and time available, showcasing her impressive abilities."

Dimuthu Tilakaratne, Manager · AITS Performance Appraisal 2023-2024

Institutional Impact

Enrollment predictions generated through this project can be made at multiple levels of granularity, including individual student, department, college, and system-wide, giving institutional leadership the tools to plan staffing, facilities, and financial aid allocation well ahead of each Fall term.

Two-Component Architecture

Component 1: H2O Random Forest + Tree Ensemble classification of students likely to discontinue, segmented by student level (UG, Grad, Professional, Law) and term seasonality.

Component 2: Python and KNIME ARIMA models forecasting total enrollment by college across 14 UIC colleges, separately for incoming and continuing student populations.

Data Pipeline & Engineering

All data was sourced from the university's Enterprise Data Warehouse (EDW), specifically the Oracle database DSPROD01. The wrangling workflow was built in KNIME, connecting to live Oracle tables and assembling the final dataset through a cascading series of left-joins anchored to a student base file.

The dataset spans Fall 2013 through Summer 2023, with term codes controlled via KNIME flow variables. The final combined dataset covered 189,516 graduate student rows (114 columns) and 429,121 undergraduate rows (110 columns). Each row represents a single student per term at the Census snapshot date.

Oracle EDW KNIME Parquet Python pandas H2O SMOTE

Oracle Connection & Flow Variables

Established live Oracle connection to DSPROD01 with KNIME flow variable nodes controlling the term code range (Fall 2013 to Summer 2023). Flow variables propagated through all subsequent query nodes.

Base File Construction

Student base file assembled from six EDW tables: demographics (T_RS_PERS), student info (T_RS_STUDENT), registration (T_RS_STUDENT_REG), term codes (T_RS_TIME), holds (T_RS_STUDENT_HOLD), and leave of absence (T_STUDENT_TERM). Filtered to active, Census-snapshot, registered students at campus code 200.

Multi-Table Left-Join Pipeline

Nine additional source tables joined sequentially: degree history, student attributes (first-gen, honors, special programs), SAT/ACT presence indicators, degree code history, GPA and transfer details, placement exam scores (Math, English, Chemistry), high school characteristics, geographic data, and admissions data. Each join included null-handling preprocessing.

Target Variable Engineering

Custom Python function get_next_term() computed the expected subsequent term for each student row. Target labeled 1 if that term appears in the student's full enrollment history, 0 otherwise. Additional trinary and year-level targets also generated for alternate modeling approaches.

Feature Selection & SMOTE Balancing

Redundant columns removed (paired _CD and _DESC variants). Leakage-risk features dropped. Dataset aggregated to one row per student per year using earliest-semester snapshot. Minority class oversampled via SMOTE to address class imbalance before model training.

Data Categories Used in Analysis

Academic Progress

Enrollment history, completed terms, credit hours accumulated, current class standing, GPA at multiple levels, transfer credit, and progression toward degree. These time-in-program signals were the strongest predictors of continuation risk.

Student Profile

Demographic information, degree program, major, student level (undergraduate, graduate, professional, law), residency status, admit type, college affiliation, and special program designations such as first-generation and honors status.

Engagement Signals

Registration hold indicators, leave of absence history, placement exam performance (Math, English, Chemistry), ACT/SAT presence as a binary signal, high school background, and geographic origin data for incoming student cohorts.

02b

Data Preparation

Data preparation was one of the most consequential phases of the entire project. All data originates from the raw ALL_DATA_6162023 parquet file produced during wrangling. A single Python script then handles the full pipeline: appending engineered features, segmenting by student level for correct aggregation, labeling four variations of the target variable, and exporting multiple dataset formats for different modeling strategies.

A critical design decision was made to keep all term-count features relative to the student's current degree level. Without this, students who completed both undergraduate and graduate programs would accumulate inflated graduate term counts — corrupting the signal the model depends on most.

Engineered Features Added

UNIQUE_TERM_CD

A complete list of every term in which each student was enrolled. Used to label continuation (1 if enrolled next term, 0 if not) and to calculate completed term counts accurately.

AH_ACAD_YEAR_CD_LIST

Academic year codes for every year a student was enrolled. Year codes are preferred over calendar years because a full academic year spans fall of one year through summer of the next.

CURRENT_COMPLETED_TERMS

Number of terms successfully completed up to each snapshot point. Strongly recommended for model training — early-career students show elevated dropout risk, while students nearing completion show increasing persistence probability.

TOTAL_COMPLETED_TERMS

Total terms completed across all recorded history. Excluded from model training — its inclusion risks data leakage by effectively revealing the outcome to the model. Used for statistical analysis only.

CURRENT_COMPLETED_ACAD_YEARS

Number of academic years completed relative to current degree. Complements term-level counts with a year-level signal for degree progression.

Target Variable Labeling Strategy

Significant design effort went into how to label student outcomes. Three potential outcomes exist — discontinuation, graduation, and continuation — and the boundary between them is nuanced. Should summer non-enrollment count as dropout? Should graduation be treated separately from continuation, or grouped together as "success"? Four labeling variations were created and evaluated:

Target: Next Term

Labels continuation as 1 if enrolled in the immediately following fall/spring term. Summer skips are not penalized. Graduation = 2, dropout = 0.

Target: Next Year (Any Term)

Checks whether the student enrolled in any term within the following academic year — more forgiving of gaps. Graduation = 2, dropout = 0.

Target: Next Fall Only

Looks only for fall term re-enrollment in the following year, isolating the strongest continuation signal independent of spring/summer behavior.

Target: Binary Success

Collapses continuation and graduation into a single success label (1). Dropout = 0. Recommended approach — graduation and continuation share more feature similarity with each other than either does with dropout.

Key Insight on Label Design

Although both dropout and graduation reduce enrollment, they do not share similar characteristics. Grouping them together as a single "not continuing" outcome would weaken the predictive signal for each. The binary success framing — where the model distinguishes successful students from those who discontinued — produced the most coherent feature clusters and was the recommended production target.

Aggregation & Sampling Strategies

Single Point per Term

Raw labeled data with one row per student per term. No aggregation applied. Used for term-over-term prediction tasks where temporal granularity matters.

Single Point per Year

Aggregated to the most recent term snapshot per student per academic year. Reduces duplicate signals while preserving the most current state. Used for year-over-year forecasting.

Downsampled Random Selection

One randomly selected term snapshot per student. Prevents overfitting caused by near-duplicate rows (stable features like sex, degree, and resident status remain constant across terms). Preserves real-world target variable distribution. Recommended dataset for model training.

Feature Selection

Feature selection was conducted using two complementary methods on the H2O Random Forest model: Variable Importance (scaled impurity-based importance from 50 decision trees) and SHAP values (game-theoretic attribution of each feature's contribution to individual predictions). Partial Dependence Plots were generated for the top three features in each student population segment to understand direction and shape of the relationships.

Top Features by Relative Importance (Graduate)

Feature	Description	Relative Importance
TOTAL_COMPLETED_TERMS	Total academic terms completed by the student
SEMESTER_CD	Current semester code (seasonality signal)
CURRENT_COMPLETED_ACAD_YEARS	Academic years completed at current enrollment
STUDENT_CURR_1_MAJOR_NAME	Current declared major program
CURRENT_COMPLETED_TERMS	Terms completed in current enrollment period
DEPT_NAME	Department of enrollment
SUM(STUDENT_TOT_REG_CREDIT_HOUR)	Cumulative registered credit hours

Top Features by Relative Importance (Undergraduate)

Feature	Description	Relative Importance
TOTAL_COMPLETED_TERMS	Total academic terms completed
SEMESTER_CD	Current semester (fall/spring/summer)
CALC_CLS_DESC	Calculated class standing (Freshman/Sophomore/etc.)
CURRENT_COMPLETED_TERMS	Terms completed in current period
LEVEL_GPA_HOUR	Credit hours at current GPA level
LEVEL_GPA_QUAL_PT	Quality points at current GPA level
STUDENT_CURR_1_MAJOR_NAME	Current declared major

Key Finding: Time-Based Features Dominate

Across both graduate and undergraduate populations, time-in-program features (total completed terms, current completed terms, academic years) consistently ranked as the most predictive. This finding directly informed the decision to also build ARIMA time-series models as a complementary forecasting approach.

Partial Dependence: Completed Terms

The PDP for TOTAL_COMPLETED_TERMS showed a sharp nonlinear increase in continuation probability from 0 to ~8 terms, plateauing around 0.91 mean response beyond 10 terms. Students in their first 1 to 3 terms represent the highest dropout risk window.

Partial Dependence: Semester Code

Semester code showed a step-change pattern: continuation probability held flat through semesters 1 to 6, then dropped sharply at semester code 7+. This suggests that students last enrolled in later-academic-year semesters face meaningfully higher dropout risk entering Fall.

Leakage Detection & Feature Removal

Two variables that initially ranked as the most important in an early H2O model run (GRAD_STATUS_DESC and AH_ACAD_YEAR_CD_LIST) were identified as term-dependent leakage risks rather than genuine student-level predictors. Both were dropped before final model training.

Classification Model Results

Two algorithm frameworks were evaluated to handle the high-cardinality categorical features in the dataset: H2O Distributed Random Forest (DRF) and CatBoost Gradient Boosted Trees. Both natively handle categorical features without one-hot encoding via target-encoding with random permutation, avoiding the dimensionality explosion that standard encoding would cause at this feature cardinality.

H2O Random Forest

Selected Model

AUC (cross-validation)0.9847

F1 Score0.9676

Accuracy94.8%

AUCPR0.9958

Mean Per-Class Error0.083

Trees / Max Depth50 / 20

Cross-Validation Folds5-fold

CatBoost GBM

Evaluated

Confusion Matrix Error (Continued)6.55%

Confusion Matrix Error (Discontinued)47.71%

Confusion Matrix Error (Graduated)4.06%

Projection ResidualSystematic bias detected

Discontinued class errorSignificantly higher than H2O

ConclusionOverfitting; consistent residual

H2O Confusion Matrix (Graduate · Cross-Validation)

	Pred: Discontinued (0)	Pred: Continued (1)	Error	Rate
Actual: Discontinued (0)	31,497	4,943	0.1356	4,943 / 36,440
Actual: Continued (1)	4,074	134,052	0.0295	4,074 / 138,126
Total	35,571	138,995	0.0517	9,017 / 174,566

H2O model error for all three classes (Discontinued, Continued, Graduated) remained below 5% in cross-validation, versus CatBoost's 47.71% error on the Discontinued class specifically. The ability to accurately identify the discontinuation class was the primary evaluation criterion given institutional use case.

Model Visualizations

Variable Importance — H2O Distributed Random Forest (Graduate Population)

Top 10 features ranked by scaled importance (0–1) from a 50-tree ensemble with 5-fold cross-validation. TOTAL_COMPLETED_TERMS dominates by a large margin, confirming that time-in-program is the primary driver of continuation probability.

05b

Workflow Architecture & Validation Evidence

01 — KNIME Workflow Architecture (Graduate Population)

The KNIME pipeline for the graduate model illustrates the complete end-to-end ML workflow: data ingested from Parquet, filtered via feature selection, balanced with SMOTE on the training partition, fitted with a Tree Ensemble Learner (semester-based), scored against the held-out test split, and routed to performance metrics, actual-vs-predicted visualizations, and a final classification output node.

The workflow segments training by semester (Summer/Spring/Fall), ensuring each Tree Ensemble Learner captures term-specific dropout patterns. SMOTE is applied only to the training partition (70%) — never the held-out test set (30%) — preventing data leakage during validation.

02 — SMOTE Pipeline: Undergraduate per-Semester Branching

After reading UNDERGRADUATE_TARGET_BY_YEAR.parquet, the pipeline applies two column filters (feature selection, then removing the identifier edw_pers_id), excludes the current year (2223) from training via a row filter, then fans out into three parallel semester-specific branches — each performing 70/30 partitioning, SMOTE oversampling on the minority (discontinued) class, and independent Tree Ensemble training.

SMOTE Configuration (Dialog 4:1046)

Class Column

SUCCESSFUL_OUTCOME_TARGET_BY_YEAR

# Nearest Neighbors

Mode

● Oversample minority classes

Static Seed

Disabled (non-deterministic)

Three parallel semester models allow the Tree Ensemble to learn term-specific discontinuation signals. SMOTE with k=5 nearest neighbors synthesizes new minority-class samples to balance the severely imbalanced discontinued-student class before training — without touching the 30% test hold-out.

03 — Validation Metrics: Confusion Matrices by Semester (Undergraduate)

Post-SMOTE Tree Ensemble validation results across the three undergraduate semester models. Class 0 = Discontinued; Class 1 = Continued/Enrolled. Overall accuracy ranges from 92–97%, with Cohen's κ peaking at 0.882 for Fall.

Fall terms produce the strongest model (κ = 0.882), with 99.32% sensitivity for correctly identifying discontinued students — critical for advisor intervention. Spring is the weakest (κ = 0.461) due to more diffuse spring dropout patterns and a larger, noisier population. Summer sits in between with κ = 0.814 on a smaller cohort.

04 — Tree Ensemble: Actual vs. Predicted Discontinued Students

Line charts comparing the model's predicted count (teal) against the historical actual count (gold) from academic year 1213 to 2223. The x-axis uses KNIME academic year codes (e.g., 1213 = 2012–13, 2223 = 2022–23).

Undergraduate — Discontinued Fall

Liberal Arts & Sciences — All Discontinued

Both charts show the model consistently underpredicts the actual discontinued count — the systematic gap that motivated the linear regression adjustment ratio. Note the sharp actual-count surge in AY 2122 (post-COVID enrollment recovery attrition), which the model partially misses, validating the need for the adjustment ratio recalibration methodology.

Project Team

This project was led and architected by Tayler Erbe as part of the AITS Advanced Analytics team, with contributions from a team of data science interns across four workstreams. Tayler owned the overall project design, data wrangling pipeline, KNIME classification workflow, Python feature engineering, and stakeholder delivery.

Tayler Erbe · Lead

KNIME data wrangling workflow · Python feature engineering and target labeling · KNIME classification model (Tree Ensemble, Random Forest, GBM) · Project architecture, stakeholder delivery, documentation

Anirudh Palakurthi · Technology Solutions, UIC

KNIME ARIMA model for projecting incoming student enrollment. Yield prediction (%NotEnrolled) for Undergrads, Grads, Professionals, and Law students by college and term.

Azaan Barlas · Intern

Python ARIMA model for predicting continuing student enrollment outcomes by student level and college. Projection outputs, actual vs predicted graphs, performance metrics, and documentation.

Riya Saini · Intern

Python model comparison analysis — CatBoost vs H2O evaluation, sampling methods, encoding strategies, cross-validation scoring, and documented recommendations for future optimization.

Zhitao Zeng · Intern

Python data exploration and feature importance analysis using SweetViz comparison reports. Student outcome probability modeling (high/medium/low risk classification by term).

ARIMA Enrollment Forecasting

Separate from the classification model, ARIMA (AutoRegressive Integrated Moving Average) time-series models were built to forecast total Fall enrollment numbers by student population and college. Two distinct ARIMA implementations were built: one in KNIME focused on incoming new students, and one in Python focused on projecting continuing, discontinuing, and graduating student counts across 14 UIC colleges.

Colleges Modeled

Separate ARIMA models trained per college per student level. Final parameters selected via cross-referenced grid search across all ARIMA parameter combinations for each college and student segment.

Student Segments

Incoming, continuing, discontinuing, and graduating student populations each modeled separately, recognizing that their enrollment drivers and seasonality patterns differ significantly.

Implementations

KNIME ARIMA workflow (incoming student yield) built by Anirudh Palakurthi. Python ARIMA for continuing student projections by college built by Azaan Barlas. Both implementations iterated through multiple model variants.

Two-Stage ARIMA Development

Iteration 1 — Baseline ARIMA

Initial ARIMA models were built with manually tuned p, d, q parameters, optimized per college and student level via grid search. Models were trained on historical enrollment data from 2013 through 2021-22, with separate parameter sets selected for each of the 14 UIC colleges to account for program-specific enrollment patterns and seasonality.

Iteration 2 — ARIMA with Exponential Smoothing & Auto-ARIMA

A second iteration extended the baseline models with exponential smoothing to handle combinations where ARIMA could not produce a reliable forecast, and incorporated Auto-ARIMA (automated parameter selection via AIC/BIC optimization) to systematically identify the best-fit model order for each college-level series. This iteration improved forecast stability and reduced manual tuning overhead across the 14 college segments.

Exponential Smoothing — Non-Continuing Student Forecasts

The full 10-year forecast was divided into two student types — Continuing and Non-Continuing — on the premise that their enrollment drivers differ substantially. For each type, separate models were run across every combination of Student Level (Undergraduate, Graduate, Professional), College, and Term (Fall, Spring, Summer). In practice, this meant dozens of individual ARIMA models. For 12 specific college-level-term combinations, ARIMA failed to produce a reliable forecast and an ad-hoc Holt's Double Exponential Smoothing model was implemented as a replacement.

Why Holt's (Double) Exponential Smoothing?

By splitting the enrollment data into Fall, Spring, and Summer subsets and running separate models on each, seasonality was effectively removed from each individual series. What remained was a trend-bearing, non-seasonal time series — exactly the use case Holt's method is designed for.

Holt's method extends simple exponential smoothing with two parameters: α (alpha) controls the weight given to the current level observation, and β (beta) controls the weight given to the current trend estimate. Both decay exponentially, giving more influence to recent observations without discarding older patterns entirely.

st = αxt + (1 – α)(st-1 + bt-1)
bt = β(st – st-1) + (1 – β)bt-1
where s = smoothed level, b = trend estimate, α/β ∈ (0,1)

12 Combinations Handled by Exponential Smoothing

Level

College

Term

Undergrad

Business Administration

Spring

Undergrad

Applied Health

Summer

Graduate

Business Administration

Fall

Graduate

Medicine

Fall

Graduate

Dentistry

Spring

Graduate

Pharmacy

Spring

Graduate

Liberal Arts & Sciences

Summer

Graduate

Medicine

Summer

Professional

Applied Health

Spring

Professional

Dentistry

Spring

Professional

Nursing

Spring

Professional

Dentistry

Summer

Optimization & Forecast Pipeline

Parameter Grid Search

Alpha (α) and beta (β) were both swept from 0.05 to 1.0 in steps of 0.05. Every combination was evaluated against an 80/20 train-validation split using RMSE as the selection criterion. The pair yielding the lowest validation RMSE was retained as the best model for that combination.

Full-Data Refit & Forecasting

After the best α and β were identified, the model was refit on the entire dataset (training + validation) to maximize the information available before projecting forward. The fitted model then produced a 10-period point forecast for each of the 12 problem combinations.

Forecast Range via MAE

Mean Absolute Error (MAE) was computed on the validation set using the best-fit model. The forecast range was expressed as the point forecast ± MAE, providing a practical uncertainty band for each projected enrollment figure to communicate forecast confidence to stakeholders.

Incoming Student Yield Model

The KNIME-based ARIMA model projected the percentage of admitted students who would not enroll for each Fall term. Target variable (%NotEnrolled) was computed as the ratio of non-enrolling accepted students to total applicants, calculated separately for undergraduates, graduates, and professional students.

Continuing Student Projections

The Python ARIMA models forecasted yearly enrollment by continuing student counts for 14 colleges. Key findings: Applied Health Sciences, Business Administration, and Engineering expected to grow; Liberal Arts and Sciences and Public Health stable; Law significant growth; Medicine and Pharmacy slight projected declines.

Projection Reliability Window

H2O GBM projections performed reliably on historical data through academic year 2021-2022. Projections for 2022-23 were identified as less reliable, which is documented in project output. This honest limitation assessment drove the recommendation for domain-knowledge-informed feature refinement in future iterations.

UIC Extended Campus Insight

ARIMA projections identified consistent enrollment growth expected for the UIC Extended Campus, highlighting the institutional importance of non-traditional student pipelines. This was surfaced as an actionable insight for strategic planning conversations with institutional leadership.

Design Decisions & Methodology

Handling High-Cardinality Categoricals

One-hot encoding was ruled out due to dimensionality explosion risk with features like STUDENT_CURR_1_MAJOR_NAME and DEPT_NAME. Both H2O and CatBoost use target-encoding with random permutation, which handles high-cardinality features efficiently while preventing overfitting and data leakage.

Aggregation to One Row Per Student Per Year

Rather than modeling at the term snapshot level, data was aggregated to the earliest semester snapshot per student per academic year. Analysis showed students with Fall as their last enrolled term had higher dropout probability, validating the term-segmented modeling approach and enabling cleaner seasonality isolation.

SMOTE Class Imbalance Handling

The discontinued student class was substantially underrepresented in the training data. SMOTE (Synthetic Minority Over-sampling Technique) was applied after random partitioning of the training set. Both H2O and CatBoost additionally apply internal class-weight adjustments. Ensuring accurate identification of the minority discontinuation class was the primary modeling objective.

Segmentation by Student Level

Models were trained separately for Undergraduate, Graduate, Professional, and Law student populations, and further segmented by term (fall, spring, summer) within each level. This captured meaningful behavioral differences in dropout patterns across student types that a single pooled model would obscure.

Individual Risk Scoring Output

Beyond aggregate projections, the classification model produced individual student-level risk scores, confidence measures, and probability estimates. Students were binned into high, medium, and low risk tiers for advisor use. All outputs were keyed to EDW_PERS_ID for operational integration into institutional advising systems.

Documented Future Recommendations

The project explicitly documented its own limitations and produced concrete recommendations: incorporating domain knowledge for variable selection beyond model-derived importance, combining high-cardinality categories to reduce cardinality, and exploring ensemble approaches combining class-sensitivity-tuned models for the discontinued segment specifically.

Technology Stack

Data & Wrangling

KNIME (Oracle integration, ARIMA workflows) Oracle Database (DSPROD01 / EDW schema) Python / pandas (feature engineering) Parquet (intermediate storage) cx_Oracle (direct Python-Oracle connection)

Modeling

H2O Distributed Random Forest (DRF) H2O GBM (Gradient Boosted Machine) CatBoost (gradient boosted trees) ARIMA with exponential smoothing (Python) SMOTE oversampling (imblearn)

Analysis & Evaluation

SHAP summary plots (feature attribution) Partial Dependence Plots (H2O) SweetViz (exploratory comparison reports) Matplotlib / H2O plot (visualizations) 5-fold cross-validation (H2O)

← Back to Portfolio

Student Enrollment Classification& Forecasting

Project Overview

Data Pipeline & Engineering

Data Preparation

Feature Selection

Classification Model Results

Model Visualizations

Workflow Architecture & Validation Evidence

Project Team

ARIMA Enrollment Forecasting

Design Decisions & Methodology

Technology Stack

Student Enrollment Classification
& Forecasting