A full-lifecycle predictive analytics system for the University of Illinois Chicago, combining machine learning classification of at-risk students with ARIMA time-series enrollment forecasting across 14 colleges and four student population segments.
Universities typically approach enrollment planning reactively, with short planning horizons and limited data-driven infrastructure for identifying students at risk before they leave. Tayler Erbe scoped, designed, and delivered a full predictive analytics platform for UIC to directly address both gaps — proactive retention risk identification and long-range enrollment forecasting.
The platform produced two interconnected systems: a machine learning classification system that identifies which currently enrolled students are likely to discontinue before the next Fall term, segmented by student level and term seasonality; and an ARIMA enrollment forecasting system that projects total Fall enrollment for incoming, continuing, graduating, and discontinuing populations across 14 UIC colleges with a 10-year horizon.
Tayler Erbe owned the full project lifecycle — from problem definition and data acquisition through feature engineering, modeling, validation, documentation, and executive presentation. Outputs spanned individual student-level risk scores for advisors, college-level projections for deans, and university-wide summaries for institutional leadership.
"The enrollment targets project at UIC was an idea that Tayler brought to life. She gathered interns and staff, analyzed data, evaluated multiple approaches, and delivered the best possible results with the data and time available, showcasing her impressive abilities."
Dimuthu Tilakaratne, Manager · AITS Performance Appraisal 2023-2024
Enrollment predictions generated through this project can be made at multiple levels of granularity, including individual student, department, college, and system-wide, giving institutional leadership the tools to plan staffing, facilities, and financial aid allocation well ahead of each Fall term.
Component 1: H2O Random Forest + Tree Ensemble classification of students likely to discontinue, segmented by student level (UG, Grad, Professional, Law) and term seasonality.
Component 2: Python and KNIME ARIMA models forecasting total enrollment by college across 14 UIC colleges, separately for incoming and continuing student populations.
All data was sourced from the university's Enterprise Data Warehouse (EDW), specifically the Oracle database DSPROD01. The wrangling workflow was built in KNIME, connecting to live Oracle tables and assembling the final dataset through a cascading series of left-joins anchored to a student base file.
The dataset spans Fall 2013 through Summer 2023, with term codes controlled via KNIME flow variables. The final combined dataset covered 189,516 graduate student rows (114 columns) and 429,121 undergraduate rows (110 columns). Each row represents a single student per term at the Census snapshot date.
get_next_term() computed the expected subsequent term for each student row. Target labeled 1 if that term appears in the student's full enrollment history, 0 otherwise. Additional trinary and year-level targets also generated for alternate modeling approaches.Enrollment history, completed terms, credit hours accumulated, current class standing, GPA at multiple levels, transfer credit, and progression toward degree. These time-in-program signals were the strongest predictors of continuation risk.
Demographic information, degree program, major, student level (undergraduate, graduate, professional, law), residency status, admit type, college affiliation, and special program designations such as first-generation and honors status.
Registration hold indicators, leave of absence history, placement exam performance (Math, English, Chemistry), ACT/SAT presence as a binary signal, high school background, and geographic origin data for incoming student cohorts.
Data preparation was one of the most consequential phases of the entire project. All data originates from the raw ALL_DATA_6162023 parquet file produced during wrangling. A single Python script then handles the full pipeline: appending engineered features, segmenting by student level for correct aggregation, labeling four variations of the target variable, and exporting multiple dataset formats for different modeling strategies.
A critical design decision was made to keep all term-count features relative to the student's current degree level. Without this, students who completed both undergraduate and graduate programs would accumulate inflated graduate term counts — corrupting the signal the model depends on most.
Significant design effort went into how to label student outcomes. Three potential outcomes exist — discontinuation, graduation, and continuation — and the boundary between them is nuanced. Should summer non-enrollment count as dropout? Should graduation be treated separately from continuation, or grouped together as "success"? Four labeling variations were created and evaluated:
Labels continuation as 1 if enrolled in the immediately following fall/spring term. Summer skips are not penalized. Graduation = 2, dropout = 0.
Checks whether the student enrolled in any term within the following academic year — more forgiving of gaps. Graduation = 2, dropout = 0.
Looks only for fall term re-enrollment in the following year, isolating the strongest continuation signal independent of spring/summer behavior.
Collapses continuation and graduation into a single success label (1). Dropout = 0. Recommended approach — graduation and continuation share more feature similarity with each other than either does with dropout.
Although both dropout and graduation reduce enrollment, they do not share similar characteristics. Grouping them together as a single "not continuing" outcome would weaken the predictive signal for each. The binary success framing — where the model distinguishes successful students from those who discontinued — produced the most coherent feature clusters and was the recommended production target.
Raw labeled data with one row per student per term. No aggregation applied. Used for term-over-term prediction tasks where temporal granularity matters.
Aggregated to the most recent term snapshot per student per academic year. Reduces duplicate signals while preserving the most current state. Used for year-over-year forecasting.
One randomly selected term snapshot per student. Prevents overfitting caused by near-duplicate rows (stable features like sex, degree, and resident status remain constant across terms). Preserves real-world target variable distribution. Recommended dataset for model training.
Feature selection was conducted using two complementary methods on the H2O Random Forest model: Variable Importance (scaled impurity-based importance from 50 decision trees) and SHAP values (game-theoretic attribution of each feature's contribution to individual predictions). Partial Dependence Plots were generated for the top three features in each student population segment to understand direction and shape of the relationships.
| Feature | Description | Relative Importance |
|---|---|---|
| TOTAL_COMPLETED_TERMS | Total academic terms completed by the student | |
| SEMESTER_CD | Current semester code (seasonality signal) | |
| CURRENT_COMPLETED_ACAD_YEARS | Academic years completed at current enrollment | |
| STUDENT_CURR_1_MAJOR_NAME | Current declared major program | |
| CURRENT_COMPLETED_TERMS | Terms completed in current enrollment period | |
| DEPT_NAME | Department of enrollment | |
| SUM(STUDENT_TOT_REG_CREDIT_HOUR) | Cumulative registered credit hours |
| Feature | Description | Relative Importance |
|---|---|---|
| TOTAL_COMPLETED_TERMS | Total academic terms completed | |
| SEMESTER_CD | Current semester (fall/spring/summer) | |
| CALC_CLS_DESC | Calculated class standing (Freshman/Sophomore/etc.) | |
| CURRENT_COMPLETED_TERMS | Terms completed in current period | |
| LEVEL_GPA_HOUR | Credit hours at current GPA level | |
| LEVEL_GPA_QUAL_PT | Quality points at current GPA level | |
| STUDENT_CURR_1_MAJOR_NAME | Current declared major |
Across both graduate and undergraduate populations, time-in-program features (total completed terms, current completed terms, academic years) consistently ranked as the most predictive. This finding directly informed the decision to also build ARIMA time-series models as a complementary forecasting approach.
The PDP for TOTAL_COMPLETED_TERMS showed a sharp nonlinear increase in continuation probability from 0 to ~8 terms, plateauing around 0.91 mean response beyond 10 terms. Students in their first 1 to 3 terms represent the highest dropout risk window.
Semester code showed a step-change pattern: continuation probability held flat through semesters 1 to 6, then dropped sharply at semester code 7+. This suggests that students last enrolled in later-academic-year semesters face meaningfully higher dropout risk entering Fall.
Two variables that initially ranked as the most important in an early H2O model run (GRAD_STATUS_DESC and AH_ACAD_YEAR_CD_LIST) were identified as term-dependent leakage risks rather than genuine student-level predictors. Both were dropped before final model training.
Two algorithm frameworks were evaluated to handle the high-cardinality categorical features in the dataset: H2O Distributed Random Forest (DRF) and CatBoost Gradient Boosted Trees. Both natively handle categorical features without one-hot encoding via target-encoding with random permutation, avoiding the dimensionality explosion that standard encoding would cause at this feature cardinality.
| Pred: Discontinued (0) | Pred: Continued (1) | Error | Rate | |
|---|---|---|---|---|
| Actual: Discontinued (0) | 31,497 | 4,943 | 0.1356 | 4,943 / 36,440 |
| Actual: Continued (1) | 4,074 | 134,052 | 0.0295 | 4,074 / 138,126 |
| Total | 35,571 | 138,995 | 0.0517 | 9,017 / 174,566 |
H2O model error for all three classes (Discontinued, Continued, Graduated) remained below 5% in cross-validation, versus CatBoost's 47.71% error on the Discontinued class specifically. The ability to accurately identify the discontinuation class was the primary evaluation criterion given institutional use case.
Top 10 features ranked by scaled importance (0–1) from a 50-tree ensemble with 5-fold cross-validation. TOTAL_COMPLETED_TERMS dominates by a large margin, confirming that time-in-program is the primary driver of continuation probability.
The KNIME pipeline for the graduate model illustrates the complete end-to-end ML workflow: data ingested from Parquet, filtered via feature selection, balanced with SMOTE on the training partition, fitted with a Tree Ensemble Learner (semester-based), scored against the held-out test split, and routed to performance metrics, actual-vs-predicted visualizations, and a final classification output node.
After reading UNDERGRADUATE_TARGET_BY_YEAR.parquet, the pipeline applies two column filters (feature selection, then removing the identifier edw_pers_id), excludes the current year (2223) from training via a row filter, then fans out into three parallel semester-specific branches — each performing 70/30 partitioning, SMOTE oversampling on the minority (discontinued) class, and independent Tree Ensemble training.
Post-SMOTE Tree Ensemble validation results across the three undergraduate semester models. Class 0 = Discontinued; Class 1 = Continued/Enrolled. Overall accuracy ranges from 92–97%, with Cohen's κ peaking at 0.882 for Fall.
Line charts comparing the model's predicted count (teal) against the historical actual count (gold) from academic year 1213 to 2223. The x-axis uses KNIME academic year codes (e.g., 1213 = 2012–13, 2223 = 2022–23).
This project was led and architected by Tayler Erbe as part of the AITS Advanced Analytics team, with contributions from a team of data science interns across four workstreams. Tayler owned the overall project design, data wrangling pipeline, KNIME classification workflow, Python feature engineering, and stakeholder delivery.
KNIME data wrangling workflow · Python feature engineering and target labeling · KNIME classification model (Tree Ensemble, Random Forest, GBM) · Project architecture, stakeholder delivery, documentation
KNIME ARIMA model for projecting incoming student enrollment. Yield prediction (%NotEnrolled) for Undergrads, Grads, Professionals, and Law students by college and term.
Python ARIMA model for predicting continuing student enrollment outcomes by student level and college. Projection outputs, actual vs predicted graphs, performance metrics, and documentation.
Python model comparison analysis — CatBoost vs H2O evaluation, sampling methods, encoding strategies, cross-validation scoring, and documented recommendations for future optimization.
Python data exploration and feature importance analysis using SweetViz comparison reports. Student outcome probability modeling (high/medium/low risk classification by term).
Separate from the classification model, ARIMA (AutoRegressive Integrated Moving Average) time-series models were built to forecast total Fall enrollment numbers by student population and college. Two distinct ARIMA implementations were built: one in KNIME focused on incoming new students, and one in Python focused on projecting continuing, discontinuing, and graduating student counts across 14 UIC colleges.
Separate ARIMA models trained per college per student level. Final parameters selected via cross-referenced grid search across all ARIMA parameter combinations for each college and student segment.
Incoming, continuing, discontinuing, and graduating student populations each modeled separately, recognizing that their enrollment drivers and seasonality patterns differ significantly.
KNIME ARIMA workflow (incoming student yield) built by Anirudh Palakurthi. Python ARIMA for continuing student projections by college built by Azaan Barlas. Both implementations iterated through multiple model variants.
Initial ARIMA models were built with manually tuned p, d, q parameters, optimized per college and student level via grid search. Models were trained on historical enrollment data from 2013 through 2021-22, with separate parameter sets selected for each of the 14 UIC colleges to account for program-specific enrollment patterns and seasonality.
A second iteration extended the baseline models with exponential smoothing to handle combinations where ARIMA could not produce a reliable forecast, and incorporated Auto-ARIMA (automated parameter selection via AIC/BIC optimization) to systematically identify the best-fit model order for each college-level series. This iteration improved forecast stability and reduced manual tuning overhead across the 14 college segments.
The full 10-year forecast was divided into two student types — Continuing and Non-Continuing — on the premise that their enrollment drivers differ substantially. For each type, separate models were run across every combination of Student Level (Undergraduate, Graduate, Professional), College, and Term (Fall, Spring, Summer). In practice, this meant dozens of individual ARIMA models. For 12 specific college-level-term combinations, ARIMA failed to produce a reliable forecast and an ad-hoc Holt's Double Exponential Smoothing model was implemented as a replacement.
By splitting the enrollment data into Fall, Spring, and Summer subsets and running separate models on each, seasonality was effectively removed from each individual series. What remained was a trend-bearing, non-seasonal time series — exactly the use case Holt's method is designed for.
Holt's method extends simple exponential smoothing with two parameters: α (alpha) controls the weight given to the current level observation, and β (beta) controls the weight given to the current trend estimate. Both decay exponentially, giving more influence to recent observations without discarding older patterns entirely.
Alpha (α) and beta (β) were both swept from 0.05 to 1.0 in steps of 0.05. Every combination was evaluated against an 80/20 train-validation split using RMSE as the selection criterion. The pair yielding the lowest validation RMSE was retained as the best model for that combination.
After the best α and β were identified, the model was refit on the entire dataset (training + validation) to maximize the information available before projecting forward. The fitted model then produced a 10-period point forecast for each of the 12 problem combinations.
Mean Absolute Error (MAE) was computed on the validation set using the best-fit model. The forecast range was expressed as the point forecast ± MAE, providing a practical uncertainty band for each projected enrollment figure to communicate forecast confidence to stakeholders.
The KNIME-based ARIMA model projected the percentage of admitted students who would not enroll for each Fall term. Target variable (%NotEnrolled) was computed as the ratio of non-enrolling accepted students to total applicants, calculated separately for undergraduates, graduates, and professional students.
The Python ARIMA models forecasted yearly enrollment by continuing student counts for 14 colleges. Key findings: Applied Health Sciences, Business Administration, and Engineering expected to grow; Liberal Arts and Sciences and Public Health stable; Law significant growth; Medicine and Pharmacy slight projected declines.
H2O GBM projections performed reliably on historical data through academic year 2021-2022. Projections for 2022-23 were identified as less reliable, which is documented in project output. This honest limitation assessment drove the recommendation for domain-knowledge-informed feature refinement in future iterations.
ARIMA projections identified consistent enrollment growth expected for the UIC Extended Campus, highlighting the institutional importance of non-traditional student pipelines. This was surfaced as an actionable insight for strategic planning conversations with institutional leadership.
One-hot encoding was ruled out due to dimensionality explosion risk with features like STUDENT_CURR_1_MAJOR_NAME and DEPT_NAME. Both H2O and CatBoost use target-encoding with random permutation, which handles high-cardinality features efficiently while preventing overfitting and data leakage.
Rather than modeling at the term snapshot level, data was aggregated to the earliest semester snapshot per student per academic year. Analysis showed students with Fall as their last enrolled term had higher dropout probability, validating the term-segmented modeling approach and enabling cleaner seasonality isolation.
The discontinued student class was substantially underrepresented in the training data. SMOTE (Synthetic Minority Over-sampling Technique) was applied after random partitioning of the training set. Both H2O and CatBoost additionally apply internal class-weight adjustments. Ensuring accurate identification of the minority discontinuation class was the primary modeling objective.
Models were trained separately for Undergraduate, Graduate, Professional, and Law student populations, and further segmented by term (fall, spring, summer) within each level. This captured meaningful behavioral differences in dropout patterns across student types that a single pooled model would obscure.
Beyond aggregate projections, the classification model produced individual student-level risk scores, confidence measures, and probability estimates. Students were binned into high, medium, and low risk tiers for advisor use. All outputs were keyed to EDW_PERS_ID for operational integration into institutional advising systems.
The project explicitly documented its own limitations and produced concrete recommendations: incorporating domain knowledge for variable selection beyond model-derived importance, combining high-cardinality categories to reduce cardinality, and exploring ensemble approaches combining class-sensitivity-tuned models for the discontinued segment specifically.