Prediction Model Overview
Trusting anyone’s prediction model requires essential understanding of where it comes from and how it works. This article orients you to the fundamentals of Civitas Learning prediction modeling.
What is “prediction”?
For our purposes, a prediction is a signal. It points to what is likely to happen without intervention. Far from being deterministic, a prediction is a machine-learning-aided call to action: it’s worth quantifying a prediction because knowing it helps you change that outcome.
Civitas Learning’s student insights are driven by two key types of prediction:
Persistence Prediction — “How likely are you to retain this student?”
Persistence means a given student will:
- Enroll in your institution in a future term
- Stay enrolled past your institution’s census (add/drop period)
Completion Prediction — “How likely is this student to graduate on time?”
Completion means a given student will:
- Earn a credential at your institution
- Complete it within a set number of years after their first enrolled term
Prediction scores
On Day 1 and then every week thereafter (every day if you have an Advising product), each student is rescored on their likelihood to persist. Scores range from 0% to 100%. Once a term is over, the scores from Day 1 can be analyzed against what actually happened to evaluate the model, the results of which inform adjustments to be made when the model is retrained.
When scored, the predictions are distributed (grouped) into simple buckets by range: Very High, High, Moderate, Low, and Very Low. These buckets are easy to work with throughout the products:
There is no weighting here, so expect to see the number of students in each bucket change throughout the term. Combining buckets with filtering for those with rapid changes in their prediction score can quickly target students needing prioritized attention.
Where you see these buckets, hover over the gray information (ℹ) icon to see the ranges that apply to your institution. If the ranges on your buckets don’t match what you see above, it’s because your institution opted to customize the ranges.
Beyond grades
These scores are far more predictive of persistence than conventional methods used by institutions, such as watching to see who falls on academic probation: on day one, this prediction score has been found to identify 82% of the students who later failed to persist.
Comparing persistence data across 4 million students in 62 institutions surfaced these surprises:
- Almost every institution lost more students with GPAs above 2.0 than below.
- 78% of students who were lost had GPAs over 2.0.
- 44% of students who were lost had GPAs over 3.0.
Mining predictions for opportunities
An opportunity is a data-inspired insight. It is a finding — specific to your institution — of where there’s potential for effective action for student success. It has two features:
- It points to a current gap in student outcomes
- It’s a gap that may be improved through intervention
A big part of Administrative Analytics is surfacing opportunities for you to help students succeed. For example, you could look up gaps in persistence likelihood between two groups of students, and drill down to what’s correlated with those different persistence rates.
What data is used?
This immense modeling power comes from merging and standardizing the data across multiple sources at your institution:
- SIS — Student Information System, such as Ellucian Banner and Oracle PeopleSoft
- LMS — Learning Management System, such as Blackboard and Canvas
A typical deployment will include both SIS and LMS systems, but institutions have chosen to include additional sources of data, such as:
- CRM — Customer Management System, such as Salesforce and Oracle NetSuite
- Card swipe data, which tracks physical activity and resource utilization
Data Types — The types of variables we can derive from these data sources are many:
- Academics and Degree Alignment
- Financial aid
- Demographics and U.S. Census findings
- High School outcomes
- Engagement
- Activity
- Behavioral
Types of predictors
Here is a categorized view of the many predictors (also called features or variables) that are considered for modeling:
Category | Description | Examples |
Background & Demographics | What students come with when they enter school | Demographics, high school variables, test scores, U.S. Census data from their zip code |
Academic performance | Variables that define a student’s academic performance after entering school | Time-series GPA, grade consistency, LMS grade — absolute and relative |
Academic progress & program | Progress and speed towards completion along with program/major information | Credits earned time series, credit load, credits withdrawn, N major switches, program CIP code |
Financial aid | Based on mapping from institutional financial aid categories to Civitas canonical categories | Total disbursed time history, need based, Pell, scholarship, EFC (if provided by institution), which is expected family contribution (determines eligibility for federal student aid) |
Enrollment behavior | Enrollment behaviors around how students enroll for current and future terms | Credits enrolled before term start, days enrolled before term start, credits enrolled for next term |
LMS | Absolute and relative LMS predictors in various categories using digital signal processing | Fraction of active days in LMS, posting, relative levels of engagement |
What is predictive modeling?
From the top: An algorithm is a set of steps designed to accomplish a task. Computer systems follow algorithms to perform their functions; when computers use algorithms that improve their own performance without explicit human intervention, we say they are engaging in machine learning (ML).
ML algorithms use statistics to search for patterns in large sets of data. Those searches can be supervised or unsupervised; that is, we can label the data to tell the algorithms what kinds of patterns to look for, or we can leave the algorithms alone to see what patterns emerge. Once the algorithms find patterns in existing data sets, they can use those patterns to predict, or model, future behavior.
Predictive modeling is, in essence, supervised machine learning: the algorithms learn relationships between training data (which are specifically curated and provided sets of variables or features) and specified outcomes. The trained models can then be applied to real-world operational data to predict future outcomes. Given inevitable changes in data characteristics over time and the resulting degradation in model performance, predictive models are retrained as needed.
How are predictions generated?
Here is a high-level view of the deployment process:
- Prepare the data — We connect to all of your data sources, then we set the data to be in a compatible format, standardizing all of it. This is called “normalizing” or “cleaning”.
- Check the data — We work closely with your IT team to make sure the data is accurate. This is called “data validation”, or QA. This happens up front, during implementation, and it continues as monitoring to ensure it’s accurate and up-to-date.
- Model the data — We prepare your historic data for training so that models can learn from past data and outcomes. For example, we use typical data like GPA, credits earned, and full-time status for initial modeling. Then we develop a rich set of derived variables that unearth deeper insights about students.
- Train your model — Also known as feature engineering, we configure and train your models to detect the patterns throughout all your data and student outcomes, such as persistence, course success, and graduation. The models get smarter by learning what predictors affect student outcomes at your institution, specifically.
- Test your model — Then we test the models’ performance for accuracy. Once the models meet our quality standards, we share the performance results with you. Once you validate and confirm readiness to deploy, the platform can start generating fresh predictions to support your work.
Once deployed, your model is strengthened by ongoing monitoring:
- Review your model — Modeling is rigorous and continuous. The built-in QA process revisits all of the predictions to see how accurate they proved. Any data issues are flagged and reviewed, and the model is checked for accuracy after each term ends (“Day 0” analysis).
- Retrain your model — We retrain your model whenever testing indicates or something changes in your data mapping. This lets it adapt to take into account every byte of new information that arrives. Retraining happens proactively for larger issues, such as for institutional changes or impacts such as COVID-19.
- Evolve your model — We continually research, evaluate, and evolve our model variables and algorithms based on performance trends, so you’ll always receive our improvements.
How are models configured?
Typically, over 100 variables are considered for modeling. The majority of these model predictors are actually new variables that are based on your basic data, including:
- Time series predictors
- Relative predictors
Derived Variables — We use your data to build out derived variables, which surface time-dependent and comparative information that help reveal which students need attention:
- Sections Withdrawn — how many were withdrawn, and how recently?
- Change in GPA — has there been a rapid drop?
- LMS Activity Relative to Peers — how are they engaging compared to students in their own sections, and how has it changed?
- Degree Program Alignment — how are they doing course-wise, compared to students in their same degree program who graduated on time successfully?
In the end, the most powerful and important predictors tend to be our derived variables in combination with student engagement information.
Configuration of your model is not fixed in stone:
- Your institution can opt to test or add in additional model variables, through custom configuration work. You can contact Support to inquire.
- When there is common need or shared circumstances, the product roadmap will take on those data source or model variable updates.
For example, COVID-19 had a huge impact on institutional policies and methods of engagement, causing big shifts in model variables related to enrollment, credits attempted, grades, course modality, and LMS activity.
What algorithm is used?
We start with a primary algorithm, Random Forests (RF), and test with additional algorithms as needed to achieve the quality standards for our predictions. Random Forests uses a set of CARTs, which are classification-and-regression trees:
We have found that Random Forest offers better model performance than simpler linear classifiers, such as segmented logistic regression.
Why Random Forest?
Random Forest is a workhorse machine learning (ML) technique that is widely used in many industries, bringing many benefits:
- Random Forest is well-known, familiar, well-tested, and well-researched.
- RF can achieve excellent model performance very quickly.
- RF is easier to configure and less dependent on precisely curated features and segments (unlike logistic regression).
- RF has low bias and low variance. This means it can better represent complex patterns in data, even with smaller data sets (unlike logistic regression).
Trade-off — The one trade-off of choosing RF is that it is less interpretable than linear algorithms such as logistic regression, but can still provide predictor rankings. “Less interpretable” means we can still determine predictor importance, but there is no longer an obvious direct relationship between a specific predictor value changing and its effect on the prediction score.
Institutions who changed to RF from a logistic regression model saw these effects:
- Slightly worse calibration errors (predicted vs. actual values) overall (but within acceptable model performance QA thresholds)
- Better AUC* (average 5% boost)
- Better accuracy for low prediction scores
- Better calibration during the end of the term
* AUC (Area under the ROC Curve) is a standard performance metric for classification models. It rates how close the model gets to 100% accuracy, and it is the preferred method for such models. AUC gives us an aggregate measure of performance across all possible classification thresholds.
What RF does
Random Forest has two main activities:
- Splitting — RF combines student variables to build subtrees that do the best job of splitting (creating maximum separation) between the groups (persisting and non-persisting students).
- The tree splits based on information gain (gini index or entropy)
- That is, Does further splitting lead to more differentiation in persistence rates?
- Bootstrapping — RF uses bootstrapping (random selection of training samples and variables) to create hundreds to thousands of trees.
- This forms a ‘consensus view’ of a student’s likelihood to persist.
- This increases the likelihood of each CART having an independent look at the population, which improves the model’s performance and robustness.
How RF works
Random Forest is an ensemble of a large number of classification-and-regression trees (CARTs), where each CART is generated through a random selection of the overall training data through bootstrapping and a random selection of features (predictors) during each split. Deciding on feature selection during each split is analogous to finding a feature with the largest magnitude of elasticity with the tipping point serving as a split threshold. This makes sense because the purpose of the split is to maximize the persistence rate difference between the two splitted subgroups.
The subsequent split (creating children nodes or growing a tree) follows the same principle. Gini index or entropy, both information-gain metrics, is used to determine the utility of further splitting. If further splitting leads to improved differentiation in persistence rates of children nodes vs. parent nodes, it follows that there is value in splitting.
Two-dimensional bootstrapping (training data and features) increases the likelihood of each CART having an independent look of the population, thereby improving the model performance and robustness. We conducted hyperparameter turning and decided on 1000 random CARTs, 70% in random sampling of training data, 50% in random selection of features in each split, and the maximum tree depth of 8.
In short, instead of building three separate models, the RF model builds up to 28 * 1000 relatively independent micro-segments from which a consensus is reached on a student’s persistence probability.
How are models tested?
Our quality assurance tests your model’s performance for both its accuracy and its calibration. Every time we train or retrain a model for an institution, we evaluate the model’s performance. Methodologies for persistence and completion prediction models are similar.
QA makes use of discrete data sets for comparison:
- Training data — These are past student records with known outcomes that are used to create the machine learning model.
- Testing data — These are past student records with known outcomes that are not used in model training. This dataset is used to check model performance on data that the new model has never seen before.
Why? A prediction model is only useful if it can generalize its machine learning algorithm well against new data; when it can’t, this data science problem is called “overfit” (a model exactly fitting its training data). To be able to detect this problem, part of the data set must be kept back from the training data for testing purposes.
Model accuracy
We test your model’s accuracy using the AUC metric.
AUC (Area under the ROC Curve) is a standard performance metric for classification models. It rates how close the model gets to 100% accuracy, and it is the preferred method for such models. AUC aggregates a measure of performance across all possible classification thresholds.
In order to pass QA, these are the thresholds that the model’s performance must achieve:
- The training data AUC is between 70% and 97% (0.7 – 0.97)
- The test data AUC is between 70% and 97% (0.7 – 0.97)
- The difference between train and test AUC falls within –2% and 5%
This process tests for the “overfit” model problem described above.
Model calibration
We test your model’s calibration by evaluating its predictions against known outcomes.
We compare the historic outcomes with model predictions for the entire student population. This lets us establish whether the predictions track well with historic outcomes.
In order to pass QA, this is the threshold that the model’s calibration must achieve:
- For the population tested, the difference between historic outcomes and model predictions falls within 5%
Questions about Predictions
Won’t adding our prior LMS data give us better predictions?
It’s counterintuitive, but no. First, our testing and results indicate that more isn’t better. In our 10+ years of experience working with hundreds of customers, we have found that adding more than 2 years of historical LMS data doesn’t improve the performance of the persistence model. When only 1 year of historical data is available, institutions with a large, active student population can leverage even that 1 year for an incremental positive effect on their model performance. One institution that insisted on pulling in their historical LMS data only achieved a 0.1% boost in AUC, so the return on their investment was poor.
Second, combining data from disparate LMS systems has not been shown to improve the model predictive quality and sometimes has a negative effect, so we do not recommend it. The issue with blending LMS data is consistency. For example, grades and student engagement may be calculated differently between systems. This could result in a model performing poorly if it was trained on data from one LMS but scored on data from a different LMS.
We train your persistence model deliberately, according to how much data you have:
- We use 2 years of historic SIS data to train persistence models.
- If you migrate to a new LMS system,
- With less than 1 year of data in the new LMS system, we block the LMS predictors and train the model only on SIS predictors.
- With 1 full year of data in the new LMS system, we train the model on that year of historical data.
- As soon as we have 2 years of data in the new LMS system, we retrain the model with 2 years of data.
In terms of data adequacy, however, enrollment size does matter:
- Small institutions (fewer than 5K enrollments per term) may not have enough samples to train the model with 1 year of data, which may affect model performance.
- Larger institutions (40K enrollments per term) have plenty of samples to train the model, even with just 1 year of data.
Will our models update to change predictors over time?
Yes, and keeping your model current is a primary activity. The predictor values (which are recalculated every night when the model is scored) always change over time, and the model itself is adapted to reflect this. For example, if part-time status drops off in predictive power with your new cohorts, its ranking will drop among the powerful predictors that make up your institution’s model. When there are changes to your programs and data, such as achieving a full year of LMS data, we proactively retrain the model to incorporate those changes.
How it works — The model ranks your predictors during initial training, and this tells the model which predictors are important. If some predictors are no longer important due to changes in data during the term being scored, the model must be retrained. We call this “non-stationarity”, where the distribution of data is different between training terms and scoring terms.
How do we track and compare who falls into the at-risk buckets?
During a term, you can use the Persistence Insights and Powerful Predictors in Administrative Analytics to see and save (via the Scratchpad) snapshots through a term for groups that you track.
There are optional services and subscriptions, such as the Daily Continuation Report and querying the Data Warehouse, that could add to this analysis.
How do we see the effectiveness of our outreach?
You can use Initiative Analysis to understand directional impact of programmatic initiatives (such as advising, tutoring services, writing labs, and supplemental instruction). For details about initiative design, interpreting results, and framework analysis, see the Hub and training materials.
In Initiative Analysis, the detailed results of the Impact by Student Group will show impact by various groups (such as bottom quartile, 0 terms, or 1-3 terms) .
Are negative or non-impact indicators handled by the model?
Yes. The model doesn’t categorize any predictors as negative or positive, or as no-impact; instead, it’s the value held by the individual student (such as their high school GPA) that can be evaluated that way. All predictors are dynamically ranked by their predictive “power” to affect the score calculation for the given group. The ranking is of their relative importance to the prediction.
Which predictive factors fall out of scope?
None that can be supported. All of the standard predictors, over 100 of them, are in scope for all institutions. Only when certain data is not available from your data sources will the dependent predictors fall out of scope for your model, or if overriding circumstances arise, such as needing to remove “modality” during the COVID-19 lockdown.
If at any point your data sources change and a predictor becomes available, the model can be updated and retrained to leverage it.
To see the list that is currently used in your institution’s model, contact Support.
Do any factors relate to faculty/staff behaviors?
No. All the predictors are focused exclusively on the students themselves: their demographics, characteristics, household origin, enrollment, academics, aid, behavior, and performance.
What visibility do we have into a student’s prediction?
A student’s prediction score is a complex calculation, not a simple listing of their values for individual predictors, so you do not see a list per individual. However, you can narrow your filters to focus narrowly on the students in question and see which predictors are predominant for that small group.
When working with a specific student, Advisors should check the student’s Engagement Opportunities for immediate direction on predictors that are flagged for action.
How do we ensure the model is free of bias?
Models are not inherently biased, but this is a good question. Predictive models can fail, in three key ways: a mismatch between training and operational data, laziness in model training, and malicious intent.
What about demographics? — Among the hundreds of input variables in our predictive models are demographics that categorize students by age, gender, race, ethnicity, socioeconomic status (inferred from Pell and census data), status as first-generation college student, and high school characteristics (free lunch qualifying) — all associated with factors that contribute to equity gaps. But because these variables can’t be influenced through your institutional interventions or policies, we rank them lower in our predictive models.
There are multiple factors in our model approach that combat bias:
- We use derived variables
- We focus on malleability and intervention impact
- We use a robust number of variables (over 100)
- We use transparency with variable ranking
- We hold to our social mission use case
For more details, see our white paper, Navigating Bias in Predictive Modeling.
How do we keep the prediction model and its data from being abused?
There are protective actions you can take on your side:
- Select appropriate use cases
- Controlled access — opt to hide certain data from staff members
- Professional development — proactively train your staff