Using Deep Neural Networks to Predict Failure and Dropout Rates in El Salvador Schools
Jeremy Swack | Oct. 6, 2021, 9:28 a.m.
Overview
For the Spring 2021 semester, I developed a data analytics project that examines the El Salvador education system. Using neural networks, I created two models that predict failure and dropout rates in primary and secondary schools, as well as determine the most important factors that lead to these rates. Before examining and discussing the data I used for this project, it is important to understand the differences between education in the United States and El Salvador.
Background
Education in El Salvador is broken down into two distinct categories, primary education, which correlates to about 1st through 9th grade, and secondary education, which roughly correlates to 10th and 11th grade, and sometimes 12th grade, depending on the school (“Education”). Unless enrolled in optional kindergarten, primary education in El Salvador generally begins at age seven. Primary education is further divided into three sections: 1st cycle, 2nd cycle, and 3rd cycle (“Education”). Each cycle takes up three grades, with the last cycle being similar to that of middle school in the United States.
Most schools that provide secondary education offer a two-year program, unless the student decides to stay an extra year to gain a more technical education. Completing secondary school in El Salvador is similar to graduating high school in the United States.
Data Analysis
The data I used for this project were two surveys from 2018 that were administered to 6025 primary and secondary schools in El Salvador. Both surveys were conducted entirely in Spanish. The first survey provided information about each school such as its access to electricity, internet, and adequate classroom space, among other characteristics. The second survey gave detailed information about the number of students who passed, failed, and dropped out in the given year.
Due to the formatting of the two surveys, a large amount of data preprocessing needed to be performed. Because both surveys were entirely in Spanish, I first translated both using Google Translate. This was not without its drawbacks, as Google Translate had many imprecise translations that made the meaning of some questions unclear. However, the basic translation allowed me to gain a much better understanding of the data and the questions that were asked. After combining the school characteristics data with the failure and dropout data, I determined which variables would be the most important to analyze. In total, I was left with 43 variables, which are listed below:
Variables:
- Number of classrooms for teaching
- Number of classrooms not used for teaching
- Number of unusable classrooms
- Number of computer labs
- Number of temporary classrooms
- Number of rooms in the school
- School funding in USD
- Is the school public or private?
- Is the school in a rural or urban area?
- Is the school located in an indgenous community?
- If the school has internet and if so, what type of internet connection?
- Did the school know of any teen pregnancies?
- Did the school take action to avoid teen pregnancies?
- Does the school have (a/an)…?
- Phone
- Fax machine
- Email address
- Website
- Electricity
- Heal services
- Ramp
- Handrails
- Special health center
- Computers for student use
- Library
- Computer center
- Science lab
- Educational support classrooms
- Soccer field
- Basketball court
- English lab
- Farm
- Administrative office
- Teachers’ lounge
- Clinic
- Workshop
- Professional clinic
- Multipurpose room
- Recreational space
- Dining room
- Celler
- Kitchen
- Kitchen cellar
With this new subset of the data, I was able to create visualizations that help give a better understanding into the El Salvador education system.
Data Visualization
Figure 1 displays the distribution of rural and urban schools in El Salvador, subcategorized by whether the school is private or public. A few conclusions can be drawn from this plot: First, roughly two-thirds of schools in El Salvador are rural, and almost all of those schools are public schools. Second, over a third of schools in urban areas are private. This is likely because urban areas tend to have higher levels of wealth, which means there are more families that can afford to pay tuition at a private school. Conversely, students from rural areas are likely less wealthy, meaning free public school is a necessity.
Figure 1:
Figure 2 shows the distribution of types of internet connections across schools in rural and urban areas. From this graph, it is clear that a majority of schools in rural areas do not have internet access. This reaffirms the idea that schools in rural areas generally have lower levels of wealth and infrastructure. In urban areas however, a majority of schools have some kind of internet access.
Figure 2:
Figure 3 shows the average amount of funding schools in El Salvador received in 2018, subcategorized by three different variables. The number in white on each bar represents the total number of schools that fall in each category. Because private schools have separate tuition fees, public schools generally receive more funding than private schools. Most notable however, is that schools in urban areas receive far more funding than schools in rural areas.
Figure 3:
Lastly, Figures 4 and 5 show the distribution of failure rates and dropout rates across all 6025 El Salvador schools. Nearly all schools fail less than 10% of their students, and nearly all school dropout rates fall from 0%-25%. Additionally, over half of El Salvador schools report a failure rate between 0% and 1% and a third report a dropout rate between 0% and 2%. Both of these figures differ from what I expected, as previous research seemed to suggest that schools in El Salvador had higher failure and dropout rates (“Education System”).
Figure 4:
Figure 5:
Model Creation and Results
The next step for my project was to create models that could predict a school's failure or dropout rate with the combined survey data set. I chose to build a neural network using the deep learning packages Keras and Tensorflow because of the complexity of the data set and the likelihood for non-linear relationships between the different variables. The data was first split into two groups: a training data set of roughly 4500 points and a testing data set of roughly 1500 points. I created two separate models: one with failure rate as the target variable, and another with dropout rate as the target variable. Both neural networks used the same training data set and the same model architecture, which included multiple dense layers, dropout layers, and batch normalization.
The results of the predictions on the test data sets of the two neural networks are shown in Figures 6 and 7. Both models were able to create fairly accurate predictions for their respective target variables. However, both models struggled for the more extreme values where there was likely less data for the model to train on. Because El Salvador is a relatively small country, the number of schools makes it difficult for the model to have a sufficient amount of data at all levels of failure and dropout rates.
Figure 6:
Model MSE: 4.381
Figure 7:
Model MSE: 12.457
To give further insight into what the most important variables were in each model, I created two variable importance charts using the package SHAP. This package allowed me to produce SHapley Additive exPlanations (SHAP) values for each variable, which are values that represent the average contribution a variable makes to a model’s predictions. Figures 8 and 9 display the top ten variables by SHAP values, with the first plot being for the failure rate model and the second plot being for the dropout rate model. Bars that are in blue mean the variable is positively correlated with the target variable, and red bars mean that the variable is negatively correlated with the target variable. Interestingly, the most important variable for both models was the number of classrooms used for teaching purposes. Even more strange is that the bar is blue, meaning that having more classrooms for teaching is correlated with having a higher failure and dropout rate. I am not sure why this would be the case, but a possible explanation could be that larger schools in El Salvador have higher failure and dropout rates, and these schools would generally have more classrooms for teaching. The failure rate importance chart also shows that a school being private and in an urban area are important determinants in having a lower failure rate. Similarly, a school being public and in a rural area are important determinants in having a higher failure rate. This aligns with my initial hypothesis that factors correlated with socioeconomic status would be the greatest predictors of failure and dropout rates.
Two variables I thought would be highly impactful to the models’ predictions, internet connection and average funding, were not present on either variable importance chart. Considering other variables that showed up on these charts, such as whether the school had a kitchen or a kitchen cellar, it is difficult to determine exactly why internet access and funding were not important determinants in either failure rate or dropout rate. However, a possible explanation is that there are other variables, like whether a school is in an urban or rural area, that capture the same correlation that internet access and funding do, but to a higher degree of accuracy. Another explanation is that the SHAP values of all these variables are relatively low, meaning no single variable is a great predictor of failure or dropout rate alone. This could mean that even though internet access and funding do not make the top ten, there may not be a large difference in correlation between the 10th best predictor and the 30th.
Figure 8:
Failure rate model:
Figure 9:
Dropout rate model:
Applications and Future Work
The insights from the models and variable importance charts I created could be used to help the El Salvador educational system make more informed decisions on where to invest funding to promote lower failure and dropout rates. With more data from both El Salvador and neighboring countries, it would be possible to create more robust models that could predict with higher degrees of accuracy, even for more extreme failure and dropout rates. Additionally, with more computing power, it would be possible to create models with more hidden layers and nodes which could allow for higher degrees of accuracy as well.
Works Cited
“Education System in El Salvador.” El Salvador Education System, www.scholaro.com/pro/Countries/El-Salvador/Education-System.
“Education.” U.S. Agency for International Development, 17 May 2021, www.usaid.gov/el-salvador/education.