Finding Donors for CharityML Project
This project is part of the Udacity Data Scientist Nanodegree Program: Finding Donors for CharityML Project and the goal was to apply Supervised learning techniques on data collected for the U.S. census to help a fictitious charity organization CharityML to identify people who would most likely donate to their cause.
Let’s start by using the CRISP-DM Process (Cross Industry Process for Data Mining):
Business Understanding
Data Understanding
Prepare Data
Data Modeling
Evaluate the Results
Deploy
Business Understanding
CharityML is a fictitious charity organization that wants to expand their potential donor base by sending letters to residents of the region where it is located, but only to the ones who would most likely donate. After nearly 32000 letters were sent to people in the community, CharityML determined that every donation they received came from someone that was making more than $50000 annually. So our goal is to build an algorithm to best identify potential donors and reduce overhead cost of sending mail.
Data Understanding
The dataset is composed by 45222 records and originates from the UCI Machine Learning Repository. The dataset was donated by Ron Kohavi and Barry Becker, after being published in the article “Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid”
To summarize the features are:
age: continuous
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
education_level: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
education-num: continuous
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other
sex: Female, Male
capital-gain: continuous
capital-loss: continuous
hours-per-week: continuous
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
income: >50K, <=50K
As we can see from the preliminary analysis our dataset is unbalanced because of course most individuals do not make more than $50,000 thus as we will better explain in the Naive Predictor section this can have an impact on the accuracy of the model we are going to develop.
Transforming Skewed Continuous Features
Algorithms can be sensitive to such distributions of values and can underperform if the range is not properly normalized so it is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. With the census dataset two features fit this description: capital-gain and capital-loss.
Normalizing Numerical Features
Applying a scaling to the data does not change the shape of each feature distribution. We will use sklearn.preprocessing.MinMaxScaler for this on age, education-num, hours-per-week, capital-gain and capital-loss.
Prepare Data
Convert categorical variables is by using the one-hot encoding scheme.
As always, we will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.
Data Modeling
The purpose of generating a Naive Predictor is simply to show what a base model without any intelligence would look like. As already said, by looking at the distribution of the data, it is clear that most individuals make less than $50000 annually. Therefore a model that always predicts '0' (i.e. the individual makes less than 50k) will generally be right
The fact that the dataset is imbalanced also means that Accuracy is not very helpful because even if we obtain high accuracy the actual predictions are not necessarily that good. It is usually recommended to use Precision and Recall in this cases.
Let’s compare the results of 3 models:
Decision Trees
Support Vector Machine
AdaBoost
As already said we are focusing on the model’s ability to precisely predict those that make more than $50000 which is more important than the model’s ability to recall those individuals. AdaBoostClassifier is the one that performs best on the testing data, in terms of both the Accuracy and F-score. Moreover AdaBoostClassifier is also pretty fast to train as shown int the Time-Training_set_size histogram.
Now we will fine tune the model using sklearn.grid_search.GridSearchCV.
Finally we can find out which are the features that provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon. We can do that using feature_importance.
Evaluate the Results
Our goal was to predict if an individual earns more than 50k annually because the individuals that fulfill this requirement are more willing to donate to a charity. After cleaning the data and modeling them into a dataset ready to be used for ML training we tested the performance of three different models. Considering F-1 Score the best model is AdaBoostClassifier.
Sponsor some real charity organization
I want to take advantage of this article to talk about some real charity organization that I really care about:
The Italian Multiple Sclerosis Association (AISM) is a non-profit organization that deals with multiple sclerosis
EMERGENCY is a humanitarian NGO that provides free medical treatment to the victims of war, poverty and landmines
Save the Children improves the quality of life of children through better education, health care, and economic opportunities. It also provides emergency aid during natural disasters, wars and other conflicts
Outro
I hope the post was interesting and thank you for taking the time to read it. The code for this project can be found in this GitHub repository, on my Medium you can find a more in depth story and on my Blogspot you can find the same post in italian. Let me know if you have any question and if you like the content that I create feel free to buy me a coffee.