Projects by Darren Keeley: Pursuing Good Leads with Random Forests (school project)

Executive Summary: This paper is a demonstration of how machine learning can be used to optimize pursuing leads in sales. In a bank’s phone marketing campaign, whether a sale was closed or not is predicted using a Random Forest based on customer data like history with the company, occupation and age. The final model correctly identifies a closed lead 69% of the time (True Positive Rate), and misidentifies a lead that doesn't close as one that does 17% of the time (False Positive Rate).

Darren Keeley, STAT 6620-02

Abstract

AUC is used as the model performance metric. Based on this metric, a grid search is conducted to find the best number of variables per tree, which turned out to be close to the default of the square root of the number of features. Since the data is highly imbalanced toward sales not closing, the model is further improved using stratified sampling. Finally, a theoretical cost to persuing bad leads was adhered to, where two dead-end leads equals one closed sale. An optimal cutoff probability for True Positives of 36% was found, with TPR and FPR equal to 0.688 and 0.1698, respectively.

Introduction

In sales, separating the quality leads from the bad ones can save tremendous time and money. Machine learning can be used to predict whether leads will become a sale or a bust. This paper uses a random forest to make such predictions for a direct phone marketing campaign conducted by a Portuguese bank. The product was a term deposit.

Step 1: Collecting the Data

The dataset was procured from the UCI ML Repository. Each row is a client that the bank called, totaling 4119 observations. The dataset includes one response, a “yes” or “no” signifying if the client subscribed to the deposit account. There are 20 explanatory variables divided into 3 categories: client data, the campaign and its history with the client, and economic attributes at the time. A list of all variables and their descriptions is included in the Appendix as item [1]. A snippet of the data follows (Figure 1).

Step 2: Exploring and Preparing the Data

The dataset took some preparation before any analysis could be conducted.

The string ‘unknown’ designates a missing value in the data. To prepare the data to be utilized in R, elements were parsed for this string and replaced with NAs.

The variable ‘duration’ warrants immediate removal, since it is an unknown variable before the phone call and is highly predictive of the response. Removing this variable for predictive models is also recommended by the documentation accompanying this dataset. ‘Default’ is dropped from the analysis since it is a binary variable that is nearly homogeneous: 3315 values are “no,” 1 is “yes,” and the rest are missing values.

There are 308 observations with missing values. Among the variables with missing values, all are qualitative with numerous categories, making their imputation very difficult. Since they are randomly distributed and categorical, rows with missing values are simply removed, a loss of nearly 7.5% of the data (4119 obs to 3811).

Finally, the data is ready for analysis. The response is highly imbalanced, with 431 “yes” labels and 3398 “no” labels. This will affect the results in the initial fitting of the model in Step 3 and then be rectified in Step 5.

The dataset is split 7:3 training and test.

Visualizations of the features and the response reveal few obvious patterns. There are, however, some that do stand out. The first is the client’s job. It seems that people who weren’t working had a higher probability of opening an account. The retired, unemployed and students all opened accounts at higher proportions than the other, employed clients (Figure 2). This is reflected in the ages of clients as well (Figure 3).

Figures 2 and 3.

A client’s history with previous campaigns was also important. If the client had signed up for an account during the previous campaign, they were very likely to do the same this campaign (Figure 4). Similarly, the bank keeping in regular contact with the client before this campaign led to much greater odds of success (Figure 5).

Figures 4 and 5.

All the economic attributes seemed to be weak indicators of marketing success, though the campaign was likely not spaced out long enough for them to matter anyway. The count of successful and failed solicitations alongside the Consumer Price Index illustrates the lack of trend (Figure 6).

Figure 6

Step 3: Fitting the Model

The only tuning parameter adjusted in this paper is the number of random features per tree, or “mtry.” A grid search with 10-fold cross validation was used to find the best number, using Area Under the Curve (AUC) as the performance metric.

The grid search for parameter values 1 through 8 shows the AUC, the True Positive Rate (TPR) and the False Positive Rate (FPR) (Figure 7). Weighing AUC and TPR, mtry=4 is chosen as the marginal gains to Sensitivity seem to taper off past that number. It is also the rounded root of the number of features in the dataset. A random forest with mtry=4 was then fit using the entire training set. The ranger package in R was used.

mtry	AUC	TPR (Sensitivity)	FPR (1 – Specificity)
1	0.7585082	0.0000000	1.0000000
2	0.7524035	0.1387931	0.9915895
3	0.7486975	0.1735222	0.9869659
4	0.7377460	0.1841133	0.9865458
5	0.7283485	0.1908867	0.9815037
6	0.7257443	0.1874384	0.9819221
7	0.7277262	0.2046798	0.9810836
8	0.7245718	0.2046798	0.9798231

Figure 7, first eight parameter values searched.

Step 4: Evaluating the Model

The TPR and AUC are used to gauge the performance of the model. An ROC plot was generated (Figure 8). AUC is equal to 0.7903. The marginal gains in the TPR diminish after 0.60; from there, the cost to the FPR becomes very high.

Figure 8.

The random forest returns the probability of the observation being positive (“yes” label), and the ROC plot is used to determine the best cutoff probability for positive values that will maximize revenue without incurring costs from persuing too many bad leads. Here, a theoretical cost structure is implemented where the cost of persuing two bad leads equals the profit of closing one sale. As such, the goal is for the forest to generate predictions that lead the sales persons to close one sale for every three total leads pursued. For this first model, the cutoff is 0.1416, which brings the TPR and FPR to 0.632 and 0.1551, respectively. These are the green dashed lines in the ROC plot.

It may seem strange that the cutoff is so low, but since the model saw very few “yes” labels during the training phase, the forest produces very low probabilities for these Positive values. However, as will be seen in the next section, these cutoffs don’t affect the ROC plot too much.

Step 5: Improving the Model

There are 288 “yes” labels versus 2379 “no” labels in the training dataset, making “yes” only 11% of the dataset. This biases the model toward accurately predicting negative labels at the cost of predicting positive ones. To shift the model away from this bias, stratified sampling was used so that each tree in the random forest was fed equal numbers of both types of labels. The chosen sample size for each tree was 20% of the dataset, with half of that being “yes” labels and the other half “no.” This percentage was chosen according to the ranger package’s practices detailed in its documentation: when feeding the model bagged samples, the size is equal to the entire dataset. This is assuming that the dataset is approximately split 50-50 for both labels. Therefore, since positive labels consist of only 11% of the data, the trees will be fed bagged samples equal to 10% + 10% = 20% of the total training set size.

Using this technique, a grid search yielded the same mtry=4. From there, the model was fit on the entire training set using stratefied sampling and yielded an AUC = 0.7948, an improvement of nearly half a point. Adhering to the cost structure, the new cutoff is 0.4196 and the TPR and FPR 0.688 and 0.1649, respectively. Compared to 0.632 and 0.1551 in the previous model, stratefied sampling allowed a gain of over 5 points to the TPR while only increasing the FPR by 1.5 points.

Using the second model allows the bank to expand revenue without incurring additional net costs. Hypothetically, if the test set was new unlabeled data and then the model’s predictions followed, 86 of 258 leads pursued would lead to sales. Compared to 79 and 158 for the previous model, that’s 7 more sales and 14 more dead-end leads followed. ROC plot for the second random forest is below (Figure 9).

Figure 9.

Conclusion

Machine learning’s usefulness in sifting through leads was demonstrated, allowing the firm to confidently increase revenue without cost. Stratefied sampling was very helpful in allowing the algorithm to learn from labels of interest that otherwise would be obscured in the larger dataset.

References

Moro, S. (2014). UCI Machine Learning Repository: Bank Marketing Data Set. Retrieved from https://archive.ics.uci.edu/ml/datasets/bank marketing

Appendix

[1] Variable descriptions

1	age	numeric
2	job	type of job (categorical: 'admin.','blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')
3	marital	marital status (categorical: 'divorced', 'married', 'single', 'unknown'; note: 'divorced' means divorced or widowed)
4	education	(categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')
5	default	has credit in default? (categorical: 'no','yes','unknown')
6	housing	has housing loan? (categorical: 'no','yes','unknown')
7	loan	has personal loan? (categorical: 'no','yes','unknown')
8	contact	contact communication type (categorical: 'cellular','telephone')
9	month	last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10	day_of_week	last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11	duration	last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no')
12	campaign	number of contacts performed during this campaign and for this client (numeric, includes last contact)
13	pdays	number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14	previous	number of contacts performed before this campaign and for this client (numeric)
15	poutcome	outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
16	emp.var.rate	employment variation rate - quarterly indicator (numeric)
17	cons.price.idx	consumer price index - monthly indicator (numeric)
18	cons.conf.idx	consumer confidence index - monthly indicator (numeric)
19	euribor3m	euribor 3 month rate - daily indicator (numeric)
20	nr.employed	number of employees - quarterly indicator (numeric)
21	y	has the client subscribed a term deposit? (binary: 'yes','no')

[2] Code

Step 1: Collecting the Data

library(Amelia)
library(tidyverse)
library(ranger)
library(caret)
library(ROCR)

bank1 <- read.csv("bank-additional.csv", sep=";")

# Remove duration
bank2 <- bank1[, -11]

Step 2: Exploring and Preparing the Data

# Convert NA them remove rows with NA's
bank2 <- bank2 %>%
mutate_all(funs(type.convert(as.character(replace(., .=='unknown', NA)))))

# Check for missing values
missmap(bank2, main = "Missing values vs observed")

cat("\nNumber of missing values in each column\n")
sapply(bank2,function(x) sum(is.na(x)))

cat("\nNumber of unique values in each column\n")
sapply(bank2, function(x) length(unique(x)))

# Remove default
bank2 <- bank2[, !names(bank2) %in% c("default")]

# Remove all rows with missing values
bank2 <- na.omit(bank2)

Training and test sets are made with a 9:1 split.

set.seed(7297)

bank <- bank2

indx <- sample(1:nrow(bank), as.integer(0.7*nrow(bank)))

# Need the training labels in the train/test sets bc of glm fn.
bank_train <- bank[indx, ]
bank_test <- bank[-indx, ]

bank_train_labels <- bank[indx, names(bank) %in% c("y")]
bank_test_labels <- bank[-indx, names(bank) %in% c("y")]

Step 3: Training the Model

Grid search is used to determine how many random variables should be included in each tree. Parameter values tested are 1 to 19. Given the binary nature of the response, AUC is used as the performance metric. 10-fold Cross Validation is used to calculate AUC.

Due to alphabetized ordering of factor labels, “no” constitutes a positive value and “y” negative, and thus Sensitivity and Specificity are switched from what would be normally expected, with Sensitivity being the accuracy of identifying “no” labels and Specificity “yes” labels.

set.seed(7297)

ctrl <- trainControl(method = "cv", number = 10, summaryFunction = twoClassSummary, classProbs = T)

grid_rf <- expand.grid(.mtry = c(1:19), .splitrule=c("gini"), .min.node.size=1)

m_rf <- train(y ~ ., data = bank_train, method = "ranger",
metric = "ROC", trControl = ctrl, tuneGrid = grid_rf)

m_rf

ROC was used to select the optimal model using the largest value. The final value used for the model was mtry = 4. Probability returns probs instead of labels.

set.seed(7297)

rf <- ranger(y ~ ., bank_train, mtry=4, probability=T)
rf

Step 4: Evaluating the Model

The ROC curve shows that a Sensitivity of just below 60% is probably the best this model can do without sacrificing too much Specificity.

rf.pr <- predict(rf, bank_test)$predictions[,2]

#prediction is ROCR function
rf.pred = prediction(rf.pr, bank_test_labels)

#performance in terms of true and false positive rates
rf.perf = performance(rf.pred,"tpr","fpr")

plot(rf.perf,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")

#compute area under curve

auc <- performance(rf.pred,"auc")@y.values
auc

# Cost is TP = 2 FP
# Compare the Sensitivity and Specificity
ss_comparison <- cbind(rf.perf@y.values[[1]], rf.perf@x.values[[1]])

# Multiply Sensitivity and Specificity by number of Yes' and No's, respectively, to find number of TP and FP leads using those values for S and S.
# Divide Sensitivity column by 2 bc cost
ss_comparison2 <- round(cbind(rf.perf@y.values[[1]] * table(bank_test$y)[2],
rf.perf@x.values[[1]] * table(bank_test$y)[1]/2))

# Store indices of SS where number of TP = FP/2
ss_index <- which(ss_comparison2[,1]==ss_comparison2[,2])
ss_index

# Find SS for selected indices
ss_comparison[ss_index,] # 0.632, 0.1550540 vs # 0.688, 0.1697743

# Probability cutoffs for those SS
rf.pred@cutoffs[[1]][ss_index]

plot(rf.perf,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
abline(a=ss_comparison[ss_index[length(ss_index)],1],b=0,lwd=1,lty=2,col="green")
abline(v=ss_comparison[ss_index[length(ss_index)],2],lwd=1,lty=2,col="green")

Step 5: Improving Performance

Using stratified sampling Vector sample.fraction is how many observations to collect from the dataset, equal to the proportion of the entire dataset.

set.seed(7297)

rf2 <- ranger(y~., bank_train, probability = T, mtry=4,
replace=T, keep.inbag=T, sample.fraction=c(.1, .1))
rf2

rf.pr2 <- predict(rf2, bank_test)$predictions[,2]

#prediction is ROCR function
rf.pred2 = prediction(rf.pr2, bank_test_labels)

#performance in terms of true and false positive rates
rf.perf2 = performance(rf.pred2,"tpr","fpr")

plot(rf.perf2,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")

#compute area under curve

auc2 <- performance(rf.pred2,"auc")
auc2@y.values

# Cost is TP = 2 FP
# Compare the Sensitivity and Specificity
ss_comparison21 <- cbind(rf.perf2@y.values[[1]], rf.perf2@x.values[[1]])

# Multiply Sensitivity and Specificity by number of Yes' and No's, respectively, to find number of TP and FP leads using those values for S and S.
# Divide Sensitivity column by 2 bc cost
ss_comparison22 <- round(cbind(rf.perf2@y.values[[1]] * table(bank_test$y)[2],
rf.perf2@x.values[[1]] * table(bank_test$y)[1]/2))

# Store indices of SS where number of TP = FP/2
ss_index2 <- which(ss_comparison22[,1]==ss_comparison22[,2])
ss_index2

# Find SS for selected indices
ss_comparison[ss_index2,] # 0.632, 0.1550540 vs # 0.688, 0.1697743

# Probability cutoffs for those SS
rf.pred2@cutoffs[[1]][ss_index2]

plot(rf.perf2,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
abline(a=ss_comparison21[ss_index2[length(ss_index2)],1],b=0,lwd=1,lty=2,col="green")
abline(v=ss_comparison21[ss_index2[length(ss_index2)],2],lwd=1,lty=2,col="green")

2019-06-10

Pursuing Good Leads with Random Forests (school project)

Darren Keeley, STAT 6620-02

Abstract

Introduction

Step 1: Collecting the Data

Step 2: Exploring and Preparing the Data

Step 3: Fitting the Model

Step 4: Evaluating the Model

Step 5: Improving the Model

Conclusion

References

Appendix

[1] Variable descriptions

[2] Code

Step 1: Collecting the Data

Step 2: Exploring and Preparing the Data

Step 3: Training the Model

Step 4: Evaluating the Model

Step 5: Improving Performance