Exploratory Data Analysis with R: Customer Churn

13 minute read

R Code: Exploratory Data Analysis with R

Subscription based services typically make money in the following three ways:

  1. Acquire new customers
  2. Upsell customers
  3. Retain existing customers

In this article I’m going to focus on customer retention. To do this, I’m going to perform an exploratory analysis, and do some basic data cleaning.

I’ll generate some questions focused on customer segments to help guide the analysis. Based off of the insights gained, I’ll provide some recommendations for improving customer retention. In a future article I’ll build a customer churn predictive model.

Wrangling the Data

The dataset I’m going to be working with can be found on the IBM Watson Analytics website.

This is a sample dataset for a telecommunications company. We can start by taking a look at the dimensions of the data, as well as the different features.

library(tidyverse)
library(miscset)

# Setting the working directory
path_loc <- "C:/Users/jsullivan/Desktop/Storage/Data Science/Portfolio/Projects/Churn Prediction"
setwd(path_loc)

# reading in the data
df <- read_csv("Telco data.csv")

# dimensions of the data
dim_desc(df)
## [1] "[7,043 x 21]"
# names of the data
names(df)
##  [1] "customerID"       "gender"           "SeniorCitizen"   
##  [4] "Partner"          "Dependents"       "tenure"          
##  [7] "PhoneService"     "MultipleLines"    "InternetService"
## [10] "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
## [13] "TechSupport"      "StreamingTV"      "StreamingMovies"
## [16] "Contract"         "PaperlessBilling" "PaymentMethod"   
## [19] "MonthlyCharges"   "TotalCharges"     "Churn"

Taking a look we see that there are 21 features, and 7043 rows of observances. The features are named pretty well, such as “PhoneService” and “TechSupport.” The target feature we’ll be attempting to predict is “Churn”. We can dig a little deeper and take a look at the data types of the features.

# data types
glimpse(df)
## Observations: 7,043
## Variables: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "77...
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "...
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "N...
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "N...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5...
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"...
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone ser...
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "F...
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", ...
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", ...
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", ...
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "N...
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "...
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "N...
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month...
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes"...
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed c...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89....
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820....
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", ...

The data contains various categorical features that are character types, such as the “Dependents” feature, that has values of either “Yes” or “No”. There’s also numeric types, which includes “MonthlyCharges” and “TotalCharges”. The “SeniorCitizen” variable is an integer type, but it really represents “Yes” and “No” so we’ll convert that to a factor. We’ll investigate the “tenure” variable, which is also an integer, later on.

For now, let’s start by transforming the character variables, as well as the “SeniorCitizen”” variable, to factor types.

df <- df %>% mutate_if(is.character, as.factor)
df$SeniorCitizen <- as.factor(df$SeniorCitizen)
glimpse(df)
## Observations: 7,043
## Variables: 21
## $ customerID       <fctr> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFO...
## $ gender           <fctr> Female, Male, Male, Male, Female, Female, Ma...
## $ SeniorCitizen    <fctr> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Partner          <fctr> Yes, No, No, No, No, No, No, No, Yes, No, Ye...
## $ Dependents       <fctr> No, No, No, No, No, No, Yes, No, No, Yes, Ye...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5...
## $ PhoneService     <fctr> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Ye...
## $ MultipleLines    <fctr> No phone service, No, No, No phone service, ...
## $ InternetService  <fctr> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic...
## $ OnlineSecurity   <fctr> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes,...
## $ OnlineBackup     <fctr> Yes, No, Yes, No, No, No, Yes, No, No, Yes, ...
## $ DeviceProtection <fctr> No, Yes, No, Yes, No, Yes, No, No, Yes, No, ...
## $ TechSupport      <fctr> No, No, No, Yes, No, No, No, No, Yes, No, No...
## $ StreamingTV      <fctr> No, No, No, No, No, Yes, Yes, No, Yes, No, N...
## $ StreamingMovies  <fctr> No, No, No, No, No, Yes, No, No, Yes, No, No...
## $ Contract         <fctr> Month-to-month, One year, Month-to-month, On...
## $ PaperlessBilling <fctr> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No...
## $ PaymentMethod    <fctr> Electronic check, Mailed check, Mailed check...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89....
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820....
## $ Churn            <fctr> No, No, Yes, No, Yes, Yes, No, No, Yes, No, ...

Now lets look for missing values.

df %>% map(~ sum(is.na(.)))
## $customerID
## [1] 0
##
## $gender
## [1] 0
##
## $SeniorCitizen
## [1] 0
##
## $Partner
## [1] 0
##
## $Dependents
## [1] 0
##
## $tenure
## [1] 0
##
## $PhoneService
## [1] 0
##
## $MultipleLines
## [1] 0
##
## $InternetService
## [1] 0
##
## $OnlineSecurity
## [1] 0
##
## $OnlineBackup
## [1] 0
##
## $DeviceProtection
## [1] 0
##
## $TechSupport
## [1] 0
##
## $StreamingTV
## [1] 0
##
## $StreamingMovies
## [1] 0
##
## $Contract
## [1] 0
##
## $PaperlessBilling
## [1] 0
##
## $PaymentMethod
## [1] 0
##
## $MonthlyCharges
## [1] 0
##
## $TotalCharges
## [1] 11
##
## $Churn
## [1] 0

It looks like “TotalCharges” is the only feature with missing values. Lets go ahead and impute the 11 missing values using the median value.

# imputing with the median
df <- df %>%
  mutate(TotalCharges = replace(TotalCharges,
                                is.na(TotalCharges),
                                median(TotalCharges, na.rm = T)))

# checking that the imputation worked
sum(is.na(df$TotalCharges))
## [1] 0

Now that we’ve imported the data and done some cleaning, lets start to explore the data.

Exploring the Data

Let’s start by taking a look at the unique values of the factor variables.

df_tbl <- df %>%
  select_if(is.factor) %>%
  summarise_all(n_distinct)


df_tbl[1:8] %>%
  print(width = Inf)
## # A tibble: 1 x 8
##   customerID gender SeniorCitizen Partner Dependents PhoneService MultipleLines InternetService
##        <int>  <int>         <int>   <int>      <int>        <int>         <int>           <int>
## 1       7043      2             2       2          2            2             3               3
df_tbl[9:15] %>%
  print(width = Inf)
## # A tibble: 1 x 7
##   OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract
##            <int>        <int>            <int>       <int>       <int>           <int>    <int>
## 1              3            3                3           3           3               3        3
df_tbl[16:18] %>%
  print(width = Inf)
## # A tibble: 1 x 3
##   PaperlessBilling PaymentMethod Churn
##              <int>         <int> <int>
## 1                2             4     2

There’s a unique value for each “customerID” so we probably won’t be able to gain much information there. All of the other factors have four or fewer unique values, so they will all be pretty manageable.

To guide the analysis, I’m going to try and answer the following questions about my customer segments:

  1. Are men more likely to churn than women?
  2. Are senior citizens more like to churn?
  3. Do individuals with a partner churn more than those without a partner?
  4. Do people with dependents churn more than people that do not have dependents?

I’ll start with gender. I wouldn’t expect one gender to be more likely than another to churn, but lets see what the data shows.

ggplot(df) +
  geom_bar(aes(x = gender, fill = Churn), position = "dodge")

jpg

Taking a look, the results are similar. Roughly one quarter of the male customers churn, and roughly one quarter of the female customers churn. We can also take a look at exactly how many people from each gender churned.

df %>%
  group_by(gender,Churn) %>%
  summarise(n=n())
## # A tibble: 4 x 3
## # Groups:   gender [?]
##   gender  Churn     n
##   <fctr> <fctr> <int>
## 1 Female     No  2549
## 2 Female    Yes   939
## 3   Male     No  2625
## 4   Male    Yes   930

Next I’ll take a look at senior citizens.

#SeniorCitizen
ggplot(df) +
  geom_bar(aes(x = SeniorCitizen, fill = Churn), position = "dodge")

jpg

df %>%
  group_by(SeniorCitizen) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
## # A tibble: 2 x 3
##   SeniorCitizen     n      freq
##          <fctr> <int>     <dbl>
## 1             0  5901 0.8378532
## 2             1  1142 0.1621468
df %>%
  group_by(SeniorCitizen, Churn) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
## # A tibble: 4 x 4
## # Groups:   SeniorCitizen [2]
##   SeniorCitizen  Churn     n      freq
##          <fctr> <fctr> <int>     <dbl>
## 1             0     No  4508 0.7639383
## 2             0    Yes  1393 0.2360617
## 3             1     No   666 0.5831874
## 4             1    Yes   476 0.4168126

This variable shows a much more meaningful relationship. Roughly 16% of the customers are senior citizens, and roughly 42% of those senior citizens churn. On the other hand, of the 84% of customers that are not senior citizens, only 24% churn. These results show that senior citizens are much more likely to churn.

Now I’m going to take a look at people with partners.

#Partner
ggplot(df) +
  geom_bar(aes(x=Partner, fill = Churn), position = "dodge")

jpg

df %>%
  group_by(Partner) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
## # A tibble: 2 x 3
##   Partner     n      freq
##    <fctr> <int>     <dbl>
## 1      No  3641 0.5169672
## 2     Yes  3402 0.4830328
df %>%
  group_by(Partner, Churn) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
## # A tibble: 4 x 4
## # Groups:   Partner [2]
##   Partner  Churn     n      freq
##    <fctr> <fctr> <int>     <dbl>
## 1      No     No  2441 0.6704202
## 2      No    Yes  1200 0.3295798
## 3     Yes     No  2733 0.8033510
## 4     Yes    Yes   669 0.1966490

Roughly half of the people have partners. Of the people with partners, 20% churn. For people without partners, approximately 33% churn.

Next, I’ll take a look at the Dependents category.

ggplot(df) +
  geom_bar(aes_string(x="Dependents", fill="Churn"), position = "dodge")

jpg

df %>% group_by(Dependents, Churn) %>%
  summarise(n=n()) %>%
  mutate(freq = n / sum(n))
## # A tibble: 4 x 4
## # Groups:   Dependents [2]
##   Dependents  Churn     n      freq
##       <fctr> <fctr> <int>     <dbl>
## 1         No     No  3390 0.6872086
## 2         No    Yes  1543 0.3127914
## 3        Yes     No  1784 0.8454976
## 4        Yes    Yes   326 0.1545024
df %>% group_by(Dependents) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
## # A tibble: 2 x 3
##   Dependents     n      freq
##       <fctr> <int>     <dbl>
## 1         No  4933 0.7004118
## 2        Yes  2110 0.2995882

Approximately 30% of the people have dependents, of which 15% churn. For the other 70% that don’t have dependents, 31% churn.

Another useful visualization is the box and whisker plot. This gives us a little bit more compact visual of our data, and helps us identify outliers. Lets take a look at some box and whisker plots for total charges of the different customer segments.

# Senior Citizens
ggplot(df, aes(x = SeniorCitizen, y = TotalCharges)) +
  geom_boxplot()

jpg

# Partner
ggplot(df, aes(x = Partner, y = TotalCharges)) +
  geom_boxplot()

jpg

# Dependents
ggplot(df, aes(x = Dependents, y = TotalCharges)) +
  geom_boxplot()

jpg

After looking at these initial results, we can ask some more questions. We might want to compare the total charges of senior citizens, people without partners, and people without dependents.

These seem to be the subsets of people most likely to churn within their respective customer segments. Lets compare them so that we can identify where we would potentially focus our efforts.

# Total charges and tenure of senior citizens
df %>%
  select(SeniorCitizen, Churn, TotalCharges, tenure) %>%
  filter(SeniorCitizen == 1, Churn == "Yes") %>%
  summarize(n = n(),
            total = sum(TotalCharges),
            avg_tenure = sum(tenure)/n)
## # A tibble: 1 x 3
##       n    total avg_tenure
##   <int>    <dbl>      <dbl>
## 1   476 882405.2   21.03361
# Total charges and tenure of people without a partner
df %>%
  select(Partner, Churn, TotalCharges, tenure) %>%
  filter(Partner == "No", Churn == "Yes") %>%
  summarise(n = n(),
            total = sum(TotalCharges),
            avg_tenure = sum(tenure)/n)
## # A tibble: 1 x 3
##       n   total avg_tenure
##   <int>   <dbl>      <dbl>
## 1  1200 1306776   13.17667
# Total charges and tenure of people without dependents
df %>%
  select(Dependents, Churn, TotalCharges, tenure) %>%
  filter(Dependents == "No", Churn == "Yes") %>%
  summarise(n = n(),
            total = sum(TotalCharges),
            avg_tenure = sum(tenure)/n)
## # A tibble: 1 x 3
##       n   total avg_tenure
##   <int>   <dbl>      <dbl>
## 1  1543 2261840   17.12314

Here’s a summary of the total charges for each customer segment that churned:

Customer SegmentTotal Charges
Senior Citizens900,000
No Partners1,300,000
No Dependents2,300,000

Based on the results, we should focus our efforts on people without dependents. This customer segment that churned had nearly 2.3MM in total charges compared to 1.3MM for people without partners, and only 900K for senior citizens.

Let’s dig a little deeper and see what services that customer segment uses.

dependents <- df %>% filter(Dependents == "No")

ggplotGrid(ncol=2,
lapply(c("PhoneService","MultipleLines","InternetService","OnlineSecurity","OnlineBackup",
         "DeviceProtection"),
       function(col){
         ggplot(dependents,aes_string(col)) + geom_bar(aes(fill=Churn),position="dodge")
       }))

jpg

ggplotGrid(ncol=2,
lapply(c("TechSupport","StreamingTV","StreamingMovies","Contract",
         "PaperlessBilling"),
       function(col){
         ggplot(dependents,aes_string(col)) + geom_bar(aes(fill=Churn),position="dodge")
       }))

jpg

ggplot(dependents) +
  geom_bar(aes(x=PaymentMethod,fill=Churn), position = "dodge")

jpg

Taking a look at the results, we gain some potential insights. Based on these insights, here are some recommendations for improving customer retention:

  1. A lot of people with phone service churned. Maybe these people don’t really use the phone service. Moving them to a plan without phone service to save them some money on their bill might help retain them.
  2. People with fiber optic internet churned much more than people with DSL or no internet at all. Maybe moving some of those people to DSL or eliminating their internet service would be an option. Another option could be some sort of price reduction to their fiber optic plan as some sort of a promotion for being a loyal customer.
  3. People without online backup, device protection, and online security churn fairly frequently. Maybe their devices have crashed, causing them to lose valuable files. They may have also experienced fraud or identity theft that has left them very unhappy. Moving these people to some of these services may help safeguard their systems, thus preventing a lot of unwanted headaches.
  4. Similarly to online backup and security, those without device protection tended to churn more than those that subscribed ot the service. Adding device protection to their plans may be a good way to prevent churn.
  5. Those without tech support tend to churn more frequently than those with tech support. Moving customers to tech support accounts might be another potential way to prevent churn.

There are a number of other different insights that we could gain from the data, but this would be a good initial list to investigate further if the company had even more detailed data sets.

In the next article, I’ll develop a predictive model. I’ll make some assumptions about customer acquisition and customer retention costs. This will allow me to put some actual dollar amounts behind the potential cost savings of my model.