Data Preparation

Imported the dataset into Excel from CSV converting it to an Excel file, after which data was initially explored in Excel, after which it was imported into Rstudio for analysis and visualization

Packages Used in this stage

  • library(readxl): Used to Import Excel file into RStudio.

Code Implementation

insurance_data <- read_excel("insurance_dataset.xlsx")

Data Preprocessing

Packages Used in this stage

  • library(lubridate): Used to manipulate and work with date data type.
  • library(tidyverse): used the dplyr and ggplot sub packages for manipulation and visuals.

Feature Engeinerring

Certain new variables were created out of existing one for the purpose of analysis and visualization, the mutate and ifelse functions was used at this stage to achieve this.

  1. Calculate Age and Group it
# Extract the Current year and month
current_year <- 2024
current_month <- 6

# Extract the Birth year and month for policyholders
insurance_data$birth_year <- year(insurance_data$birthdate)
insurance_data$birth_month <- month(insurance_data$birthdate) 

# calculate the age 
insurance_data <- insurance_data %>%
  mutate(
     age = ifelse(birth_month > current_month, current_year - birth_year - 1, current_year - birth_year)
  )

# to group this ages into interval for the aim of exploration 
age_breaks <- c(21, 31, 41, 51, 61, 71, 75)
age_lables <- c("21-30", "31-40", "41-50", "51-60", "61-70", "71-74")

insurance_data <- insurance_data %>%
  mutate(
    age_group = cut(age, breaks = age_breaks, labels = age_lables, right = FALSE)
  )
  1. To Encode Categorical Variables into Numeric for Correlation Purpose
# Encode the data to turn the character data type to numeric for correlation
insurance_data_encoded <- insurance_data %>%
  mutate(make_encoded = as.numeric(factor(car_make)),
         model_encoded = as.numeric(factor(car_model)),
         color_encoded = as.numeric(factor(car_color)))
  1. Defining Risk profiles(high and low), claim status (frequent(high) and infrequent(low)), high household income, and estimated premium based on claim_freq, claim_amount and household_income.
#Define risk profile of policyholders based on claim frequency and claim amount
 insurance_data <- insurance_data %>%
   mutate(risk_profile = ifelse(claim_freq >= 2 | claim_amt > 75000, "High", "Low"))
 
# Convert risk_profile to a binary variable for logistic regression
 insurance_data <- insurance_data %>%
   mutate(high_risk = ifelse(risk_profile == "High", 1, 0))

# Define a threshold for what frequent claim is 
insurance_data <- insurance_data %>%
  mutate(claim_freq_status = ifelse(claim_freq >= 2, "Frequent", "Infrequent"))

# Convert claim_freq to a binary variable for logistic regression
insurance_data <- insurance_data %>%
  mutate(frequent_claimer = ifelse(claim_freq_status == "Frequent", 1, 0))

# Define what Low Claim Frequency and High House Hold income is 
insurance_data <- insurance_data %>%
  mutate(low_claim_high_income = ifelse(claim_freq < 2 
                                        & household_income > 198000,"Yes", "No"))

# Convert low_claim_high_income to a binary variable for logistic regression
insurance_data <- insurance_data %>%
  mutate(highi_lowc = ifelse(low_claim_high_income == "Yes", 1, 0))


# Estimate premiums based on claim amounts and frequency 
insurance_data <- insurance_data %>%
    mutate(estimated_premium = 0.1 * claim_amt + 100 * claim_freq)

Assumption Used at this Stage

For defining the risk profiles, claim status, and low claim frequency and high household income based on claim_frequency, claim_amount, and household_income was not done based on judgement but rather using certain descriptive statistics techniques such as mean, standard deviation, max and min and quartile to determine what should be appropriate for this. For the premiums it was estimated based on a certain % of what splendor insurance will charge based on the claim amount and also a fixed amount of the claim_freq as certain policy holders who claim more will be high risk policyholders that status of their risk will have to be adjusted for same for low risk policy holders when calculating premium.

Method of Analysis

Certain methods were used to extract insights out of the dataset

Packages Used in this stage

  • library(tidyverse): used the dplyr and ggplot sub packages for manipulation and visuals.
  • library(caret): used for regression analysis.
  • library(SmartEDA): used for Automated EDA.
  • library(Skimr): used as well for quick EDA and check for missing values.

EDA

An automated EDA was done to explore the dataset and understand frequency distribution, datatype attributes, relationships, missing value check and lot more.

#  Automated Data Exploration 
#Exp Report(insurance_data,op_file = "EDA_report.html")
skim_without_charts(insurance_data)
Data summary
Name insurance_data
Number of rows 37542
Number of columns 28
_______________________
Column type frequency:
character 14
factor 1
numeric 12
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1 10 10 0 37541 0
marital_status 0 1 6 9 0 4 0
car_use 0 1 7 10 0 2 0
gender 0 1 4 6 0 2 0
kids_driving_ch 0 1 3 5 0 4 0
parent 0 1 2 3 0 2 0
education 0 1 3 11 0 4 0
car_make 0 1 2 13 0 78 0
car_model 0 1 1 23 0 1011 0
car_color 0 1 3 10 0 19 0
coverage_zone 0 1 5 12 0 5 0
risk_profile 0 1 3 4 0 2 0
claim_freq_status 0 1 8 10 0 2 0
low_claim_high_income 0 1 2 3 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
age_group 0 1 FALSE 6 61-: 7186, 41-: 7087, 31-: 7071, 51-: 7062

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
kids_driving 0 1 0.42 0.74 0.00 0.00 0.00 1.00 3.00
car_year 0 1 2000.29 9.05 1909.00 1995.00 2002.00 2007.00 2013.00
claim_freq 0 1 0.51 1.02 0.00 0.00 0.00 1.00 4.00
claim_amt 0 1 50028.51 28706.52 0.04 25439.41 49455.89 74974.93 99997.70
household_income 0 1 147247.41 59145.59 45004.91 96162.18 146674.90 198277.42 249991.11
birth_year 0 1 1975.85 15.30 1949.00 1963.00 1976.00 1989.00 2002.00
birth_month 0 1 6.53 3.44 1.00 4.00 7.00 10.00 12.00
age 0 1 47.65 15.30 21.00 34.00 48.00 61.00 74.00
high_risk 0 1 0.34 0.47 0.00 0.00 0.00 1.00 1.00
frequent_claimer 0 1 0.12 0.32 0.00 0.00 0.00 0.00 1.00
highi_lowc 0 1 0.22 0.42 0.00 0.00 0.00 0.00 1.00
estimated_premium 0 1 5053.88 2872.65 0.00 2597.43 5005.02 7546.02 10398.03

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
birthdate 0 1 1949-10-21 2002-10-18 1976-04-21 16525

Visuals and Aggregate Functions

This were used to explore the data for pattern and trends among the various categorical variables and numeric variables. The function used were ggplot for visuals summarise for applying an aggregation such as sum and mean for a numeric variable group_by to group by a categorical variable filter to narrow out observations not needed. This technique was applied across every question that the project to sort to provide answer too. below is a code sample

# common characteristics among policyholders who make frequent claims?
# Gender
insurance_data %>%
  group_by(claim_freq_status, gender) %>% 
  filter(claim_freq_status == "Frequent") %>%
  summarise(Count = n(), .groups = 'drop') %>%
  ggplot(aes(x=claim_freq_status, y=Count, fill = gender)) +
  geom_col(position = 'dodge', width = 0.4) +
  labs(title = "Distrubtion of PolicyHolders that Claim Frequently based on Gender",
       x="Frequent Claimers",
       y="Value") +
  theme_minimal()

# key characteristics of policyholders with low claim frequencies and high household incomes
insurance_data %>%
  group_by(low_claim_high_income, marital_status) %>% 
  filter(low_claim_high_income == "Yes") %>%
  summarise(Count = n(), .groups = 'drop') %>%
  ggplot(aes(x=low_claim_high_income, y=Count, fill = marital_status)) +
  geom_col(position = 'dodge', width = 0.4) +
  labs(title = "Distribution of Policyholders with Low Claim Frequency and 
       High Household Income based on Marital Status",
       x="Low Claim & High Income",
       y="Value") +
  theme_minimal()

Logistic Regression

Was used to provide insights for certain questions this included the Risk Assessment and Customer Segmentation questions below is a code sample

# A. factors that are most indicative of high-risk policyholders

# Using Logistic Regression for other features
#logistic_model_1 <- glm(high_risk ~ household_income + education + kids_driving_ch + parent, data = insurance_data, family = binomial)

# Based on Car features
#logistic_model_2 <- glm(high_risk ~ car_use + car_make + car_model + car_color , data = insurance_data, family = binomial) 
 
 # Based on Demographics
 logistic_model_3 <- glm(high_risk ~ age_group + gender + marital_status + 
                           coverage_zone,
                          data = insurance_data, family = binomial) 

Correlation

This was to done to understand and provide insights into relationship that existed between variables to be exact the question on Are there any specific vehicle characteristics (e.g., make, model, year) that correlate with higher claim frequencies or amounts?

# Select only the needed numeric variables out of the encoded data
numeric_data <- insurance_data_encoded %>% 
  select(claim_amt, claim_freq, car_year, make_encoded, model_encoded, color_encoded)
# calcuate correlation among the variables
cor(numeric_data)
##                  claim_amt   claim_freq     car_year  make_encoded
## claim_amt      1.000000000  0.002058273  0.001201309 -0.0066980780
## claim_freq     0.002058273  1.000000000  0.003845844  0.0032000747
## car_year       0.001201309  0.003845844  1.000000000 -0.0292848318
## make_encoded  -0.006698078  0.003200075 -0.029284832  1.0000000000
## model_encoded -0.001034638  0.001940056  0.092446074  0.0289658081
## color_encoded -0.003110628 -0.006654749 -0.002555016  0.0003793521
##               model_encoded color_encoded
## claim_amt      -0.001034638 -0.0031106285
## claim_freq      0.001940056 -0.0066547487
## car_year        0.092446074 -0.0025550157
## make_encoded    0.028965808  0.0003793521
## model_encoded   1.000000000 -0.0011830157
## color_encoded  -0.001183016  1.0000000000

Challenges and Solutions

Challenges

  • Data and Question Complexity: The level of mainpulation needed to get get insight for each question was quite difficult.
  • Interpreting logistic regression: Being my first major time applying a machine learning techqiue, interpreating the results was quick hard. and contextual knowledge.
  • Technical Issues with R Studio: Frequent hangs and slowdowns in R Studio, likely due to the extensive operations and code running simultaneously, disrupted the smooth progress of analysis.
  • Variable Selection Conflict: Deciding on the most informative variables for analysis and those that would best answer my research questions was a recurring dilemma.

Solutions

  • Efficient Data Handling: Leveraged powerful R packages like dplyr and tidyverse for effective data management.
  • Leverage Online Resources: Utilized online resources to get a better understanding and interpretation of Logistic Regression model.
  • Technical Workarounds: Implemented measures to mitigate technical issues, such as optimizing code for efficiency and breaking down analysis into smaller, manageable segments.
  • Iterative Variable Selection: Adopted an iterative approach to variable selection, continuously refining the choices based on the evolving understanding of the dataset and project objectives.