Data in retail

It is important for an organization to understand which customers are ‘Loyal Customers’ for their business. As, it helps for an organization to build long term relations with such customers. With huge number of customers, organizations are having difficulty in identifying a Loyal Customer.

Objective: The aim is to use the industry’s historic data to build a model which identifies if a current customer is a ‘Loyal Customer’. The provided data set contains the target variable – ‘Active_Customer’, 1 signifies the customer is Loyal and 0 signifies customer is not Loyal.

Data Source: The data was part of an online competition https://rang.shinyapps.io/Competition/

Exploring Data

Following are the Steps which were followed in this analysis :

Understanding the Dimensions of Data
Treating missing values
Analyzing relation between response variable and independent variables

Dimensions of Data

There are two data set as part of this analysis - Train Data – in this data set the response variable is already present - Test Data – in this data the response variable need to be predicted DatasetDetails

Treating Missing values

The way missing values are treated depends on the type of variable for which value is missing. Also, the percentage of how much missing values are there for a column
For Train data set, the % of value missing were quite low. Following are the percentage of missing values for each variable:
As the percentage for each variable is significantly low (less than 3.5 %). The records can be ignored for the analysis

Analyzing Relation b/w variables

The response variable in this scenario is – ‘Active_Customer’, which is a categorical variable.
The ‘Active_Customer’ variable needs to be analyzed against each independent variable to see if there is any relation with the response variable.
Following are the method used to understand association between response variables and independent variables

Resulting Plots

Plots

Test Data – Treating missing values

The ‘Test’ data should not have any missing values for the variables the data model is build
Again the percentage of missing data is calculated for all the variables which are included in the model.
The percentage of missing value is very low for all the variables (less than 3%). As the percentage is low, the missing values are replaced with mean of that column.

Building a Model – Logistic Regression

As the response variable is Categorical in this analysis, Logistic regression is used to build a model which fit the value in the train dataset. In R, glm() function is used to build a Logistic regression model. Below is an example of the glm() function. glm

Exploring the Model – Logistic Regression

The summary function understand the model
Below is the details

Predicting Values

Next step is to use the model build on Train data set to predict values for the Test data set.
The predict() function in R is used to predict probability for each value in the provided dataset (which is Test dataset in this case)
Following is an example of how predict function is used

glm - The output for this function is a positive probability (value less than 1). For example the value - After applying on the prediction model on Test data set.

Written on May 20, 2016