Placebo Is Good for Human Species

How does Placebo work? A long time ago, my friend and I used to live in a hostel. Some other people used to stay in the next room of this hostel. My friend was very kind. I started telling the people…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




All You Need to Know About Handling High Dimensional Data

Customer Segmentation using Unsupervised and Supervised Learning

Targeting potential customers is an essential task for organizations. It helps boost their revenue and tailor their needs to cater to the right customers. Moreover, it helps them understand why particular segments of people do not use their services.

In this post, we will study ways of preprocessing a high dimensional dataset and prepare it for analysis with machine learning algorithms. We will use the power of machine learning to segment customers from a mail-order campaign, understand their demographics, and predict potential future customers.

The data for this task is provided by Arvato Financial Services as a part of the Udacity Data Scientist Nanodegree. For confidentiality reasons, this data is not available publicly and can only be accessed via Udacity. The following files are provided:

The provided features can be grouped into various levels:

Our data is highly dimensional and consists of 366 features. We need to filter out the important features and hence a lot of preprocessing is required for our task. We will go through all the data cleaning steps one by one:

Our data consists of many missing values which render the features meaningless for further evaluation. The bar plot below shows the percent of nulls in each of our features. After analyzing the plot, I decided to drop features having more than 30% values as nulls.

2. Check Columns with Similar Values

Now, we find out the features where the majority share of the values are similar. These features will not help us differentiate between customers and hence we drop the features where more than 90% of values are similar.

This approach helped us drop 24 columns from our data.

3. Convert Unknowns to NaNs

The data provided to us has some peculiar issues. Some features consist of multiple values that denote an ‘unknown’ value e.g. values 0 and 9 in some features denote an unknown. This distorts our data as multiple values mean the same but are represented differently. We convert these values to NaNs.

4. Encode Categorical Data

Next, we encode our categorical features using one-hot encoding. One-hot encoding converts the unique categories to binary features. The following code snippet can be used for this purpose.

5. Handle Mixed Data types

Some features in our dataset are a combination of string and int types and need to be converted to a generalized format. The code snippet shows an example of mixed data types and how to handle them.

6. Handle Numeric Data

Once, we have encoded the categoricals and handled the mixed data, we can move on to analyze the numerical data. For numerical data, we should always check their distribution and then decide on a strategy to process them. In our case, we check the skewness of our numeric columns.

We transform the skewed features using log transformation to conform to normality. This is the most popular method for transformation but you should try other approaches such as binning, depending on your data.

Numeric features before(left) and after(right) log transformation

7. Detect and Remove Outliers

Outliers are observations that highly differ from the general observations in the data. These observations can take extreme values that distort the distribution of our data. Hence, it is pertinent to deal with them while preparing your data for further meaningful analysis.

Since our data does not strictly follow a normal distribution, we use Tukey's method to remove outliers. Tukey’s method suggests that points above or below 1.5 times the inter-quartile range from the quartiles: below Q1–1.5*IQR or above Q3+1.5*IQR are outliers.

8. Impute Data

Missing values in data can be handled in multiple ways. Firstly, if you have very few missing values compared to the size of your dataset, you may simply drop the rows with missing data. Secondly, you can impute the missing values if the data is too much to be dropped.

A very simple imputation technique is provided by sklearn’s SimpleImputer with which you can impute your feature data with the mean, median, or most frequent values. Other advanced techniques such as knnImputer or MICE algorithm (sklearn’s IterativeImputer) can be used to achieve better results.

9. Remove Correlated Features

After preprocessing our dataset, we still have around 300 features in our set. We need to further reduce the dimensionality of our data. Let’s take a look at certain approaches.

Using the SHAP package and a base XGBoost classifier, we were able to filter out 50 important features from our data. These features are shown in the plot below.

2. Principal Component Analysis

Now that we have reduced our feature set, we can do further analysis via principal component analysis (PCA). PCA is a dimensionality reduction technique that combines our input variables in a way such that they explain the maximum variance of the data and the least important variables can be dropped. Note that PCA retains the most valuable information from all the variables but the interpretability of features is lost.

For our data, it can be seen from the scree plot that 35 features explain 95% variability in our data. You can decide to keep this threshold low depending on your case.

Scree plot for PCA

Further, we explore the features with the maximum correlation within our top 2 principal components that explain about 35% of the variability in our data.

We notice that attributes related to finance and age are highly correlated in the first principal component whereas attributes related to lifestyle and family are representative of the second principal component.

Once we have reduced our dimensions using PCA, we segment the general population into clusters using the KMeans algorithm. KMeans is a relatively simple algorithm that clusters similar points into groups based on a distance metric, usually euclidean.

We will fit KMeans on the general population data and use it to transform the customer data to identify clusters with the most prominent share of potential customers.

From the elbow plot below, we observe that the optimal value of ‘k’ for our data should be 6 as the curve begins to straighten thereafter.

Elbow plot to identify the optimal value of k

Using this value of k, we cluster our general and customer population into k distinct clusters and the results are shown below.

We clearly notice that the customers are over-represented in cluster 1 and underrepresented in clusters 2 and 5. Let’s dive into the characteristics that differentiate these customers from the general population.

Based on our findings, the customer population in cluster 1 differs from the general population in the following categories:

Since the customers are almost non-existent in cluster 5, we will check the characteristics of these customers to identify our non-target population. We observe the following from our analysis:

We have now explored our data and understood the characteristics of our target customers. Let’s move on to the most interesting part, modeling. We are going to use the MAILOUT datasets described at the beginning of this article, for the supervised learning task.

We undertake the following steps:

Optuna is a hyperparameter tuning framework that works seamlessly with python and other machine learning frameworks. You can read more about using Optuna with XGBoost in this blog post.

5. Once we have found our best model parameters, we evaluate our test data. The choice of an evaluation metric is highly important for any data science project. For our project, we will use the AUC (area under the ROC curve) score.

A ROC curve (receiver operating characteristic curve) is a graph that shows the performance of a classification model at all classification thresholds. It is a plot between the TPR (True Positive Rate) and FPR (False Positive Rate).

TPR is a synonym for recall and tells us what proportion of positive class got classified correctly. FPR tells us what proportion of the negative class got incorrectly classified by the classifier.

The value of AUC ranges between 0 and 1, where 0 denotes a 100% incorrect model prediction and 1 denotes a 100% correct prediction.

We found the following avenues to improve our model predictions:

In this blog post, we discussed ways for preprocessing high dimensional data. Further, we used unsupervised learning to segment customers into groups and understand the demographics. Lastly, we looked at the supervised task of predicting target customers from our data. Hopefully, this analysis will add to the plethora of knowledge available on machine learning over the net.

Add a comment

Related posts:

How to Ask A Woman Out in 2021

I often hear the confusion some men feel around appropriately expressing their romantic interest in a woman. Asking someone out can seem even riskier today— especially if you haven’t checked in from…

A verified PayPal account

To confirm your purchase of cheap PayPal accounts, you need to follow these steps. First, navigate to the PayPal website and log in to your account by entering the email address associated with the…

My Feelings on the 2020 BLM Protests

Cold sweat fell down my forehead as I crouched behind a worn brick wall, the sounds of gunshots, people screaming, and cop cars filling my ears and barely managing to drown out the slamming of my…