GOOGLE ANALYTICS CUSTOMER REVENUE PREDICTION

Nikhil Bokade
15 min readFeb 21, 2020

TABLE OF CONTENTS :-

  1. BUSINESS PROBLEM

2. USE OF MACHINE LEARNING

3. DATA SOURCE

4. DATA DESCRIPTION

5. EXPLORATORY DATA ANALYSIS(EDA)

6.EXISTING SOLUTON

7. FIRST CUT SOLUTION

8. FEATURE ENGINEERING

9. MODELING

10. COMPARING OF MODELS

11. KAGGLE SUBMISSION

12 FUTURE WORK

13. REFERENCES

14. GITHUB REPO

15. LINKED IN PROFILE

BUSINESS PROBLEM :

In this competition we are challenged to analyze a google merchandise store (also called as Gstore where google swag is sold) customer data set to predict the revenue that can be generated from the customer in the future.

Marketing teams wants to make appropriate investment in promotional strategies for attracting potential customers but many small businesses have only small percentage of customers who generate most of revenue, and hence becomes difficult to predict potential customers and amount of revenue that can generated from those customers in future.

USE OF MACHINE LEARNING :

Customer segmentation and customer lifetime value (LTV) prediction are the main challenges faced by any marketer. Businesses have a huge amount of marketing relevant data from various sources such as email campaign, website visitors and lead data. Using data mining and machine learning, an accurate prediction for individual marketing offers and incentives can be achieved.

Using ML, savvy marketers can eliminate guesswork involved in data-driven marketing. For example, given the pattern of behavior by a user during a trial period and the past behaviors of all users, identifying chances of conversion to paid version can be predicted. A model of this decision problem would allow a program to trigger customer interventions to persuade the customer to convert early or better engage in the trial.

DATA SOURCE :

The data-set has been obtained from the kaggle’s website with the following link:

https://www.kaggle.com/c/ga-customer-revenue-prediction/data

DATA DESCRIPTION :

The dataset consists of 2 files train_v2.csv and test_v2.csv. Both the files have 13 columns. Among these 13 columns there are few columns that are in JSON format. Each row in the dataset is one visit to the store.

JSON_COLUMNS = [‘device’, ‘geoNetwork’, ‘totals’, ‘trafficSource’]

After normalizing the JSON fields the shape of train data is : (1708337, 60) and the shape of test data is : (401589, 59)

  • train_v2.csv — contains user transactions from August 1st 2016 to April 30th 2018.
  • test_v2.csv — contains user transactions from May 1st 2018 to October 15th 2018.

IMPORTANT: Due to the formatting of fullVisitorId you must load the Id's as strings in order for all Id's to be properly unique!

Data Fields

  • fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
  • channelGrouping — The channel via which the user came to the Store.
  • date — The date on which the user visited the Store.
  • device — The specifications for the device used to access the Store.
  • geoNetwork — This section contains information about the geography of the user.
  • socialEngagementType — Engagement type, either “Socially Engaged” or “Not Socially Engaged”.
  • totals — This section contains aggregate values across the session.
  • trafficSource — This section contains information about the Traffic Source from which the session originated.
  • visitId — An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
  • visitNumber — The session number for this user. If this is the first session, then this is set to 1.
  • visitStartTime — The timestamp (expressed as POSIX time).
  • hits — This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
  • customDimensions — This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.
  • totals — This set of columns mostly includes high-level aggregate data.

CODE TO LOAD DATA SET :

WHAT ARE WE EVEN PREDICTING ?

We are predicting the natural log of the sum of all transactions per user. This will be for all users in test_v2.csv for December 1st, 2018 to January 31st, 2019.

Note: The dataset does NOT contain data for December 1st 2018 to January 31st 2019. We must identify the unique fullVisitorId’s in the provided test_v2.csv and make predictions for them for those unseen months.

EXPLORATORY DATA ANALYSIS :

Now that we have some idea of the data and its format now lets look at the data itself and see if there are any missing values.

By looking at the data it can be observed that some columns in both train and test data set have ‘not available in demo dataset’. and many columns have NaN values.

Analysis of target variable that is the totals.transactionRevenue :-

The above image clearly shows that there is only 1.219% of customers who have made any kind of transaction the rest 98.781 % of customers have just visited the store and have not made any kind of purchase.

As we can clearly see here, for many businesses–only a small percentage of customers produce most of the revenue.

Similarly test data has only 1.4% of users who have made transaction.

SCATTER PLOT OF TARGET :-

lets look at the scatter plot of the transaction revenue with respect to fullvisitorid.

clearly very few customers have made purchases when compared to the total number of customers visited.

In the given data the 98.7% transaction revenue data is missing values and is to be considered as 0 according to the organizer of the competition.

PDF OF TARGET VARIABLE :-

Since the target = log(1+sum(per user transactions)) we will look at the pdf of target to understand more about its distribution.

Observation:-

  • The target variable has wide range of values and its distribution is slightly right skewed.
  • The mean of log Revenue distribution is 17.9

MISSING VALUE ANALYSIS :-

Now lets Visualize the missing data columns and the amount of data missing per column using horizontal bar plot both for train and test data.

missing values in train data per category
missing values in test data per category
percentage of missing values per category in train data

Observation :-

  • There are many columns that have more than 95% of missing values.

FEATURES CORRELATION : -

Let’s take a look at the correlation between top correlated features using heatmap.

correlation between features

Observation :-

  • we can see that there is some relation between device is mobile and year, visit id and device is mobile, visit id and year.
  • If the correlation coefficient is +1 or -1 we know that the features/variables are strongly related to each other and if it is 0 they are independent of each other.

FEATURE WISE DATA ANALYSIS

ANALYZING DATE AND VISIT START TIME :-

we first convert the date columns into proper datetime format and then extract few features like day , weekday, month and year and we also have a visitstarttime column from where we will get the hour at which the session of a customer has started.

date features converting to proper date time format and extracting features from the data column

TRANSACTION REVENUE VS WEEKDAYS:-

Bar plot of transaction revenue wrt weekday

Observation:-

We can clearly see that the transaction revenue for weekdays is higher when compared to weekends. Also note that 0 on the x axis in the above plot implies Monday and 6 implies Sunday.

TRANSACTION REVENUE VS MONTH:-

Observation :-

We can clearly see that the sales in April and September months are more when compared to the rest of the months and the sales in may is the least.

TRANSACTION REVENUE VS YEAR :-

Observation :-

The revenue generated in the year 2016 and 2017 are almost same where as there is a decrease in the 2018 when compared to previous years.

TRANSACTION REVENUE VS VISIT HOUR :-

Observation :-

we observe that the late hours of a day has more transaction revenue compared to early hours.

HOW TARGET VARIABLE CHANGE WITH TIME

Observation:-

The revenue can be viewed as a time series as there seems to some pattern here.

Lets also take a look at the number of users visited the store over time.

users per day visited the store

Observation:-

we can observe that there is a sudden spike in the graph at the end of year 2017 this may be because of festive season.

Now lets analyze other columns

GEO NETWORK CATEGORY :-

Geo network has sub categories as seen below we analyze them one by one

  1. GEO NETWORK CONTINENT :-
avg and total revenue plot for continent

the first plot tells us about the avg revenue per continent that is the mean of transaction revenue per continent and the 2nd graph gives us the total transaction amount per continent.

Observation:-

As we can see that the avg transaction revenue might be more for Africa but the total transaction revenue itself is very high for America this shows that most of the revenue is coming from America.

GEO NETWORK COUNTRY :-

AVG AND TOTAL REVENUE PER COUNTRY

Observation:-

The avg revenue generated is high in countries like Japan, Canada and United States where as the total transaction revenue is high for United States.

ANALYSIS OF TOTALS CATEGORY :-

TOTALS.HITS

The term “hit” is perhaps the most misused term in online marketing, mistakenly used to mean unique visitors, visits, page views, or all of the above. A hit is merely a request for a file from a Web server.

A request for a Web page counts as a hit, but so does a request for a graphic on a Web page.

This is not very informative.

TOTALS.PAGEVIEWS :-

Observation:-

There is a increase in avg transaction revenue with the increase in number of page views.

TRAFFIC SOURCE CATEGORY :-

  1. TrafficSource.Source:-

Observation:-

More traffic and revenue is generated from Direct and Google Source.

2) TrafficSource.medium:-

Medium: the general category of the source, for example, organic search (organic), cost-per-click paid search (cpc), web referral (referral).

Observation :-

Most of the revenue is generated from organic search and other mediums.

DEVICE CATEGORY :-

Device Browser :-

Observation :-

Avg revenue generation is high for device browsers like FireFox , Chrome and Edge. Total revenue generated is high for Chrome Browser.

Device Category :-

Observation :-

Most of the revenue is generated from the desktop device and very few is generated from the mobile device.

DEVICE OPERATING SYSTEM :-

Observation :-

Most users from where the most of the revenue is generated use windows, machintosh and Chrome operating system.

ORGANIC SEARCH :-

  1. CHANNEL GROUPING:-

Channel grouping is a rule based grouping of marketing channels. Channels groups are created for two main reasons: 1) To change the way Google Analytics label and aggregate the incoming traffic for advanced data analysis. 2) To quickly check the performance of a set of marketing channels or set of traffic sources

Observation:-

Most of the revenue is generated from referral direct and display channel groupings.

EXISTING SOLUTION TO SOLVE THIS PROBLEM :-

This is the 36th place solution. In this for the data preprocessing part the category columns with only 2 values that is one is the value and other is the NaN value one converted into binary columns with 0 and 1 . this is especially for columns where NaN can be taken as zero and not for all the columns.For totals.bounce category, updated some values, that for anyone who “bounced” from the website, their sessionQuality was set to 1 (the lowest possible) and their timeOnSite was set to 1 (the lowest possible).

Also removed those columns which had only one value and did not add much information/value to the model.

Next for feature engineering part

  1. Time until final visit
  2. Time of last visit (max visit)
  3. Time of first visit (min visit)
  4. Difference between time of last and first visits
  5. Difference between current visit and first/last visits
  6. Ratio of 5 and 4
  7. Take the length of the gclId
  8. Obtain hour, day of year, weekday, week, month, and year values from the local visit time.
  9. Binary column if trafficSource has the word “google” in it
  10. Binary column if trafficSource has the word “.com” in it
  11. Binary column if referralPath has the word “google” in it
  12. Binary column if the referralPath has the word “youtube” in it
  13. Binary column if the referralPath has the word “intl” in it
  14. Binary column if the referralPath has the word “html” in it
  15. Obtain the last 3 characters of the gclId (this can be a new category)
  16. Consolidate browsers into fewer categories based on research. Include a new category called “nonhuman” for spiders and scrapers
  17. Consolidate operatingSystem into fewer categories

and few other feature engineering steps were followed here

This is at a visit-based level, but we need to make user-based predictions. Therefore he took aggregate of all these features(categorical and numerical ) by taking maxes, mins, sums, means, standard deviations, heads, tails and modes. Apart from that he also has taken log of these aggregate features as he saw that there was an improvement in the models performance.

using the above features he trains the model.

Here he is solving it as a classification and regression task where first he will predict whether the customer will return in the given future time period or not and then if he returns how revenue can we expect from him. That is if a customer is not returning we will take target as 0 else if he is returning we will predict the target using the model and then calculate the rmse value.

FIRST CUT SOLUTION :-

So first look at the below image to get an idea of the train data set , test data set and the future prediction window.

Since we have to predict the revenue generated by the users in test data set for a future time period that is the 3rd window in the above pic we create our training data set as follows.

Train data Preparation :-

We create 4 non overlapping windows using the train data set such that each window will be of 168 days. After creating these windows we will calculate features for each windows. what are the features that we will create will be discussed in the latter part of this blog. Now the target for the users in each window is calculated on a 62 day window after a gap of 46 days from each window. Hope this makes sense !!

so if there is a user in first window to calculate the target or the revenue generated by this user first we will see if this user is returning after a 46 days gap in a 62 days time frame. If yes we will calculated the target of that user in the 62 days time frame.

FEATURE ENGINEERING:

Feature engineering is the process of going from raw data to data that is ready for modeling. It can serve multiple purposes:

  • Make the models easier to interpret (e.g. binning)
  • Capture more complex relationships (e.g. NNs)
  • Reduce data redundancy and dimensionality (e.g. PCA)
  • Rescale variables (e.g. standardizing or normalizing)

We do some basic feature engineering like since we have visit start time with us we will calculate the features like the first session from window period start, last session from the window period end.

Other features involve aggregate features on both categorical and numerical features. All the features that we create are on user level. that is first we group by user and then perform aggregate operations on each column as shown below.

code for feature engineering using train data.

before feature engineering we create windows and the code for that is below.

we create windows using train data here

Similarly we do feature engineering for the test data set.

MODELING:-

Now we train our model using the train features and corresponding target value.

LIGHTGBM :-

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm.

Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.

evaluation metric that we use here is root mean squared error(RMSE)

I have done hyperparameter tuning here and took the best parameters.The train and cv rmse i acheived here is 0.307 and 0.3145.

Now that we have our trained model we use test data to test the model performance. The LeaderBoard Score gave me an rmse value of 0.88501 on test data.

rmse scores

PLOTTING FEATURE IMPORTANCE :-

STACKING REGRESSOR :-

Stacking regression is an ensemble learning technique to combine multiple regression models via a meta-regressor. The individual regression models are trained based on the complete training set; then, the meta-regressor is fitted based on the outputs — meta-features — of the individual regression models in the ensemble.

The leader board score using stacking regressor was 0.90166.

MULTILAYER PERCEPTRON :-

lets try using the simple multilayer perceptron to further improve the scores.

using MLP i was able to improve the Leader Board scores from 0.88501 to 0.88446

HOW TO MAKE KAGGLE SUBMISSION :-

Here we need to be careful while submitting since the fullvisitorId is a string we need to make sure they are in the correct data format to be able to make submissions.

COMPARING MODELS :-

KAGGLE SUBMISSION :-

So with the Kaggle score of 0.88446 , this brings me in top 5% of the ranking.

FUTURE WORK:-

we could have certainly improved the score by doing more featurizations. Also i have solved this as a regression problem instead this can also be solved as a classification and regression task to get better results, by predicting the probability of the user returning in the future and then multiplying the probability with the predicted target value(log(1+expected revenue per user).

REFERENCES :

https://www.appliedaicourse.com

https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields

https://www.analyticsvidhya.com/blog/2019/12/6-powerful-feature-engineering-techniques-time-series/

GITHUB REPOSITORY :-

LINKED IN PROFILE :-

https://www.linkedin.com/in/nikhil-bokade-bb7524b1/

--

--

Nikhil Bokade

A DataScience Enthusiast with strong Analytics and problem solving skills, avid learner.