Ishwarya S
·
Follow
6 min read
·
Jun 8,
--
Artificial Intelligence and Machine learning have been around for a long time now but the wide range of applicability of ML has made it a key area of focus for business in recent times. There are many good articles covering concepts like exploratory data analysis, data analysis, feature engineering, and classification/regression. However, topics like model evaluation are not covered in detail and if covered, most of the time only a few sets of model performance measures which are commonly available in the machine learning model framework have been covered. This blog aims to cover the gap in discussion around model evaluation by focusing on one of the important methods to evaluate a model-Gain Charts and Lift Charts.
Lift charts are used to evaluate classification models with a binary target variable. While evaluating a model there are so many metrics that we can use like accuracy, precision-recall, ROC curve etc. Some of the metrics can be misleading too. For example if you use accuracy metrics in a data where target rate is 2% you might get 98% accuracy in your model even though all the points in one of the classes are classified completely wrong.
Metrics like precision, recall, f1-score, etc. can provide an idea on how the model is performing however they do not provide deeper insights on performance. Lift charts can provide further details like rank ordering in probability score and relative ability of model compared to a random classifier. The rank ordering will be further covered in the below topics.
Steps to calculate lift is as follows:
6. Calculate Lift by dividing the Gain Percentage with Cumulative Population percentage.
Lets consider the very famous titanic dataset from kaggle as an example. We have 891 samples in the training dataset and for now lets forget about the test dataset since we need actual Y to evaluate the performance. In this data set we have X variables such as age, their coach class, sex, ticket fare etc. With the training data we build a model to predict which passenger is more likely to survive. The model assigns probability between 0 to 1. I have built 3 models, LogisticRegression, SVM and RandomForestClassifier on the same data to show the comparison between the models lift chart so that you will be able to understand its importance better.
Gain charts will provide you with an idea about which model is performing better and which segment to choose and how many segments to target.
Lets assume that we have lifeboats that can save upto 270 passengers. We have to pick only those with the maximum probability of survival. Thats when the Gain chart helps. It helps to target specific passengers whose probability of survival is more.
Figure 1: Gain chart for various machine learning algorithmsIf you look at the plots in the above figure you can see that till 3rd segment(i,e 30% of the total population) we have 268 passengers. If we use the SVM model, out of 268 passengers the number of survivors that we capture using the model is 49% of 268 which is approximately 131 i,e we can pick 131 passengers who will survive correctly by targeting top 3 segments out of 268. But if we adapt Logistic Regression the number of survivors we can pick is 64% of 268 which is approximately equal to 171. If we adapt Random Forest the number of survivors out of 268 is 209 which is 78%. The last plot is if we pick the passengers randomly. It is a no-brainer that our Random Forest model is performing better.
This type of analysis also ensures that rank order is maintained i.e. in higher probability decile incidence rate should be higher compared to lower probability deciles. It indicates that an incident is likely to have a higher probability score compared to a non-incident. KS value can also be extracted (maximum difference between cumulative incidence rate and non-incidence rate by decile) by adding cumulative incidence and non-incidence rate in the table.
The lift chart basically shows how likely a passenger would survive if we pick a passenger in a random manner. Lets analyse the result with a lift chart.
Figure 2: Lift of using different machine learning model over random classifier by decileHere lift in the first decile is 1, 2.02, 2.54, 2.61 for Random Model, SVM, Logistic Regression and Random Forest classifier respectively. It means that the Random Forest model captures survivors among the passengers 2.61 times better than a random pick.
def lift_chart(X,actual_target,model):
"""
DESCRIPTION
____________________________________________________________Function that takes in X features, Y features and model object as input and creates Gain percentage and Lift.
PARAMETERS
_____________________________________________________________X: DataFrame
The X features that are used by the model.Goto Wonder Machinery to know more.
Explore more:
5 Must-Have Features in a Aluminum Alloy Liftactual_target: DataFrame
Actual target that is used to train the model.model: fit object
The fit object returned by the training algorithm.RETURNS
______________________________________________________________output_df:DataFrame
Output dataframe with columns,
Max_Scr : Maximum Probability Score for that Decile
Min_Scr : Minimum Probability Score for that Decile
Actual : Sum of targets captured by the Decile
Total : Total population of the Decile
Population_perc : Percentage of population in the Decile
Per_Events : Percentage of Events in the decile
Gain Percentage : Gain percentage
Cumulative Population : Cumulative sum of population down the decile
Lift : Lift provided by that particular decile"""
avg_tgt = actual_target.sum()/len(actual_target)
df_data = X.copy()
X_data = df_data.copy()
df_data['Actual'] = actual_target
df_data['Predict']= model.predict(X_data)
y_Prob = pd.DataFrame(model.predict_proba(X_data))
df_data['Prob_1']=list(y_Prob[1])
df_data.sort_values(by=['Prob_1'],ascending=False,inplace=True)
df_data.reset_index(drop=True,inplace=True)
df_data['Decile']=pd.qcut(df_data.index,10,labels=False)
output_df = pd.DataFrame()
grouped = df_data.groupby('Decile',as_index=False)
output_df['Max_Scr']=grouped.max().Prob_1
output_df['Min_Scr']=grouped.min().Prob_1
output_df['Actual']=grouped.sum().Actual
output_df['Total']=grouped.count().Actual
output_df["Population_perc"] = (output_df["Total"]/len(actual_target))*100
output_df['Per_Events'] = (output_df['Actual']/output_df['Actual'].sum())*100
output_df['Gain_Percentage'] = output_df.Per_Events.cumsum()
output_df["Cumulative_Population"] = output_df.Population_perc.cumsum()
output_df["Lift"] = output_df["Gain_Percentage"]/output_df["Cumulative_Population"]
return output_df
Here X is the X features, actual_target is actual_y and model parameter is the model fit object.
The above code provides decile wise lift chart and gain percentage in a pandas dataframe as output. The output looks like the image below.
Figure 3: Lift chart for random forest model on titanic datasetFigure 3 is the lift chart for the above mentioned Random Forest model for the Titanic dataset. Rank ordering predicts the highest number of events in the first decile and then goes progressively down. If the event rate is not monotonically increasing then the model is not performing as expected. If you look in the above figure the monotonically increasing target rate(Per_Events column in figure 3) is actually breaking in the 5th decile. Though the difference is very less between 4th and 5th decile it means the model can be improved by parameter tuning or by using any other model.
Just like every other metric, lift charts are not a one-off solution, but they help in evaluating the overall performance of the model. You can quickly spot the flaws in the model if the slope of the chart is not monotonic. It also helps to set a threshold to choose the segments that are worth targeting much better than the random targeting.
https://www.kdnuggets.com//03/lift-analysis-data-scientist-secret-weapon.html
https://www.analyticsvidhya.com/blog//08/11-important-model-evaluation-error-metrics/
The gain chart and lift chart are two measures that are used for Measuring the benefits of using the model and are used in business contexts such as target marketing. Its not just restricted to marketing analysis. It can also be used in other domains such as risk modeling, supply chain analytics, etc. In other words, Gain and Lift charts are two approaches used while solving classification problems with imbalanced data sets.
Example: In target marketing or marketing campaigns, the customer responses to campaigns are usually very low (in many cases the customers who respond to marketing campaigns are less than 1%). The organization will raise the cost for each customer contact and hence would like to minimize the cost of the marketing campaign and at the same time achieve desired response level from the customers.
The gain chart and lift chart is the measures in logistic regression that will help organizations to understand the benefits of using that model. So that better and more efficient output carry out.
The gain and lift chart is obtained using the following steps:
4. Lift is the ratio of the number of positive observations up to decile i using the model to the expected number of positives up to that decile i based on a random model. Lift chart is the chart between the lift on the vertical axis and the corresponding decile on the horizontal axis.
I
immortalishika
Improve
If you are looking for more details, kindly visit lift table.
Comments
Please Join Us to post.
0