This case study is about building a predictive model that finds which credit card transactions are fraudulent.
We used a dataset available on the Kaggle website. The dataset contained transactions made by credit cards in September 2013 by European cardholders. This dataset presented transactions that occurred in two days, where we had 492 frauds out of 284,807 transactions. The dataset was highly unbalanced, the positive class (frauds) accounted for 0.172% of all transactions.
It contained only numerical input variables which were the result of a PCA (Principal Component Analysis) transformation. Unfortunately, due to confidentiality issues, we could not get the original features and more background information about the data. Features V1, V2, … V28 were the principal components obtained with PCA, the only features which had not been transformed with PCA were ‘Time’ and ‘Amount’. Feature ‘Time’ contained the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ was the transaction Amount. Feature ‘Class’ was the response variable and it had value “Yes” in case of fraud and “No” otherwise.
We used Databolics to build a predictive model which allowed to detect which transaction will be fraudulent or not. The credit card company will then make decisions based on their alerting/decision workflow.
Preparation of the dataset file: The response variable Class (Yes or No) value was derived from the metadata. Rows were then randomized such that the order of samples in the rows was ensured to be random.
Results in less than 6 hours: Generation of an explanatory model, took just under 6 hours. When evaluated against an independent hold out set of 85,443 transactions (actually the test subset derived from the 284,807 transactions dataset, see Fig 2 comments), the resulting explanatory model was 92.5% accurate with a True Negative Rate of 92.5%, and a True Positive Rate of 94.2%. True Negative Rate represents the percentage of samples predicted Negative (not fraudulent in our case) and being actually Negative in the test subset. True Positive Rate represents the percentage of samples predicted Positive (fraudulent in our case) and being actually Positive in the test subset.
Area under ROC curve was 0.984.
Figure 1: resulting model (Model ID: 20170512T222907.337333). It uses 4 variables (V4, V14, V7 and V13), several operators and constants. In a binary classification model, model is made of 2 equations, one calculating probability of true prediction, the other calculating probability of false prediction. Upon computing the value of each equation, the greater value yields the case predicted by the model. The difference between these 2 numbers gives the prediction’s confidence score.
Figure 2: Review of best model statistics. One can see the 92.5% accuracy obtained on the entire dataset and on the 3 subsets (training, validation and test, training and validation are used for model creation, the test set is an independent hold out used to provide a final accuracy assessment by the final model against new data not available to the modeling process) used to build hundreds of potential models and finally select the best model.
Figure 3: When evaluated against an independent hold out set of 85,443 transactions, the resulting explanatory model was 92.5% accurate with a true negative rate of 92.5%, and a true positive rate of 94.2%. Area under ROC curve was 0.984.
Sierrabolics is a software company that has developed a Universal Analytics Modeling technology that is fast, accurate and actionable. This technology runs on personal to cloud computing platforms.
Databolics, Sierrabolics flagship software, provides the fastest and cheapest end to end method to obtain the most accurate predictive models from datasets.