Can you predict product backorders?
Part backorders is a common supply chain problem. A backorder is a retailer’s order for a part that is temporarily out of stock with the vendor.
This case study is about building a predictive model to identify parts at risk of backorder before the event occurs so to have time to react.
We used a dataset available on the Kaggle website. The initial dataset contained 1M parts and 22 variables for each part. Unlike most of the machine learning algorithms available today in the market, Databolics does not need big datasets to build accurate and compact predictive models. To prove it, we decided to run Databolics on 3 different datasets : the 1M parts one, a subset of 100K parts and a subset of 30K parts. The models produced on these 3 different datasets were very much similar and equivalent in terms of accuracy/performance. Results below are those we got with the 100K parts dataset.
Dataset contained the historical data for 8 weeks prior to the week we were looking to predict. The data was taken as weekly snapshots at the start of each week. 22 columns/variables were defined as follows:
sku – Random ID for the product
national_inv – Current inventory level for the part
lead_time – Transit time for product (if available)
in_transit_qty – Amount of product in transit from source
forecast_3_month – Forecast sales for the next 3 months
forecast_6_month – Forecast sales for the next 6 months
forecast_9_month – Forecast sales for the next 9 months
sales_1_month – Sales quantity for the prior 1 month time period
sales_3_month – Sales quantity for the prior 3 month time period
sales_6_month – Sales quantity for the prior 6 month time period
sales_9_month – Sales quantity for the prior 9 month time period
min_bank – Minimum recommend amount to stock
potential_issue – Source issue for part identified
pieces_past_due – Parts overdue from source
perf_6_month_avg – Source performance for prior 6 month period perf_12_month_avg – Source performance for prior 12 month period
local_bo_qty – Amount of stock orders overdue
deck_risk – Part risk flag
oe_constraint – Part risk flag
ppap_risk – Part risk flag
stop_auto_buy – Part risk flag
rev_stop – Part risk flag
went_on_backorder – Product actually went on backorder. This is the target value.
The dataset size was 8.3 Mbytes size and referenced 100,000 parts with 823 backorders. The dataset was highly unbalanced, the positive class (backorder) accounted for 0.823% of all parts.
The dataset contained numerical input variables and text input variables. Lead time variable had some missing values, so we had to replace all of these by -1 to let Databolics interpret -1 has a missing value. No other pre-processing had to be made like outliers management as Databolics handles this automatically.
Feature ‘Went_on_backorder’ was the response variable and it had value “Yes” in case of backorder and “No” otherwise.
We used Databolics to produce a model which allowed to predict which part will be on backorder or not. Databolics automatically produced the best model – made actually of 2 mathematical equations – without any programming, any algorithm selection, any dataset splitting, etc…
Preparation of the dataset file: The response variable ‘Went_on_backorder’ (Yes or No) value was derived from the metadata. Rows were then randomized such that the order of samples in the rows was ensured to be random.
Results in less than 1 hour: Generation of an explanatory model, took just under 1 hour on a simple MacBook Air notebook. When evaluated against an independent hold out set of 30,001 parts (actually the test subset derived from the 100,000 parts dataset, see Fig 2 comments), the resulting explanatory model was 85% accurate with a True Negative Rate of 86%, and a True Positive Rate of 85%. True Negative Rate represents the percentage of parts predicted Negative (not on backorder in our case) and being actually Negative in the test subset. True Positive Rate represents the percentage of parts predicted Positive (on backorder in our case) and being actually Positive in the test subset.
Area under ROC curve was 0.91.
Figure 1: resulting model (Model ID: 20170523T091440.499444). It uses only 3 variables out of 22 (national_inv, forecast_9_month and sales_3_month), several operators and constants. In a binary classification model, model is made of 2 equations, one calculating probability of true prediction, the other calculating probability of false prediction. Upon computing the value of each equation, the greater value yields the case predicted by the model. The difference between these 2 numbers gives the prediction’s confidence score.
Figure 2: Review of best model statistics. One can see the 85% accuracy obtained on the entire dataset and on the 3 subsets (training, validation and test, training and validation are used for model creation, the test set is an independent hold out used to provide a final accuracy assessment by the final model against new data not available to the modeling process) used to build hundreds of potential models and finally select the best model.