A synthetic dataset was created in Excel by creating random data points.
The Y axis value was chosen to be a simple continuous non-linear function which will be described later in the document. The X coordinates were randomly generated but constrained to only small sections of the continuous curve.
X constraints were chosen such that only a very small aspect if the function’s curve was expressed in the data set to simulate the common real world situation of data sets that are not fully representative of the phenomenon under consideration.
The data set was composed of only 35 points to make the analysis and discovery of the underlying function more challenging to the algorithms.
Two of these sets were produced. The first set, called the training set was to be analyzed to produce a model. The second data set, called the test set, was to be used to assess the correctness of the models generated by analysis of the training data set.
The second data composed of completely different random data points along the same underlying continuous function. Essentially the two datasets were two different pieces of the same curve subject to the same X constraints mentioned above.
MODEL GENERATION AND ASSESSMENT
Modeling of the training data was performed using four different machine learning algorithms: Least squares regression, k-Nearest-Neighbor, Regression Trees, and the Databolics ZGP engine.
Modeling of this data set was quick for all three algorithms, taking less than a few seconds, due to the data set containing only 35 rows.
Least Squares Regression Line Plot (black line represents least squares, blue point represent actual points):
At first glance, it may appear that the least squares regression line might be an acceptable predictor for this dataset depending on the precision required of the use case. This will be discussed more later in this document.
kNN results plot of model evaluated against the test set (blue points are actual values, red plots and the model’s prediction):
A model was produced by generating a best pruned regression tree with a minimum terminal node size of 2 against the training data, the resulting model was evaluated against the test set.
Regression Tree model results plot of model evaluated against the test set (blue points are actual values, red plots and the model’s prediction):
While the regression tree model is slightly better than the kNN model, the error rate is still quite significant. Like the kNN model, the regression tree model also did not do well at discovering and replicating the underlying mathematical relationship expressed in the data.
Lastly, modeling was performed using Databolics Pro and the ZGP analysis engine.
Databolics ZGP model results plot of model evaluated against the test set (blue points are actual values, red plots and the model’s prediction):
EQUATION BASED MODELING
The reason that the ZGP engine performed so well is that the ZGP engine does not seek to find artifacts within the data set by which to discriminate similar observations nor does it employ statistical measures such as prevalence, mean and probability to produce a prediction. ZGP seeks to find a mathematical equation driven by data values that explains the response variable. This allows ZGP to build models which strive to express a complete description of the system behavior as supported by the data.
As a result, such models are much more effective and predicting proper values which were not specifically present in the training data such as in this data set.
This is a crucial distinction as it is very rare in practice that a sample data set contains data points that fully represent every possible manifestation of the phenomenon under consideration.
Statistical based algorithms are poorly equipped to predict behavior of observations outside of the gamut provided by a not fully representative data sample.
THE UNDERLYING MATHEMATICAL RELATIONSHIP
With this in mind, the effectiveness of the ZGP model becomes clear when the actual underlying function is revealed.
It is not at all obvious from the small and constrained points contained in the training and test sets, but these points perfectly describe the very simple equation:
y = sin x
If these points are plotted against the graph of y = sin x the relationship of these points becomes much clearer:
Test Set data points plot against graph of y = sin x :
Note that the dataset contained data points for only a tiny fraction of the complete mathematical curve being modeled. This is an extreme case intended to illustrate a point that is much more subtle and hard to numerically quantify in real world analytic efforts. Real world data sets contain many dimensions of variables comprised of imperfect, incomplete and not fully representative data. These exact same principles apply regardless of the magnitude or mathematical form of this data set aspect.
MODEL ASSESSMENT AGAINST FULL RANGE OF THE UNDERLYING FUNCTION
The value and significance of equation based modeling becomes even more vivid when the same models are used to evaluate data points on segments of the curve for which there was no observation present in the training data.
kNN model plot against the full range of data points for the underlying mathematics:
Now that the full behavior of the underlying data can be seen, it becomes very clear that the least squares regression line mentioned earlier would be an extremely poor predictor for this system.
Plot of the regression tree model against the full range of data points for the underlying mathematics:
The regression tree model was also unable to model behavior outside of the training set sample characteristics despite the fact that sufficient data points were present to infer the true underlying mathematical relationship between X and Y.
Plot of the Databolics ZGP model against the full range of data points for the underlying mathematics:
The ZGP model was able to infer the true mathematical behavior being described by the training set data points despitethose data points presenting a very incomplete sample of the full underlying behavior. The ZGP model perfectly predicts all data points in the full range data set and indeed the model formula reported by the engine identified that Y=SIN(X):
FUSION( SIN, SIN, 0.800854505226215, X, X)
While this simple example is a very simplified case, the principles apply to other mathematical relationships regardless of composition and complexity.
In real world data sets it is very rare that a sample of data presents the full gamut of values and relationships that drive an underlying process. The ability to model these aspects mathematically rather than statistical and probabilistically is a key capability in order to mitigate and avoid limitations imposed by incomplete or non representative data sets.