Geneva University Hospital
Mantle cell lymphoma (MCL) and chronic lymphocytic leukemia (CLL) share many features and both arise from CD5+ B-cells; their distinction is critical as MCL is a more aggressive neoplasm.
Thomas Matthes, Professor of Hematology at Geneva University Hospital, and Sierrabolics have teamed up to help Professor’s research team in his gene analysis and root/cause analysis work related to MCL and CLL diseases.
The research team had assembled a very small dataset, which contained 50 samples (20 MCL and 30 CLL) with 290 attributes per sample. The attributes were obtained using Nanostring technology® and corresponded to values for gene expression levels. The dataset file size was 150 kilobytes.
Typically, producing a predictive model from a very small dataset (like the one used in this case) is extremely challenging via traditional machine learning techniques. This is a common difficulty
in such research due to the dilemma that common analysis techniques require substantial numbers of samples in order to attain useful analysis results yet the procurement of each sample can be both expensive and logistically difficult. As such, the dataset under consideration offered an excellent test case to challenge the efficacy of Databolics under difficult analysis constraints.
Databolics has been used:
- To find the top 25 genes among the 290 responsible of the MCL and CLL diseases, and
- To build a predictive model, which will allow clinicians to diagnose automatically MCL or CLL diseases on future patients for whom the gene expression levels of 290 genes have been determined.
Preparation of the dataset file: The response variable Diagnosis (MCL or CLL) value was derived from the metadata. Rows were then randomized such that the order of samples in the rows was ensured to be random.
Results in less than an hour: Feature reduction, (which in Databolics produces a ranking of the entire set of feature attributes in terms of each feature’s ability to predict the response variable) and generation of an explanatory model, took just under an hour. When evaluated against an independent hold out set of 15 samples, the resulting explanatory model was 100% accurate with a true negative rate of 100%, and a true positive rate of 100%.
With regards to the ranked feature set, in the top 25 genes identified by the analysis process, the research team found that there was a 90% overlap between their previous manual work and Databolics results: in the 10% delta, some genes found by Databolics as top25 were not found previously by R&D team, and some top 25 genes found by R&D team were not found by Databolics. This could not be interpreted immediately as errors from one side or the other, it might happen that genes found by Databolics were actually in the top 25 ranking or the other way around. Further analysis is currently being conducted as well as increasing the 50 samples dataset with additional patients to get to more precise conclusions.
Professor Matthes’s quote: “Before using Databolics, we were using manual intensive and time consuming algorithmic methods to identify genes responsible of blood cancer pathologies. Databolics brought us speed, accuracy of predictions and ease of use. As clinicians and researchers, we have neither time nor programming and IT skills to use any of the predictive technologies available today. Having an automated predictive modeling tool makes a huge difference in our research work from timing and cost perspectives. We will keep on using Databolics to refine our models, and will apply it to other research works. We also plan in the future to make the predictive models available to clinicians to support their diagnosis processes.”