Databolics was used to model and perform feature selection on the NCBI microarray gene expression dataset GSE19429.
This dataset contains Affymetrix GeneChip Human Genome V133 Plus 2.0 microarray data representing gene expression levels for 200 samples. These samples consisted of bone marrow tissue obtained from 183 patients with Myelodysplastic Syndromes (MDS) and 17 healthy controls. The microarray used generated 54,675 attributes per sample. The dataset file size was 103 megabytes.
Acquisition and preparation of the dataset file : This entailed download of the series matrix text file, removal of metadata rows, transposition of rows/columns such that columns represented attributes and rows represented samples. The response variable value was derived from the metadata and added as a column. Rows were then randomized such that the order of samples in the rows was ensured to be random. The healthy vs MDS attribute was chosen as the response variable to analyze.
Results in less than two hours: This included dataset acquisition and preparation, feature reduction, ranking of the entire set of feature attributes and generation of an explanatory model, took just under two hours. When evaluated against hold out samples (samples not available for consideration in the analysis process), the resulting explanatory model was 90% accurate with a true negative rate of 100%, a true positive rate of 89.1%.
With regards to the ranked feature set, in the top 20 probes identified by the analysis process, 4 probes (231067_s_at, 241679_at, 210517_s_at, and 227530_at) were identified that represent transcripts for the gravin/AKAP21 gene, relevant to MDS as discovered by other research:
This is of note because Databolics Pro placed 4 of the 5 probe ID’s for AKAP21 in the top 20, which lends confidence that the analysis process is truly considering the merit of all attributes and not simply randomly finding useful attributes.
Also in the top 20 attributes was found three probe IDs that represent transcripts for ARPP21 (220359_s_at, 1556599_s_at, and 231935_at). The relevance of ARPP21 to MDS was also found by the research described in: http://www.nature.com/leu/journal/v24/n4/full/leu201031a.html .
Also identified by Databolics Pro in the top 20 probes:
OR7A5 (208285_at) – A finding also found by: http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2141.2007.06833.x/pdf
SH2D4B (1563849_at) and KIAA0226L (previously named C13orf18, probe 44790_s_at) – Both found to be down regulated and differentially expressed in MDS patients per: http://bmcmedgenomics.biomedcentral.com/articles/10.1186/1755-8794-3-30
PPP2R2C (228010_at) – Downregulated in MDS patients per: http://onlinelibrary.wiley.com/doi/10.1002/ijc.27896/full
CD19 (206398_s_at) – Found to be downregulated in MDS patients per: http://ajcp.oxfordjournals.org/content/138/5/732
P4HA1 (202733_at) – Found to have gene pathway aberrantly methylated in MDS HSCs: http://www.bloodjournal.org/content/bloodjournal/120/10/2076.full.pdf?sso-checked=1
TP53INP1 (225912_at) – Relevant mutation characteristics in MDS: http://williams.medicine.wisc.edu/mdsgenetics.pdf
TP53 as prognostic biomarker and association with higher likelihood of transformation to AML: https://www.mycancergenome.org/content/disease/myelodysplastic-syndromes/tp53/331/
IFR4 (204562_at) – Relevance to MDS per: http://www.bloodjournal.org/content/124/21/2203.abstract?sso-checked=true
The following probes/genes were also identified in the top 20 probe IDs but relevance not found in other research or literature:
Gene: HMHB1, Probe: 208302_at
Gene: DUSP26, Probe: 219144_at
Gene: MME, Probe: 203434_s_at
Gene: P2RY14 Probe: 206637_at
Sierrabolics is a software company that has developed a Universal Analytics Modeling technology that is fast, accurate and actionable. This technology runs on personal to cloud computing platforms.
Databolics Pro, Sierrabolics flagship software, provides the fastest and cheapest end to end method to obtain the most accurate models from datasets.
Databolics Pro enables access to previously inaccessible knowledge embedded in data, while reducing the time, cost, and expertise required to gain benefits from such knowledge.