Given a set of high-dimensional data, run_umap.m produces a lower-dimensional representation of the data for purposes of data visualization and exploration. See the comments at the top of the file run_umap.m for documentation and many examples of how to use this code.
This MATLAB implementation follows a very similar structure to the Python implementation, and many of the function descriptions are nearly identical.
Here are some major differences in this MATLAB implementation:
1) The MATLAB function eigs.m does not appear to be as fast as the function "eigsh" in the Python package Scipy. For large data sets, we initialize a low-dimensional transform by binning the data using an algorithm known as probability binning. If the user downloads and installs the function lobpcg.m, made available here (https://www.mathworks.com/matlabcentral/fileexchange/48-locally-optimal-block-preconditioned-conjugate-gradient) by Andrew Knyazev, this can be used to find exact eigenvectors for medium-sized data sets. We also give you the option of downloading our slightly altered version of lobpcg.m, which has equivalent results. 2) We have built in the optional ability to detect clusters in the low-dimensional output of UMAP. The clustering method we invoke is either DBM (described at https://www.hindawi.com/journals/abi/2009/686759/ ) for 2D reductions or DBSCAN (built in to MATLAB R2019a and later) for any sized reduction. This produces cluster ID output and visualizations as explained in the code examples. 3) We also have built in visual and computational tools for data group comparisons. Data groups (AKA subsets) can be defined either by running clustering (described above) on the data islands formed by UMAP’s reduction or by external classification labels provided for every row of the high dimensional input data given to UMAP. Our UMAP implementation uses external labels for supervised reductions and supervised template reductions as well as for comparing any reduction’s data islands directly to the external classification. We use a change quantification metric (QFMatch) which detects similarity in both mass & distance (described at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5818510/) as well as a score for measuring overlap when the groups are different classifications for the same data (described at https://en.wikipedia.org/wiki/F-score). For visualizing data groups we provide a dendrogram (described as QF-tree at https://www.nature.com/articles/s42003-019-0467-6) and sortable tables which show each data group’s similarity, overlap, false positive % and false negative %. In version 2.2 we added “UMAP dimension explorer” (UDE). UDE is a sortable table that shows characteristics of a data group’s unreduced data in each input dimension. These characteristics include the Kullback-Leibler divergence (KLD); the distribution as a density bar (colored using MATLAB’s jet colormap); and median, mean, SD and MAD. UDE supports data groups drawn by a MATLAB ROI tool (region of interest) on the UMAP output plot.
Without the aid of any compression this MATLAB UMAP implementation tends to be faster than the current Python implementation (version 0.5.2 of umap-learn). All UMAP reductions are made faster with C++ MEX implementations. Due to File Exchange requirements, we only supply the C++ source code for the MEX modules. Users must download or build the .MEX binary files themselves separately (the option to download or build the files is provided upon calling "run_umap"). As examples 13 to 15 show, you can test the speed difference between the implementations for yourself on your computer by setting the 'python' argument to true.
Additionally, users of supervised templates may request the post reduction services of supervisor matching, QF-tree, and QFMatch. The function run_umap.m returns the results of these services via the new 4th output argument: extras. The properties of extras are documented in the file umap/UMAP_extra_results.m.
The major improvements in our version 3.01 release are
- Significant acceleration of both dimension reduction as well as matching of resultant classifications. This is done by compressing the input data into probability bins. We invented probability binning some 2 decades ago as an early attempt at an open cover like UMAP’s fuzzy simplicial complexes. Hence the compression operation specializes in retaining high dimensional characteristics while reducing size significantly. In our testing with flow cytometry data sets, we see negligible loss of classification accuracy for up to 40 dimensions when running QFMatch on clusters from UMAP reductions with prior trusted classifications. However, we do notice some loss of global structure on the UMAP plots. See the fast_approximation argument comments in the run_umap.m file. To understand probability bins see https://onlinelibrary.wiley.com/doi/full/10.1002/1097-0320(20010901)45:1%3C37::AID-CYTO1142%3E3.0.CO%3B2-E
- A PredictionAdjudicator (PA) feature that helps determine how well one classification’s subsets predict another’s. PA reorganizes the predicting classifier’s subsets into predicting subsets: true positive, false positive and false negative subsets. PA determines whether the false positives or false negatives have more QFMatch based similarity to the predicted subset. PA guides UMAP dimension explorers into showing the measurement distributions and Kullback-Leibler divergence of predicting subsets stacked together with the predicted subset. Selections in the PA table are highlighted in the UMAP and EPP plots.
- A complementary independent classifier that generates labels both for supervising UMAP as well as for classification comparison research. This classifier is named “exhaustive projection pursuit” (EPP). EPP takes a more conservative approach to grappling with the curse of dimensionality than that taken by UMAP’s algebraic topology. For more background on the curse of dimensionality see https://www.nature.com/articles/nri.2017.150 . EPP scans all dimension pairs for the best split and then repeats on each split part until no further are found. Its key benefit for the flow cytometry community is that the decisions are more familiar and reviewable to the biologist than decisions made by most classifiers. EPP shows its decisions in a tree that resembles the biologists’ age-old manual gating trees that have been driving immunology research for decades! EPP is described at https://onedrive.live.com/?authkey=%21ALyGEpe8AqP2sMQ&cid=FFEEA79AC523CD46&id=FFEEA79AC523CD46%21209192&parId=FFEEA79AC523CD46%21204865&o=OneUp
Optional toolbox dependencies:
-The Bioinformatics Toolbox is required to change the 'qf_tree' argument.
-The Curve Fitting Toolbox is required to change the 'min_dist' argument.
This implementation is a work in progress. It has been looked over by Leland McInnes, who considers it "a fairly faithful direct translation of the original Python code". We hope to continue improving it in the future.
Provided by the Herzenberg Lab at Stanford University.
We appreciate all and any help in finding bugs. Our priority has been determining the suitability of our concepts for research publications in flow cytometry for the use of UMAP supervised templates and exhaustive projection pursuit.