In a recent study published on bioRxiv* preprint server, researchers developed and elucidated a “bridge integration” method for harmonizing single-cell ribonucleic acid (RNA) sequencing (scRNA-seq) datasets.

Study: Dictionary learning for integrative, multimodal, and scalable single-cell analysis. Image Credit: Meletios Verras/Shutterstock

Mapping new scRNA-seq datasets to reference sets is an exciting and growing opportunity in single-cell genomics. Unlike the unsupervised approach, supervised mapping relies on well-curated and extended reference datasets to annotate conservation-enabled query profiles and novel computational tools. Although existing practices are powerful, they are built from scRNA-seq data and cannot annotate datasets that do not measure gene expression.

The Human Cell Atlas (HCA), the Human Biomolecular Atlas Project (HuBMAP) and the Chan Zuckerberg Biohub are references carefully selected and annotated by experts. Mapping datasets to these references allows data harmonization and comparison of scRNA-seq datasets under different experimental conditions and disease states. Map additional molecular modalities such as single-cell assays for transposase-accessible chromatin sequencing (scATAC-seq), single-cell bisulfite sequencing (scBS-seq) for DNA methylation assessment, flow cytometry by time-of-flight (cyTOF)-cell cleavage under targets and tagmentation (scCUT & TAG) for histone modifications are difficult because they estimate different features than scRNA-seq.

The study

In the current study, the researchers revealed a “bridge integration” for integrating single-cell datasets that measure disparate modalities. The method presented here exploits another set of data as a “bridge” in which the two modalities are calculated. Dictionary learning, commonly used in image analysis, is used for bridge integration. This form of learning represents the input data (eg, a noisy image) as individual elements, and these elements (image patches) are called atoms which collectively constitute the “dictionary”. Image reconstruction with a weighted linear combination of these atoms could be effective for denoising representing the conversion of the image (dataset) into the dictionary-defined space.

The rationale for integrating the bridge was to combine single cell sequence data in which different modalities (single modality datasets) are measured. Although the authors have previously described conversion from one feature set to another, the transformation makes strict biological assumptions between modalities and may not always hold.

The authors leveraged multi-omics datasets as a bridge to translate between distinct modalities by dictionary learning for integration (bridge) at single cell resolution. Essentially, the multi-omics dataset was treated as a dictionary and the (multi-)omics profile of the individual cell represented an atom. Then, the dictionary representation of each of these disparate unimodal datasets is inferred based on the atoms. Distinct datasets are described in a space with similar characteristics and are finally aligned.

The bridge integration method makes no (biological) assumptions between the separate modalities, but these are automatically learned from the multi-omics dataset. Subsequently, these disparate data sets are transformed to be represented by a shared set of entities. After transformation, a final alignment procedure is followed, compatible with other single cell integration methods like Harmony, Seurat, mnnCorrect, etc.


The bridge integration technique was implemented to map scRNA-seq and scATAC-seq samples of human bone marrow mononuclear cells (BMMC). These samples contain whole cells of hematopoietic differentiation, including hematopoietic stem cells (HSCs), multipotent progenitors, and fully differentiated cells. A reference scRNA-seq dataset from BMMC called Azimuth reference with 297,627 cells was constructed from publicly available datasets (HuBMAP). The scATAC-seq (query) dataset of BMMCs was mapped to this reference dataset, and a 10x multiome dataset (32,368 cells with paired scRNA-seq and scATAC-seq data) was used as a molecular bridge.

The authors reported successful mapping of the query dataset to the Azimuth reference that allowed viewing and annotating of scRNA-seq and scATAC-seq data. They noted exclusive CD34 mapping+ BMMC fractions to HSCs and progenitors in the reference, indicating the robustness of the bridging integration strategy.

Additionally, unlike unsupervised analysis, bridge integration annotated rare subpopulations at high resolution, namely, monocytes were clustered into CD16+ and CD14+ fractions, natural killer cells in CD56brilliant and CD56dm subgroups. It should be noted that the rare sets of innate lymphoid cells have been identified with AXL+SIGLEC6+ (ASDC) dendritic cells with this method.

Although reducing the size of the bridge dataset yields concordant results, the accuracy of annotations for rare cell types might be compromised. Further evaluations with bridge size scaling found that a bridge dataset with at least 50 cells (atoms) per subpopulation should yield acceptable results with rare cell type annotations .


The current study demonstrated the successful application of the bridge integration technique. Additionally, the methodology built into atomic sketching could extend the application to harmonize large datasets comprising millions of cells. The bridge integration method is suitable for studies in which the multi-omics technique is applied for a subset instead of all experimental samples.

*Important Notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be considered conclusive, guide clinical practice/health-related behaviors, or treated as established information.

Journal reference: