Haoyu Zhang
0009-0005-6025-0217
·
haoyu-zc
Department of Biomedical Informatics, University of Colorado Anschutz, Aurora, CO 80045, USA
Kevin Fotso
0009-0002-2196-6157
·
kf-cuanschutz
Office of Information Technology, University of Colorado Anschutz, Aurora, CO 80045, USA
Marc Subirana-Granés
0000-0003-3934-839X
·
msubirana
·
msubirana20
Department of Biomedical Informatics, University of Colorado Anschutz, Aurora, CO 80045, USA
Milton Pividori
✉
0000-0002-3035-4403
·
miltondp
·
miltondp
·
@miltondp@genomic.social
Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz, Aurora, CO 80045, USA
· Funded by The National Human Genome Research Institute (K99/R00 HG011898); The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765)
✉ — Correspondence possible via GitHub Issues or email to Milton Pividori <milton.pividori@cuanschutz.edu>.
Motivation: Identifying meaningful patterns in complex biological data necessitates correlation coefficients capable of capturing diverse relationship types beyond simple linearity. Furthermore, efficient computational tools are crucial for handling the ever-increasing scale of biological datasets.
Results: We introduce CCC-GPU, a high-performance, GPU-accelerated implementation of the Clustermatch Correlation Coefficient (CCC). CCC-GPU computes correlation coefficients for mixed data types, effectively detects nonlinear relationships, and offers significant speed improvements over its predecessor.
Availability: The source code of CCC-GPU is openly available on GitHub (https://github.com/pivlab/ccc-gpu) and archived on Zenodo (https://doi.org/10.5281/zenodo.18310318), distributed under the BSD-2-Clause Plus Patent License.
Correlation coefficients are fundamental tools for uncovering meaningful patterns within data. While traditional measures like Pearson and Spearman are adept at capturing linear and monotonic relationships, newer methodologies, such as the Clustermatch Correlation Coefficient (CCC) [1] and the Maximal Information Coefficient (MIC) [2,3] have emerged to detect a broader spectrum of associations. These coefficients represent three distinct statistical assumptions: Pearson captures linear relationships, Spearman measures monotonic patterns using a nonparametric approach, and CCC detects general nonlinear associations. We centered our analysis on these three to effectively cover the most common correlation scenarios without redundancy; notably, we excluded Kendall’s \(\tau\) due to its high correlation with Spearman’s \(\rho\) and ~540× slower computation.
The CCC is a clustering-based statistic designed to identify both linear and nonlinear patterns. Previously, we demonstrated CCC’s utility in analyzing gene expression data from GTEx, showcasing its robustness to outliers and its ability to detect both strong linear relationships and biologically significant nonlinear patterns often missed by conventional coefficients [1]. Unlike MIC, CCC accommodates both numerical and categorical data types and is up to two orders of magnitude faster. However, despite leveraging CPU multi-threading for acceleration, the original CCC implementation can still be computationally expensive for large datasets. For instance, in the original study [1], we used only the top 5,000 most variable genes in a single tissue (whole blood) to ease computation.
Here, we introduce a new implementation, CCC-GPU, that harnesses the power of NVIDIA CUDA for GPU programming. We have computed CCC-GPU values for all ~50,000 genes in the Genotype-Tissue Expression (GTEx) v8 dataset [4] across all 54 tissues in a fraction of the time that would have been needed using the original CCC implementation. This advancement achieves a substantial speedup over the original CPU-based implementation, making comprehensive correlation analysis of large biological datasets more practical and efficient.
CCC was originally developed in Python, as detailed in the original publication [1]. We provide the algorithm equation and pseudocode in Supplementary Information E1 and S1, respectively. Profiling CCC revealed that Adjusted Rand Index (ARI) computation [5] was the primary performance bottleneck in our experiments, accounting for 74–91% of total runtime depending on workload size (Figures S2–S4). This is expected as CCC heavily relies on ARI computation for internal data clustering. For example, when computing the pairwise correlations for one GTEx v8 tissue with ~50,000 genes using default CCC parameters, approximately 1.5 billion ARI calculations are needed. Since individual ARI calculations have no data dependencies and require minimal synchronization, they are ideal candidates for utilizing graphics processing units (GPUs), which contain thousands of compute units designed for massively parallel operations.
We developed a GPU-accelerated implementation by rewriting the core ARI computation module in CUDA C++ (requiring an NVIDIA GPU with CUDA Compute Capability 8.6 or higher, i.e., RTX 30-series or newer). While existing CUDA-based Python libraries were evaluated, native C++ code provides greater flexibility and complete access to CUDA’s feature set [6]. In this implementation, we replaced the Python-based ARI computation logic with CUDA C++ kernel functions. Each feature (gene) pair’s CCC correlation involves multiple ARI calculations, which we assigned unique global indices. These calculations are carefully mapped to kernel functions by their global indices, then executed concurrently across multiple GPU cores. This required rebuilding internal algorithms that Python libraries such as NumPy conveniently provide. An orchestration layer processes inputs in batches, allocating up to 4.5 GB of GPU memory per batch—well within the capacity of modern GPUs, enabling memory-efficient streaming of large-scale matrices. Within each batch, kernel functions use shared memory for caching and efficient inter-thread communication, device memory buffers are reused across batches to avoid repeated allocation overhead, and results are transferred to host memory upon batch completion. To minimize overhead, device synchronization events occur only at critical points: after each ARI batch computation completes and during p-value calculations to ensure kernel completion before data transfer. In summary, our custom solution incorporates GPU architecture optimizations including explicit memory management, tailored caching strategies, and coordinated CPU-GPU communication to maximize performance. Future software versions will expose more parameters for hardware-specific fine-tuning.
We then used pybind11 [7] to integrate the CUDA C++ backend with the existing Python framework, ensuring full compatibility with the original CCC interface. This hybrid approach preserves the original system’s comprehensive feature set while enabling GPU acceleration, facilitating thorough testing and validation.
We conducted a comprehensive comparative evaluation of Pearson, Spearman, and CCC-GPU using simulated data and all genes and tissues in the GTEx v8 dataset. The evaluation was performed on an AMD Ryzen Threadripper 7960X CPU with an NVIDIA RTX 4090 GPU, representing a typical high-performance workstation configuration. The original analysis and benchmarking code is available in the GitHub repository (https://github.com/pivlab/ccc-gpu/tree/main/analysis).
We compared the computational complexity of CCC-GPU (using the GPU) and the original CCC (using 24 CPU cores) when analyzing each tissue in the GTEx dataset. For whole blood (56,200 genes, 755 samples), CCC-GPU demonstrated a 37x speedup. On average, the original CCC required approximately 6 hours to process one GTEx tissue. This acceleration enabled the complete analysis of all 54 GTEx tissues in just 8 hours on a single machine, compared to ~312 hours (~13 days) that would be needed with the original implementation. Consistent performance gains were observed across all tissue types, confirming CCC-GPU’s broad applicability.
To further understand CCC-GPU’s scalability with increasing data size, we performed benchmarks using synthesized input, comparing its speedup against the original CPU implementation. The results, illustrated in Figure 1a, demonstrate that CCC-GPU maintains a stable speedup trend as input size increases, confirming its robust scaling capabilities when handling large datasets. The different curves in Figure 1a simulate CCC-GPU’s prospective speedup over CPU hardware with varying numbers of cores. We also compared the runtime of the original CCC, CCC-GPU, Pearson, and Spearman methods, as shown in Figure 1b. While Pearson and Spearman are inherently faster due to their reliance on simple statistics, CCC-GPU’s runtime is remarkably close to these methods, showcasing its efficiency while maintaining its advanced capabilities for capturing complex relationships.
The significant performance enhancement provided by CCC-GPU expands the scope of transcriptomic analyses and allowed us to uncover more nonlinear patterns. In the UpSet analysis [8] shown in Figure 1c, we compared how Pearson, Spearman, and CCC-GPU agreed or disagreed in prioritizing gene pairs across all genes in the 54 GTEx tissues. We found that ~2.9 billion gene pairs found only by CCC-GPU (Disagreements in Figure 1c, where CCC-GPU is “high” and any of the others “low”) likely have biologically meaningful nonlinear patterns, as it was found in the original CCC study [1]. Likewise, gene pairs where Pearson is “high” and the rest are “low” are likely driven by outliers [1].
Leveraging CCC-GPU’s mixed data type capability, we correlated gene expression with GTEx metadata to interpret nonlinear patterns. For this, we included categorical variables such as sex, mortality status, and numerical variables such as BMI, age, among others. From the five intersection groups where CCC values were high but Pearson or Spearman remained low, we selected the top 100 gene pairs per tissue with the largest CCC value. Figure 1d highlights gene pairs with biologically interpretable nonlinear patterns explained by a metadata variable, such as UTY-RAD18 (sex) and RASSF2-CYTIP (pre/post-mortem status; these two genes were previously found to be differentially expressed in these conditions [9]). We provide the CCC values computed for all gene pairs, and the top gene-metadata correlation results across all GTEx v8 tissues (see Supplementary Note 1).
We present CCC-GPU, a GPU-accelerated implementation of the original Clustermatch Correlation Coefficient (CCC). Our work demonstrates that CCC-GPU delivers a remarkable acceleration over its CPU-based predecessor, enabling the rapid and efficient computation of correlation coefficients in large transcriptomic datasets. This performance leap transforms analyses that previously required weeks into tasks achievable within hours on standard research hardware.
Challenges often reside not only in detecting but also in interpreting complex, nonlinear patterns between genes. CCC-GPU’s ability to correlate different data types allows researchers to easily incorporate available metadata into their analyses. For example, a previously highlighted nonlinear relationship for the gene pair RASSF2-CYTIP, detected by CCC, was explained by GTEx metadata field “COHORT” (pre- vs. post-mortem status). These two genes were previously reported to be differentially expressed under these conditions [9]. CCC also detected a strong linear pattern between these genes regardless of the organism’s mortality status.
Our new CCC-GPU delivers a next-generation correlation coefficient at a fraction of the computational cost without sacrificing its accessibility, accuracy, or reliability. This significant performance enhancement makes comprehensive correlation analysis of large genomic data practical on standard research hardware. Beyond standard correlation analyses, CCC-GPU empowers biologists to perform sophisticated tasks such as advanced feature selection prior to machine learning model training with large datasets. Moreover, by accelerating the discovery of novel nonlinear relationships in expression data, CCC-GPU helps researchers quickly identify patterns beyond conventional, linear-only relationships. This can potentially uncover new biological insights that would otherwise remain hidden and drive forward our understanding of complex biological systems.
This work is supported by the National Human Genome Research Institute (R00 HG011898 to M.P.), and The Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765 to M.P.).
K.F. has recently held less than $3,000 worth of NVIDIA stock. All other authors declare no competing interests.
In our Zenodo archive (https://doi.org/10.5281/zenodo.17156519), we provide:
We used two approaches to define correlation values that are “high” or “low” for each coefficient per tissue. The first approach captures, for each tissue and correlation coefficient, the top 30% of gene pairs by using the 70th percentile of correlation values, and the bottom 30% of gene pairs by using the 30th percentile (Figure 1c, in the main text, for all tissues combined; and panel c for individual tissues in the supplementary figures section below). The second approach computes the null distribution of coefficient values per tissue by taking a random subset of 10,000 genes and shuffling samples. Then, we define genes with “high” correlation as those with coefficient values larger than the 95th percentile (P < 0.05), and “low” correlation as those with coefficient values smaller than the 80th percentile (P > 0.20) (Figure S5 for all tissues, and panel d for individual tissues, in the supplementary figures section below).
The Clustermatch Correlation Coefficient (CCC) between a feature pair (i.e., gene pair in the context of transcriptomics), represented by data vectors \(\mathbf{x}\) and \(\mathbf{y}\), is defined as the maximum ARI between all possible object partitions (i.e., grouping of objects, which are groups of RNA-seq samples in the context of transcriptomics) derived from the data vectors:
\[\textrm{CCC}(\mathbf{x}, \mathbf{y}) = \max \lbrace 0, \max_{\substack{\pi_j \in \Pi^{\mathbf{x}} \\ \pi_l \in \Pi^{\mathbf{y}}}} \lbrace \textrm{ARI}(\pi_j, \pi_l) \rbrace \rbrace, \forall \left\vert\pi\right\vert \in [2, k_{\mathrm{max}}]\](E1)
where \(\Pi^{\mathbf{x}}\) is a set of partitions derived from \(\mathbf{x}\), \(\Pi^{\mathbf{y}}\) is a set of partitions derived from \(\mathbf{y}\), and \(k_{\mathrm{max}}\) specifies the maximum number of clusters allowed for partitions. The ARI has an upper bound of 1 (achieved when both partitions are identical), and although it does not have a well-defined lower bound, values equal or less than zero are achieved when partitions are independent. Therefore, \(\textrm{CCC}(\mathbf{x}, \mathbf{y}) \in \left[0,1\right]\). In the special case where all \(n\) objects in either \(\mathbf{x}\) or \(\mathbf{y}\) have the same value, the CCC is undefined. Refer to the original CCC article [1] for extended definitions, explanation of \(k_{\mathrm{max}}\), properties, statistical significance, among other details.
The main function of the algorithm, ccc, generates a set of partitions \(\Pi^{\mathbf{x}}\) for variable \(\mathbf{x}\) (line 15), and another set of partitions \(\Pi^{\mathbf{y}}\) for variable \(\mathbf{y}\) (line 16).
Then, it computes the ARI between each partition \(\pi_j \in \Pi^{\mathbf{x}}\) and \(\pi_l \in \Pi^{\mathbf{y}}\) and gets the maximum (line 17), returning either this value or zero if this is negative (line 18).
Refer to the original CCC article [1] for more details.
To quantify the computational bottleneck in the CCC algorithm, we performed comprehensive profiling using Python’s cProfile on representative workloads. We categorized all function calls into: ARI (Adjusted Rand Index computation), Partitioning (quantile-based clustering), Coordination (algorithm orchestration), and NumPy/Numba (numerical operations).
Figure S2 shows the runtime breakdown by individual function for a medium workload (2,500 features × 500 samples). The three ARI-related functions (adjusted_rand_index, get_pair_confusion_matrix, and get_contingency_matrix) collectively consume approximately 80% of total runtime, confirming that ARI computation is the dominant bottleneck.
Figure S3 compares runtime distribution across three feature sizes: Small (500 features), Medium (2,500 features), and Large (5,000 features), all with 500 samples. The ARI category consistently dominates across all workload sizes, accounting for ~74% of runtime.
Figure S4 shows how ARI’s share of total runtime changes as sample count increases (500 to 4,000 samples, fixed 500 features). The ARI percentage increases from approximately 74% at 500 samples to over 91% at 4,000 samples, demonstrating that for larger datasets typical in transcriptomics, ARI computation becomes even more dominant. This analysis confirms that GPU acceleration of ARI computation addresses the correct computational bottleneck.
| Number of genes | CCC-GPU vs. CCC (6 cores) | CCC-GPU vs. CCC (12 cores) | CCC-GPU vs. CCC (24 cores) |
|---|---|---|---|
| 500 | 17.6x | 16.52x | 16.1x |
| 1,000 | 56.17x | 30.65x | 21.82x |
| 2,000 | 87.06x | 45.72x | 24.45x |
| 4,000 | 116.39x | 59.46x | 33.03x |
| 6,000 | 128.74x | 67.46x | 34.77x |
| 8,000 | 140.24x | 71.48x | 38.67x |
| 10,000 | 138.51x | 72.38x | 37.02x |
| 16,000 | 142.7x | 73.83x | 37.53x |
| 20,000 | 142.83x | 73.88x | 37.73x |
| Number of genes | CCC-GPU | CCC (12 cores) | Spearman (12 cores) | Pearson (12 cores) |
|---|---|---|---|---|
| 500 | 0.279s | 4.603s | 0.051s | 0.024s |
| 1,000 | 0.529s | 16.198s | 0.089s | 0.045s |
| 2,000 | 1.384s | 63.263s | 0.235s | 0.138s |
| 4,000 | 4.290s | 255.071s | 0.679s | 0.489s |
| 6,000 | 8.441s | 569.409s | 1.379s | 1.093s |
| 8,000 | 14.116s | 1009.069s | 2.443s | 1.973s |
| 10,000 | 21.694s | 1570.269s | 4.379s | 3.448s |
| 16,000 | 55.469s | 4051.114s | 11.169s | 9.310s |
| 20,000 | 86.286s | 6374.781s | 20.282s | 17.238s |