Treffer: A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets.
Weitere Informationen
Background: Rapid advancements in high-throughput sequencing technologies allow for detailed and accurate measurement of omics features within their biological context. The integration of different omics types creates heterogeneous datasets, presenting challenges in analysis due to variations in measurement units, sample numbers, and features. Currently, there is a lack of generalized guidelines for making decisions in multi-omics study design (MOSD), such as selecting an appropriate number of samples and features, type of preprocessing and integration for robust analysis results. We propose a suggestive guideline for MOSD, involving nine important factors: sample size, feature selection, preprocessing strategy, noise characterization, class balance, number of classes, cancer subtype combination, omics combination, and clinical features. Results: To assess the effectiveness of our proposed MOSD guidelines, we designed and conducted seven benchmark tests using 10 clustering methods on various TCGA cancer datasets with an objective of clustering cancer subtypes. The results indicated robust performance in terms of cancer subtype discrimination when adhering to the following criteria: 26 or more samples per class, selecting less than 10% of omics features, maintaining a sample balance under a 3:1 ratio, and keeping the noise level below 30%. Feature selection was particularly important, improving clustering performance by 34%. Conclusion: These findings provide evidence-based recommendations for MOSD, enabling researchers to optimize analytical approaches and enhance the reliability of results across cancer datasets. The proposed MOSD framework offers a suggestive guideline addressing both computational and biological factors for multi-omics data integration. [ABSTRACT FROM AUTHOR]
Copyright of BMC Genomics is the property of BioMed Central and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)