My Account Log in

1 option

Navigating Heterogeneity to Learn from Large-Scale Cancer Data: Optimization, Redundancy, and Generalization / John Crawford.

Dissertations & Theses @ University of Pennsylvania Available online

View online
Format:
Book
Thesis/Dissertation
Author/Creator:
Crawford, John, author.
Contributor:
University of Pennsylvania. Genomics and Computational Biology, degree granting institution.
Language:
English
Subjects (All):
Biology.
Oncology.
Systematic biology.
Genomics and Computational Biology--Penn dissertations.
Penn dissertations--Genomics and Computational Biology.
Local Subjects:
Biology.
Oncology.
Systematic biology.
Genomics and Computational Biology--Penn dissertations.
Penn dissertations--Genomics and Computational Biology.
Physical Description:
1 online resource (164 pages)
Distribution:
Ann Arbor : ProQuest Dissertations & Theses, 2023
Contained In:
Dissertations Abstracts International 85-08B.
Place of Publication:
[Philadelphia, Pennsylvania] : University of Pennsylvania, 2022.
Language Note:
English
Summary:
In the pursuit of molecular characterization of diverse cancers, collaborative efforts have generated large publicly available datasets, which combine various data types and data sources. Simultaneously, machine learning has rapidly gravitated toward models with many parameters that can be trained on broad sets of data, and subsequently fine-tuned to a wide variety of tasks. Computational oncology sits squarely at the intersection between these advances. However, the structure of most cancer datasets is uniquely heterogeneous, relative to other fields and data types in which large models have proven successful. In this dissertation, we first study aspects of machine learning model tuning in cancer, showing that the choice of optimizer used to fit models on cancer transcriptomics datasets can have pronounced effects on model selection. We then explore two aspects of heterogeneity inherent to public cancer datasets that affect machine learning modeling choices. We first show that most -omics types available in the TCGA Pan-Cancer Atlas can capture information relevant to cancer function, but somewhat less intuitively, when multiple -omics types are combined there is considerable redundancyand model performance does not generally improve. Next, we study model generalization across biological contexts in cancer transcriptomics and its implications on model selection, finding that cross-validation performance on holdout data is a sufficient selection criterion, and criteria that incorporate model sparsity or simplicity do not tend to improve generalization performance. Overall, our results show that the particularities of large cancer genomics datasets must be considered for applications of machine learning to be successful in this domain. These findings suggest hurdles to, but also opportunities for, machine learning models integrating pan-cancer and pan-omics data for biological and clinical insights.
Notes:
Source: Dissertations Abstracts International, Volume: 85-08, Section: B.
Advisors: Greene, Casey S.; Ritchie, Marylyn D.; Committee members: Li, Mingyao; Tan, Kai; Camara, Pablo G.; Slonim, Donna K.
Department: Genomics and Computational Biology.
Ph.D. University of Pennsylvania 2023.
Local Notes:
School code: 0175
ISBN:
9798381471557
Access Restriction:
Restricted for use by site license.

The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.

Find

Home Release notes

My Account

Shelf Request an item Bookmarks Fines and fees Settings

Guides

Using the Find catalog Using Articles+ Using your account