2 options
Fine-grained provenance and applications to data analytics computation / Nan Zheng.
- Format:
- Book
- Thesis/Dissertation
- Author/Creator:
- Zheng, Nan, author.
- Language:
- English
- Subjects (All):
- Computer science.
- Computer engineering.
- Information science.
- Computer and Information Science--Penn dissertations.
- Penn dissertations--Computer and Information Science.
- Local Subjects:
- Computer science.
- Computer engineering.
- Information science.
- Computer and Information Science--Penn dissertations.
- Penn dissertations--Computer and Information Science.
- Genre:
- Academic theses.
- Physical Description:
- 1 online resource (141 pages)
- Contained In:
- Dissertations Abstracts International 82-12B.
- Place of Publication:
- [Philadelphia, Pennsylvania] : University of Pennsylvania ; Ann Arbor : ProQuest Dissertations & Theses, 2021.
- Language Note:
- English
- System Details:
- Mode of access: World Wide Web.
- text file
- Summary:
- Data provenance tools seek to facilitate reproducible data science and auditable data analyses by capturing the analytics steps used in generating data analysis results. However, analysts must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks-for data types such as strings, images, etc. Additionally, we need a provenance archival layer to store and manage the tracked fine-grained provenance that enables future sophisticated reasoning about why individual output results appear or fail to appear. For reproducibility and auditing, the provenance archival system should be tamper-resistant. On the other hand, the provenance collecting over time or within the same query computation tends to be repeated partially (i.e., the same operation with the same input records in the middle computation step). Hence, we desire efficient provenance storage (i.e., it compresses repeated results). We address these challenges with novel formalisms and algorithms, implemented in the PROVision system, for reconstructing fine-grained provenance for a broad class of ETL-style workflows. We extend database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluations. We develop solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads.
- Notes:
- Source: Dissertations Abstracts International, Volume: 82-12, Section: B.
- Advisors: Ives, Zachary G.; Committee members: Susan Davidson; Boon Thau Loo; Andreas Haeberlen; Junhyong Kim.
- Department: Computer and Information Science.
- Ph.D. University of Pennsylvania 2021.
- Local Notes:
- School code: 0175
- ISBN:
- 9798738617621
- Access Restriction:
- Restricted for use by site license.
- This item must not be sold to any third party vendors.
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.