TechRxiv
manuscript.pdf (766.67 kB)
Download file

DataCurator.jl: Efficient, portable, and reproducible validation, curation, and transformation of large heterogeneous datasets using human-readable recipes compiled into machine verifiable templates

Download (766.67 kB)
preprint
posted on 2023-02-15, 16:11 authored by Ben CardoenBen Cardoen, Hanene Ben YedderHanene Ben Yedder, Sieun Lee, Ivan Robert Nabi, Ghassan Hamarneh

Large-scale processing of heterogeneous datasets in interdisciplinary research often requires time-consuming manual data curation. Ambiguity in the data layout and preprocessing conventions can easily compromise reproducibility and scientific discovery, and even when detected, it requires time and effort to be corrected by domain experts. Poor data curation can also interrupt processing jobs on large computing clusters, causing frustration and delays. We introduce DataCurator , a portable software package that verifies arbitrarily complex datasets of mixed formats, working equally well on clusters as on local systems. Human-readable TOML recipes are converted into executable machine-verifiable templates, enabling users to easily verify datasets using custom rules without writing code. Recipes can be used to transform and validate data, for pre- or post-processing, selection of data subsets, sampling, and aggregation, such as summary statistics. Processing pipelines no longer need to be burdened by laborious data validation, with data curation and validation replaced by human and machine verifiable recipes specifying rules and actions. Multithreaded execution ensures scalability on clusters, and existing Julia, R, and Python libraries can be reused. DataCurator enables efficient remote workflows, offering integration with Slack and the ability to transfer curated data to clusters using OwnCloud and SCP. Code available at: https://github.com/bencardoen/DataCurator.jl.

History

Email Address of Submitting Author

bcardoen@sfu.ca

ORCID of Submitting Author

0000-0001-6871-1165

Submitting Author's Institution

Simon Fraser University

Submitting Author's Country

  • Canada

Usage metrics

    Categories

    Exports