: : ABSTRACT

Managing Data-Intensive Scientific Workflows in Distributed Environments

Ewa Deelman

Information Sciences Institute
University of Southern California, USA

deelman@isi.edu

Abstract:

In this talk we examine the issue of optimizing disk usage and of scheduling large-scale scientific workflows onto distributed resources where the workflows are data-intensive, requiring large amounts of data storage, and where the resources have limited storage resources. Our approach is two-fold: we minimize the amount of space a workflow requires during execution by removing data files at runtime when they are no longer needed and we demonstrate that some workflows may need to be restructured in order to significantly reduce the data footprint of the workflow. We describe the results of our data management and workflow restructuring solutions using a Laser Interferometer Gravitational-Wave Observatory (LIGO) application-the binary inspiral analysis, and an astronomy application, Montage, running on the Open Science Grid. We also examine the cost of the restructuring in terms of the application's runtime.

>> presentation .pdf <<