<?php  
            require('/srv/new-pegasus.isi.edu/includes/common.php'); 
            pegasus_header("10.6. Data Cleanup");
        ?><div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.php">Pegasus 4.8.0 User Guide</a></span> &gt; <span class="breadcrumb-link"><a href="data_management.php">Data Management</a></span> &gt; <span class="breadcrumb-node">Data Cleanup</span>
</div><hr><div class="section">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="data_cleanup"></a>10.6. Data Cleanup</h2></div></div></div>
<div class="toc"><dl class="toc"><dt><span class="section"><a href="data_cleanup.php#idm4816">10.6.1. Data Cleanup in Hierarchal Workflows</a></span></dt></dl></div>
<p>When executing large workflows, users often may run out of diskspace
    on the remote clusters / staging site. Pegasus provides a couple of ways
    of enabling automated data cleanup on the staging site ( i.e the scratch
    space used by the workflows). This is achieved by adding data cleanup jobs
    to the executable workflow that the Pegasus Mapper generates. These
    cleanup jobs are responsible for removing files and directories during the
    workflow execution. To enable data cleanup you can pass the --cleanup
    option to pegasus-plan . The value passed decides the cleanup strategy
    implemented</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p><span class="bold"><strong>none </strong></span> disables cleanup
        altogether. The planner does not add any cleanup jobs in the
        executable workflow whatsoever.</p></li>
<li class="listitem"><p><span class="bold"><strong>leaf</strong></span> the planner adds a leaf
        cleanup node per staging site that removes the directory created by
        the create dir job in the workflow</p></li>
<li class="listitem"><p><span class="bold"><strong>inplace</strong></span> the mapper adds cleanup
        nodes per level of the workflow in addition to leaf cleanup nodes. The
        nodes remove files no longer required during execution. For example,
        an added cleanup node will remove input files for a particular compute
        job after the job has finished successfully. Starting 4.8.0 release,
        the number of cleanup nodes created by this algorithm on a particular
        level, is dictated by the number of nodes it encounters on a level of
        the workflow.</p></li>
<li class="listitem"><p><span class="bold"><strong>constraint</strong></span> the mapper adds
        cleanup nodes to constraint the amount of storage space used by a
        workflow, in addition to leaf cleanup nodes. The nodes remove files no
        longer required during execution. The added cleanup node guarantees
        limits on disk usage. File sizes are read from the <span class="bold"><strong>size</strong></span> flag in the DAX, or from a CSV file (<a class="link" href="properties.php#cleanup_props" title="13.3.13. Cleanup Properties"><span class="emphasis"><em>
        pegasus.file.cleanup.constraint.csv</em></span></a>).</p></li>
</ol></div>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>For large workflows with lots of files, the inplace strategy may
      take a long time as the algorithm works at a per file level to figure
      out when it is safe to remove a file.</p>
</div>
<p>Behaviour of the cleanup strategies implemented in the Pegasus
    Mapper can be controlled by properties described <a class="link" href="properties.php#cleanup_props" title="13.3.13. Cleanup Properties">here</a>.</p>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idm4816"></a>10.6.1. Data Cleanup in Hierarchal Workflows</h3></div></div></div>
<p>By default, for hierarchal workflows the inplace cleanup is always
      turned off. This is because the cleanup algorithm ( InPlace ) does not
      work across the sub workflows. For example, if you have two DAX jobs in
      your top level workflow and the child DAX job refers to a file generated
      during the execution of the parent DAX job, the InPlace cleanup
      algorithm when applied to the parent dax job will result in the file
      being deleted, when the sub workflow corresponding to parent DAX job is
      executed. This would result in failure of sub workflow corresponding to
      the child DAX job, as the file deleted is required to present during
      it's execution.</p>
<p>In case there are no data dependencies across the dax jobs, then
      yes you can enable the InPlace algorithm for the sub dax’es . To do this
      you can set the property</p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>pegasus.file.cleanup.scope deferred</p></li></ul></div>
<p>This will result in cleanup option to be picked up from the
      arguments for the DAX job in the top level DAX .</p>
</div>
</div><div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="ref_output_mapper.php">Prev</a> </td>
<td width="20%" align="center"><a accesskey="u" href="data_management.php">Up</a></td>
<td width="40%" align="right"> <a accesskey="n" href="metadata.php">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">10.5. Output Mappers </td>
<td width="20%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="40%" align="right" valign="top"> 10.7. Metadata</td>
</tr>
</table>
</div><?php  
            pegasus_footer();
        ?>
