<?php  
            include_once( $_SERVER['DOCUMENT_ROOT']."/static/includes/common.inc.php" );
            do_html_header("Documentation");
        ?><div id="content">
<div class="navheader">
<table width="100%" summary="Navigation header"><tr>
<td width="20%" align="left">
<a accesskey="p" href="running_workflows.php">Prev</a> </td>
<td width="60%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="20%" align="right"> <a accesskey="n" href="data_staging_configuration.php">Next</a>
</td>
</tr></table>
<hr>
</div>
<div class="section" title="5.2. Mapping Refinement Steps">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="mapping_refinement_steps"></a>5.2. Mapping Refinement Steps</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="mapping_refinement_steps.php#idp44744288">5.2.1. Data Reuse</a></span></dt>
<dt><span class="section"><a href="mapping_refinement_steps.php#idp35180144">5.2.2. Site Selection</a></span></dt>
<dt><span class="section"><a href="mapping_refinement_steps.php#idp42615296">5.2.3. Job Clustering</a></span></dt>
<dt><span class="section"><a href="mapping_refinement_steps.php#idp40807504">5.2.4. Addition of Data Transfer and
      Registration Nodes</a></span></dt>
<dt><span class="section"><a href="mapping_refinement_steps.php#idp40321216">5.2.5. Addition of Create Dir and Cleanup Jobs</a></span></dt>
<dt><span class="section"><a href="mapping_refinement_steps.php#idp41385824">5.2.6. Code Generation</a></span></dt>
</dl></div>
<p>During the mapping process, the abstract workflow undergoes a series
    of refinement steps that converts it to an executable form.</p>
<div class="section" title="5.2.1. Data Reuse">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp44744288"></a>5.2.1. Data Reuse</h3></div></div></div>
<p>The abstract workflow after parsing is optionally handed over to
      the Data Reuse Module. The Data Reuse Algorithm in Pegasus attempts to
      prune all the nodes in the abstract workflow for which the output files
      exist in the Replica Catalog. It also attempts to cascade the deletion
      to the parents of the deleted node for e.g if the output files for the
      leaf nodes are specified, Pegasus will prune out all the workflow as the
      output files in which a user is interested in already exist in the
      Replica Catalog.</p>
<p>The Data Reuse Algorithm works in two passes</p>
<p><span class="bold"><strong>First Pass</strong></span> - Determine all the
      jobs whose output files exist in the Replica Catalog. An output file
      with the transfer flag set to false is treated equivalent to the file
      existing in the Replica Catalog , if the output file is not an input to
      any of the children of the job X.</p>
<p><span class="bold"><strong>Second Pass</strong></span> - The algorithm
      removes the job whose output files exist in the Replica Catalog and
      tries to cascade the deletion upwards to the parent jobs. We start the
      breadth first traversal of the workflow bottom up.</p>
<pre class="programlisting">( It is already marked for deletion in Pass 1
     OR
      ( ALL of it's children have been marked for deletion
        AND
        Node's output files have transfer flags set to false
       )
 )</pre>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The Data Reuse Algorithm can be disabled by passing the
        <span class="bold"><strong>--force</strong></span> option to
        pegasus-plan.</p>
</div>
<div class="figure">
<a name="idp36582256"></a><p class="title"><b>Figure 5.2. Workflow Data Reuse</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-data-reuse.png" align="middle" alt="Workflow Data Reuse"></div></div>
</div>
<br class="figure-break">
</div>
<div class="section" title="5.2.2. Site Selection">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp35180144"></a>5.2.2. Site Selection</h3></div></div></div>
<p>The abstract workflow is then handed over to the Site Selector
      module where the abstract jobs in the pruned workflow are mapped to the
      various sites passed by a user. The target sites for planning are
      specified on the command line using the<span class="bold"><strong>
      --sites</strong></span> option to pegasus-plan. If not specified, then
      Pegasus picks up all the sites in the Site Catalog as candidate sites.
      Pegasus will map a compute job to a site only if Pegasus can</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>find an INSTALLED executable on the site</p></li>
<li class="listitem">
<p>OR find a STAGEABLE executable that can be staged to the site
          as part of the workflow execution.</p>
<p>Pegasus supports variety of site selectors with Random being
          the default</p>
<div class="itemizedlist"><ul class="itemizedlist" type="circle">
<li class="listitem">
<p><span class="bold"><strong>Random</strong></span></p>
<p>The jobs will be randomly distributed among the sites that
              can execute them.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>RoundRobin</strong></span></p>
<p>The jobs will be assigned in a round robin manner amongst
              the sites that can execute them. Since each site cannot execute
              every type of job, the round robin scheduling is done per level
              on a sorted list. The sorting is on the basis of the number of
              jobs a particular site has been assigned in that level so far.
              If a job cannot be run on the first site in the queue (due to no
              matching entry in the transformation catalog for the
              transformation referred to by the job), it goes to the next one
              and so on. This implementation defaults to classic round robin
              in the case where all the jobs in the workflow can run on all
              the sites.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Group</strong></span></p>
<p>Group of jobs will be assigned to the same site that can
              execute them. The use of the<span class="bold"><strong> PEGASUS
              profile key group</strong></span> in the DAX, associates a job with a
              particular group. The jobs that do not have the profile key
              associated with them, will be put in the default group. The jobs
              in the default group are handed over to the "Random" Site
              Selector for scheduling.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Heft</strong></span></p>
<p>A version of the HEFT processor scheduling algorithm is
              used to schedule jobs in the workflow to multiple grid sites.
              The implementation assumes default data communication costs when
              jobs are not scheduled on to the same site. Later on this may be
              made more configurable.</p>
<p>The runtime for the jobs is specified in the
              transformation catalog by associating the <span class="bold"><strong>pegasus profile key runtime</strong></span> with the
              entries.</p>
<p>The number of processors in a site is picked up from the
              attribute <span class="bold"><strong>idle-nodes</strong></span> associated
              with the vanilla jobmanager of the site in the site
              catalog.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>NonJavaCallout</strong></span></p>
<p>Pegasus will callout to an external site selector.In this
              mode a temporary file is prepared containing the job information
              that is passed to the site selector as an argument while
              invoking it. The path to the site selector is specified by
              setting the property pegasus.site.selector.path. The environment
              variables that need to be set to run the site selector can be
              specified using the properties with a pegasus.site.selector.env.
              prefix. The temporary file contains information about the job
              that needs to be scheduled. It contains key value pairs with
              each key value pair being on a new line and separated by a
              =.</p>
<p>The following pairs are currently generated for the site
              selector temporary file that is generated in the
              NonJavaCallout.</p>
<div class="table">
<a name="idp42268576"></a><p class="title"><b>Table 5.1. Table 1: Key Value Pairs that are currently generated
                for the site selector temporary file that is generated in the
                NonJavaCallout.</b></p>
<div class="table-contents"><table summary="Table 1: Key Value Pairs that are currently generated
                for the site selector temporary file that is generated in the
                NonJavaCallout." border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Value</strong></span></td>
</tr>
<tr>
<td>version</td>
<td>is the version of the site selector api,currently
                      2.0.</td>
</tr>
<tr>
<td>transformation</td>
<td>is the fully-qualified definition identifier for
                      the transformation (TR) namespace::name:version.</td>
</tr>
<tr>
<td>derivation</td>
<td>is the fully qualified definition identifier for
                      the derivation (DV), namespace::name:version.</td>
</tr>
<tr>
<td>job.level</td>
<td>is the job's depth in the tree of the workflow
                      DAG.</td>
</tr>
<tr>
<td>job.id</td>
<td>is the job's ID, as used in the DAX file.</td>
</tr>
<tr>
<td>resource.id</td>
<td>is a pool handle, followed by whitespace,
                      followed by a gridftp server. Typically, each gridftp
                      server is enumerated once, so you may have multiple
                      occurances of the same site. There can be multiple
                      occurances of this key.</td>
</tr>
<tr>
<td>input.lfn</td>
<td>is an input LFN, optionally followed by a
                      whitespace and file size. There can be multiple
                      occurances of this key,one for each input LFN required
                      by the job.</td>
</tr>
<tr>
<td>wf.name</td>
<td>label of the dax, as found in the DAX's root
                      element. wf.index is the DAX index, that is incremented
                      for each partition in case of deferred planning.</td>
</tr>
<tr>
<td>wf.time</td>
<td>is the mtime of the workflow.</td>
</tr>
<tr>
<td>wf.manager</td>
<td>is the name of the workflow manager being used
                      .e.g condor</td>
</tr>
<tr>
<td>vo.name</td>
<td>is the name of the virtual organization that is
                      running this workflow. It is currently set to
                      NONE</td>
</tr>
<tr>
<td>vo.group</td>
<td>unused at present and is set to NONE.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</li>
</ul></div>
</li>
</ul></div>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The site selector to use for site selection can be specified by
        setting the property <span class="bold"><strong>pegasus.selector.site</strong></span></p>
</div>
<div class="figure">
<a name="idp42237056"></a><p class="title"><b>Figure 5.3. Workflow Site Selection</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-site-selection.png" align="middle" alt="Workflow Site Selection"></div></div>
</div>
<br class="figure-break">
</div>
<div class="section" title="5.2.3. Job Clustering">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp42615296"></a>5.2.3. Job Clustering</h3></div></div></div>
<p>After site selection, the workflow is optionally handed for to the
      job clustering module, which clusters jobs that are scheduled to the
      same site. Clustering is usually done on short running jobs in order to
      reduce the remote execution overheads associated with a job. Clustering
      is described in detail in the <a class="link" href="job_clustering.php" title="10.2. Job Clustering">optimization</a> chapter.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The job clustering is turned on by passing the <span class="bold"><strong>--cluster</strong></span> option to pegasus-plan.</p>
</div>
</div>
<div class="section" title="5.2.4. Addition of Data Transfer and Registration Nodes">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp40807504"></a>5.2.4. Addition of Data Transfer and
      Registration Nodes</h3></div></div></div>
<p>After job clustering, the workflow is handed to the Data Transfer
      module that adds data stage-in , inter site and stage-out nodes to the
      workflow. Data Stage-in Nodes transfer input data required by the
      workflow from the locations specified in the Replica Catalog to a
      directory on the staging site associated with the job. The staging site
      for a job is the execution site if running in a sharedfs mode, else it
      is the one specified by <span class="bold"><strong>--staging-site</strong></span>
      option to the planner. In case, multiple locations are specified for the
      same input file, the location from where to stage the data is selected
      using a <span class="bold"><strong>Replica Selector</strong></span> . Replica
      Selection is described in detail in the <a class="link" href="data_management.php#replica_selection" title="9.1. Replica Selection">Replica Selection</a> section of the
      <a class="link" href="data_management.php" title="Chapter 9. Data Management">Data Management</a> chapter. More
      details about staging site can be found in the <a class="link" href="data_staging_configuration.php" title="5.3. Data Staging Configuration">data staging configuration</a>
      chapter.</p>
<p>The process of adding the data stage-in and data stage-out nodes
      is handled by Transfer Refiners. All data transfer jobs in Pegasus are
      executed using <span class="bold"><strong>pegasus-transfer</strong></span> . The
      pegasus-transfer client is a python based wrapper around various
      transfer clients like globus-url-copy, s3cmd, irods-transfer, scp, wget,
      cp, ln . It looks at source and destination url and figures out
      automatically which underlying client to use. pegasus-transfer is
      distributed with the PEGASUS and can be found in the bin subdirectory .
      Pegasus Transfer Refiners are are described in the detail in the
      Transfers section of the <a class="link" href="data_management.php" title="Chapter 9. Data Management">Data
      Management</a> chapter. The default transfer refiner that is used in
      Pegasus is the <span class="bold"><strong>BalancedCluster</strong></span> Transfer
      Refiner, that clusters data stage-in nodes and data stage-out nodes per
      level of the workflow, on the basis of certain pegasus profile keys
      associated with the workflow.</p>
<div class="figure">
<a name="idp42197984"></a><p class="title"><b>Figure 5.4. Addition of Data Transfer Nodes to the Workflow</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-transfer-jobs.png" align="middle" alt="Addition of Data Transfer Nodes to the Workflow"></div></div>
</div>
<br class="figure-break"><p>Data Registration Nodes may also be added to the final executable
      workflow to register the location of the output files on the final
      output site back in the Replica Catalog . An output file is registered
      in the Replica Catalog if the register flag for the file is set to true
      in the DAX.</p>
<div class="figure">
<a name="idp42195968"></a><p class="title"><b>Figure 5.5. Addition of Data Registration Nodes to the Workflow</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-registration-jobs.png" align="middle" alt="Addition of Data Registration Nodes to the Workflow"></div></div>
</div>
<br class="figure-break"><p>The data staged-in and staged-out from a directory that is created
      on the head node by a create dir job in the workflow. In the vanilla
      case, the directory is visible to all the worker nodes and compute jobs
      are launched in this directory on the shared filesystem. In the case
      where there is no shared filesystem, users can turn on worker node
      execution, where the data is staged from the head node directory to a
      directory on the worker node filesystem. This feature will be refined
      further for Pegasus 3.1. To use it with Pegasus 3.0 send email to
      <span class="bold"><strong>pegasus-support at isi.edu</strong></span>.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The replica selector to use for replica selection can be
        specified by setting the property <span class="bold"><strong>pegasus.selector.replica</strong></span></p>
</div>
</div>
<div class="section" title="5.2.5. Addition of Create Dir and Cleanup Jobs">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp40321216"></a>5.2.5. Addition of Create Dir and Cleanup Jobs</h3></div></div></div>
<p>After the data transfer nodes have been added to the workflow,
      Pegasus adds a create dir jobs to the workflow. Pegasus usually ,
      creates one workflow specific directory per compute site , that is on
      the staging site associated with the job. In the case of shared shared
      filesystem setup, it is a directory on the shared filesystem of the
      compute site. In case of shared filesystem setup, this directory is
      visible to all the worker nodes and that is where the data is staged-in
      by the data stage-in jobs.</p>
<p>The staging site for a job is the execution site if running in a
      sharedfs mode, else it is the one specified by <span class="bold"><strong>--staging-site</strong></span> option to the planner. More
      details about staging site can be found in the <a class="link" href="data_staging_configuration.php" title="5.3. Data Staging Configuration">data staging configuration</a>
      chapter.</p>
<p>After addition of the create dir jobs, the workflow is optionally
      handed to the cleanup module. The cleanup module adds cleanup nodes to
      the workflow that remove data from the directory on the shared
      filesystem when it is no longer required by the workflow. This is useful
      in reducing the peak storage requirements of the workflow.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The addition of the cleanup nodes to the workflow can be
        disabled by passing the <span class="bold"><strong>--nocleanup</strong></span>
        option to pegasus-plan.</p>
</div>
<div class="figure">
<a name="idp40322544"></a><p class="title"><b>Figure 5.6. Addition of Directory Creation and File Removal Jobs</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-creadir-rm-jobs.png" align="middle" alt="Addition of Directory Creation and File Removal Jobs"></div></div>
</div>
<br class="figure-break"><div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Users can specify the maximum number of cleanup jobs added per
        level by specifying the property <span class="bold"><strong>pegasus.file.cleanup.clusters.num</strong></span> in the
        properties.</p>
</div>
</div>
<div class="section" title="5.2.6. Code Generation">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp41385824"></a>5.2.6. Code Generation</h3></div></div></div>
<p>The last step of refinement process, is the code generation where
      Pegasus writes out the executable workflow in a form understandable by
      the underlying workflow executor. At present Pegasus supports the
      following code generators</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p><span class="bold"><strong>Condor</strong></span></p>
<p>This is the default code generator for Pegasus . This
          generator generates the executable workflow as a Condor DAG file and
          associated job submit files. The Condor DAG file is passed as input
          to Condor DAGMan for job execution.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Shell</strong></span></p>
<p>This Code Generator generates the executable workflow as a
          shell script that can be executed on the submit host. While using
          this code generator, all the jobs should be mapped to site local i.e
          specify <span class="bold"><strong>--sites local </strong></span> to
          pegasus-plan.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>To use the Shell code Generator set the property <span class="bold"><strong>pegasus.code.generator</strong></span> Shell</p>
</div>
</li>
<li class="listitem">
<p><span class="bold"><strong>PMC</strong></span></p>
<p>This Code Generator generates the executable workflow as a PMC
          task workflow. This is useful to run on platforms where it not
          feasible to run Condor such as the new XSEDE machines such as Blue
          Waters. In this mode, Pegasus will generate the executable workflow
          as a PMC task workflow and a sample PBS submit script that submits
          this workflow. Note that the generated PBS file needs to be manually
          updated before it can be submitted.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>To use the Shell code Generator set the property <span class="bold"><strong>pegasus.code.generator</strong></span> PMC</p>
</div>
</li>
</ol></div>
<div class="figure">
<a name="idp40642400"></a><p class="title"><b>Figure 5.7. Final Executable Workflow</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="./images/refinement-final-executable-wf.png" align="middle" alt="Final Executable Workflow"></div></div>
</div>
<br class="figure-break">
</div>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="running_workflows.php">Prev</a> </td>
<td width="20%" align="center"><a accesskey="u" href="running_workflows.php">Up</a></td>
<td width="40%" align="right"> <a accesskey="n" href="data_staging_configuration.php">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Chapter 5. Running Workflows </td>
<td width="20%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="40%" align="right" valign="top"> 5.3. Data Staging Configuration</td>
</tr>
</table>
</div>
</div><?php  
            do_html_footer();
        ?>
