<?php  
            include_once( $_SERVER['DOCUMENT_ROOT']."/static/includes/common.inc.php" );
            do_html_header("Documentation");
        ?><div id="content">
<div class="navheader">
<table width="100%" summary="Navigation header"><tr>
<td width="20%" align="left">
<a accesskey="p" href="data_management.php">Prev</a> </td>
<td width="60%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="20%" align="right"> <a accesskey="n" href="cred_staging.php">Next</a>
</td>
</tr></table>
<hr>
</div>
<div class="section">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="transfer"></a>9.2. Data Transfers</h2></div></div></div>
<div class="toc"><dl class="toc">
<dt><span class="section"><a href="transfer.php#ref_data_staging_configuration">9.2.1. Data Staging Configuration</a></span></dt>
<dt><span class="section"><a href="transfer.php#local_vs_remote_transfers">9.2.2. Local versus Remote Transfers</a></span></dt>
<dt><span class="section"><a href="transfer.php#controlling_transfer_parallelism">9.2.3. Controlling Transfer Parallelism</a></span></dt>
<dt><span class="section"><a href="transfer.php#idp62625344">9.2.4. Symlinking Against Input Data</a></span></dt>
<dt><span class="section"><a href="transfer.php#data_movement_nodes">9.2.5. Addition of Separate Data Movement Nodes to Executable
      Workflow</a></span></dt>
<dt><span class="section"><a href="transfer.php#idp62674752">9.2.6. Executable Used for Transfer Jobs</a></span></dt>
<dt><span class="section"><a href="transfer.php#idp62694944">9.2.7. Staging of Executables</a></span></dt>
<dt><span class="section"><a href="transfer.php#idp63339216">9.2.8. Staging of Pegasus Worker Package</a></span></dt>
<dt><span class="section"><a href="transfer.php#staging_job_checkpoint_files">9.2.9. Staging of Job Checkpoint Files</a></span></dt>
<dt><span class="section"><a href="transfer.php#idp63362688">9.2.10. Using Amazon S3 as a Staging Site</a></span></dt>
<dt><span class="section"><a href="transfer.php#idp63372992">9.2.11. iRODS data access</a></span></dt>
<dt><span class="section"><a href="transfer.php#idp63378256">9.2.12. GridFTP over SSH (sshftp)</a></span></dt>
</dl></div>
<p>As part of the Workflow Mapping Process, Pegasus does data
    management for the executable workflow . It queries a Replica Catalog to
    discover the locations of the input datasets and adds data movement and
    registration nodes in the workflow to</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>stage-in input data to the staging sites ( a site associated
        with the compute job to be used for staging. In the shared filesystem
        setup, staging site is the same as the execution sites where the jobs
        in the workflow are executed )</p></li>
<li class="listitem"><p>stage-out output data generated by the workflow to the final
        storage site.</p></li>
<li class="listitem"><p>stage-in intermediate data between compute sites if
        required.</p></li>
<li class="listitem"><p>data registration nodes to catalog the locations of the output
        data on the final storage site into the replica catalog.</p></li>
</ol></div>
<p>The separate data movement jobs that are added to the executable
    workflow are responsible for staging data to a workflow specific directory
    accessible to the staging server on a staging site associated with the
    compute sites. Depending on the data staging configuration, the staging
    site for a compute site is the compute site itself. In the default case,
    the staging server is usually on the headnode of the compute site and has
    access to the shared filesystem between the worker nodes and the head
    node. Pegasus adds a directory creation job in the executable workflow
    that creates the workflow specific directory on the staging server.</p>
<p>In addition to data, Pegasus does transfer user executables to the
    compute sites if the executables are not installed on the remote sites
    before hand. This chapter gives an overview of how transfers of data and
    executables is managed in Pegasus.</p>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="ref_data_staging_configuration"></a>9.2.1. Data Staging Configuration</h3></div></div></div>
<p>Pegasus can be broadly setup to run workflows in the following
      configurations</p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem">
<p><span class="bold"><strong>Shared File System</strong></span></p>
<p>This setup applies to where the head node and the worker nodes
          of a cluster share a filesystem. Compute jobs in the workflow run in
          a directory on the shared filesystem.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>NonShared FileSystem</strong></span></p>
<p>This setup applies to where the head node and the worker nodes
          of a cluster don't share a filesystem. Compute jobs in the workflow
          run in a local directory on the worker node</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Condor Pool Without a shared
          filesystem</strong></span></p>
<p>This setup applies to a condor pool where the worker nodes
          making up a condor pool don't share a filesystem. All data IO is
          achieved using Condor File IO. This is a special case of the non
          shared filesystem setup, where instead of using pegasus-transfer to
          transfer input and output data, Condor File IO is used.</p>
</li>
</ul></div>
<p>For the purposes of data configuration various sites, and
      directories are defined below.</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p><span class="bold"><strong>Submit Host</strong></span></p>
<p>The host from where the workflows are submitted . This is
          where Pegasus and Condor DAGMan are installed. This is referred to
          as the <span class="bold"><strong>"local"</strong></span> site in the site
          catalog .</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Compute Site</strong></span></p>
<p>The site where the jobs mentioned in the DAX are executed.
          There needs to be an entry in the Site Catalog for every compute
          site. The compute site is passed to pegasus-plan using <span class="bold"><strong>--sites</strong></span> option</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Staging Site</strong></span></p>
<p>A site to which the separate transfer jobs in the executable
          workflow ( jobs with stage_in , stage_out and stage_inter prefixes
          that Pegasus adds using the transfer refiners) stage the input data
          to and the output data from to transfer to the final output site.
          Currently, the staging site is always the compute site where the
          jobs execute.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Output Site</strong></span></p>
<p>The output site is the final storage site where the users want
          the output data from jobs to go to. The output site is passed to
          pegasus-plan using the <span class="bold"><strong>--output</strong></span>
          option. The stageout jobs in the workflow stage the data from the
          staging site to the final storage site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Input Site</strong></span></p>
<p>The site where the input data is stored. The locations of the
          input data are catalogued in the Replica Catalog, and the
          <span class="emphasis"><em>"site"</em></span> attribute of the locations gives us the
          site handle for the input site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Workflow Execution
          Directory</strong></span></p>
<p>This is the directory created by the create dir jobs in the
          executable workflow on the Staging Site. This is a directory per
          workflow per staging site. Currently, the Staging site is always the
          Compute Site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Worker Node Directory</strong></span></p>
<p>This is the directory created on the worker nodes per job
          usually by the job wrapper that launches the job.</p>
</li>
</ol></div>
<div class="section">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp63596992"></a>9.2.1.1. Shared File System</h4></div></div></div>
<p>By default Pegasus is setup to run workflows in the shared file
        system setup, where the worker nodes and the head node of a cluster
        share a filesystem.</p>
<div class="figure">
<a name="idp63598320"></a><p class="title"><b>Figure 9.1. Shared File System Setup</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><img src="images/data-configuration-sharedfs.png" align="middle" height="450" alt="Shared File System Setup"></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes ( either on Submit Host or Head Node )
            to stage in input data from Input Sites ( 1---n) to a workflow
            specific execution directory on the shared filesystem.</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in the workflow
            execution directory. Accesses the input data using Posix IO</p></li>
<li class="listitem"><p>Compute Job executes on the worker node and writes out
            output data to workflow execution directory using Posix IO</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or Head Node )
            to stage out output data from the workflow specific execution
            directory to a directory on the final output site.</p></li>
</ol></div>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set <span class="bold"><strong>
          pegasus.data.configuration</strong></span> to <span class="bold"><strong>
          sharedfs</strong></span> to run in this configuration.</p>
</div>
</div>
<div class="section">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp63611744"></a>9.2.1.2. Non Shared Filesystem</h4></div></div></div>
<p>In this setup , Pegasus runs workflows on local file-systems of
        worker nodes with the the worker nodes not sharing a filesystem. The
        data transfers happen between the worker node and a staging / data
        coordination site. The staging site server can be a file server on the
        head node of a cluster or can be on a separate machine.</p>
<p><span class="bold"><strong>Setup</strong></span> </p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem"><p>compute and staging site are the different</p></li>
<li class="listitem"><p>head node and worker nodes of compute site don't share a
              filesystem</p></li>
<li class="listitem"><p>Input Data is staged from remote sites.</p></li>
<li class="listitem"><p>Remote Output Site i.e site other than compute site. Can
              be submit host.</p></li>
</ul></div>
<div class="figure">
<a name="idp63618064"></a><p class="title"><b>Figure 9.2. Non Shared Filesystem Setup</b></p>
<div class="figure-contents"><div class="mediaobject" align="center">
<a name="Figure2"></a><img src="images/data-configuration-nonsharedfs.png" align="middle" height="450" alt="Non Shared Filesystem Setup">
</div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes ( either on Submit Host or on staging
            site ) to stage in input data from Input Sites ( 1---n) to a
            workflow specific execution directory on the staging site.</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in a local execution
            directory. Accesses the input data using pegasus transfer to
            transfer the data from the staging site to a local directory on
            the worker node</p></li>
<li class="listitem"><p>The compute job executes in the worker node, and executes on
            the worker node.</p></li>
<li class="listitem"><p>The compute Job writes out output data to the local
            directory on the worker node using Posix IO</p></li>
<li class="listitem"><p>Output Data is pushed out to the staging site from the
            worker node using pegasus-transfer.</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or staging
            site ) to stage out output data from the workflow specific
            execution directory to a directory on the final output
            site.</p></li>
</ol></div>
<p>In this case, the compute jobs are wrapped as <a class="link" href="pegasuslite.php" title="5.4. PegasusLite">PegasusLite</a> instances.</p>
<p>This mode is especially useful for running in the cloud
        environments where you don't want to setup a shared filesystem between
        the worker nodes. Running in that mode is explained in detail <a class="link" href="cloud.php#amazon_aws" title="7.3.1. Amazon EC2">here.</a></p>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set p <span class="bold"><strong>egasus.data.configuration</strong></span> to <span class="bold"><strong>nonsharedfs</strong></span> to run in this configuration. The
          staging site can be specified using the <span class="bold"><strong>--staging-site</strong></span> option to pegasus-plan.</p>
</div>
<p>In this setup, Pegasus always stages the input files through the
        staging site i.e the stage-in job stages in data from the input site
        to the staging site. The PegasusLite jobs that start up on the worker
        nodes, then pull the input data from the staging site for each job. In
        some cases, it might be useful to setup the PegasusLite jobs to pull
        input data directly from the input site without going through the
        staging server. This is based on the assumption that the worker nodes
        can access the input site. Starting 4.3 release, users can enable
        this. However, you should be aware that the access to the input site
        is no longer throttled ( as in case of stage in jobs). If large number
        of compute jobs start at the same time in a workflow, the input server
        will see a connection from each job.</p>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set <span class="bold"><strong>
          pegasus.transfer.bypass.input.staging</strong></span> to <span class="bold"><strong>true</strong></span>to enable the bypass of staging of input
          files via the staging server.</p>
</div>
</div>
<div class="section">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp63640368"></a>9.2.1.3. Condor Pool Without a Shared Filesystem</h4></div></div></div>
<p>This setup applies to a condor pool where the worker nodes
        making up a condor pool don't share a filesystem. All data IO is
        achieved using Condor File IO. This is a special case of the non
        shared filesystem setup, where instead of using pegasus-transfer to
        transfer input and output data, Condor File IO is used.</p>
<p><span class="bold"><strong>Setup</strong></span> </p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem"><p>Submit Host and staging site are same</p></li>
<li class="listitem"><p>head node and worker nodes of compute site don't share a
              filesystem</p></li>
<li class="listitem"><p>Input Data is staged from remote sites.</p></li>
<li class="listitem"><p>Remote Output Site i.e site other than compute site. Can
              be submit host.</p></li>
</ul></div>
<div class="figure">
<a name="idp63646752"></a><p class="title"><b>Figure 9.3. Condor Pool Without a Shared Filesystem</b></p>
<div class="figure-contents"><div class="mediaobject" align="center">
<a name="Figure13"></a><img src="images/data-configuration-condorio.png" align="middle" height="450" alt="Condor Pool Without a Shared Filesystem">
</div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executeson the submit host to stage in input
            data from Input Sites ( 1---n) to a workflow specific execution
            directory on the submit host</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in a local execution
            directory. Before the compute job starts, Condor transfers the
            input data for the job from the workflow execution directory on
            thesubmit host to the local execution directory on the worker
            node.</p></li>
<li class="listitem"><p>The compute job executes in the worker node, and executes on
            the worker node.</p></li>
<li class="listitem"><p>The compute Job writes out output data to the local
            directory on the worker node using Posix IO</p></li>
<li class="listitem"><p>When the compute job finishes, Condor transfers the output
            data for the job from the local execution directory on the worker
            node to the workflow execution directory on the submit
            host.</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or staging
            site ) to stage out output data from the workflow specific
            execution directory to a directory on the final output
            site.</p></li>
</ol></div>
<p>In this case, the compute jobs are wrapped as <a class="link" href="pegasuslite.php" title="5.4. PegasusLite">PegasusLite</a> instances.</p>
<p>This mode is especially useful for running in the cloud
        environments where you don't want to setup a shared filesystem between
        the worker nodes. Running in that mode is explained in detail <a class="link" href="cloud.php#amazon_aws" title="7.3.1. Amazon EC2">here.</a></p>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set p <span class="bold"><strong>egasus.data.configuration</strong></span> to <span class="bold"><strong>condorio</strong></span> to run in this configuration. In
          this mode, the staging site is automatically set to site <span class="bold"><strong>local</strong></span></p>
</div>
<p>In this setup, Pegasus always stages the input files through the
        submit host i.e the stage-in job stages in data from the input site to
        the submit host (local site). The input data is then transferred to
        remote worker nodes from the submit host using Condor file transfers.
        In the case, where the input data is locally accessible at the submit
        host i.e the input site and the submit host are the same, then it is
        possible to bypass the creation of separate stage in jobs that copy
        the data to the workflow specific directory on the submit host.
        Instead, Condor file transfers can be setup to transfer the input
        files directly from the locally accessible input locations ( file
        URL's with "<span class="emphasis"><em>site</em></span>" attribute set to local)
        specified in the replica catalog. Starting 4.3 release, users can
        enable this.</p>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set <span class="bold"><strong>
          pegasus.transfer.bypass.input.staging</strong></span> to <span class="bold"><strong>true</strong></span>to bypass the creation of separate stage
          in jobs.</p>
</div>
</div>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="local_vs_remote_transfers"></a>9.2.2. Local versus Remote Transfers</h3></div></div></div>
<p>As far as possible, Pegasus will ensure that the transfer jobs
      added to the executable workflow are executed on the submit host. By
      default, Pegasus will schedule a transfer to be executed on the remote
      staging site only if there is no way to execute it on the submit host.
      For e.g if the file server specified for the staging site/compute site
      is a file server, then Pegasus will schedule all the stage in data
      movement jobs on the compute site to stage-in the input data for the
      workflow. Another case would be if a user has symlinking turned on. In
      that case, the transfer jobs that symlink against the input data on the
      compute site, will be executed remotely ( on the compute site ).</p>
<p>Users can specify the property <span class="bold"><strong>
      pegasus.transfer.*.remote.sites</strong></span> to change the default
      behaviour of Pegasus and force pegasus to run different types of
      transfer jobs for the sites specified on the remote site. The value of
      the property is a comma separated list of compute sites for which you
      want the transfer jobs to run remotely.</p>
<p>The table below illustrates all the possible variations of the
      property.</p>
<div class="table">
<a name="idp63674080"></a><p class="title"><b>Table 9.1. Property Variations for pegasus.transfer.*.remote.sites</b></p>
<div class="table-contents"><table summary="Property Variations for pegasus.transfer.*.remote.sites" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Property Name</th>
<th>Applies to</th>
</tr></thead>
<tbody>
<tr>
<td>pegasus.transfer.stagein.remote.sites</td>
<td>the stage in transfer jobs</td>
</tr>
<tr>
<td>pegasus.transfer.stageout.remote.sites</td>
<td>the stage out transfer jobs</td>
</tr>
<tr>
<td>pegasus.transfer.inter.remote.sites</td>
<td>the inter site transfer jobs</td>
</tr>
<tr>
<td>pegasus.transfer.*.remote.sites</td>
<td>all types of transfer jobs</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>The prefix for the transfer job name indicates whether the
      transfer job is to be executed locallly ( on the submit host ) or
      remotely ( on the compute site ). For example stage_in_local_ in a
      transfer job name stage_in_local_isi_viz_0 indicates that the transfer
      job is a stage in transfer job that is executed locally and is used to
      transfer input data to compute site isi_viz. The prefix naming scheme
      for the transfer jobs is <span class="bold"><strong>
      [stage_in|stage_out|inter]_[local|remote]_</strong></span> .</p>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="controlling_transfer_parallelism"></a>9.2.3. Controlling Transfer Parallelism</h3></div></div></div>
<p>When it comes to data transfers, Pegasus ships with a default
      configuration which is trying to strike a balance between performance
      and aggressiveness. We obviously want data transfers to be as quick as
      possibly, but we also do not want our transfers to overwhelm data
      services and systems. The default configuration consists of a
      combination of the maximum number of transfer jobs per level in the
      workflow, and how many threads such a pegasus-transfer job can
      spawn.</p>
<p>Information on how to control the number of stagein and stageout
      jobs can be found in the <a class="link" href="transfer.php#data_movement_nodes" title="9.2.5. Addition of Separate Data Movement Nodes to Executable Workflow"> Data
      Movement Nodes</a> section.</p>
<p>How to control the number of threads pegasus-transfer can use
      depends on if you want to control standard transfer jobs, or
      PegasusLite. For the former, see the <a class="link" href="properties.php#transfer_props" title="12.3.9. Transfer Configuration Properties">
      pegasus.transfer.threads</a> property, and for the latter the <a class="link" href="properties.php#transfer_props" title="12.3.9. Transfer Configuration Properties"> pegasus.transfer.lite.threads</a>
      property.</p>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp62625344"></a>9.2.4. Symlinking Against Input Data</h3></div></div></div>
<p>If input data for a job already exists on a compute site, then it
      is possible for Pegasus to symlink against that data. In this case, the
      remote stage in transfer jobs that Pegasus adds to the executable
      workflow will symlink instead of doing a copy of the data.</p>
<p>Pegasus determines whether a file is on the same site as the
      compute site, by inspecting the <span class="emphasis"><em>"site</em></span>" attribute
      associated with the URL in the Replica Catalog. If the
      <span class="emphasis"><em>"site"</em></span> attribute of an input file location matches
      the compute site where the job is scheduled, then that particular input
      file is a candidate for symlinking.</p>
<p>For Pegasus to symlink against existing input data on a compute
      site, following must be true</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Property <span class="bold"><strong>
          pegasus.transfer.links</strong></span> is set to <span class="bold"><strong>
          true</strong></span></p></li>
<li class="listitem"><p>The input file location in the Replica Catalog has the
          <span class="emphasis"><em>"site"</em></span> attribute matching the compute
          site.</p></li>
</ol></div>
<div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>To confirm if a particular input file is symlinked instead of
        being copied, look for the destination URL for that file in
        stage_in_remote*.in file. The destination URL will start with
        symlink:// .</p>
</div>
<p>In the symlinking case, Pegasus strips out URL prefix from a URL
      and replaces it with a file URL.</p>
<p>For example if a user has the following URL catalogued in the
      Replica Catalog for an input file f.input</p>
<pre class="programlisting">f.input   gsiftp://server.isi.edu/shared/storage/input/data/f.input site="isi"</pre>
<p>and the compute job that requires this file executes on a compute
      site named isi , then if symlinking is turned on the data stage in job
      (stage_in_remote_viz_0 ) will have the following source and destination
      specified for the file</p>
<pre class="programlisting">#viz viz
file:///shared/storage/input/data/f.input  symlink://shared-scratch/workflow-exec-dir/f.input
</pre>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="data_movement_nodes"></a>9.2.5. Addition of Separate Data Movement Nodes to Executable
      Workflow</h3></div></div></div>
<p>Pegasus relies on a Transfer Refiner that comes up with the
      strategy on how many data movement nodes are added to the executable
      workflow. All the compute jobs scheduled to a site share the same
      workflow specific directory. The transfer refiners ensure that only one
      copy of the input data is transferred to the workflow execution
      directory. This is to prevent data clobbering . Data clobbering can
      occur when compute jobs of a workflow share some input files, and have
      different stage in transfer jobs associated with them that are staging
      the shared files to the same destination workflow execution
      directory.</p>
<p>Pegasus supports three different transfer refiners that dictate
      how the stagein and stageout jobs are added for the workflow.The default
      Transfer Refiner used in Pegasus is the BalancedCluster Refiner that
      allows the user to specify how many local|remote stagein|stageout jobs
      are created per execution site.</p>
<p>The behavior of the refiners (BalancedCluster and Cluster) are
      controlled by specifying certain pegasus profiles</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>either with the execution sites in the site catalog</p></li>
<li class="listitem"><p>OR globally in the properties file</p></li>
</ol></div>
<div class="table">
<a name="idp62644128"></a><p class="title"><b>Table 9.2. Pegasus Profile Keys For the Cluster Transfer Refiner</b></p>
<div class="table-contents"><table summary="Pegasus Profile Keys For the Cluster Transfer Refiner" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Profile Key</th>
<th>Description</th>
</tr></thead>
<tbody>
<tr>
<td>stagein.clusters</td>
<td>This key determines the maximum number of stage-in jobs
              that are can executed locally or remotely per compute site per
              workflow.</td>
</tr>
<tr>
<td>stagein.local.clusters</td>
<td>This key provides finer grained control in determining
              the number of stage-in jobs that are executed locally and are
              responsible for staging data to a particular remote
              site.</td>
</tr>
<tr>
<td>stagein.remote.clusters</td>
<td>This key provides finer grained control in determining
              the number of stage-in jobs that are executed remotely on the
              remote site and are responsible for staging data to it.</td>
</tr>
<tr>
<td>stageout.clusters</td>
<td>This key determines the maximum number of stage-out jobs
              that are can executed locally or remotely per compute site per
              workflow.</td>
</tr>
<tr>
<td>stageout.local.clusters</td>
<td>This key provides finer grained control in determining
              the number of stage-out jobs that are executed locally and are
              responsible for staging data from a particular remote
              site.</td>
</tr>
<tr>
<td>stageout.remote.clusters</td>
<td>This key provides finer grained control in determining
              the number of stage-out jobs that are executed remotely on the
              remote site and are responsible for staging data from
              it.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><div class="tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Which transfer refiner to use is controlled by property
        pegasus.transfer.refiner</p>
</div>
<div class="section">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp62658112"></a>9.2.5.1. BalancedCluster</h4></div></div></div>
<p>This is a new transfer refiner that was introduced in Pegasus
        4.4.0 and is the default one used in Pegasus. It does a round robin
        distribution of the files amongst the stagein and stageout jobs per
        level of the workflow. The figure below illustrates the behavior of
        this transfer refiner.</p>
<div class="figure">
<a name="idp62659600"></a><p class="title"><b>Figure 9.4. BalancedCluster Transfer Refiner : Input Data To Workflow
          Specific Directory on Shared File System</b></p>
<div class="figure-contents"><div class="mediaobject" align="center">
<a name="img-balanced-cluster-transfer-refiner"></a><img src="images/balanced-cluster-transfer-refiner.png" align="middle" height="650" alt="BalancedCluster Transfer Refiner : Input Data To Workflow Specific Directory on Shared File System">
</div></div>
</div>
<br class="figure-break">
</div>
<div class="section">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp62665440"></a>9.2.5.2. Cluster</h4></div></div></div>
<p>This transfer refiner is similar to BalancedCluster but differs
        in the way how distribution of files happen across stagein and
        stageout jobs per level of the workflow. In this refiner, all the
        input files for a job get associated with a single transfer job. As
        illustrated in the figure below each compute usually gets associated
        with one stagein transfer job. In contrast, for the BalancedCluster a
        compute job maybe associated with multiple data stagein jobs.</p>
<div class="figure">
<a name="idp62667120"></a><p class="title"><b>Figure 9.5. Cluster Transfer Refiner : Input Data To Workflow Specific
          Directory on Shared File System</b></p>
<div class="figure-contents"><div class="mediaobject" align="center">
<a name="img-cluster-transfer-refiner"></a><img src="images/cluster-transfer-refiner.png" align="middle" height="650" alt="Cluster Transfer Refiner : Input Data To Workflow Specific Directory on Shared File System">
</div></div>
</div>
<br class="figure-break">
</div>
<div class="section">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp62672944"></a>9.2.5.3. Basic</h4></div></div></div>
<p>Pegasus also supports a basic Transfer Refiner that adds one
        stagein and stageout job per compute job of the workflow. This is not
        recommended to be used for large workflows as the number of data
        transfer nodes in the worst case are 2n where n is the number of
        compute jobs in the workflow.</p>
</div>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp62674752"></a>9.2.6. Executable Used for Transfer Jobs</h3></div></div></div>
<p>Pegasus refers to a python script called <span class="bold"><strong>
      pegasus-transfer</strong></span> as the executable in the transfer jobs to
      transfer the data. pegasus-transfer is a python based wrapper around
      various transfer clients . pegasus-transfer looks at source and
      destination url and figures out automatically which underlying client to
      use. pegasus-transfer is distributed with the PEGASUS and can be found
      at $PEGASUS_HOME/bin/pegasus-transfer.</p>
<p>Currently, pegasus-transfer interfaces with the following transfer
      clients</p>
<div class="table">
<a name="idp62677520"></a><p class="title"><b>Table 9.3. Transfer Clients interfaced to by pegasus-transfer</b></p>
<div class="table-contents"><table summary="Transfer Clients interfaced to by pegasus-transfer" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Transfer Client</th>
<th>Used For</th>
</tr></thead>
<tbody>
<tr>
<td>globus-url-copy</td>
<td>staging files to and from a gridftp server.</td>
</tr>
<tr>
<td>lcg-copy</td>
<td>staging files to and from a SRM server.</td>
</tr>
<tr>
<td>wget</td>
<td>staging files from a HTTP server.</td>
</tr>
<tr>
<td>cp</td>
<td>copying files from a POSIX filesystem .</td>
</tr>
<tr>
<td>ln</td>
<td>symlinking against input files.</td>
</tr>
<tr>
<td>pegasus-s3</td>
<td>staging files to and from S3 bucket in the Amazon
              cloud</td>
</tr>
<tr>
<td>gsutil</td>
<td>staging files to and from Google Storage buckets</td>
</tr>
<tr>
<td>scp</td>
<td>staging files using scp</td>
</tr>
<tr>
<td>iget</td>
<td>staging files to and from a irods server.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>For remote sites, Pegasus constructs the default path to
      pegasus-transfer on the basis of PEGASUS_HOME env profile specified in
      the site catalog. To specify a different path to the pegasus-transfer
      client , users can add an entry into the transformation catalog with
      fully qualified logical name as <span class="bold"><strong>pegasus::pegasus-transfer</strong></span></p>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp62694944"></a>9.2.7. Staging of Executables</h3></div></div></div>
<p>Users can get Pegasus to stage the user executables ( executables
      that the jobs in the DAX refer to ) as part of the transfer jobs to the
      workflow specific execution directory on the compute site. The URL
      locations of the executables need to be specified in the transformation
      catalog as the PFN and the type of executable needs to be set to
      <span class="bold"><strong> STAGEABLE</strong></span> .</p>
<p>The location of a transformation can be specified either in</p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem"><p>DAX in the executables section. More details <a class="link" href="api.php#dax_transformation_catalog" title="14.1.1.3.. The Transformation Catalog Section">here</a> .</p></li>
<li class="listitem"><p>Transformation Catalog. More details <a class="link" href="transformation.php" title="4.4. Executable Discovery (Transformation Catalog)">here</a> .</p></li>
</ul></div>
<p>A particular transformation catalog entry of type STAGEABLE is
      compatible with a compute site only if all the System Information
      attributes associated with the entry match with the System Information
      attributes for the compute site in the Site Catalog. The following
      attributes make up the System Information attributes</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>arch</p></li>
<li class="listitem"><p>os</p></li>
<li class="listitem"><p>osrelease</p></li>
<li class="listitem"><p>osversion</p></li>
</ol></div>
<div class="section">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp63324384"></a>9.2.7.1. Transformation Mappers</h4></div></div></div>
<p>Pegasus has a notion of transformation mappers that determines
        what type of executables are picked up when a job is executed on a
        remote compute site. For transfer of executables, Pegasus constructs a
        soft state map that resides on top of the transformation catalog, that
        helps in determining the locations from where an executable can be
        staged to the remote site.</p>
<p>Users can specify the following property to pick up a specific
        transformation mapper</p>
<pre class="programlisting"><span class="bold"><strong>pegasus.catalog.transformation.mapper</strong></span> </pre>
<p>Currently, the following transformation mappers are
        supported.</p>
<div class="table">
<a name="idp63328272"></a><p class="title"><b>Table 9.4. Transformation Mappers Supported in Pegasus</b></p>
<div class="table-contents"><table summary="Transformation Mappers Supported in Pegasus" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Transformation Mapper</th>
<th>Description</th>
</tr></thead>
<tbody>
<tr>
<td>Installed</td>
<td>This mapper only relies on transformation catalog
                entries that are of type INSTALLED to construct the soft state
                map. This results in Pegasus never doing any transfer of
                executables as part of the workflow. It always prefers the
                installed executables at the remote sites</td>
</tr>
<tr>
<td>Staged</td>
<td>This mapper only relies on matching transformation
                catalog entries that are of type STAGEABLE to construct the
                soft state map. This results in the executable workflow
                referring only to the staged executables, irrespective of the
                fact that the executables are already installed at the remote
                end</td>
</tr>
<tr>
<td>All</td>
<td>This mapper relies on all matching transformation
                catalog entries of type STAGEABLE or INSTALLED for a
                particular transformation as valid sources for the transfer of
                executables. This the most general mode, and results in the
                constructing the map as a result of the cartesian product of
                the matches.</td>
</tr>
<tr>
<td>Submit</td>
<td>This mapper only on matching transformation catalog
                entries that are of type STAGEABLE and reside at the submit
                host (site local), are used while constructing the soft state
                map. This is especially helpful, when the user wants to use
                the latest compute code for his computations on the grid and
                that relies on his submit host.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</div>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp63339216"></a>9.2.8. Staging of Pegasus Worker Package</h3></div></div></div>
<p>Pegasus can optionally stage the pegasus worker package as part of
      the executable workflow to remote workflow specific execution directory.
      The pegasus worker package contains the pegasus auxillary executables
      that are required on the remote site. If the worker package is not
      staged as part of the executable workflow, then Pegasus relies on the
      installed version of the worker package on the remote site. To determine
      the location of the installed version of the worker package on a remote
      site, Pegasus looks for an environment profile PEGASUS_HOME for the site
      in the Site Catalog.</p>
<p>Users can set the following property to true to turn on worker
      package staging</p>
<pre class="programlisting"><span class="bold"><strong>pegasus.transfer.worker.package          true</strong></span> </pre>
<p>By default, when worker package staging is turned on pegasus pulls
      the compatible worker package from the Pegasus Website. To specify a
      different worker package location, users can specify the transformation
      <span class="bold"><strong>pegasus::worker</strong></span> in the transformation
      catalog with</p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
<li class="listitem"><p>type set to STAGEABLE</p></li>
<li class="listitem"><p>System Information attributes of the transformation catalog
          entry match the System Information attributes of the compute
          site.</p></li>
<li class="listitem"><p>the PFN specified should be a remote URL that can be pulled to
          the compute site.</p></li>
</ul></div>
<div class="section">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp63347392"></a>9.2.8.1. Worker Package Staging in Non Shared Filesystem setup</h4></div></div></div>
<p>Worker package staging is automatically set to true , when
        workflows are setup to run in a non shared filesystem setup i.e.
        <span class="bold"><strong>pegasus.data.configuration</strong></span> is set to
        <span class="bold"><strong>nonsharedfs</strong></span> or <span class="bold"><strong>condorio</strong></span> . In these configurations, a
        stage_worker job is created that brings in the worker package to the
        submit directory of the workflow. For each job, the worker package is
        then transferred with the job using Condor File Transfers ( <span class="bold"><strong>transfer_input_files</strong></span> ) . This transfer always
        happens unless, PEGASUS_HOME is specified in the site catalog for the
        site on which the job is scheduled to run.</p>
<p>Users can explicitly set the following property to false, to
        turn off worker package staging by the Planner. This is applicable ,
        when running in the cloud and virtual machines / worker nodes already
        have the pegasus worker tools installed.</p>
<pre class="programlisting"><span class="bold"><strong>pegasus.transfer.worker.package          false</strong></span> </pre>
</div>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="staging_job_checkpoint_files"></a>9.2.9. Staging of Job Checkpoint Files</h3></div></div></div>
<p>Pegasus has support for transferring job checkpoint files back to
      the staging site, when a job exceeds it's advertised running time. In
      order to use this feature, you need to</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Associate a job checkpoint file ( that the job creates ) with
          the job in the DAX. A checkpoint file is specified by setting the
          link attribute to checkpoint for the uses tag.</p></li>
<li class="listitem"><p>Associate a Pegasus profile key named <span class="bold"><strong>
          checkpoint.time</strong></span> is the time in minutes after which a job
          is sent the TERM signal by pegasus-kickstart, telling it to create
          the checkpoint file.</p></li>
<li class="listitem"><p>Associate a Pegasus profile key named <span class="bold"><strong>
          maxwalltime</strong></span> with the job that specifies the max runtime
          in minutes before the job will be killed by the local resource
          manager ( such as PBS) deployed on the site. Usually, this value
          should be associated with the execution site in the site
          catalog.</p></li>
</ol></div>
<p>Pegasus planner uses the above mentioned profile keys to setup
      pegasus-kickstart such that the job is sent a TERM signal when the
      checkpoint time of job is reached. A KILL signal is sent at
      (checkpoint.time + (maxwalltime-checkpoint.time)/2) minutes. This
      ensures that there is enough time for pegasus-lite to transfer the
      checkpoint file before the job is killed by the underlying
      scheduler.</p>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp63362688"></a>9.2.10. Using Amazon S3 as a Staging Site</h3></div></div></div>
<p>Pegasus can be configured to use Amazon S3 as a staging site. In
      this mode, Pegasus transfers workflow inputs from the input site to S3.
      When a job runs, the inputs for that job are fetched from S3 to the
      worker node, the job is executed, then the output files are transferred
      from the worker node back to S3. When the jobs are complete, Pegasus
      transfers the output data from S3 to the output site.</p>
<p>In order to use S3, it is necessary to create a config file for
      the S3 transfer client, <a class="link" href="cli-pegasus-s3.php" title="pegasus-s3">
      pegasus-s3</a>. See the <a class="link" href="cli-pegasus-s3.php" title="pegasus-s3">man
      page</a> for details on how to create the config file. You also need
      to specify <a class="link" href="data_staging_configuration.php#non_shared_fs" title="5.3.2. Non Shared Filesystem">S3 as a staging
      site</a>.</p>
<p>Next, you need to modify your site catalog to tell the location of
      your s3cfg file. See <a class="link" href="cred_staging.php" title="9.3. Credentials Management">the section on
      credential staging</a>.</p>
<p>The following site catalog shows how to specify the location of
      the s3cfg file on the local site and how to specify an Amazon S3 staging
      site:</p>
<pre class="programlisting">&lt;sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog
             http://pegasus.isi.edu/schema/sc-3.0.xsd" version="3.0"&gt;
    &lt;site handle="local" arch="x86_64" os="LINUX"&gt;
        &lt;head-fs&gt;
            &lt;scratch&gt;
                &lt;shared&gt;
                    &lt;file-server protocol="file" url="file://" mount-point="/tmp/wf/work"/&gt;
                    &lt;internal-mount-point mount-point="/tmp/wf/work"/&gt;
                &lt;/shared&gt;
            &lt;/scratch&gt;
            &lt;storage&gt;
                &lt;shared&gt;
                    &lt;file-server protocol="file" url="file://" mount-point="/tmp/wf/storage"/&gt;
                    &lt;internal-mount-point mount-point="/tmp/wf/storage"/&gt;
                &lt;/shared&gt;
            &lt;/storage&gt;
        &lt;/head-fs&gt;
        <span class="bold"><strong>&lt;profile namespace="env" key="S3CFG"&gt;/home/username/.s3cfg&lt;/profile&gt;</strong></span>
    &lt;/site&gt;
    <span class="bold"><strong>&lt;site handle="s3" arch="x86_64" os="LINUX"&gt;
        &lt;head-fs&gt;
            &lt;scratch&gt;
                &lt;shared&gt;
                    &lt;!-- wf-scratch is the name of the S3 bucket that will be used --&gt;
                    &lt;file-server protocol="s3" url="s3://user@amazon" mount-point="/wf-scratch"/&gt;
                    &lt;internal-mount-point mount-point="/wf-scratch"/&gt;
                &lt;/shared&gt;
            &lt;/scratch&gt;
        &lt;/head-fs&gt;
    &lt;/site&gt;</strong></span>
    &lt;site handle="condorpool" arch="x86_64" os="LINUX"&gt;
        &lt;head-fs&gt;
            &lt;scratch/&gt;
            &lt;storage/&gt;
        &lt;/head-fs&gt;
        &lt;profile namespace="pegasus" key="style"&gt;condor&lt;/profile&gt;
        &lt;profile namespace="condor" key="universe"&gt;vanilla&lt;/profile&gt;
        &lt;profile namespace="condor" key="requirements"&gt;(Target.Arch == "X86_64")&lt;/profile&gt;
    &lt;/site&gt;
&lt;/sitecatalog&gt;
</pre>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp63372992"></a>9.2.11. iRODS data access</h3></div></div></div>
<p>iRODS can be used as a input data location, a storage site for
      intermediate data during workflow execution, or a location for final
      output data. Pegasus uses a URL notation to identify iRODS files.
      Example:</p>
<pre class="programlisting">irods://some-host.org/path/to/file.txt</pre>
<p>The path to the file is <span class="bold"><strong> relative</strong></span>
      to the internal iRODS location. In the example above, the path used to
      refer to the file in iRODS is <span class="emphasis"><em> path/to/file.txt</em></span> (no
      leading /).</p>
<p>See <a class="link" href="cred_staging.php" title="9.3. Credentials Management">the section on credential
      staging</a> for information on how to set up an irodsEnv file to be
      used by Pegasus.</p>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp63378256"></a>9.2.12. GridFTP over SSH (sshftp)</h3></div></div></div>
<p>Instead of using X.509 based security, newer version of Globus
      GridFTP can be configured to set up transfers over SSH. See the <a class="ulink" href="http://toolkit.globus.org/toolkit/docs/6.0/gridftp/admin/#gridftp-admin-config-security-sshftp" target="_top">Globus
      Documentation</a> for details on installing and setting up this
      feature.</p>
<p>Pegasus requires the ability to specify which SSH key to be used
      at runtime, and thus a small modification is necessary to the default
      Globus configuration. On the hosts where Pegasus initiates transfers
      (which depends on the data configuration of the workflow), please
      replace <span class="emphasis"><em>gridftp-ssh</em></span>, usually located under
      <span class="emphasis"><em>/usr/share/globus/gridftp-ssh</em></span>, with:</p>
<pre class="programlisting">
#!/bin/bash

url_string=$1
remote_host=$2
port=$3
user=$4

port_str=""
if  [ "X" = "X$port" ]; then
    port_str=""
else
    port_str=" -p $port "
fi

if  [ "X" != "X$user" ]; then
    remote_host="$user@$remote_host"
fi

remote_default1=.globus/sshftp
remote_default2=/etc/grid-security/sshftp
remote_fail="echo -e 500 Server is not configured for SSHFTP connections.\\\r\\\n"
remote_program=$GLOBUS_REMOTE_SSHFTP
if  [ "X" = "X$remote_program" ]; then
    remote_program="(( test -f $remote_default1 &amp;&amp; $remote_default1 ) || ( test -f $remote_default2 &amp;&amp; $remote_default2 ) || $remote_fail )"
fi

if [ "X" != "X$GLOBUS_SSHFTP_PRINT_ON_CONNECT" ]; then
    echo "Connecting to $1 ..." &gt;/dev/tty
fi

# for pegasus-transfer
extra_opts=" -o StrictHostKeyChecking=no"
if [ "x$SSH_PRIVATE_KEY" != "x" ]; then
    extra_opts="$extra_opts -i $SSH_PRIVATE_KEY"
fi

exec /usr/bin/ssh $extra_opts $port_str $remote_host $remote_program
</pre>
<p>Once configured, you should be able to use URLs such as
      <span class="emphasis"><em>sshftp://username@host/foo/bar.txt</em></span> in your
      workflows.</p>
</div>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="data_management.php">Prev</a> </td>
<td width="20%" align="center"><a accesskey="u" href="data_management.php">Up</a></td>
<td width="40%" align="right"> <a accesskey="n" href="cred_staging.php">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Chapter 9. Data Management </td>
<td width="20%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="40%" align="right" valign="top"> 9.3. Credentials Management</td>
</tr>
</table>
</div>
</div><?php  
            do_html_footer();
        ?>
