<?php  
            include_once( $_SERVER['DOCUMENT_ROOT']."/static/includes/common.inc.php" );
            do_html_header("Documentation");
        ?><div id="content">
<div class="navheader">
<table width="100%" summary="Navigation header"><tr>
<td width="20%" align="left">
<a accesskey="p" href="example_workflows.php">Prev</a> </td>
<td width="60%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="20%" align="right"> <a accesskey="n" href="cli-pegasus-analyzer.php">Next</a>
</td>
</tr></table>
<hr>
</div>
<div class="chapter" title="Chapter 10. Reference Manual">
<div class="titlepage"><div><div><h2 class="title">
<a name="reference"></a>Chapter 10. Reference Manual</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#Properties">10.1. Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#profiles">10.2. Profiles</a></span></dt>
<dt><span class="section"><a href="reference.php#replica_selection">10.3. Replica Selection</a></span></dt>
<dt><span class="section"><a href="reference.php#job_clustering">10.4. Job Clustering</a></span></dt>
<dt><span class="section"><a href="reference.php#transfer">10.5. Data Transfers</a></span></dt>
<dt><span class="section"><a href="reference.php#hierarchial_workflows">10.6. Hierarchical Workflows</a></span></dt>
<dt><span class="section"><a href="reference.php#notifications">10.7. Notifications</a></span></dt>
<dt><span class="section"><a href="reference.php#monitoring">10.8. Monitoring</a></span></dt>
<dt><span class="section"><a href="reference.php#api">10.9. API Reference</a></span></dt>
<dt><span class="section"><a href="reference.php#cli">10.10. Command Line Tools</a></span></dt>
</dl></div>
<div class="section" title="10.1. Properties">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="Properties"></a>10.1. Properties</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#Propertiespegasus.home">10.1.1. pegasus.home</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesLocalDirectories">10.1.2. Local Directories</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesSiteDirectories">10.1.3. Site Directories</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesSchemaFileLocationProperties">10.1.4. Schema File Location Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesDatabaseDriversForAllRelationalCatalogs">10.1.5. Database Drivers For All Relational Catalogs</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesCatalogProperties">10.1.6. Catalog Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesReplicaSelectionProperties">10.1.7. Replica Selection Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesSiteSelectionProperties">10.1.8. Site Selection Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesDataStagingConfiguration">10.1.9. Data Staging Configuration</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesTransferConfigurationProperties">10.1.10. Transfer Configuration Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesGridstartAndExitcodeProperties">10.1.11. Gridstart And Exitcode Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesInterfaceToCondorAndCondorDagman">10.1.12. Interface To Condor And Condor Dagman</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesMonitoringProperties">10.1.13. Monitoring Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesJobClusteringProperties">10.1.14. Job Clustering Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesLoggingProperties">10.1.15. Logging Properties</a></span></dt>
<dt><span class="section"><a href="reference.php#PropertiesMiscellaneousProperties">10.1.16. Miscellaneous Properties</a></span></dt>
</dl></div>
<p></p>
<p>This is the reference guide to all properties regarding the
Pegasus Workflow Planner, and their respective default values. Please refer
to the user guide for a discussion when and which properties to use to
configure various components. Please note that the values rely on
proper capitalization, unless explicitly noted otherwise.
</p>
<p>Some properties rely with their default on the value of other
properties. As a notation, the curly braces refer to the value of the
named property. For instance, ${pegasus.home} means that the value depends
on the value of the pegasus.home property plus any noted additions. You
can use this notation to refer to other properties, though the extent
of the subsitutions are limited. Usually, you want to refer to a set
of the standard system properties. Nesting is not allowed.
Substitutions will only be done once.
</p>
<p>There is a priority to the order of reading and evaluating properties.
Usually one does not need to worry about the priorities. However, it
is good to know the details of when which property applies, and how
one property is able to overwrite another. The following is a mutually exclusive
list ( highest priority first ) of property file locations.
</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">--conf option to the tools. Almost all of the clients that use properties
have a --conf option to specify the property file to pick up.
</li>
<li class="listitem"> submit-dir/pegasus.xxxxxxx.properties file. All tools that work on the
submit directory ( i.e after pegasus has planned a workflow) pick up the
pegasus.xxxxx.properties file from the submit directory. The location for the
pegasus.xxxxxxx.propertiesis picked up from the braindump file.
</li>
<li class="listitem">The properties defined in the user property file
<span class="emphasis"><em>${user.home}/.pegasusrc</em></span> have lowest priority.
</li>
</ol></div>
<p>
</p>
<p>Commandline properties have the highest priority. These override any property loaded
from a property file. Each  commandline property is introduced by a -D argument.
Note that these arguments  are parsed by the shell wrapper, and thus the -D arguments
must be the first arguments to any command. Commandline properties are useful for debugging
purposes.
</p>
<p>From Pegasus 3.1 release onwards, support has been dropped for the following
properties that were used to signify the location of the properties file
</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">pegasus.properties</li>
<li class="listitem">pegasus.user.properties</li>
</ul></div>
<p>
</p>
<p>The following example provides a sensible set of properties to be set
by the user property file. These properties use mostly non-default
settings. It is an example only, and will not work for you:
</p>
<pre class="screen">
pegasus.catalog.replica              File
pegasus.catalog.replica.file         ${pegasus.home}/etc/sample.rc.data
pegasus.catalog.transformation       Text
pegasus.catalog.transformation.file  ${pegasus.home}/etc/sample.tc.text
pegasus.catalog.site.file            ${pegasus.home}/etc/sample.sites.xml
</pre>
<p>
</p>
<p>If you are in doubt which properties are actually visible, pegasus during the
planning of the workflow  dumps all properties after reading and prioritizing
in the submit directory in a file with the suffix properties.
</p>
<div class="section" title="10.1.1. pegasus.home">
<div class="titlepage"><div><div><h3 class="title">
<a name="Propertiespegasus.home"></a>10.1.1. pegasus.home</h3></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">all</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">directory location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">"$PEGASUS_HOME"</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>The property pegasus.home cannot be set in the property file. This property is
automatically set up by the pegasus clients internally by determining the installation
directory of pegasus. Knowledge about this property is important for developers who
want to invoke PEGASUS JAVA classes without the shell wrappers.
</p>
</div>
<div class="section" title="10.1.2. Local Directories">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesLocalDirectories"></a>10.1.2. Local Directories</h3></div></div></div>
<p></p>
<p>This section describes the GNU directory structure conventions. GNU
distinguishes between architecture independent and thus sharable
directories, and directories with data specific to a platform, and
thus often local. It also distinguishes between frequently modified
data and rarely changing data. These two axis form a space of four
distinct directories.
</p>
<div class="section" title="10.1.2.1. pegasus.home.datadir">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.home.datadir"></a>10.1.2.1. pegasus.home.datadir</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">all</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">directory location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home}/share</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>The datadir directory contains broadly visiable and possilby exported
configuration files that rarely change. This directory is currently
unused.
</p>
</div>
<div class="section" title="10.1.2.2. pegasus.home.sysconfdir">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.home.sysconfdir"></a>10.1.2.2. pegasus.home.sysconfdir</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">all</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">directory location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home}/etc</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>The system configuration directory contains configuration files that
are specific to the machine or installation, and that rarely change.
This is the directory where the XML schema definition copies are
stored, and where the base pool configuration file is stored.
</p>
</div>
<div class="section" title="10.1.2.3. pegasus.home.sharedstatedir">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.home.sharedstatedir"></a>10.1.2.3. pegasus.home.sharedstatedir</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">all</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">directory location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home}/com</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>Frequently changing files that are broadly visible are stored in the
shared state directory. This is currently unused.
</p>
</div>
<div class="section" title="10.1.2.4. pegasus.home.localstatedir">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.home.localstatedir"></a>10.1.2.4. pegasus.home.localstatedir</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">all</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">directory location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home}/var</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>Frequently changing files that are specific to a machine and/or
installation are stored in the local state directory. This directory
is being used for the textual transformation catalog,
and the file-based replica catalog.
</p>
</div>
<div class="section" title="10.1.2.5. pegasus.dir.submit.logs">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dir.submit.logs"></a>10.1.2.5. pegasus.dir.submit.logs</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.4</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">directory location string</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property can be used to specify the directory where the condor
logs for the workflow should go to. By default, starting 4.2.1 release,
Pegasus will setup the log to be in the workflow submit directory.
This can create problems, in case users submit directories are on NSF.
</p>
<p>This is done to ensure that the logs are created in a local directory
even though the submit directory maybe on NFS.
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.3. Site Directories">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesSiteDirectories"></a>10.1.3. Site Directories</h3></div></div></div>
<p>The site directory properties modify the behavior of remotely run jobs.
In rare occasions, it may also pertain to locally run compute jobs.
</p>
<div class="section" title="10.1.3.1. pegasus.dir.useTimestamp">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dir.useTimestamp"></a>10.1.3.1. pegasus.dir.useTimestamp</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.1</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>While creating the submit directory, Pegasus employs a run numbering
scheme. Users can use this property to use a timestamp based
numbering scheme instead of the runxxxx scheme.
</p>
</div>
<div class="section" title="10.1.3.2. pegasus.dir.exec">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dir.exec"></a>10.1.3.2. pegasus.dir.exec</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">remote directory location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property modifies the remote location work directory in which all
your jobs will run. If the path is relative then it is appended to the
work directory (associated with the site), as specified in the site
catalog.  If the path is  absolute then it overrides the work directory
specified in the site catalog.
</p>
</div>
<div class="section" title="10.1.3.3. pegasus.dir.storage.mapper">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dir.storage.mapper"></a>10.1.3.3. pegasus.dir.storage.mapper</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">4.3</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Flat</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Fixed</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">Hashed</td>
</tr>
<tr>
<td align="left">Value[3]:</td>
<td align="left">Replica</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Flat</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.dir.storage.deep</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property modifies determines how the output files are mapped on the
output site storage location.
</p>
<p>In order to preserve backward compatibility, setting the boolean property
pegasus.dir.storage.deep results in the Hashed output mapper to be loaded,
if no output mapper property is specified.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Flat</span></dt>
<dd>
By default, Pegasus will place the output files in the storage directory
specified in the site catalog for the output site.
</dd>
<dt><span class="term">Fixed</span></dt>
<dd>
Using this mapper, users can specify an externally accesible url to
the storage directory in their properties file. The following property
needs to be set.
<pre class="screen">
pegasus.dir.storage.mapper.fixed.url  an externally accessible URL to the
storage directory on the output site
e.g. gsiftp://outputs.isi.edu/shared/outputs
</pre>
Note: For hierarchal workflows, the above property needs to be set
separately for each dax job, if you want the sub workflow outputs
to goto a different directory.
</dd>
<dt><span class="term">Hashed</span></dt>
<dd>
This mapper results in the creation of a deep directory structure
on the output site, while populating the results. The base directory
on the remote end is determined from the site catalog.
Depending on the number of files being staged to the remote site a
Hashed File Structure is created that ensures that only 256 files
reside in one directory.
To create this directory structure on the storage site, Pegasus
relies on the directory creation feature of the Grid FTP server,
which appeared in globus 4.0.x
</dd>
<dt><span class="term">Replica</span></dt>
<dd>
This mapper determines the path for an output file on the output site by
querying an output replica catalog. The output site is one that is
passed on the command line. The output replica catalog can be configured
by specifiing the properties with the prefix pegasus.dir.storage.replica.
By default, a Regex File based backend is assumed unless overridden.
For example
<pre class="screen">
pegasus.dir.storage.mapper.replica       Regex|File
pegasus.dir.storage.mapper.replica.file  the RC file at the backend to use if using a file based RC
</pre>
</dd>
</dl></div>
<p>
</p>
</div>
<div class="section" title="10.1.3.4. pegasus.dir.storage.deep">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dir.storage.deep"></a>10.1.3.4. pegasus.dir.storage.deep</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.1</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.dir.storage.mapper</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property results in the creation of a deep directory structure
on the output site, while populating the results. The base directory
on the remote end is determined from the site catalog.
</p>
<p>To this base directory, the relative submit directory structure
( $user/$vogroup/$label/runxxxx ) is appended.
</p>
<p>$storage = $base + $relative_submit_directory
</p>
<p>This is the base directory that is passed to the storage mapper.
</p>
<p>Note: To preserve backward compatibilty, setting this
property results in the Hashed mapper to be loaded unless
pegasus.dir.storage.mapper is explicitly specified. Before 4.3,
this property resulted in HashedDirectory structure.
</p>
</div>
<div class="section" title="10.1.3.5. pegasus.dir.create.strategy">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dir.create.strategy"></a>10.1.3.5. pegasus.dir.create.strategy</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">HourGlass</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Tentacles</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">Minimal</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Minimal</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>If the </p>
<pre class="screen">--randomdir</pre>
<p> option is given to the Planner at
runtime, the Pegasus planner adds nodes that create the random
directories at the remote pool sites, before any jobs are
actually run. The two modes determine the placement of these
nodes and their dependencies to the rest of the graph.
</p>
<div class="variablelist"><dl>
<dt><span class="term">HourGlass</span></dt>
<dd>
It adds a make directory node at the top level of the graph, and all
these concat to a single dummy job before branching out to the root
nodes of the original/ concrete dag so far. So we introduce a
classic X shape at the top of the graph. Hence the name HourGlass.
</dd>
<dt><span class="term">Tentacles</span></dt>
<dd>
This option places the jobs creating directories at the top of the
graph. However instead of constricting it to an hour glass shape,
this mode links the top node to all the relevant nodes for which the
create dir job is necessary. It looks as if the node spreads its
tentacleas all around. This puts more load on the DAGMan because of
the added dependencies but removes the restriction of the plan
progressing only when all the create directory jobs have progressed
on the remote pools, as is the case in the HourGlass model.
</dd>
<dt><span class="term">Minimal</span></dt>
<dd>
The strategy involves in walking the graph in a BFS order, and
updating a bit set associated with each job based on the BitSet
of the parent jobs. The BitSet indicates whether an edge exists
from the create dir job to an ancestor of the node.
For a node, the bit set is the union of all the parents BitSets.
The BFS traversal ensures that the bitsets are of a node are
only updated once the parents have been processed.
</dd>
</dl></div>
<p>
</p>
</div>
<div class="section" title="10.1.3.6. pegasus.dir.create.impl">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dir.create.impl"></a>10.1.3.6. pegasus.dir.create.impl</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">DefaultImplementation</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">S3</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">DefaultImpelmentation</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property is used to select the executable that is used to
create the working directory on the compute sites.
</p>
<div class="variablelist"><dl>
<dt><span class="term">DefaultImplementation</span></dt>
<dd>
The default executable that is used to create a directory is the
dirmanager executable shipped with Pegasus. It is found at
$PEGASUS_HOME/bin/dirmanager in the pegasus distribution.
An entry for transformation pegasus::dirmanager needs
to exist in the Transformation Catalog or the PEGASUS_HOME
environment variable should be specified in the site catalog for
the sites for this mode to work.
</dd>
<dt><span class="term">S3</span></dt>
<dd>
This option is used to create buckets in S3 instead of a
directory. This should be set when running workflows on Amazon
EC2. This implementation relies on s3cmd command line client to
create the bucket. An entry for transformation amazon::s3cmd needs
to exist in the Transformation Catalog for this to work.
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.4. Schema File Location Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesSchemaFileLocationProperties"></a>10.1.4. Schema File Location Properties</h3></div></div></div>
<p>This section defines the location of XML schema files that are
used to parse the various XML document instances in the PEGASUS. The
schema backups in the installed file-system permit PEGASUS operations
without being online.
</p>
<div class="section" title="10.1.4.1. pegasus.schema.dax">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.schema.dax"></a>10.1.4.1. pegasus.schema.dax</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">XML schema file location string</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">${pegasus.home.sysconfdir}/dax-3.2.xsd</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home.sysconfdir}/dax-3.2.xsd</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This file is a copy of the XML schema that describes abstract DAG
files that are the result of the abstract planning process, and input
into any concrete planning. Providing a copy of the schema enables the
parser to use the local copy instead of reaching out to the internet,
and obtaining the latest version from the GriPhyN website dynamically.
</p>
</div>
<div class="section" title="10.1.4.2. pegasus.schema.sc">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.schema.sc"></a>10.1.4.2. pegasus.schema.sc</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">XML schema file location string</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">${pegasus.home.sysconfdir}/sc-3.0.xsd</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home.sysconfdir}/sc-3.0.xsd</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This file is a copy of the XML schema that describes the xml
description of the site catalog, that is generated as a result of
using genpoolconfig command.
Providing a copy of the schema enables the parser to use the local
copy  instead of reaching out to the internet, and obtaining the
latest version from the GriPhyN website dynamically.
</p>
</div>
<div class="section" title="10.1.4.3. pegasus.schema.ivr">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.schema.ivr"></a>10.1.4.3. pegasus.schema.ivr</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">all</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">XML schema file location string</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">${pegasus.home.sysconfdir}/iv-2.0.xsd</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home.sysconfdir}/iv-2.0.xsd</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This file is a copy of the XML schema that describes invocation record
files that are the result of the a grid launch in a remote or local
site. Providing a copy of the schema enables the parser to use the
local copy instead of reaching out to the internet, and obtaining the
latest version from the GriPhyN website dynamically.
</p>
</div>
</div>
<div class="section" title="10.1.5. Database Drivers For All Relational Catalogs">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesDatabaseDriversForAllRelationalCatalogs"></a>10.1.5. Database Drivers For All Relational Catalogs</h3></div></div></div>
<p></p>
<p></p>
<div class="section" title="10.1.5.1. pegasus.catalog.*.db.driver">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.catalog.*.db.driver"></a>10.1.5.1. pegasus.catalog.*.db.driver</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Java class name</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Postgres</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">MySQL</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">SQLServer2000	(not yet implemented!)</td>
</tr>
<tr>
<td align="left">Value[3]:</td>
<td align="left">Oracle		(not yet implemented!)</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.provenance</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>The database driver class is dynamically loaded, as required by the
schema. Currently, only PostGreSQL 7.3 and MySQL 4.0 are supported.
Their respective JDBC3 driver is provided as part and parcel of the
PEGASUS.
</p>
<p>A user may provide their own implementation, derived from
org.griphyn.vdl.dbdriver.DatabaseDriver, to talk to a database of
their choice.
</p>
<p>For each schema in PTC, a driver is instantiated
separately, which has the same prefix as the schema. This may result
in multiple connections to the database backend. As fallback, the
schema "*" driver is attempted.
</p>
<p>The * in the property name can be replaced by a catalog name to
apply the property only for that catalog.
Valid catalog names are
</p>
<pre class="screen">
replica
provenance
</pre>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.5.2. pegasus.catalog.*.db.url">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.catalog.*.db.url"></a>10.1.5.2. pegasus.catalog.*.db.url</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">PTC, ...</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">JDBC database URI string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">Example:</td>
<td align="left">jdbc:postgresql:${user.name}</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Each database has its own string to contact the database on a given
host, port, and database. Although most driver URLs allow to pass
arbitrary arguments, please use the
pegasus.catalog.[catalog-name].db.* keys or  pegasus.catalog.*.db.*
to preload these arguments.
THE URL IS A MANDATORY PROPERTY FOR ANY DBMS BACKEND.
</p>
<pre class="screen">
Postgres : jdbc:postgresql:[//hostname[:port]/]database
MySQL    : jdbc:mysql://hostname[:port]]/database
SQLServer: jdbc:microsoft:sqlserver://hostname:port
Oracle   : jdbc:oracle:thin:[user/password]@//host[:port]/service
</pre>
<p>
</p>
<p>The * in the property name can be replaced by a catalog name to
apply the property only for that catalog.
Valid catalog names are
</p>
<pre class="screen">
replica
provenance
</pre>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.5.3. pegasus.catalog.*.db.user">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.catalog.*.db.user"></a>10.1.5.3. pegasus.catalog.*.db.user</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">PTC,  ...</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">Example:</td>
<td align="left">${user.name}</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>In order to access a database, you must provide the name of your
account on the DBMS. This property is database-independent. THIS IS A
MANDATORY PROPERTY FOR MANY DBMS BACKENDS.
</p>
<p>The * in the property name can be replaced by a catalog name to
apply the property only for that catalog.
Valid catalog names are
</p>
<pre class="screen">
replica
provenance
</pre>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.5.4. pegasus.catalog.*.db.password">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.catalog.*.db.password"></a>10.1.5.4. pegasus.catalog.*.db.password</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">PTC, ...</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">Example:</td>
<td align="left">${user.name}</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>In order to access a database, you must provide an optional password
of your account on the DBMS. This property is database-independent.
THIS IS A MANDATORY PROPERTY, IF YOUR DBMS BACKEND ACCOUNT REQUIRES
A PASSWORD.
</p>
<p>The * in the property name can be replaced by a catalog name to
apply the property only for that catalog.
Valid catalog names are
</p>
<pre class="screen">
replica
provenance
</pre>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.5.5. pegasus.catalog.*.db.*">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.catalog.*.db.*"></a>10.1.5.5. pegasus.catalog.*.db.*</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody><tr>
<td align="left">System:</td>
<td align="left">PTC,  RC</td>
</tr></tbody>
</table></div>
<p>
</p>
<p>Each database has a multitude of options to control in fine detail
the further behaviour. You may want to check the JDBC3 documentation
of the JDBC driver for your database for details. The keys will be
passed as part of the connect properties by stripping the
"pegasus.catalog.[catalog-name].db." prefix from them.
The catalog-name can be replaced by the following values
provenance for Provenance Catalog (PTC),
replica for Replica Catalog (RC)
</p>
<p>Postgres 7.3 parses the following properties:
</p>
<pre class="screen">
pegasus.catalog.*.db.user
pegasus.catalog.*.db.password
pegasus.catalog.*.db.PGHOST
pegasus.catalog.*.db.PGPORT
pegasus.catalog.*.db.charSet
pegasus.catalog.*.db.compatible
</pre>
<p>
</p>
<p>MySQL 4.0 parses the following properties:
</p>
<pre class="screen">
pegasus.catalog.*.db.user
pegasus.catalog.*.db.password
pegasus.catalog.*.db.databaseName
pegasus.catalog.*.db.serverName
pegasus.catalog.*.db.portNumber
pegasus.catalog.*.db.socketFactory
pegasus.catalog.*.db.strictUpdates
pegasus.catalog.*.db.ignoreNonTxTables
pegasus.catalog.*.db.secondsBeforeRetryMaster
pegasus.catalog.*.db.queriesBeforeRetryMaster
pegasus.catalog.*.db.allowLoadLocalInfile
pegasus.catalog.*.db.continueBatchOnError
pegasus.catalog.*.db.pedantic
pegasus.catalog.*.db.useStreamLengthsInPrepStmts
pegasus.catalog.*.db.useTimezone
pegasus.catalog.*.db.relaxAutoCommit
pegasus.catalog.*.db.paranoid
pegasus.catalog.*.db.autoReconnect
pegasus.catalog.*.db.capitalizeTypeNames
pegasus.catalog.*.db.ultraDevHack
pegasus.catalog.*.db.strictFloatingPoint
pegasus.catalog.*.db.useSSL
pegasus.catalog.*.db.useCompression
pegasus.catalog.*.db.socketTimeout
pegasus.catalog.*.db.maxReconnects
pegasus.catalog.*.db.initialTimeout
pegasus.catalog.*.db.maxRows
pegasus.catalog.*.db.useHostsInPrivileges
pegasus.catalog.*.db.interactiveClient
pegasus.catalog.*.db.useUnicode
pegasus.catalog.*.db.characterEncoding
</pre>
<p>
</p>
<p>MS SQL Server 2000 support the following properties (keys are
case-insensitive, e.g. both "user" and "User" are valid):
</p>
<pre class="screen">
pegasus.catalog.*.db.User
pegasus.catalog.*.db.Password
pegasus.catalog.*.db.DatabaseName
pegasus.catalog.*.db.ServerName
pegasus.catalog.*.db.HostProcess
pegasus.catalog.*.db.NetAddress
pegasus.catalog.*.db.PortNumber
pegasus.catalog.*.db.ProgramName
pegasus.catalog.*.db.SendStringParametersAsUnicode
pegasus.catalog.*.db.SelectMethod
</pre>
<p>
</p>
<p>The * in the property name can be replaced by a catalog name to
apply the property only for that catalog.
Valid catalog names are
</p>
<pre class="screen">
replica
provenance
</pre>
<p>
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.6. Catalog Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesCatalogProperties"></a>10.1.6. Catalog Properties</h3></div></div></div>
<p></p>
<div class="section" title="10.1.6.1. Replica Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="PropertiesReplicaCatalog"></a>10.1.6.1. Replica Catalog</h4></div></div></div>
<p></p>
<div class="section" title="10.1.6.1.1. pegasus.catalog.replica">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.replica"></a>10.1.6.1.1. pegasus.catalog.replica</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">RLS</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">LRC</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">JDBCRC</td>
</tr>
<tr>
<td align="left">Value[3]:</td>
<td align="left">File</td>
</tr>
<tr>
<td align="left">Value[4]:</td>
<td align="left">Directory</td>
</tr>
<tr>
<td align="left">Value[5]:</td>
<td align="left">MRC</td>
</tr>
<tr>
<td align="left">Value[6]:</td>
<td align="left">Regex</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">RLS</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Pegasus queries a Replica Catalog to discover the physical filenames
(PFN) for input files specified in the DAX. Pegasus can interface
with various types of Replica Catalogs. This property specifies
which type of Replica Catalog to use during the planning process.
</p>
<div class="variablelist"><dl>
<dt><span class="term">RLS</span></dt>
<dd> RLS (Replica Location Service) is a distributed replica
catalog, which ships with GT4. There is an index service called
Replica Location Index (RLI) to which 1 or more Local Replica
Catalog (LRC) report. Each LRC can contain all or a subset of
mappings. In this mode, Pegasus queries the central RLI to
discover in which LRC's the mappings for a LFN reside. It then
queries the individual LRC's for the PFN's.
To use RLS, the user additionally needs to set the property
pegasus.catalog.replica.url to specify the URL for the RLI to
query.
Details about RLS can be found at
http://www.globus.org/toolkit/data/rls/
</dd>
<dt><span class="term">LRC</span></dt>
<dd> If the user does not want to query the RLI, but directly a
single Local Replica Catalog.
To use LRC, the user additionally needs to set the property
pegasus.catalog.replica.url to specify the URL for the LRC to
query.
Details about RLS can be found at
http://www.globus.org/toolkit/data/rls/
</dd>
<dt><span class="term">JDBCRC</span></dt>
<dd> In this mode, Pegasus queries a SQL based replica catalog that
is accessed via JDBC. The sql schema's for this catalog can be
found at $PEGASUS_HOME/sql directory.
To use JDBCRC, the user additionally needs to set the following
properties
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">pegasus.catalog.replica.db.driver = mysql</li>
<li class="listitem">pegasus.catalog.replica.db.url = jdbc url to database e.g jdbc:mysql://database-host.isi.edu/database-name  </li>
<li class="listitem">pegasus.catalog.replica.db.user = database-user</li>
<li class="listitem">pegasus.catalog.replica.db.password = database-password</li>
</ol></div>
</dd>
<dt><span class="term">File</span></dt>
<dd>
<p>In this mode, Pegasus queries a file based replica catalog.
It is neither transactionally safe, nor advised to use for
production purposes in any way. Multiple concurrent instances
<span class="emphasis"><em>will clobber</em></span> each other!.  The site attribute should
be specified whenever possible. The attribute key for the site
attribute is "pool".
</p>
<p>The LFN may or may not be quoted. If it contains linear
whitespace, quotes, backslash or an equality sign, it must be
quoted and escaped. Ditto for the PFN. The attribute key-value
pairs are separated by an equality sign  without any
whitespaces. The value may be in quoted. The LFN  sentiments about quoting apply.
</p>
<pre class="screen">
LFN PFN
LFN PFN a=b [..]
LFN PFN a="b" [..]
"LFN w/LWS" "PFN w/LWS" [..]
</pre>
<p>
</p>
<p>To use File, the user additionally needs to specify
pegasus.catalog.replica.file property to specify the path to the
file based RC.
</p>
</dd>
<dt><span class="term">Regex</span></dt>
<dd>
<p>In this mode, Pegasus queries a file based replica catalog.
It is neither transactionally safe, nor advised to use for
production purposes in any way. Multiple concurrent access to
the File will end up clobbering the contents of the file.  The
site attribute should be specified whenever possible. The attribute
key for the site attribute is "pool".
</p>
<p>The LFN may or may not be quoted. If it contains linear
whitespace, quotes, backslash or an equality sign, it must be
quoted and escaped. Ditto for the PFN. The attribute key-value
pairs are separated by an equality sign  without any
whitespaces. The value may be in quoted. The LFN  sentiments about quoting apply.
</p>
<p>In addition users can specifiy regular expression based LFN's. A regular expression
based entry should be qualified with an attribute named 'regex'. The attribute regex
when set to true identifies the catalog entry as a regular expression based entry.
Regular expressions should follow Java regular expression syntax.
</p>
<p>For example, consider a replica catalog as shown below.
</p>
<p>Entry 1 refers to an entry which does not use a resular expressions. This entry
would only match a file named 'f.a', and nothing else.
Entry 2 referes to an entry which uses a regular expression. In this entry f.a
referes to files having name as f[any-character]a i.e. faa, f.a, f0a, etc.
</p>
<pre class="screen">
f.a file:///Volumes/data/input/f.a pool="local"
f.a file:///Volumes/data/input/f.a pool="local" regex="true"
</pre>
<p>
</p>
<p>Regular expression based entries also support substitutions. For example,
consider the regular expression based entry shown below.
</p>
<p>Entry 3 will match files with name alpha.csv, alpha.txt, alpha.xml.
In addition, values matched in the expression can be used to generate a PFN.
</p>
<p>For the entry below if the file being looked up is alpha.csv, the PFN for the file
would be generated as file:///Volumes/data/input/csv/alpha.csv. Similary if the
file being lookedup was alpha.csv, the PFN for the file would be generated as
file:///Volumes/data/input/xml/alpha.xml i.e. The section [0], [1] will be replaced.
Section [0] refers to the entire string i.e. alpha.csv. Section [1] refers to a partial
match in the input i.e. csv, or txt, or xml. Users can utilize as many sections as they wish.
</p>
<pre class="screen">
alpha\.(csv|txt|xml) file:///Volumes/data/input/[1]/[0] pool="local" regex="true"
</pre>
<p>
</p>
<p>To use File, the user additionally needs to specify
pegasus.catalog.replica.file property to specify the path to the
file based RC.
</p>
</dd>
<dt><span class="term">Directory</span></dt>
<dd>
<p>In this mode, Pegasus does a directory listing on an input
directory to create the LFN to PFN mappings. The directory listing is
performed recursively, resulting in deep LFN mappings. For example, if an
input directory $input is specified with the following structure
</p>
<pre class="screen">
$input
$input/f.1
$input/f.2
$input/D1
$input/D1/f.3
</pre>
<p>
Pegasus will create the mappings the following LFN PFN mappings internally
</p>
<pre class="screen">
f.1 file://$input/f.1  pool="local"
f.2 file://$input/f.2  pool="local"
D1/f.3 file://$input/D2/f.3 pool="local"
</pre>
<p>
</p>
<p>If you don't want the deep lfn's to be created then, you can set
pegasus.catalog.replica.directory.flat.lfn  to true
In that case, for the previous example, Pegasus will create the following
LFN PFN mappings internally.
</p>
<pre class="screen">
f.1 file://$input/f.1  pool="local"
f.2 file://$input/f.2  pool="local"
f.3 file://$input/D2/f.3 pool="local"
</pre>
<p>
</p>
<p>pegasus-plan has --input-dir option that can be used to specify an input
directory.
</p>
<p>Users can optionally specify additional properties to configure the behvavior
of this implementation.
</p>
<p>pegasus.catalog.replica.directory.site  to specify a site attribute other than
local to associate with the mappings.
</p>
<p>pegasus.catalog.replica.directory.url.prefix to associate a URL prefix for the PFN's
constructed. If not specified, the URL defaults to file://
</p>
</dd>
<dt><span class="term">MRC</span></dt>
<dd>
<p>In this mode, Pegasus queries multiple replica catalogs to
discover the file locations on the grid.  To use it set
</p>
<pre class="screen">
pegasus.catalog.replica MRC
</pre>
<p>
</p>
<p>Each associated replica catalog can be configured via properties
as follows.
</p>
<p>The user associates a variable name referred to as [value] for
each of the catalogs, where [value] is any legal identifier
(concretely [A-Za-z][_A-Za-z0-9]*) For each associated replica
catalogs the user specifies the following properties.
</p>
<pre class="screen">
pegasus.catalog.replica.mrc.[value]       specifies the type of replica catalog.
pegasus.catalog.replica.mrc.[value].key   specifies a property name key for a
particular catalog
</pre>
<p>
</p>
<p>For example, if a user wants to query two lrc's at the same time
he/she can specify as follows
</p>
<pre class="screen">
pegasus.catalog.replica.mrc.lrc1 LRC
pegasus.catalog.replica.mrc.lrc2.url rls://sukhna
pegasus.catalog.replica.mrc.lrc2 LRC
pegasus.catalog.replica.mrc.lrc2.url rls://smarty
</pre>
<p>
</p>
<p>In the above example, lrc1, lrc2 are any valid identifier names
and url is the property key that needed to be specified.
</p>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.6.1.2. pegasus.catalog.replica.url">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.replica.url"></a>10.1.6.1.2. pegasus.catalog.replica.url</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">URI string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>When using the modern RLS replica catalog, the URI to the Replica
catalog must be  provided to Pegasus to enable it to look up
filenames. There is no  default.
</p>
</div>
<div class="section" title="10.1.6.1.3. pegasus.catalog.replica.chunk.size">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.replica.chunk.size"></a>10.1.6.1.3. pegasus.catalog.replica.chunk.size</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus, rc-client</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Integer</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">1000</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>The rc-client takes in an input file containing the mappings upon
which to work. This property determines, the number of lines that
are read in at a time, and worked upon at together. This allows the
various operations like insert, delete happen in bulk if the
underlying replica implementation supports it.
</p>
<p></p>
</div>
<div class="section" title="10.1.6.1.4. pegasus.catalog.replica.lrc.ignore">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.replica.lrc.ignore"></a>10.1.6.1.4. pegasus.catalog.replica.lrc.ignore</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Replica Catalog - RLS</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">comma separated list of LRC urls</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.replica.lrc.restrict</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Certain users may like to skip some LRCs while querying for the physical
locations of a file. If some LRCs need to be skipped from those found in the
rli then use this property. You can define either the full URL or partial
domain names that need to be skipped. E.g. If a user wants
rls://smarty.isi.edu and all LRCs on usc.edu to be skipped then the
property will be set as pegasus.rls.lrc.ignore=rls://smarty.isi.edu,usc.edu
</p>
</div>
<div class="section" title="10.1.6.1.5. pegasus.catalog.replica.lrc.restrict">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.replica.lrc.restrict"></a>10.1.6.1.5. pegasus.catalog.replica.lrc.restrict</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Replica Catalog - RLS</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">1.3.9</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">comma separated list of LRC urls</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.replica.lrc.ignore</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property applies a tighter restriction on the results returned
from the LRCs specified. Only those PFNs are returned that have a
pool attribute associated with them. The property "pegasus.rc.lrc.ignore"
has a higher priority than "pegasus.rc.lrc.restrict". For example, in case
a LRC is specified in both properties, the LRC would be ignored (i.e.
not queried at all instead of applying a tighter restriction on the
results returned).
</p>
</div>
<div class="section" title="10.1.6.1.6. pegasus.catalog.replica.lrc.site.[site-name]">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.replica.lrc.site.%5Bsite-name%5D"></a>10.1.6.1.6. pegasus.catalog.replica.lrc.site.[site-name]</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Replica Catalog - RLS</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.3.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">LRC url</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property allows for the LRC url to be associated with site
handles. Usually, a pool attribute is required to be associated with
the PFN for Pegasus to figure out the site on which PFN resides.
However, in the case where an LRC is responsible for only
a single site's mappings, Pegasus can safely associate LRC url
with the site. This association can be used to determine the pool
attribute for all  mappings returned from the LRC, if the mapping
does not have a pool attribute associated with it.
</p>
<p>The site_name in the property should be replaced by the name of
the site. For example
</p>
<pre class="screen">
pegasus.catalog.replica.lrc.site.isi  rls://lrc.isi.edu
</pre>
<p>
tells Pegasus that all PFNs returned from LRC rls://lrc.isi.edu
are associated with site isi.
</p>
<p>The [site_name] should be the same as the site handle specified in
the site catalog.
</p>
</div>
<div class="section" title="10.1.6.1.7. pegasus.catalog.replica.cache.asrc">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.replica.cache.asrc"></a>10.1.6.1.7. pegasus.catalog.replica.cache.asrc</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">true</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.replica</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property determines whether to treat the cache file specified
as a supplemental replica catalog or not. User can specify on the
command line to pegasus-plan a comma separated list of cache files using
the --cache option. By default, the LFN-&gt;PFN mappings contained in
the cache file are treated as cache, i.e if an entry is found in a
cache file the replica catalog is not queried. This results in only
the entry specified in the cache file to be available for replica
selection.
</p>
<p>Setting this property to true, results in the cache files to be
treated as supplemental replica catalogs. This results in the
mappings found in the replica catalog (as specified by
pegasus.catalog.replica) to be  merged with the ones found in the
cache files. Thus, mappings for  a particular LFN found in both the
cache and the replica catalog are available for replica selection.
</p>
</div>
</div>
<div class="section" title="10.1.6.2. Site Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="PropertiesSiteCatalog"></a>10.1.6.2. Site Catalog</h4></div></div></div>
<p></p>
<p></p>
<div class="section" title="10.1.6.2.1. pegasus.catalog.site">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.site"></a>10.1.6.2.1. pegasus.catalog.site</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Site Catalog</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">XML4</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">XML3</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">XML4</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>The site catalog file format is now automatically detected, so there
should be no need to use the property anymore.
</p>
</div>
<div class="section" title="10.1.6.2.2. pegasus.catalog.site.file">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.site.file"></a>10.1.6.2.2. pegasus.catalog.site.file</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Site Catalog</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">file location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home.sysconfdir}/sites.xml</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.site</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>Running things on the grid requires an extensive description of the
capabilities of each compute cluster, commonly termed "site". This
property describes the location of the file that contains such a site
description. As the format is currently in flow, please refer to the
userguide and Pegasus for details which format is expected.
</p>
</div>
</div>
<div class="section" title="10.1.6.3. Transformation Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="PropertiesTransformationCatalog"></a>10.1.6.3. Transformation Catalog</h4></div></div></div>
<p></p>
<p></p>
<div class="section" title="10.1.6.3.1. pegasus.catalog.transformation">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.transformation"></a>10.1.6.3.1. pegasus.catalog.transformation</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Transformation Catalog</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Text</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">File</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Text</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.transformation.file</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<div class="variablelist"><dl>
<dt><span class="term">Text</span></dt>
<dd>
<p>In this mode, a multiline file based format is understood. The file
is read and cached in memory. Any modifications, as adding or
deleting, causes an update of the memory and hence to the file
underneath. All queries are done against the memory
representation.
</p>
<p>The file sample.tc.text in the etc directory contains an example
</p>
<p>Here is a sample textual format for transfomation catalog containing
one transformation on two sites
</p>
<pre class="screen">
tr example::keg:1.0 {
#specify profiles that apply for all the sites for the transformation
#in each site entry the profile can be overriden
profile env "APP_HOME" "/tmp/karan"
profile env "JAVA_HOME" "/bin/app"
site isi {
profile env "me" "with"
profile condor "more" "test"
profile env "JAVA_HOME" "/bin/java.1.6"
pfn "/path/to/keg"
arch  "x86"
os    "linux"
osrelease "fc"
osversion "4"
type "INSTALLED"
site wind {
profile env "me" "with"
profile condor "more" "test"
pfn "/path/to/keg"
arch  "x86"
os    "linux"
osrelease "fc"
osversion "4"
type "STAGEABLE"
</pre>
<p>
</p>
</dd>
<dt><span class="term">File</span></dt>
<dd>THIS FORMAT IS DEPRECATED. WILL BE REMOVED IN COMING VERSIONS.
USE pegasus-tc-converter to convert File format to Text Format.
In this mode, a file format is understood. The file is
read and cached in memory. Any modifications, as adding or
deleting, causes an update of the memory and hence to the file
underneath. All queries are done against the memory
representation. The new TC file format uses 6 columns:
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">The resource ID is represented in the first column.</li>
<li class="listitem">The logical transformation uses the colonized format
ns::name:vs.</li>
<li class="listitem">The path to the application on the system</li>
<li class="listitem">The installation type is identified by one of the following
keywords - all upper case: INSTALLED, STAGEABLE.
If not specified, or <span class="command"><strong>NULL</strong></span> is used, the type
defaults to INSTALLED.</li>
<li class="listitem">The system is of the format ARCH::OS[:VER:GLIBC]. The
following arch types are understood: "INTEL32", "INTEL64",
"SPARCV7", "SPARCV9".
The following os types are understood: "LINUX", "SUNOS",
"AIX". If unset or <span class="command"><strong>NULL</strong></span>, defaults to
INTEL32::LINUX.</li>
<li class="listitem">Profiles are written in the format
NS::KEY=VALUE,KEY2=VALUE;NS2::KEY3=VALUE3
Multiple key-values for same namespace are seperated by a
comma "," and multiple namespaces are seperated by a
semicolon ";". If any of your profile values contains a
comma  you must not use the namespace abbreviator.</li>
</ol></div>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.6.3.2. pegasus.catalog.transformation.file">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.transformation.file"></a>10.1.6.3.2. pegasus.catalog.transformation.file</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">Systems:</td>
<td align="left">Transformation Catalog</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">file location string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home.sysconfdir}/tc.text | ${pegasus.home.sysconfdir}/tc.data</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.transformation</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property is used to set the path to the textual transformation
catalogs of type File or Text. If the transformation catalog is of type Text
then tc.text file is picked up from sysconfdir, else tc.data
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.6.4. Provenance Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="PropertiesProvenanceCatalog"></a>10.1.6.4. Provenance Catalog</h4></div></div></div>
<p></p>
<div class="section" title="10.1.6.4.1. pegasus.catalog.provenance">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.provenance"></a>10.1.6.4.1. pegasus.catalog.provenance</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Provenance Tracking Catalog (PTC)</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Java class name</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">InvocationSchema</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">NXDInvSchema</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.*.db.driver</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property denotes the schema that is being used to access a PTC.
The PTC is usually not a standard installation. If you use a database
backend, you most likely have a schema that supports PTCs. By default,
no PTC will be used.
</p>
<p>Currently only the InvocationSchema is available for storing the
provenance tracking records. Beware, this can become a lot of data.
The values are names of Java classes. If no absolute Java classname
is given, "org.griphyn.vdl.dbschema." is prepended. Thus, by deriving
from the DatabaseSchema API, and implementing the PTC interface,
users can provide their own classes here.
</p>
<p>Alternatively, if you use a native XML database like eXist, you can
store data using the NXDInvSchema. This will avoid using any of the
other database driver properties.
</p>
</div>
<div class="section" title="10.1.6.4.2. pegasus.catalog.provenance.refinement">
<div class="titlepage"><div><div><h5 class="title">
<a name="Propertiespegasus.catalog.provenance.refinement"></a>10.1.6.4.2. pegasus.catalog.provenance.refinement</h5></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">PASOA Provenance Store</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0.1</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Java class name</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Pasoa</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">InMemory</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">InMemory</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.*.db.driver</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>This property turns on the logging of the refinement process that
happens inside Pegasus to the PASOA store. Not all actions are
currently captured. It is still an experimental feature.
</p>
<p>The PASOA store needs to run on localhost on port 8080
https://localhost:8080/prserv-1.0
</p>
<p></p>
</div>
</div>
</div>
<div class="section" title="10.1.7. Replica Selection Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesReplicaSelectionProperties"></a>10.1.7. Replica Selection Properties</h3></div></div></div>
<p></p>
<div class="section" title="10.1.7.1. pegasus.selector.replica">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.replica"></a>10.1.7.1. pegasus.selector.replica</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Replica Selection</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">URI string</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">default</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.replica.*.ignore.stagein.sites</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.replica.*.prefer.stagein.sites</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Each job in the DAX maybe associated with input LFN's denoting the
files that are required for the job to run. To determine the
physical replica (PFN) for a LFN, Pegasus queries the replica
catalog to get all the PFN's (replicas) associated with a LFN.
Pegasus  then calls out to a replica selector to select a replica
amongst the various replicas returned. This property determines the
replica selector  to use for selecting the replicas.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Default</span></dt>
<dd>
If a PFN that is a file URL (starting with file:///) and has a
pool attribute matching to the site handle of the site where the
compute is to be run is found, then that is returned.
Else,a random PFN is selected amongst all the PFN's that
have a pool attribute matching to the site handle of the site
where a compute job is to be run.
Else, a random pfn is selected amongst all the PFN's.
</dd>
<dt><span class="term">Restricted</span></dt>
<dd>
<p>
This replica selector, allows the user to specify good sites and
bad sites for staging in data to a particular compute site. A good
site for a compute site X, is a preferred site from which
replicas should be staged to site X. If there are more than one
good sites having a particular replica, then a random site is
selected amongst these preferred sites.
</p>
<p>A bad site for a compute site X, is a site from which replica's
should not be staged. The reason of not accessing replica from a
bad site can vary from the link being down, to the user not having
permissions on that site's data.
</p>
<p>The good | bad sites are specified by the properties
</p>
<pre class="screen">
pegasus.replica.*.prefer.stagein.sites
pegasus.replica.*.ignore.stagein.sites
</pre>
<p>
</p>
<p>where the * in the property name denotes the name of the compute
site. A * in the property key is taken to mean all sites.
</p>
<p>The pegasus.replica.*.prefer.stagein.sites property takes precedence
over pegasus.replica.*.ignore.stagein.sites property i.e. if for a
site X, a site Y is specified both in the ignored and the
preferred  set, then site Y is taken to mean as only a preferred
site for a site X.
</p>
</dd>
<dt><span class="term">Regex</span></dt>
<dd>
<p>
This replica selector allows the user allows the user to specific regex
expressions that can be used to rank various PFN's returned from the
Replica Catalog for a particular LFN. This replica selector selects the
highest ranked PFN i.e the replica with the lowest rank value.
</p>
<p>The regular expressions are assigned different rank, that determine
the order in which the expressions are employed. The rank values for
the regex can expressed in user properties using the property.
</p>
<pre class="screen">
pegasus.selector.replica.regex.rank.[value]   regex-expression
</pre>
<p>
</p>
<p>The value is an integer value that denotes the rank of an expression
with a rank value of 1 being the highest rank.
</p>
<p>Please note that before applying any regular expressions on
the PFN's, the file URL's that dont match the preferred site are
explicitly filtered out.
</p>
</dd>
<dt><span class="term">Local</span></dt>
<dd>
This replica selector prefers replicas from the local host and that
start with a file: URL scheme.  It is useful, when users want to
stagin files to a remote site from your  submit host using the
Condor file transfer mechanism.
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.7.2. pegasus.selector.replica.*.ignore.stagein.sites">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.replica.*.ignore.stagein.sites"></a>10.1.7.2. pegasus.selector.replica.*.ignore.stagein.sites</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Replica Selection</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">comma separated list of sites</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">no default</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.replica</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.replica.*.prefer.stagein.sites</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>A comma separated list of storage sites from which to never stage in
data to a compute site. The property can apply to all or a single
compute site, depending on how the * in the property name is expanded.
</p>
<p>The * in the property name means all compute sites unless replaced
by a site name.
</p>
<p>For e.g setting pegasus.selector.replica.*.ignore.stagein.sites to usc means that
ignore  all replicas from site usc for staging in to any compute site.
Setting pegasus.replica.isi.ignore.stagein.sites to usc means that
ignore all replicas from site usc for staging in data to site isi.
</p>
<p></p>
</div>
<div class="section" title="10.1.7.3. pegasus.selector.replica.*.prefer.stagein.sites">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.replica.*.prefer.stagein.sites"></a>10.1.7.3. pegasus.selector.replica.*.prefer.stagein.sites</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Replica Selection</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">comma separated list of sites</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">no default</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.replica</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.replica.*.ignore.stagein.sites</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>A comma separated list of preferred storage sites from which to stage in
data to a compute site. The property can apply to all or a single
compute site, depending on how the * in the property name is expanded.
</p>
<p>The * in the property name means all compute sites unless replaced
by a site name.
</p>
<p>For e.g setting pegasus.selector.replica.*.prefer.stagein.sites to usc means that
prefer all replicas from site usc for staging in to any compute site.
Setting pegasus.replica.isi.prefer.stagein.sites to usc means that
prefer all replicas from site usc for staging in data to site isi.
</p>
<p></p>
</div>
<div class="section" title="10.1.7.4. pegasus.selector.replica.regex.rank.[value]">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.replica.regex.rank.%5Bvalue%5D"></a>10.1.7.4. pegasus.selector.replica.regex.rank.[value]</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Replica Selection</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Regex Expression</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.3.0</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">no default</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.replica</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Specifies the regex expressions to be applied on the PFNs returned
for a particular LFN.  Refer to
</p>
<pre class="screen">
http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
</pre>
<p>
on information of how to construct a regex expression.
</p>
<p>The [value] in the property key is to be replaced by an int value
that designates the rank value for the regex expression to be
applied in the Regex replica selector.
</p>
<p>The example below indicates preference for file URL's over
URL's referring to gridftp server at example.isi.edu
</p>
<pre class="screen">
pegasus.selector.replica.regex.rank.1 file://.*
pegasus.selector.replica.regex.rank.2 gsiftp://example\.isi\.edu.*
</pre>
<p>
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.8. Site Selection Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesSiteSelectionProperties"></a>10.1.8. Site Selection Properties</h3></div></div></div>
<p></p>
<div class="section" title="10.1.8.1. pegasus.selector.site">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.site"></a>10.1.8.1. pegasus.selector.site</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Random</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">RoundRobin</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">NonJavaCallout</td>
</tr>
<tr>
<td align="left">Value[3]:</td>
<td align="left">Group</td>
</tr>
<tr>
<td align="left">Value[4]:</td>
<td align="left">Heft</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Random</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.site.path</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.site.timeout</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.site.keep.tmp</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.selector.site.env.*</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>The site selection in Pegasus can be on basis of any of the
following strategies.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Random</span></dt>
<dd>In this mode, the jobs will be randomly distributed among the
sites that can execute them.
</dd>
<dt><span class="term">RoundRobin</span></dt>
<dd>In this mode. the jobs will be assigned in a round
robin manner amongst the sites that can execute them. Since
each site cannot execute everytype of  job, the round robin
scheduling is done per level  on a sorted list. The sorting is
on the basis of  the number of jobs a particular site has been
assigned in that level so far. If a job cannot be run on the
first site in the queue (due to no matching entry in the
transformation catalog for the transformation referred to by
the job), it goes to the next one and so on. This implementation
defaults to classic round robin in the case where all the jobs
in the workflow can run on all the sites.
</dd>
<dt><span class="term">NonJavaCallout</span></dt>
<dd>
<p>In this mode, Pegasus will callout to an external site
selector.In this mode a temporary file is prepared containing
the job information that is passed to the site selector as an
argument while invoking it. The path to the site selector is
specified by setting the property pegasus.site.selector.path. The
environment variables that need to be set to run the site
selector can be specified using the properties with a
pegasus.site.selector.env. prefix.  The temporary file contains
information about the job that needs to be scheduled. It
contains key value pairs with each key value pair  being on a
new line and separated by a =.
</p>
<p>The following pairs are currently generated for the site
selector temporary file that is generated in the NonJavaCallout.
</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">

version                  </td>
<td align="left"> is the version of the site selector
api,currently 2.0.</td>
</tr>
<tr>
<td align="left">

transformation           </td>
<td align="left"> is the fully-qualified definition
identifier for the transformation (TR)
namespace::name:version. </td>
</tr>
<tr>
<td align="left">

derivation               </td>
<td align="left"> is teh fully qualified definition
identifier for the derivation (DV),
namespace::name:version. </td>
</tr>
<tr>
<td align="left">

job.level                </td>
<td align="left"> is the job's depth in the tree of the
workflow DAG. </td>
</tr>
<tr>
<td align="left">

job.id                   </td>
<td align="left"> is the job's ID, as used in the DAX
file. </td>
</tr>
<tr>
<td align="left">

resource.id              </td>
<td align="left"> is a pool handle, followed by whitespace,
followed by a gridftp server. Typically,
each gridftp server is enumerated once,
so you may have multiple occurances of
the same site. There can be multiple
occurances of this key. </td>
</tr>
<tr>
<td align="left">

input.lfn                </td>
<td align="left"> is an input LFN, optionally followed by a
whitespace and file size. There can be
multiple occurances of this key,one for
each  input LFN required by the job.</td>
</tr>
<tr>
<td align="left">

wf.name                  </td>
<td align="left"> label of the dax, as found in the DAX's
root element.
wf.index                   is the DAX index, that is incremented for
each partition in case of deferred
planning.</td>
</tr>
<tr>
<td align="left">

wf.time                  </td>
<td align="left"> is the mtime of the workflow. </td>
</tr>
<tr>
<td align="left">

wf.manager               </td>
<td align="left"> is the name of the workflow manager being
used .e.g condor </td>
</tr>
<tr>
<td align="left">

vo.name                  </td>
<td align="left"> is the name of the virtual organization
that is running this workflow. It is
currently set to NONE </td>
</tr>
<tr>
<td align="left">

vo.group                 </td>
<td align="left"> unused at present and is set to NONE. </td>
</tr>
<tr>
<td align="left">

</td>
<td class="auto-generated"> </td>
</tr>
</tbody>
</table></div>
<p>

</p>
</dd>
<dt><span class="term">Group</span></dt>
<dd>In this mode, a group of jobs will be assigned to the same
site that can execute them. The use of the PEGASUS profile key
group in the dax, associates a job with a particular group. The
jobs that do not have the profile key associated with them,
will be put in the default group. The jobs in the
default group are handed over to the "Random" Site Selector for
scheduling.
</dd>
<dt><span class="term">Heft</span></dt>
<dd>
<p>In this mode, a version of the HEFT processor scheduling
algorithm is used to schedule jobs in the workflow to multiple
grid sites. The implementation assumes default data
communication costs when jobs are not scheduled on to the same
site. Later on this may be made more configurable.
</p>
<p>The runtime for the jobs is specified in the transformation
catalog by associating the pegasus profile key runtime with the
entries.
</p>
<p>The number of processors in a site is picked up from the
attribute idle-nodes associated with the vanilla jobmanager of
the site in the site catalog.
</p>
</dd>
</dl></div>
<p>
</p>
</div>
<div class="section" title="10.1.8.2. pegasus.selector.site.path">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.site.path"></a>10.1.8.2. pegasus.selector.site.path</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Site Selector</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>If one calls out to an external site selector using the
NonJavaCallout mode, this refers to the path where the site selector
is installed. In case other strategies are used it does not need to
be set.
</p>
</div>
<div class="section" title="10.1.8.3. pegasus.site.selector.env.*">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.site.selector.env.*"></a>10.1.8.3. pegasus.site.selector.env.*</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">1.2.3</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>The environment variables that need to be set while callout to the
site selector. These are the variables that the user would set if
running the site selector on the command line. The name of the
environment variable is got by stripping the keys of the prefix
"pegasus.site.selector.env." prefix from them. The value of the
environment variable is the value of the property.
</p>
<p>e.g pegasus.site.selector.path.LD_LIBRARY_PATH /globus/lib would lead to
the site selector being called with the LD_LIBRARY_PATH set to
/globus/lib.
</p>
</div>
<div class="section" title="10.1.8.4. pegasus.selector.site.timeout">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.site.timeout"></a>10.1.8.4. pegasus.selector.site.timeout</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Site Selector</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">non negative integer</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">60</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>It sets the number of seconds Pegasus waits to hear back from an
external site selector using the NonJavaCallout interface before
timing out.
</p>
</div>
<div class="section" title="10.1.8.5. pegasus.selector.site.keep.tmp">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.site.keep.tmp"></a>10.1.8.5. pegasus.selector.site.keep.tmp</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">onerror</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">always</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">never</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">onerror</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>It determines whether Pegasus deletes the temporary input files that
are generated in the temp directory or not. These temporary input
files are passed as input to the external site selectors.
</p>
<p>A temporary input file is created for each that needs to be scheduled.
</p>
</div>
</div>
<div class="section" title="10.1.9. Data Staging Configuration">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesDataStagingConfiguration"></a>10.1.9. Data Staging Configuration</h3></div></div></div>
<p></p>
<p></p>
<div class="section" title="10.1.9.1. pegasus.data.configuration">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.data.configuration"></a>10.1.9.1. pegasus.data.configuration</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">4.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">sharedfs</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">nonsharedfs</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">condorio</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">sharedfs</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property sets up Pegasus to run in different environments.
</p>
<div class="variablelist"><dl>
<dt><span class="term">sharedfs</span></dt>
<dd>If this is set, Pegasus will be setup to execute jobs on the shared
filesystem on the execution site. This assumes, that the head node of a cluster
and the worker nodes share a filesystem. The staging site in this case is
the same as the execution site. Pegasus adds a create dir job to the executable
workflow that creates a workflow specific directory on the shared filesystem .
The data transfer jobs in the executable workflow ( stage_in_ , stage_inter_ ,
stage_out_ ) transfer the data to this directory.The compute jobs in the
executable workflow are launched in the directory on the shared  filesystem.
Internally, if this is set the following properties are set.
<pre class="screen">
pegasus.execute.*.filesystem.local   false
</pre>
</dd>
<dt><span class="term">condorio</span></dt>
<dd>If this is set, Pegasus will be setup to run jobs in a pure condor pool,
with the nodes not sharing a filesystem. Data is staged to the compute nodes from
the submit host using Condor File IO.
The planner is automatically setup to use the submit host ( site local ) as the
staging site. All the auxillary jobs added by the planner to the executable
workflow ( create dir, data stagein and stage-out, cleanup ) jobs refer to
the workflow specific directory on the local site.  The data transfer jobs in
the executable workflow ( stage_in_ , stage_inter_ , stage_out_ ) transfer the
data to this directory. When the compute jobs start, the input data for each
job is shipped from the workflow specific directory on the submit host to
compute/worker node using Condor file IO. The output data for each job is
similarly shipped back to the submit host from the compute/worker node.
This setup is particularly helpful when running workflows in the cloud
environment where setting up a shared filesystem across the VM's may be
tricky.
On loading this property, internally the following properies are set
<pre class="screen">
pegasus.transfer.sls.*.impl          Condor
pegasus.execute.*.filesystem.local   true
pegasus.gridstart 		   PegasusLite
pegasus.transfer.worker.package      true
</pre>
</dd>
<dt><span class="term">nonsharedfs</span></dt>
<dd>If this is set, Pegasus will be setup to execute jobs on an execution site
without relying on a shared filesystem between the head node and the worker nodes.
You can specify staging site ( using --staging-site option to pegasus-plan) to
indicate the site to use as a central storage location for a workflow. The
staging site is independant of the execution sites on which a workflow executes.
All the auxillary jobs added by the planner to the executable
workflow ( create dir, data stagein and stage-out, cleanup ) jobs refer to
the workflow specific directory on the staging site.  The data transfer jobs in
the executable workflow ( stage_in_ , stage_inter_ , stage_out_ ) transfer the
data to this directory. When the compute jobs start, the input data for each
job is shipped from the workflow specific directory on the submit host to
compute/worker node using pegasus-transfer. The output data for each job is
similarly shipped back to the submit host from the compute/worker node.
The protocols supported are at this time SRM, GridFTP, iRods, S3.
This setup is particularly helpful when running workflows on OSG where
most of the execution sites don't have enough data storage. Only a few
sites have large amounts of data storage exposed that can be used to place
data during a workflow run. This setup is also helpful when running workflows
in the cloud environment where setting up a shared filesystem across the VM's may be
tricky.
On loading this property, internally the following properies are set
<pre class="screen">
pegasus.execute.*.filesystem.local   true
pegasus.gridstart 		   PegasusLite
pegasus.transfer.worker.package      true
</pre>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.9.2. pegasus.transfer.bypass.input.staging">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.bypass.input.staging"></a>10.1.9.2. pegasus.transfer.bypass.input.staging     </h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">4.3</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.data.configuration</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>When executiing in a non shared filesystem setup i.e data configuration set to nonsharedfs
or condorio, Pegasus always stages the input files through the staging site i.e the stage-in
job stages in data from the input site to the staging site. The PegasusLite jobs that start
up on the worker nodes, then pull the input data from the staging site for each job.
</p>
<p>This property can be used to setup the PegasusLite jobs to pull input data directly
from the input site without going through the staging server. This is based on the
assumption that the worker nodes can access the input site. If users set this to true,
they should be aware that the access to the input site is no longer throttled ( as in case
of stage in jobs). If large number of compute jobs start at the same time in a workflow,
the input server will see a connection from each job.
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.10. Transfer Configuration Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesTransferConfigurationProperties"></a>10.1.10. Transfer Configuration Properties</h3></div></div></div>
<p></p>
<div class="section" title="10.1.10.1. pegasus.transfer.*.impl">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.*.impl"></a>10.1.10.1. pegasus.transfer.*.impl</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Transfer</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">GUC</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Transfer</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transfer.refiner</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Each compute job usually has data products that are required to be
staged in to the execution site, materialized data products staged
out  to a final resting place, or staged to another job running at a
different site. This property determines the underlying grid
transfer tool that is used to manage the transfers.
</p>
<p>The * in the property name can be replaced to achieve finer grained
control to dictate what type of transfer jobs need to be managed
with which grid transfer tool.
</p>
<p>Usually,the arguments with which the client is invoked can be
specified by
</p>
<pre class="screen">
- the property pegasus.transfer.arguments
- associating the PEGASUS profile key transfer.arguments
</pre>
<p>
</p>
<p>The table below illustrates all the possible variations of the
property.
</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">

Property Name	         </td>
<td align="left"> Applies to </td>
</tr>
<tr>
<td align="left">

pegasus.transfer.stagein.impl   </td>
<td align="left"> the stage in transfer jobs</td>
</tr>
<tr>
<td align="left">

pegasus.transfer.stageout.impl  </td>
<td align="left"> the stage out transfer jobs</td>
</tr>
<tr>
<td align="left">

pegasus.transfer.inter.impl     </td>
<td align="left"> the inter pool transfer jobs </td>
</tr>
<tr>
<td align="left">

pegasus.transfer.setup.impl     </td>
<td align="left"> the setup transfer job</td>
</tr>
<tr>
<td align="left">

pegasus.transfer.*.impl         </td>
<td align="left"> apply to types of transfer jobs </td>
</tr>
<tr>
<td align="left">

</td>
<td class="auto-generated"> </td>
</tr>
</tbody>
</table></div>
<p>

</p>
<p>Note: Since version 2.2.0 the worker package is staged automatically during
staging of executables to the remote site. This is achieved
by adding a setup transfer job to the workflow. The setup transfer job by
default uses GUC to stage the data. The implementation to use can be
configured by setting the property
</p>
<pre class="screen">pegasus.transfer.setup.impl </pre>
<p>property.
However, if you have pegasus.transfer.*.impl set in your properties file,
then you need to set pegasus.transfer.setup.impl to GUC
</p>
<p>The various grid transfer tools that can be used to manage data
transfers are explained below
</p>
<div class="variablelist"><dl>
<dt><span class="term">Transfer</span></dt>
<dd>
<p>This results in pegasus-transfer to be used for transferring of files. It
is a python based wrapper around various transfer clients like
globus-url-copy, lcg-copy, wget, cp, ln . pegasus-transfer looks at
source and destination url and figures out automatically which underlying
client to use. pegasus-transfer is distributed with the PEGASUS and can
be found at $PEGASUS_HOME/bin/pegasus-transfer.
</p>
<p>For remote sites, Pegasus constructs the default path to pegasus-transfer
on the basis of PEGASUS_HOME env profile specified in the site catalog.
To specify a different path to the pegasus-transfer client , users can
add an entry into the transformation catalog with fully qualified logical
name as pegasus::pegasus-transfer
</p>
</dd>
<dt><span class="term">GUC</span></dt>
<dd>This refers to the new guc client that does multiple file
transfers per invocation. The globus-url-copy client
distributed with Globus 4.x is compatible with this mode.
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.10.2. pegasus.transfer.refiner">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.refiner"></a>10.1.10.2. pegasus.transfer.refiner</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Basic</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Cluster</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Cluster</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transfer.*.impl</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property determines how the transfer nodes are added to the
workflow. The various refiners differ in the how they link the
various transfer jobs, and the number of transfer jobs that are
created per compute jobs.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Basic</span></dt>
<dd>This is a basic refinement strategy that adds a stage-in
job per compute job  and a stage-out per compute jobs. It is not
recommended to use this , especially for large workflows where lots
of stage-in jobs maybe created for a workflow. This is only recommended
for experimental setups.
</dd>
<dt><span class="term">Cluster</span></dt>
<dd><p>In this refinement strategy, clusters of stage-in and stageout jobs
are created per level of the workflow. This workflow allows you to
control the number of stagein and stageout jobs by associating pegasus
profiles stagein.clusters and stageout.clusters with the jobs or in the
site catalog for the staging sites.
</p></dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.10.3. pegasus.transfer.sls.*.impl">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.sls.*.impl"></a>10.1.10.3. pegasus.transfer.sls.*.impl</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Transfer</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Condor</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Transfer</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2.0</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.data.configuration</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.execute.*.filesystem.local</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property specifies the transfer tool to be used for
Second Level Staging (SLS) of input and output data between the
head node and worker node filesystems.
</p>
<p>Currently, the * in the property name CANNOT be replaced to achieve
finer grained control to dictate what type of SLS transfers need to
be managed with which grid transfer tool.
</p>
<p>The various grid transfer tools that can be used to manage SLS data
transfers are explained below
</p>
<div class="variablelist"><dl>
<dt><span class="term">Transfer</span></dt>
<dd>
<p>This results in pegasus-transfer to be used for transferring of files. It
is a python based wrapper around various transfer clients like
globus-url-copy, lcg-copy, wget, cp, ln . pegasus-transfer looks at
source and destination url and figures out automatically which underlying
client to use. pegasus-transfer is distributed with the PEGASUS and can
be found at $PEGASUS_HOME/bin/pegasus-transfer.
</p>
<p>For remote sites, Pegasus constructs the default path to pegasus-transfer
on the basis of PEGASUS_HOME env profile specified in the site catalog.
To specify a different path to the pegasus-transfer client , users can
add an entry into the transformation catalog with fully qualified logical
name as pegasus::pegasus-transfer
</p>
</dd>
<dt><span class="term">Condor</span></dt>
<dd>
<p>This results in Condor file transfer mechanism to be used to transfer the
input data files from the submit host directly to the worker node
directories. This is used when running in pure Condor mode or in a Condor
pool that does not have a shared filesystem between the nodes.
</p>
<p>When setting the SLS transfers to Condor make sure that the
following properties are also set
</p>
<pre class="screen">
pegasus.gridstart		        PegasusLite
pegasus.execute.*.filesystem.local  true
</pre>
<p>
Alternatively, you can set
</p>
<pre class="screen">
pegasus.data.configuration           condorio
</pre>
<p> in lieu of the above 3 properties.
</p>
<p>Also make sure that pegasus.gridstart is not set.
</p>
<p>Please refer to the section on "Condor Pool Without a Shared Filesystem"
in the chapter on Planning and Submitting.
</p>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.10.4. pegasus.transfer.arguments">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.arguments"></a>10.1.10.4. pegasus.transfer.arguments</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transfer.sls.arguments</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This determines the extra arguments with which the transfer implementation is
invoked. The transfer executable that is invoked is dependant upon
the transfer mode that has been selected.
The property can be overloaded by associated the pegasus profile key
transfer.arguments either with the site in the site catalog or the
corresponding transfer executable in the transformation catalog.
</p>
<p></p>
</div>
<div class="section" title="10.1.10.5. pegasus.transfer.sls.arguments">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.sls.arguments"></a>10.1.10.5. pegasus.transfer.sls.arguments</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.4</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transfer.arguments</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transfer.sls.*.impl</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This determines the extra arguments with which the SLS transfer
implementation is invoked. The transfer executable that is invoked
is dependant upon the SLS transfer implementation that has been selected.
</p>
<p></p>
</div>
<div class="section" title="10.1.10.6. pegasus.transfer.stage.sls.file">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.stage.sls.file"></a>10.1.10.6. pegasus.transfer.stage.sls.file</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">(no default)</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.gridstart</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.execute.*.filesystem.local</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>For executing jobs on the local filesystem, Pegasus creates sls files for
each compute jobs. These sls files list the files that need to be
staged to the worker node and the output files that need to be pushed out
from the worker node after completion of the job. By default, pegasus will
stage these SLS files to the shared filesystem on the head node as part of
first level data stagein jobs. However, in the case where there is no
shared filesystem between head nodes and the worker nodes, the user can set
this property to false. This will result in the sls files to be transferred
using the Condor File Transfer from the submit host.
</p>
<p></p>
</div>
<div class="section" title="10.1.10.7. pegasus.transfer.worker.package">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.worker.package"></a>10.1.10.7. pegasus.transfer.worker.package</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.0</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.data.configuration</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>By default, Pegasus relies on the worker package to be installed in a directory
accessible to the worker nodes on the remote sites . Pegasus uses the value of
PEGASUS_HOME environment profile in the site catalog for the remote sites, to then
construct paths to pegasus auxillary executables like kickstart, pegasus-transfer,
seqexec etc.
</p>
<p>If the Pegasus worker package is not installed on the remote sites
users can set this property to true to get Pegasus to deploy worker package on the
nodes.
</p>
<p>In the case of sharedfs setup, the worker package is deployed on the shared scratch
directory for the workflow , that is accessible to all the compute nodes of the
remote sites.
</p>
<p>When running in nonsharefs environments, the worker package is first brought to the
submit directory and then transferred to the worker node filesystem using Condor
file IO.
</p>
<p></p>
</div>
<div class="section" title="10.1.10.8. pegasus.transfer.links">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.links"></a>10.1.10.8. pegasus.transfer.links</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transfer</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>If this is set, and the transfer implementation is set to Transfer
i.e. using the transfer executable distributed with the PEGASUS.
On setting this property, if Pegasus while fetching data from the
Replica Catalog sees a pool attribute associated with the PFN that matches
the execution pool on which the data has to be transferred to,
Pegasus instead of the URL returned by the Replica Catalog replaces it with
a file based URL. This is based on the assumption that the if the pools match the
filesystems are visible to the remote execution directory where
input data resides.
On seeing both the source and destination urls as file based URLs
the transfer executable spawns a job that creates a symbolic link
by calling ln -s on the remote pool.
</p>
<p></p>
</div>
<div class="section" title="10.1.10.9. pegasus.transfer.*.remote.sites">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.*.remote.sites"></a>10.1.10.9. pegasus.transfer.*.remote.sites</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">comma separated list of sites</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">no default</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>By default Pegasus looks at the source and destination URL's for to determine
whether the associated transfer job runs on the submit host or the head node
of a remote site, with preference set to run a transfer job to run on submit
host.
</p>
<p>Pegasus will run transfer jobs on the remote sites
</p>
<pre class="screen">
-  if the file server for the compute site is a file server i.e url prefix file://
-  symlink jobs need to be added that require the symlink transfer jobs to
be run remotely.
</pre>
<p>
</p>
<p>This property can be used to change the default behaviour of Pegasus and force pegasus
to run different types of transfer jobs for the sites specified on the remote site.
</p>
<p>The table below illustrates all the possible variations of the
property.
</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">

Property Name			    </td>
<td align="left"> Applies to </td>
</tr>
<tr>
<td align="left">

pegasus.transfer.stagein.remote.sites  </td>
<td align="left"> the stage in transfer jobs</td>
</tr>
<tr>
<td align="left">

pegasus.transfer.stageout.remote.sites </td>
<td align="left"> the stage out transfer jobs</td>
</tr>
<tr>
<td align="left">

pegasus.transfer.inter.remote.sites    </td>
<td align="left"> the inter pool transfer jobs </td>
</tr>
<tr>
<td align="left">

pegasus.transfer.*.remote.sites        </td>
<td align="left"> apply to types of transfer jobs </td>
</tr>
<tr>
<td align="left">

</td>
<td class="auto-generated"> </td>
</tr>
</tbody>
</table></div>
<p>

</p>
<p>In addition * can be specified as a property value, to designate
that it applies to all sites.
</p>
</div>
<div class="section" title="10.1.10.10. pegasus.transfer.staging.delimiter">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.staging.delimiter"></a>10.1.10.10. pegasus.transfer.staging.delimiter</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">:</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transformation.selector</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>Pegasus supports executable staging as part of the
workflow. Currently staging of statically linked executables is
supported only. An executable is normally staged to the work
directory for the workflow/partition on the remote site. The
basename of the staged executable is derived from the namespace,name
and version of the transformation in the transformation
catalog. This property sets the delimiter that is used for the
construction of the name of the staged executable.
</p>
<p></p>
</div>
<div class="section" title="10.1.10.11. pegasus.transfer.disable.chmod.sites">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.disable.chmod.sites"></a>10.1.10.11. pegasus.transfer.disable.chmod.sites</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">comma separated list of sites</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">no default</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>During staging of executables to remote sites, chmod jobs are
added to the workflow. These jobs run on the remote sites and do a
chmod on the staged executable. For some sites, this maynot be
required. The permissions might be preserved, or there maybe an
automatic mechanism that does it.
</p>
<p>This property allows you to specify the list of sites, where you do
not want the chmod jobs to be executed. For those sites, the chmod
jobs are replaced by NoOP jobs. The NoOP jobs are  executed by
Condor, and instead will immediately have a terminate event written
to the job log file and removed from the queue.
</p>
<p></p>
</div>
<div class="section" title="10.1.10.12. pegasus.transfer.setup.source.base.url">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.transfer.setup.source.base.url"></a>10.1.10.12. pegasus.transfer.setup.source.base.url </h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">URL</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">no default</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.3</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property specifies the base URL to the directory containing the
Pegasus worker package builds. During Staging of Executable, the
Pegasus Worker Package is also staged to the remote site. The worker
packages are by default pulled from the http server at pegasus.isi.edu.
This property can be used to override the location from where the worker
package are staged. This maybe required if the remote computes sites don't
allows files transfers from a http server.
</p>
</div>
</div>
<div class="section" title="10.1.11. Gridstart And Exitcode Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesGridstartAndExitcodeProperties"></a>10.1.11. Gridstart And Exitcode Properties</h3></div></div></div>
<p></p>
<p></p>
<div class="section" title="10.1.11.1. pegasus.gridstart">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.gridstart"></a>10.1.11.1. pegasus.gridstart</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Kickstart</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">None</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">PegasusLite</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Kickstart</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.execute.*.filesystem.local</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Jobs that are launched on the grid maybe wrapped in a wrapper
executable/script that enables information about about the
execution, resource consumption, and - most importantly - the
exitcode of the remote application.
At present, a job scheduled on a remote site is launched with a
gridstart if site catalog has the corresponding gridlaunch attribute
set and the job being launched is not MPI.
</p>
<p>Users can explicitly decide what gridstart to use for a job, by
associating the pegasus profile key named gridstart with the job.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Kickstart</span></dt>
<dd>In this mode, all the jobs are lauched via kickstart. The
kickstart executable is a light-weight program
which connects the  stdin,stdout and stderr filehandles for
PEGASUS jobs on the remote site. Kickstart is an executable
distributed with PEGASUS that can generally be found  at
${pegasus.home.bin}/kickstart.
</dd>
<dt><span class="term">None</span></dt>
<dd>In this mode, all the jobs are launched directly on
the remote site. Each job's stdin,stdout and stderr are
connected to condor commands in a manner to ensure that they are
sent back  to the submit host.
</dd>
<dt><span class="term">PegasusLite</span></dt>
<dd>In this mode, the compute jobs are wrapped by PegasusLite instances.
PegasusLite instance is a bash script, that is launced on the compute node.
It determins at runtime the directory a job needs to execute in, pulls in data
from the staging site , launches the job, pushes out the data and cleans up the
directory after execution.
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.11.2. pegasus.gridstart.kickstart.set.xbit">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.gridstart.kickstart.set.xbit"></a>10.1.11.2. pegasus.gridstart.kickstart.set.xbit</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.4</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transfer.disable.chmod.sites</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Kickstart has an option to set the X bit on an executable before it
launches it on the remote site. In case of staging of executables,
by default chmod jobs are launched that set the x bit of the user
executables staged to a remote site.
</p>
<p>On setting this property to true, kickstart gridstart module adds a
-X option to kickstart arguments. The -X arguments tells kickstart
to set the x bit of the executable before launching it.
</p>
<p>User should usually disable the chmod jobs by setting the property
pegasus.transfer.disable.chmod.sites , if they set this property
to true.
</p>
<p></p>
</div>
<div class="section" title="10.1.11.3. pegasus.gridstart.kickstart.stat">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.gridstart.kickstart.stat"></a>10.1.11.3. pegasus.gridstart.kickstart.stat</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.1</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.gridstart.generate.lof</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Kickstart has an option to stat the input files and the output
files. The stat information is collected in the XML record generated
by kickstart. Since stat is an expensive operation, it is not turned
on by on. Set this property to true if you want to see stat
information for the input files and output files of a job in it's
kickstart output.
</p>
<p></p>
</div>
<div class="section" title="10.1.11.4. pegasus.gridstart.generate.lof">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.gridstart.generate.lof"></a>10.1.11.4. pegasus.gridstart.generate.lof</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.1</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.gridstart.kickstart.stat</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>For the stat option for kickstart, we generate 2 lof ( list of
filenames ) files for each job. One lof file containing the input
lfn's for the job, and the other containing output lfn's for the
job.
In some cases, it maybe beneficial to have these lof files generated
but not do the actual stat. This property allows you to generate the
lof files without triggering the stat in kickstart invocations.
</p>
<p></p>
</div>
<div class="section" title="10.1.11.5. pegasus.gridstart.invoke.always">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.gridstart.invoke.always"></a>10.1.11.5. pegasus.gridstart.invoke.always</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.gridstart.invoke.length</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Condor has a limit in it, that restricts the length of arguments to
an executable to 4K. To get around this limit, you can trigger
Kickstart to be invoked with the -I option. In this case, an
arguments file is prepared per job that is transferred to the remote
end via the Condor file transfer mechanism. This way the arguments
to the executable are not specified in the condor submit file for
the job. This property specifies whether you want to use the invoke
option always for all jobs, or want it to be triggered only when the
argument string is determined to be greater than a certain limit.
</p>
<p></p>
</div>
<div class="section" title="10.1.11.6. pegasus.gridstart.invoke.length">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.gridstart.invoke.length"></a>10.1.11.6. pegasus.gridstart.invoke.length</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Long</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">4000</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.gridstart.invoke.always</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Gridstart is automatically  invoked with the -I option, if it is
determined that the length of the arguments to be specified is going
to be greater than a certain limit. By default this limit is set to
4K. However, it can overriden by specifying this property.
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.12. Interface To Condor And Condor Dagman">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesInterfaceToCondorAndCondorDagman"></a>10.1.12. Interface To Condor And Condor Dagman</h3></div></div></div>
<p>The Condor DAGMan facility is usually activate using the
condor_submit_dag command. However, many shapes of workflows have the
ability to either overburden the submit host, or overflow remote
gatekeeper hosts. While DAGMan provides throttles, unfortunately these
can only be supplied on the command-line. Thus,PEGASUS provides a
versatile wrapper to invoke DAGMan, called pegasus-submit-dag. It can be
configured from the command-line, from user- and system properties,
and by defaults.
</p>
<div class="section" title="10.1.12.1. pegasus.condor.logs.symlink">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.condor.logs.symlink"></a>10.1.12.1. pegasus.condor.logs.symlink</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Condor</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.0</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>Starting 4.2.1 release, this property defaults to false. Prior to that
it defaulted to true.
</p>
<p>If this property is set to true, then Pegasus will have the Condor
common log [dagname]-0.log in the submit file as a symlink to a
location in /tmp . You want to set this to true when your workflow
submit directory is on the shared filesystem . You don't want the
common log to get written to a shared filesystem. If the user knows
for sure that the workflow submit directory is not on the shared filesystem,
then the value to this property should be false.
</p>
<p></p>
</div>
<div class="section" title="10.1.12.2. pegasus.condor.arguments.quote">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.condor.arguments.quote"></a>10.1.12.2. pegasus.condor.arguments.quote</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Condor</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">true</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Old Name:</td>
<td align="left">pegasus.condor.arguments.quote</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property determines whether to apply the new Condor quoting
rules for quoting the argument string. The new argument quoting
rules appeared in Condor 6.7.xx series. We have verified it for
6.7.19 version. If you are using an old condor at the submit host,
set this  property to false.
</p>
<p></p>
</div>
<div class="section" title="10.1.12.3. pegasus.dagman.notify">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dagman.notify"></a>10.1.12.3. pegasus.dagman.notify</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">DAGman wrapper</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Case-insensitive enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Complete</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Error</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">Never</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Never</td>
</tr>
<tr>
<td align="left">Document:</td>
<td align="left">http://www.cs.wisc.edu/condor/manual/v6.9/condor_submit_dag.html</td>
</tr>
<tr>
<td align="left">Document:</td>
<td align="left">http://www.cs.wisc.edu/condor/manual/v6.9/condor_submit.html</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>The pegasus.dagman.nofity property has been deprecated in favor of the
Pegasus notification framework. Please see the reference manual for
details on how to get workflow notifications. pegasus.dagman.nofity
will be removed in the an upcoming version of Pegasus.
</p>
</div>
<div class="section" title="10.1.12.4. pegasus.dagman.verbose">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dagman.verbose"></a>10.1.12.4. pegasus.dagman.verbose</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">DAGman wrapper</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">true</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Document:</td>
<td align="left">http://www.cs.wisc.edu/condor/manual/v6.9/condor_submit_dag.html</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>The pegasus-submit-dag wrapper processes properties to set DAGMan
commandline arguments. If set and true, the argument activates
verbose output in case of DAGMan errors.
</p>
</div>
<div class="section" title="10.1.12.5. pegasus.dagman.[category].maxjobs">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dagman.%5Bcategory%5D.maxjobs"></a>10.1.12.5. pegasus.dagman.[category].maxjobs</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">DAGman wrapper</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Integer</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">no default</td>
</tr>
<tr>
<td align="left">Document:</td>
<td align="left">http://vtcpc.isi.edu/pegasus/index.php/ChangeLog\#Support_for_DAGMan_node_categories</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>DAGMan now allows for the nodes in the DAG to be grouped in
category. The tuning parameters like maxjobs then can be applied per
category instead of being applied to the whole workflow. To use this
facility users need to associate the dagman profile key named
category with their jobs.  The value of the key is the category to
which the job belongs to.
</p>
<p>You can then use this property to specify the value for a
category. For the above example you will set
pegasus.dagman.short-running.maxjobs
</p>
</div>
</div>
<div class="section" title="10.1.13. Monitoring Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesMonitoringProperties"></a>10.1.13. Monitoring Properties</h3></div></div></div>
<p></p>
<p></p>
<div class="section" title="10.1.13.1. pegasus.monitord.events">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.monitord.events"></a>10.1.13.1. pegasus.monitord.events</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus-monitord</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">true</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.0.2</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.output</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property determines whether pegasus-monitord generates log
events. If log events are disabled using this property, no bp file,
or database will be created, even if the pegasus.monitord.output
property is specified.
</p>
<p></p>
</div>
<div class="section" title="10.1.13.2. pegasus.monitord.output">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.monitord.output"></a>10.1.13.2. pegasus.monitord.output</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus-monitord</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.0.2</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.events</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property specifies the destination for generated log events in
pegasus-monitord. By default, events are stored in a sqlite database
in the workflow directory, which will be created with the workflow's
name, and a ".stampede.db" extension. Users can specify an
alternative database by using a SQLAlchemy connection
string. Details are available at:
</p>
<pre class="screen">
http://www.sqlalchemy.org/docs/05/reference/dialects/index.html
</pre>
<p>
It is important to note that users will need to have the appropriate
db interface library installed. Which is to say, SQLAlchemy is a
wrapper around the mysql interface library (for instance), it does
not provide a MySQL driver itself.  The Pegasus distribution
includes both SQLAlchemy and the SQLite Python driver.
As a final note, it is important to mention that unlike when using
SQLite databases, using SQLAlchemy with other database servers,
e.g. MySQL or Postgres , the target database needs to exist.
Users can also specify a file name using this property in order to
create a file with the log events.
</p>
<p>Example values for the SQLAlchemy connection string for various end points
are listed below
</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">

SQL Alchemy End Point             </td>
<td align="left"> Example Value </td>
</tr>
<tr>
<td align="left">

Netlogger BP File                 </td>
<td align="left"> file:///submit/dir/myworkflow.bp</td>
</tr>
<tr>
<td align="left">

SQL Lite Database                 </td>
<td align="left"> sqlite:///submit/dir/myworkflow.db</td>
</tr>
<tr>
<td align="left">

MySQL Database          	       </td>
<td align="left"> mysql://user:password@host:port/databasename</td>
</tr>
<tr>
<td align="left">

</td>
<td class="auto-generated"> </td>
</tr>
</tbody>
</table></div>
<p>

</p>
<p></p>
</div>
<div class="section" title="10.1.13.3. pegasus.dashboard.output">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.dashboard.output"></a>10.1.13.3. pegasus.dashboard.output </h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus-monitord</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">4.2</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.output</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property specifies the destination for the workflow dashboard database.
By default, the workflow dashboard datbase defaults to a sqlite database
named workflow.db in the $HOME/.pegasus directory. This is database is shared
for all workflows run as a particular user.
Users can specify an alternative database by using a SQLAlchemy connection
string. Details are available at:
</p>
<pre class="screen">
http://www.sqlalchemy.org/docs/05/reference/dialects/index.html
</pre>
<p>
It is important to note that users will need to have the appropriate
db interface library installed. Which is to say, SQLAlchemy is a
wrapper around the mysql interface library (for instance), it does
not provide a MySQL driver itself.  The Pegasus distribution
includes both SQLAlchemy and the SQLite Python driver.
As a final note, it is important to mention that unlike when using
SQLite databases, using SQLAlchemy with other database servers,
e.g. MySQL or Postgres , the target database needs to exist.
Users can also specify a file name using this property in order to
create a file with the log events.
</p>
<p>Example values for the SQLAlchemy connection string for various end points
are listed below
</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">

SQL Alchemy End Point             </td>
<td align="left"> Example Value </td>
</tr>
<tr>
<td align="left">

SQL Lite Database                 </td>
<td align="left"> sqlite:///shared/myworkflow.db</td>
</tr>
<tr>
<td align="left">

MySQL Database          	       </td>
<td align="left"> mysql://user:password@host:port/databasename</td>
</tr>
<tr>
<td align="left">

</td>
<td class="auto-generated"> </td>
</tr>
</tbody>
</table></div>
<p>

</p>
<p></p>
</div>
<div class="section" title="10.1.13.4. pegasus.monitord.notifications">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.monitord.notifications"></a>10.1.13.4. pegasus.monitord.notifications</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus-monitord</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">true</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.1</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.notifications.max</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.notifications.timeout</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property determines whether pegasus-monitord processes
notifications. When notifications are enabled, pegasus-monitord will
parse the .notify file generated by pegasus-plan and will invoke
notification scripts whenever conditions matches one of the
notifications.
</p>
<p></p>
</div>
<div class="section" title="10.1.13.5. pegasus.monitord.notifications.max">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.monitord.notifications.max"></a>10.1.13.5. pegasus.monitord.notifications.max</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus-monitord</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Integer</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">10</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.1</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.notifications</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.notifications.timeout</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property determines how many notification scripts
pegasus-monitord will call concurrently. Upon reaching this limit,
pegasus-monitord will wait for one notification script to finish
before issuing another one. This is a way to keep the number of
processes under control at the submit host. Setting this property to
0 will disable notifications completely.
</p>
<p></p>
</div>
<div class="section" title="10.1.13.6. pegasus.monitord.notifications.timeout">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.monitord.notifications.timeout"></a>10.1.13.6. pegasus.monitord.notifications.timeout</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus-monitord</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Integer</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">0</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.1</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.notifications</td>
</tr>
<tr>
<td align="left">See Also:</td>
<td align="left">pegasus.monitord.notifications.max</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property determines how long will pegasus-monitord let
notification scripts run before terminating them. When this property
is set to 0 (default), pegasus-monitord will not terminate any
notification scripts, letting them run indefinitely. If some
notification scripts missbehave, this has the potential problem of
starving pegasus-monitord's notification slots (see the
pegasus.monitord.notifications.max property), and block further
notifications. In addition, users should be aware that
pegasus-monitord will not exit until all notification scripts are
finished.
</p>
<p></p>
</div>
<div class="section" title="10.1.13.7. pegasus.monitord.stdout.disable.parsing">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.monitord.stdout.disable.parsing"></a>10.1.13.7. pegasus.monitord.stdout.disable.parsing</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus-monitord</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">False</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.1.1</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>By default, pegasus-monitord parses the stdout/stderr section of the
kickstart to populate the applications captured stdout and stderr in
the job instance table for the stampede schema. For large workflows,
this may slow down monitord especially if the application is
generating a lot of output to it's stdout and stderr. This property,
can be used to turn of the database population.
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.14. Job Clustering Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesJobClusteringProperties"></a>10.1.14. Job Clustering Properties</h3></div></div></div>
<p></p>
<div class="section" title="10.1.14.1. pegasus.clusterer.job.aggregator">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.clusterer.job.aggregator"></a>10.1.14.1. pegasus.clusterer.job.aggregator</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Job Clustering</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">seqexec</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">mpiexec</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">seqexec</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>A large number of workflows executed through the Virtual Data
System, are composed of several jobs that run for only a few seconds
or so. The overhead of running any job on the grid is usually 60
seconds or more. Hence, it makes sense to collapse small independent
jobs into a larger job.
This property determines, the executable that will be used for
running the larger job on the remote site.
</p>
<div class="variablelist"><dl>
<dt><span class="term">seqexec</span></dt>
<dd>In this mode, the executable used to run the merged job is
seqexec that runs each of the smaller jobs sequentially on the
same node. The executable "seqexec" is a PEGASUS tool distributed
in the PEGASUS worker package, and can be usually found at
{pegasus.home}/bin/seqexec.
</dd>
<dt><span class="term">mpiexec</span></dt>
<dd>In this mode, the executable used to run the merged job is
mpiexec that runs the smaller jobs via mpi on n nodes where n
is the nodecount associated with the merged job. The executable
"mpiexec" is a PEGASUS tool distributed in the PEGASUS worker package,
and can be usually found at {pegasus.home}/bin/mpiexec.
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.14.2. pegasus.clusterer.job.aggregator.seqexec.log">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.clusterer.job.aggregator.seqexec.log"></a>10.1.14.2. pegasus.clusterer.job.aggregator.seqexec.log</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Job Clustering</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.3</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.clusterer.job.aggregator</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.clusterer.job.aggregator.seqexec.log.global</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Seqexec logs the progress of the jobs that are being run by it in a
progress file on the remote cluster where it is executed.
</p>
<p>This property sets the Boolean flag, that indicates whether to turn
on the logging or not.
</p>
<p></p>
</div>
<div class="section" title="10.1.14.3. pegasus.clusterer.job.aggregator.seqexec.log.global">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.clusterer.job.aggregator.seqexec.log.global"></a>10.1.14.3. pegasus.clusterer.job.aggregator.seqexec.log.global</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Job Clustering</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">true</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.3</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.clusterer.job.aggregator</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.clusterer.job.aggregator.seqexec.log</td>
</tr>
<tr>
<td align="left">Old Name:</td>
<td align="left">pegasus.clusterer.job.aggregator.seqexec.hasgloballog</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Seqexec logs the progress of the jobs that are being run by it in a
progress file on the remote cluster where it is executed. The
progress log is useful for you to track the progress of your
computations and remote grid debugging. The progress log file can be
shared by multiple seqexec jobs that are running on a particular
cluster as part of the same workflow. Or it can be per job.
</p>
<p>This property sets the Boolean flag, that indicates whether to have
a single global log for all the seqexec jobs on a particular cluster
or progress log per job.
</p>
<p></p>
</div>
<div class="section" title="10.1.14.4. pegasus.clusterer.job.aggregator.seqexec.firstjobfail">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.clusterer.job.aggregator.seqexec.firstjobfail"></a>10.1.14.4. pegasus.clusterer.job.aggregator.seqexec.firstjobfail</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Job Clustering</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">true</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.clusterer.job.aggregator</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>By default seqexec does not stop execution even if one of the
clustered jobs it is executing fails. This is because seqexec tries
to get as much work done as possible.
</p>
<p>This property sets the Boolean flag, that indicates whether to make
seqexec stop on the first job failure it detects.
</p>
<p></p>
</div>
<div class="section" title="10.1.14.5. pegasus.clusterer.label.key">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.clusterer.label.key"></a>10.1.14.5. pegasus.clusterer.label.key</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Job Clustering</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">label</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.partitioner.label.key</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>While clustering jobs in the workflow into larger jobs, you can
optionally label your graph to control which jobs are clustered and
to which clustered job they belong. This done using a label based
clustering scheme and is done by associating a profile/label key in
the PEGASUS namespace with the jobs in  the DAX. Each job that has the
same value/label value for this profile key, is put in the same
clustered job.
</p>
<p>This property allows you to specify the PEGASUS profile key that you
want to use for label based clustering.
</p>
<p></p>
</div>
</div>
<div class="section" title="10.1.15. Logging Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesLoggingProperties"></a>10.1.15. Logging Properties</h3></div></div></div>
<p></p>
<div class="section" title="10.1.15.1. pegasus.log.manager">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.log.manager"></a>10.1.15.1. pegasus.log.manager</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Default</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Log4j</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Default</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.log.manager.formatter</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property sets the logging implementation to use for logging.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Default</span></dt>
<dd>This implementation refers to the legacy Pegasus logger, that
logs directly to stdout and stderr. It however, does have the
concept of levels  similar to log4j or syslog.
</dd>
<dt><span class="term">Log4j</span></dt>
<dd>This implementation, uses Log4j to log messages. The log4j
properties can be specified in a properties file, the location of
which is specified by the property
<pre class="screen">
pegasus.log.manager.log4j.conf
</pre>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.15.2. pegasus.log.manager.formatter">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.log.manager.formatter"></a>10.1.15.2. pegasus.log.manager.formatter</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Simple</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Netlogger</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Simple</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.log.manager.formatter</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property sets the formatter to use for formatting the log messages
while logging.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Simple</span></dt>
<dd>This formats the messages in a simple format. The messages are logged as
is with minimal formatting. Below are sample log messages in this format
while ranking a dax according to performance.
<pre class="screen">
event.pegasus.ranking dax.id se18-gda.dax  - STARTED
event.pegasus.parsing.dax dax.id se18-gda-nested.dax  - STARTED
event.pegasus.parsing.dax dax.id se18-gda-nested.dax  - FINISHED
job.id jobGDA
job.id jobGDA query.name getpredicted performace time 10.00
event.pegasus.ranking dax.id se18-gda.dax  - FINISHED
</pre>
</dd>
<dt><span class="term">Netlogger</span></dt>
<dd>
<p>This formats the messages in the Netlogger format , that is based on key
value pairs. The netlogger format is useful for loading the logs into a
database to do some meaningful analysis. Below are sample log messages
in this format while ranking a dax according to performance.
</p>
<pre class="screen">
ts=2008-09-06T12:26:20.100502Z event=event.pegasus.ranking.start \
msgid=6bc49c1f-112e-4cdb-af54-3e0afb5d593c \
eventId=event.pegasus.ranking_8d7c0a3c-9271-4c9c-a0f2-1fb57c6394d5 \
dax.id=se18-gda.dax prog=Pegasus
ts=2008-09-06T12:26:20.100750Z event=event.pegasus.parsing.dax.start \
msgid=fed3ebdf-68e6-4711-8224-a16bb1ad2969 \
eventId=event.pegasus.parsing.dax_887134a8-39cb-40f1-b11c-b49def0c5232\
dax.id=se18-gda-nested.dax prog=Pegasus
ts=2008-09-06T12:26:20.100894Z event=event.pegasus.parsing.dax.end \
msgid=a81e92ba-27df-451f-bb2b-b60d232ed1ad \
eventId=event.pegasus.parsing.dax_887134a8-39cb-40f1-b11c-b49def0c5232
ts=2008-09-06T12:26:20.100395Z event=event.pegasus.ranking \
msgid=4dcecb68-74fe-4fd5-aa9e-ea1cee88727d \
eventId=event.pegasus.ranking_8d7c0a3c-9271-4c9c-a0f2-1fb57c6394d5 \
job.id="jobGDA"
ts=2008-09-06T12:26:20.100395Z event=event.pegasus.ranking \
msgid=4dcecb68-74fe-4fd5-aa9e-ea1cee88727d \
eventId=event.pegasus.ranking_8d7c0a3c-9271-4c9c-a0f2-1fb57c6394d5 \
job.id="jobGDA" query.name="getpredicted performace" time="10.00"
ts=2008-09-06T12:26:20.102003Z event=event.pegasus.ranking.end \
msgid=31f50f39-efe2-47fc-9f4c-07121280cd64 \
eventId=event.pegasus.ranking_8d7c0a3c-9271-4c9c-a0f2-1fb57c6394d5
</pre>
<p>
</p>
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.15.3. pegasus.log.*">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.log.*"></a>10.1.15.3. pegasus.log.* </h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">No default</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property sets the path to the file where all the logging for
Pegasus can be redirected to. Both stdout and stderr are logged to
the file specified.
</p>
<p></p>
</div>
<div class="section" title="10.1.15.4. pegasus.log.metrics">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.log.metrics"></a>10.1.15.4. pegasus.log.metrics</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.1.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">true</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.log.metrics.file</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property enables the logging of certain planning and workflow
metrics to a global log file. By default the file to which the
metrics are logged is ${pegasus.home}/var/pegasus.log.
</p>
<p></p>
</div>
<div class="section" title="10.1.15.5. pegasus.log.metrics.file">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.log.metrics.file"></a>10.1.15.5. pegasus.log.metrics.file</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.1.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">${pegasus.home}/var/pegasus.log</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.log.metrics</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property determines the file to which the workflow and planning
metrics are logged if enabled.
</p>
<p></p>
</div>
<div class="section" title="10.1.15.6. pegasus.metrics.app">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.metrics.app"></a>10.1.15.6. pegasus.metrics.app</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">4.3.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">String</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.log.metrics</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property namespace allows users to pass application level metrics
to the metrics server. The value of this property is the name of the
application.
</p>
<p>Additional application specific attributes can be passed by using the
prefix pegasus.metrics.app
</p>
<pre class="screen">
pegasus.metrics.app.[arribute-name]       attribute-value
</pre>
<p>
</p>
<p>Note: the attribute cannot be named name. This attribute is automatically
assigned the value from pegasus.metrics.app
</p>
</div>
</div>
<div class="section" title="10.1.16. Miscellaneous Properties">
<div class="titlepage"><div><div><h3 class="title">
<a name="PropertiesMiscellaneousProperties"></a>10.1.16. Miscellaneous Properties</h3></div></div></div>
<p></p>
<p></p>
<div class="section" title="10.1.16.1. pegasus.code.generator">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.code.generator"></a>10.1.16.1. pegasus.code.generator</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Condor</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Shell</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">PMC</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Condor</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property is used to load the appropriate Code Generator to use for
writing out the executable workflow.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Condor</span></dt>
<dd>
This is the default code generator for Pegasus . This generator generates
the executable workflow as a Condor DAG file and associated job submit files.
The Condor DAG file is passed as input to Condor DAGMan for job execution.
</dd>
<dt><span class="term">Shell</span></dt>
<dd>
This Code Generator generates the executable workflow as a shell script that
can be executed on the submit host.  While using this code generator, all the
jobs should be mapped to site local i.e specify --sites local to pegasus-plan.
</dd>
<dt><span class="term">PMC</span></dt>
<dd>
This Code Generator generates the executable workflow as a PMC task workflow.
This is useful to run on platforms where it not feasible to run Condor such
as the new XSEDE machines such as Blue Waters.
In this mode, Pegasus will generate the executable workflow as a PMC task
workflow and a sample PBS submit script that submits this workflow.
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.16.2. pegasus.register">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.register"></a>10.1.16.2. pegasus.register</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">4.1.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">true</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Pegasus creates registration jobs to register the output files in the replica
catalog. An output file is registered only if
</p>
<p>1) a user has configured a replica catalog in the properties
2) the register flags for the output files in the DAX are set to true
</p>
<p>This property can be used to turn off the creation of the registration jobs
even though the files maybe marked to be registered in the replica catalog.
</p>
<p></p>
</div>
<div class="section" title="10.1.16.3. pegasus.job.priority.assign">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.job.priority.assign"></a>10.1.16.3. pegasus.job.priority.assign</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">3.0.3</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">true</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property can be used to turn off the default level based condor priorities
that are assigned to jobs in the executable workflow.
</p>
<p></p>
</div>
<div class="section" title="10.1.16.4. pegasus.file.cleanup.strategy">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.file.cleanup.strategy"></a>10.1.16.4. pegasus.file.cleanup.strategy</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">InPlace</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">InPlace</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property is used to select the strategy of how the the cleanup
nodes are added to the executable workflow.
</p>
<div class="variablelist"><dl>
<dt><span class="term">InPlace</span></dt>
<dd>
This is the only mode available .
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.16.5. pegasus.file.cleanup.impl">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.file.cleanup.impl"></a>10.1.16.5. pegasus.file.cleanup.impl</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Cleanup</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">RM</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">S3</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Cleanup</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>This property is used to select the executable that is used to
create the working directory on the compute sites.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Cleanup</span></dt>
<dd>
The default executable that is used to delete files  is the
dirmanager executable shipped with Pegasus. It is found at
$PEGASUS_HOME/bin/dirmanager in the pegasus distribution.
An entry for transformation pegasus::dirmanager needs
to exist in the Transformation Catalog or the PEGASUS_HOME
environment variable should be specified in the site catalog for
the sites for this mode to work.
</dd>
<dt><span class="term">RM</span></dt>
<dd>
This mode results in the rm executable to be used to delete files
from remote directories. The rm executable is standard on *nix
systems and is usually found at /bin/rm location.
</dd>
<dt><span class="term">S3</span></dt>
<dd>
This mode is used to delete files/objects from the buckets in S3
instead of a  directory. This should be set when running workflows
on Amazon  EC2. This implementation relies on s3cmd command line
client to create the bucket. An entry for transformation
amazon::s3cmd needs to exist in the Transformation Catalog for
this to work.
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.16.6. pegasus.file.cleanup.clusters.num">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.file.cleanup.clusters.num"></a>10.1.16.6. pegasus.file.cleanup.clusters.num</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">4.2</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Integer</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">2</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>In case of the InPlace strategy for adding the cleanup nodes to the
workflow, this property specifies the maximum number of cleanup
jobs that are added to the executable workflow on each level.
</p>
<p></p>
</div>
<div class="section" title="10.1.16.7. pegasus.file.cleanup.clusters.size">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.file.cleanup.clusters.size"></a>10.1.16.7. pegasus.file.cleanup.clusters.size</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">4.2.1</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Integer</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">2</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>In case of the InPlace strategy this property sets the number of
cleanup jobs that get clustered into a bigger cleanup job.
This parameters is only used if pegasus.file.cleanup.clusters.num
is not set.
</p>
</div>
<div class="section" title="10.1.16.8. pegasus.file.cleanup.scope">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.file.cleanup.scope"></a>10.1.16.8. pegasus.file.cleanup.scope</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.3.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">fullahead</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">deferred</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">fullahead</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>By default in case of deferred planning InPlace file cleanup is turned OFF.
This is because the cleanup algorithm does not work across partitions.
This property can be used to turn on the cleanup in case of deferred planning.
</p>
<div class="variablelist"><dl>
<dt><span class="term">fullahead</span></dt>
<dd>
This is the default scope. The pegasus cleanup algorithm does not work
across partitions in deferred planning. Hence the cleanup is always turned
OFF , when deferred planning occurs and cleanup scope is set to full ahead.
</dd>
<dt><span class="term">deferred</span></dt>
<dd>
If the scope is set to deferred, then Pegasus will not disable file cleanup
in case of deferred planning. This is useful for scenarios where the
partitions themselves are independant ( i.e. dont share files ). Even if
the scope is set to deferred, users can turn off cleanup by specifying
--nocleanup option to pegasus-plan.
</dd>
</dl></div>
<p>
</p>
<p></p>
</div>
<div class="section" title="10.1.16.9. pegasus.catalog.transformation.mapper">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.catalog.transformation.mapper"></a>10.1.16.9. pegasus.catalog.transformation.mapper</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Staging of Executables</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">All</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Installed</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">Staged</td>
</tr>
<tr>
<td align="left">Value[3]:</td>
<td align="left">Submit</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">All</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.transformation.selector</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>Pegasus now supports transfer of statically linked executables as
part of the concrete workflow. At present, there is only support for
staging of executables referred to by the compute jobs specified in
the DAX file.
Pegasus determines the source locations of the binaries from the
transformation catalog, where it searches for entries of type
STATIC_BINARY for a particular architecture type. The PFN for these
entries should refer to a globus-url-copy valid and accessible
remote URL.
For transfer of executables, Pegasus constructs a soft state map
that resides  on top of the transformation catalog, that helps in
determining the locations from where an executable can be staged to
the remote site.
</p>
<p>This property determines, how that map is created.
</p>
<div class="variablelist"><dl>
<dt><span class="term">All</span></dt>
<dd>In this mode, all sources with entries of type STATIC_BINARY
for a particular transformation are considered valid sources for
the transfer of executables. This the most general mode, and
results in the constructing the map as a result of the cartesian
product of the matches.
</dd>
<dt><span class="term">Installed</span></dt>
<dd>In this mode, only entries that are of type INSTALLED
are used while constructing the soft state map. This results in
Pegasus never doing any transfer of executables as part of the
workflow. It always prefers the installed executables at the remote
sites.
</dd>
<dt><span class="term">Staged</span></dt>
<dd>In this mode, only entries that are of type STATIC_BINARY
are used while constructing the soft state map. This results in
the concrete workflow referring only to the staged executables,
irrespective of the fact that the executables are already
installed at the remote end.
</dd>
<dt><span class="term">Submit</span></dt>
<dd>In this  mode, only entries that are of type STATIC_BINARY
and reside at the submit host (pool local), are used while
constructing the soft state map. This is especially helpful,
when the user wants to use the latest compute code for his
computations on the grid and that relies on his submit
host.
</dd>
</dl></div>
<p>
</p>
</div>
<div class="section" title="10.1.16.10. pegasus.selector.transformation">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.selector.transformation"></a>10.1.16.10. pegasus.selector.transformation</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Staging of Executables</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.0</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">enumeration</td>
</tr>
<tr>
<td align="left">Value[0]:</td>
<td align="left">Random</td>
</tr>
<tr>
<td align="left">Value[1]:</td>
<td align="left">Installed</td>
</tr>
<tr>
<td align="left">Value[2]:</td>
<td align="left">Staged</td>
</tr>
<tr>
<td align="left">Value[3]:</td>
<td align="left">Submit</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">Random</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.catalog.transformation</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>In case of transfer of executables, Pegasus could have various
transformations to select from when it schedules to run a particular
compute job at a remote site. For e.g it can have the choice of
staging an executable from a particular remote pool, from the local
(submit host) only, use the one that is installed on the remote site
only.
</p>
<p>This property determines, how a transformation amongst the various
candidate transformations is selected, and is applied after the
property pegasus.tc has been applied. For e.g specifying
pegasus.tc as Staged and then pegasus.transformation.selector as
INSTALLED does not work, as by the time this property is applied,
the soft state map only has entries of type STAGED.
</p>
<div class="variablelist"><dl>
<dt><span class="term">Random</span></dt>
<dd>In this mode, a random matching candidate transformation
is selected to be staged to the remote execution pool.
</dd>
<dt><span class="term">Installed</span></dt>
<dd>In this mode, only entries that are of type INSTALLED
are selected. This means that the concrete workflow only refers
to the transformations already pre installed on the remote
pools.
</dd>
<dt><span class="term">Staged</span></dt>
<dd>In this mode, only entries that are of type STATIC_BINARY
are selected, ignoring the ones that are installed at the remote
site.
</dd>
<dt><span class="term">Submit</span></dt>
<dd>In this mode, only entries that are of type STATIC_BINARY
and reside at the submit host (pool local), are selected as
sources for staging the executables to the remote execution
pools.
</dd>
</dl></div>
<p>
</p>
</div>
<div class="section" title="10.1.16.11. pegasus.execute.*.filesystem.local">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.execute.*.filesystem.local"></a>10.1.16.11. pegasus.execute.*.filesystem.local</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.1.0</td>
</tr>
<tr>
<td align="left">See also:</td>
<td align="left">pegasus.data.configuration</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p></p>
<p>Normally, Pegasus transfers the data to and from a directory on the
shared filesystem on the head node of a compute site. The directory
needs to be visible to both the head node and the worker nodes for
the compute jobs to execute correctly.
</p>
<p>By setting this property to true, you can get Pegasus to execute jobs
on the worker node filesystem. In this case, when the jobs are
launched on the worker nodes, the jobs grab the input data from
the workflow specific execution directory on the compute site and
push the output data to the same directory after completion.
The transfer of data to and from the worker node directory is referred
to as Second Level Staging ( SLS ).
</p>
<p></p>
</div>
<div class="section" title="10.1.16.12. pegasus.parser.dax.preserver.linebreaks">
<div class="titlepage"><div><div><h4 class="title">
<a name="Propertiespegasus.parser.dax.preserver.linebreaks"></a>10.1.16.12. pegasus.parser.dax.preserver.linebreaks</h4></div></div></div>
<p>

</p>
<div class="informaltable"><table border="0">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td align="left">System:</td>
<td align="left">Pegasus</td>
</tr>
<tr>
<td align="left">Type:</td>
<td align="left">Boolean</td>
</tr>
<tr>
<td align="left">Default:</td>
<td align="left">false</td>
</tr>
<tr>
<td align="left">Since:</td>
<td align="left">2.2.0</td>
</tr>
</tbody>
</table></div>
<p>
</p>
<p>The DAX Parser normally does not preserve line breaks while parsing the
CDATA section that appears in the arguments section of the job element
in the DAX. On setting this to true, the DAX Parser preserves any line
line breaks that appear in the CDATA section.
</p>
</div>
</div>
</div>
<div class="section" title="10.2. Profiles">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="profiles"></a>10.2. Profiles</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#idp19109616">10.2.1. Profile Structure Heading</a></span></dt>
<dt><span class="section"><a href="reference.php#idp12081248">10.2.2. Profile Namespaces</a></span></dt>
<dt><span class="section"><a href="reference.php#idp12664752">10.2.3. Sources for Profiles</a></span></dt>
<dt><span class="section"><a href="reference.php#idp12932416">10.2.4. Profiles Conflict Resolution</a></span></dt>
<dt><span class="section"><a href="reference.php#idp17100208">10.2.5. Details of Profile Handling</a></span></dt>
</dl></div>
<p>The Pegasus Workflow Mapper uses the concept of profiles to
  encapsulate configurations for various aspects of dealing with the Grid
  infrastructure. Profiles provide an abstract yet uniform interface to
  specify configuration options for various layers from planner/mapper
  behavior to remote environment settings. At various stages during the
  mapping process, profiles may be added associated with the job.</p>
<p>This document describes various types of profiles, levels of
  priorities for intersecting profiles, and how to specify profiles in
  different contexts.</p>
<div class="section" title="10.2.1. Profile Structure Heading">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp19109616"></a>10.2.1. Profile Structure Heading</h3></div></div></div>
<p>All profiles are triples comprised of a namespace, a name or key,
    and a value. The namespace is a simple identifier. The key has only
    meaning within its namespace, and it&amp;rsquor;s yet another identifier.
    There are no constraints on the contents of a value</p>
<p>Profiles may be represented with different syntaxes in different
    context. However, each syntax will describe the underlying triple.</p>
</div>
<div class="section" title="10.2.2. Profile Namespaces">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp12081248"></a>10.2.2. Profile Namespaces</h3></div></div></div>
<p>Each namespace refers to a different aspect of a job&amp;rsquor;s
    runtime settings. A profile&amp;rsquor;s representation in the concrete
    plan (e.g. the Condor submit files) depends its namespace. Pegasus
    supports the following Namespaces for profiles:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p><span class="bold"><strong>env</strong></span> permits remote environment
        variables to be set.</p></li>
<li class="listitem"><p><span class="bold"><strong>globus</strong></span> sets Globus RSL
        parameters.</p></li>
<li class="listitem"><p><span class="bold"><strong>condor</strong></span> sets Condor
        configuration parameters for the submit file.</p></li>
<li class="listitem"><p><span class="bold"><strong>dagman</strong></span> introduces Condor DAGMan
        configuration parameters.</p></li>
<li class="listitem"><p><span class="bold"><strong>pegasus</strong></span> configures the
        behaviour of various planner/mapper components.</p></li>
</ul></div>
<div class="section" title="10.2.2.1. The env Profile Namespace">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp14749408"></a>10.2.2.1. The env Profile Namespace</h4></div></div></div>
<p>The <span class="emphasis"><em>env</em></span> namespace allows users to specify
      environment variables of remote jobs. Globus transports the environment
      variables, and ensure that they are set before the job starts.</p>
<p>The key used in conjunction with an <span class="emphasis"><em>env</em></span>
      profile denotes the name of the environment variable. The value of the
      profile becomes the value of the remote environment variable.</p>
<p>Grid jobs usually only set a minimum of environment variables by
      virtue of Globus. You cannot compare the environment variables visible
      from an interactive login with those visible to a grid job. Thus, it
      often becomes necessary to set environment variables like
      LD_LIBRARY_PATH for remote jobs.</p>
<p>If you use any of the Pegasus worker package tools like transfer
      or the rc-client, it becomes necessary to set PEGASUS_HOME and
      GLOBUS_LOCATION even for jobs that run locally</p>
<div class="table">
<a name="idp13387488"></a><p class="title"><b>Table 10.1. Table 1: Useful Environment Settings</b></p>
<div class="table-contents"><table summary="Table 1: Useful Environment Settings" border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Environment
              Variable</strong></span></td>
<td><span class="bold"><strong>Description</strong></span></td>
</tr>
<tr>
<td>PEGASUS_HOME</td>
<td>Used by auxillary jobs created by Pegasus both on remote
              site and local site. Should be set usually set in the Site
              Catalog for the sites</td>
</tr>
<tr>
<td>GLOBUS_LOCATION</td>
<td>Used by auxillary jobs created by Pegasus both on remote
              site and local site. Should be set usually set in the Site
              Catalog for the sites</td>
</tr>
<tr>
<td>LD_LIBRARY_PATH</td>
<td>Point this to $GLOBUS_LOCATION/lib, except you cannot use
              the dollar variable. You must use the full path. Applies to
              both, local and remote jobs that use Globus components and
              should be usually set in the site catalog for the sites</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>Even though Condor and Globus both permit environment variable
      settings through their profiles, all remote environment variables must
      be set through the means of <span class="emphasis"><em>env</em></span> profiles.</p>
</div>
<div class="section" title="10.2.2.2. The Globus Profile Namespace">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp13029504"></a>10.2.2.2. The Globus Profile Namespace</h4></div></div></div>
<p>The <span class="emphasis"><em>globus</em></span> profile namespace encapsulates
      Globus resource specification language (RSL) instructions. The RSL
      configures settings and behavior of the remote scheduling system. Some
      systems require queue name to schedule jobs, a project name for
      accounting purposes, or a run-time estimate to schedule jobs. The Globus
      RSL addresses all these issues.</p>
<p>A key in the <span class="emphasis"><em>globus</em></span> namespace denotes the
      command name of an RLS instruction. The profile value becomes the RSL
      value. Even though Globus RSL is typically shown using parentheses
      around the instruction, the out pair of parentheses is not necessary in
      globus profile specifications</p>
<p>Table 2 shows some commonly used RSL instructions. For an
      authoritative list of all possible RSL instructions refer to the Globus
      RSL specification.</p>
<div class="table">
<a name="idp13841456"></a><p class="title"><b>Table 10.2. Table 2: Useful Globus RSL Instructions</b></p>
<div class="table-contents"><table summary="Table 2: Useful Globus RSL Instructions" border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Description</strong></span></td>
</tr>
<tr>
<td>count</td>
<td>the number of times an executable is started.</td>
</tr>
<tr>
<td>jobtype</td>
<td>specifies how the job manager should start the remote
              job. While Pegasus defaults to single, use mpi when running MPI
              jobs.</td>
</tr>
<tr>
<td>maxcputime</td>
<td>the max cpu time for a single execution of a job.</td>
</tr>
<tr>
<td>maxmemory</td>
<td>the maximum memory in MB required for the job</td>
</tr>
<tr>
<td>maxtime</td>
<td>the maximum time or walltime for a single execution of a
              job.</td>
</tr>
<tr>
<td>maxwalltime</td>
<td>the maximum walltime for a single execution of a
              job.</td>
</tr>
<tr>
<td>minmemory</td>
<td>the minumum amount of memory required for this
              job</td>
</tr>
<tr>
<td>project</td>
<td>associates an account with a job at the remote
              end.</td>
</tr>
<tr>
<td>queue</td>
<td>the remote queue in which the job should be run. Used
              when remote scheduler is PBS that supports queues.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>Pegasus prevents the user from specifying certain RSL instructions
      as globus profiles, because they are either automatically generated or
      can be overridden through some different means. For instance, if you
      need to specify remote environment settings, do not use the environment
      key in the globus profiles. Use one or more env profiles instead.</p>
<div class="table">
<a name="idp14750848"></a><p class="title"><b>Table 10.3. Table 3: RSL Instructions that are not permissible</b></p>
<div class="table-contents"><table summary="Table 3: RSL Instructions that are not permissible" border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Reason for
              Prohibition</strong></span></td>
</tr>
<tr>
<td>arguments</td>
<td>you specify arguments in the arguments section for a job
              in the DAX</td>
</tr>
<tr>
<td>directory</td>
<td>the site catalog and properties determine which directory
              a job will run in.</td>
</tr>
<tr>
<td>environment</td>
<td>use multiple env profiles instead</td>
</tr>
<tr>
<td>executable</td>
<td>the physical executable to be used is specified in the
              transformation catalog and is also dependant on the gridstart
              module being used. If you are launching jobs via kickstart then
              the executable created is the path to kickstart and the
              application executable path appears in the arguments for
              kickstart</td>
</tr>
<tr>
<td>stdin</td>
<td>you specify in the DAX for the job</td>
</tr>
<tr>
<td>stdout</td>
<td>you specify in the DAX for the job</td>
</tr>
<tr>
<td>stderr</td>
<td>you specify in the DAX for the job</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</div>
<div class="section" title="10.2.2.3. The Condor Profile Namespace">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp12558496"></a>10.2.2.3. The Condor Profile Namespace</h4></div></div></div>
<p>The Condor submit file controls every detail how and where a job
      is run. The <span class="emphasis"><em>condor</em></span> profiles permit to add or
      overwrite instructions in the Condor submit file.</p>
<p>The <span class="emphasis"><em>condor</em></span> namespace directly sets commands
      in the Condor submit file for a job the profile applies to. Keys in the
      <span class="emphasis"><em>condor</em></span> profile namespace denote the name of the
      Condor command. The profile value becomes the command's argument. All
      <span class="emphasis"><em>condor</em></span> profiles are translated into key=value lines
      in the Condor submit file</p>
<p>Some of the common condor commands that a user may need to specify
      are listed below. For an authoritative list refer to the online condor
      documentation. Note: Pegasus Workflow Planner/Mapper by default specify
      a lot of condor commands in the submit files depending upon the job, and
      where it is being run.</p>
<div class="table">
<a name="idp19147264"></a><p class="title"><b>Table 10.4. Table 4: Useful Condor Commands</b></p>
<div class="table-contents"><table summary="Table 4: Useful Condor Commands" border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Description</strong></span></td>
</tr>
<tr>
<td>universe</td>
<td>Pegasus defaults to either globus or scheduler universes.
              Set to standard for compute jobs that require standard universe.
              Set to vanilla to run natively in a condor pool, or to run on
              resources grabbed via condor glidein.</td>
</tr>
<tr>
<td>periodic_release</td>
<td>is the number of times job is released back to the queue
              if it goes to HOLD, e.g. due to Globus errors. Pegasus defaults
              to 3.</td>
</tr>
<tr>
<td>periodic_remove</td>
<td>is the number of times a job is allowed to get into HOLD
              state before being removed from the queue. Pegasus defaults to
              3.</td>
</tr>
<tr>
<td>filesystemdomain</td>
<td>Useful for Condor glide-ins to pin a job to a remote
              site.</td>
</tr>
<tr>
<td>stream_error</td>
<td>boolean to turn on the streaming of the stderr of the
              remote job back to submit host.</td>
</tr>
<tr>
<td>stream_output</td>
<td>boolean to turn on the streaming of the stdout of the
              remote job back to submit host.</td>
</tr>
<tr>
<td>priority</td>
<td>integer value to assign the priority of a job. Higher
              value means higher priority. The priorities are only applied for
              vanilla / standard/ local universe jobs. Determines the order in
              which a users own jobs are executed.</td>
</tr>
<tr>
<td>request_cpus</td>
<td>New in Condor 7.8.0 . Number of CPU's a job
              requires.</td>
</tr>
<tr>
<td>request_memory</td>
<td>New in Condor 7.8.0 . Amount of memory a job
              requires.</td>
</tr>
<tr>
<td>request_disk</td>
<td>New in Condor 7.8.0 . Amount of disk a job
              requires.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>Other useful condor keys, that advanced users may find useful and
      can be set by profiles are</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>should_transfer_files</p></li>
<li class="listitem"><p>transfer_output</p></li>
<li class="listitem"><p>transfer_error</p></li>
<li class="listitem"><p>whentotransferoutput</p></li>
<li class="listitem"><p>requirements</p></li>
<li class="listitem"><p>rank</p></li>
</ol></div>
<p>Pegasus prevents the user from specifying certain Condor commands
      in condor profiles, because they are automatically generated or can be
      overridden through some different means. Table 5 shows prohibited Condor
      commands.</p>
<div class="table">
<a name="idp17310256"></a><p class="title"><b>Table 10.5. Table 5: Condor commands prohibited in condor profiles</b></p>
<div class="table-contents"><table summary="Table 5: Condor commands prohibited in condor profiles" border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Reason for
              Prohibition</strong></span></td>
</tr>
<tr>
<td>arguments</td>
<td>you specify arguments in the arguments section for a job
              in the DAX</td>
</tr>
<tr>
<td>environment</td>
<td>use multiple env profiles instead</td>
</tr>
<tr>
<td>executable</td>
<td>the physical executable to be used is specified in the
              transformation catalog and is also dependant on the gridstart
              module being used. If you are launching jobs via kickstart then
              the executable created is the path to kickstart and the
              application executable path appears in the arguments for
              kickstart</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</div>
<div class="section" title="10.2.2.4. The Dagman Profile Namespace">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp21944384"></a>10.2.2.4. The Dagman Profile Namespace</h4></div></div></div>
<p>DAGMan is Condor's workflow manager. While planners generate most
      of DAGMan's configuration, it is possible to tweak certain job-related
      characteristics using dagman profiles. A dagman profile can be used to
      specify a DAGMan pre- or post-script.</p>
<p>Pre- and post-scripts execute on the submit machine. Both inherit
      the environment settings from the submit host when pegasus-submit-dag or
      pegasus-run is invoked.</p>
<p>By default, kickstart launches all jobs except standard universe
      and MPI jobs. Kickstart tracks the execution of the job, and returns
      usage statistics for the job. A DAGMan post-script starts the Pegasus
      application exitcode to determine, if the job succeeded. DAGMan receives
      the success indication as exit status from exitcode.</p>
<p>If you need to run your own post-script, you have to take over the
      job success parsing. The planner is set up to pass the file name of the
      remote job's stdout, usually the output from kickstart, as sole argument
      to the post-script.</p>
<p>Table 6 shows the keys in the dagman profile domain that are
      understood by Pegasus and can be associated at a per job basis.</p>
<div class="table">
<a name="idp19047568"></a><p class="title"><b>Table 10.6. Table 6: Useful dagman Commands that can be associated at a
          per job basis</b></p>
<div class="table-contents"><table summary="Table 6: Useful dagman Commands that can be associated at a
          per job basis" border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Description</strong></span></td>
</tr>
<tr>
<td>PRE</td>
<td>is the path to the pre-script. DAGMan executes the
                pre-script before it runs the job.</td>
</tr>
<tr>
<td>PRE.ARGUMENTS</td>
<td>are command-line arguments for the pre-script, if
                any.</td>
</tr>
<tr>
<td>POST</td>
<td>is the postscript type/mode that a user wants to
                associate with a job. <div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p><span class="bold"><strong>pegasus-exitcode</strong></span>
                      - pegasus will by default associate this postscript with
                      all jobs launched via kickstart, as long the POST.SCOPE
                      value is not set to NONE.</p></li>
<li class="listitem"><p><span class="bold"><strong>none</strong></span> -means that
                      no postscript is generated for the jobs. This is useful
                      for MPI jobs that are not launched via kickstart
                      currently.</p></li>
<li class="listitem">
<p><span class="bold"><strong>any legal
                      identifier</strong></span> - Any other identifier of the form
                      ([_A-Za-z][_A-Za-z0-9]*), than one of the 2 reserved
                      keywords above, signifies a user postscript. This allows
                      the user to specify their own postscript for the jobs in
                      the workflow. The path to the postscript can be
                      specified by the dagman profile <span class="bold"><strong>POST.PATH.[value</strong></span>] where [value]
                      is this legal identifier specified. The user postscript
                      is passed the name of the .out file of the job as the
                      last argument on the command line.</p>
<p>For e.g. if the following dagman profiles were
                      associated with a job X</p>
<div class="orderedlist"><ol class="orderedlist" type="a">
<li class="listitem"><p>POST with value user_script
                          /bin/user_postscript</p></li>
<li class="listitem"><p>POST.PATH.user_script with value
                          /path/to/user/script</p></li>
<li class="listitem"><p>POST.ARGUMENTS with value -verbose</p></li>
</ol></div>
<p>then the following postscript will be associated
                      with the job X in the .dag file</p>
<p>/path/to/user/script -verbose X.out where X.out
                      contains the stdout of the job X</p>
</li>
</ol></div>
</td>
</tr>
<tr>
<td>POST.PATH.* ( where * is replaced by the value of the
                POST Profile )</td>
<td>the path to the post script on the submit host.</td>
</tr>
<tr>
<td>POST.ARGUMENTS</td>
<td>are the command line arguments for the post script, if
                any.</td>
</tr>
<tr>
<td>RETRY</td>
<td>is the number of times DAGMan retries the full job
                cycle from pre-script through post-script, if failure was
                detected.</td>
</tr>
<tr>
<td>CATEGORY</td>
<td>the DAGMan category the job belongs to.</td>
</tr>
<tr>
<td>PRIORITY</td>
<td>the priority to apply to a job. DAGMan uses this to
                select what jobs to release when MAXJOBS is enforced for the
                DAG.</td>
</tr>
</tbody>
</table></div>
</div>
<p><br class="table-break"></p>
<p></p>
<p>Table 7 shows the keys in the dagman profile domain that are
      understood by Pegasus and can be used to apply to the whole workflow.
      These are used to control DAGMan's behavior at the workflow level, and
      are recommended to be specified in the properties file.</p>
<div class="table">
<a name="idp14298528"></a><p class="title"><b>Table 10.7. Table 7: Useful dagman Commands that can be specified in the
        properties file.</b></p>
<div class="table-contents"><table summary="Table 7: Useful dagman Commands that can be specified in the
        properties file." border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Description</strong></span></td>
</tr>
<tr>
<td>MAXPRE</td>
<td>sets the maximum number of PRE scripts within the DAG
              that may be running at one time</td>
</tr>
<tr>
<td>MAXPOST</td>
<td>sets the maximum number of PRE scripts within the DAG
              that may be running at one time</td>
</tr>
<tr>
<td>MAXJOBS</td>
<td>sets the maximum number of jobs within the DAG that will
              be submitted to Condor at one time.</td>
</tr>
<tr>
<td>MAXIDLE</td>
<td>sets the maximum number of idle jobs within the DAG that
              will be submitted to Condor at one time.</td>
</tr>
<tr>
<td>[CATEGORY-NAME].MAXJOBS</td>
<td>is the value of maxjobs for a particular category. Users
              can associate different categories to the jobs at a per job
              basis. However, the value of a dagman knob for a category can
              only be specified at a per workflow basis in the
              properties.</td>
</tr>
<tr>
<td>POST.SCOPE</td>
<td>scope for the postscripts. <div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>If set to <span class="bold"><strong>all</strong></span> ,
                    means each job in the workflow will have a postscript
                    associated with it.</p></li>
<li class="listitem"><p>If set to <span class="bold"><strong>none</strong></span> ,
                    means no job has postscript associated with it. None mode
                    should be used if you are running vanilla / standard/
                    local universe jobs, as in those cases Condor traps the
                    remote exitcode correctly. None scope is not recommended
                    for grid universe jobs.</p></li>
<li class="listitem"><p>If set to <span class="bold"><strong>essential</strong></span>, means only essential
                    jobs have post scripts associated with them. At present
                    the only non essential job is the replica registration
                    job.</p></li>
</ol></div>
</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</div>
<div class="section" title="10.2.2.5. The Pegasus Profile Namespace">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp11342256"></a>10.2.2.5. The Pegasus Profile Namespace</h4></div></div></div>
<p>The <span class="emphasis"><em>pegasus</em></span> profiles allow users to configure
      extra options to the Pegasus Workflow Planner that can be applied
      selectively to a job or a group of jobs. Site selectors may use a
      sub-set of <span class="emphasis"><em>pegasus</em></span> profiles for their
      decision-making.</p>
<p>Table 8 shows some of the useful configuration option Pegasus
      understands.</p>
<div class="table">
<a name="idp15378848"></a><p class="title"><b>Table 10.8. Table 8: Useful pegasus Profiles.</b></p>
<div class="table-contents"><table summary="Table 8: Useful pegasus Profiles." border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Description</strong></span></td>
</tr>
<tr>
<td>workdir</td>
<td>Sets the remote initial dir for a Condor-G job. Overrides
              the work directory algorithm that uses the site catalog and
              properties.</td>
</tr>
<tr>
<td>clusters.num</td>
<td>Please refer to the <a class="link" href="reference.php#horizontal_clustering" title="10.4.1.1.1. Horizontal Clustering">Pegasus Clustering Guide</a>
              for detailed description. This option determines the total
              number of clusters per level. Jobs are evenly spread across
              clusters.</td>
</tr>
<tr>
<td>clusters.size</td>
<td>Please refer to the <a class="link" href="reference.php#horizontal_clustering" title="10.4.1.1.1. Horizontal Clustering">Pegasus Clustering Guide</a>
              for detailed description. This profile determines the number of
              jobs in each cluster. The number of clusters depends on the
              total number of jobs on the level.</td>
</tr>
<tr>
<td>cores</td>
<td>The number of cores, associated with the job. This is
              solely used for accounting purposes in the database while
              generating statistics. It corresponds to the multiplier_factor
              in the job_instance table described <a class="link" href="reference.php#stampede-schema">here</a>.</td>
</tr>
<tr>
<td>runtime</td>
<td>Please refer to the <a class="link" href="reference.php#runtime_clustering" title="10.4.1.1.2. Runtime Clustering">Pegasus Clustering Guide</a> for
              detailed description. This profile specifies the expected
              runtime of a job.</td>
</tr>
<tr>
<td>clusters.maxruntime</td>
<td>Please refer to the <a class="link" href="reference.php#runtime_clustering" title="10.4.1.1.2. Runtime Clustering">Pegasus Clustering Guide</a> for
              detailed description. This profile specifies the maximum runtime
              of a job.</td>
</tr>
<tr>
<td>job.aggregator</td>
<td>Indicates the clustering executable that is used to run
              the clustered job on the remote site.</td>
</tr>
<tr>
<td>gridstart</td>
<td>Determines the executable for launching a job. Possible
              values are <span class="bold"><strong><span class="emphasis"><em>Kickstart |
              NoGridStart</em></span></strong></span> at the moment.</td>
</tr>
<tr>
<td>gridstart.path</td>
<td>Sets the path to the gridstart . This profile is best set
              in the Site Catalog.</td>
</tr>
<tr>
<td>gridstart.arguments</td>
<td>Sets the arguments with which GridStart is used to launch
              a job on the remote site.</td>
</tr>
<tr>
<td>stagein.clusters</td>
<td>This key determines the maximum number of
              <span class="emphasis"><em>stage-in</em></span> jobs that are can executed locally
              or remotely per compute site per workflow. This is used to
              configure the <span class="emphasis"><em>Bundle</em></span> Transfer Refiner,
              which is the Default Refiner used in Pegasus. This profile is
              best set in the Site Catalog or in the Properties file</td>
</tr>
<tr>
<td>stagein.local.clusters</td>
<td>This key provides finer grained control in determining
              the number of stage-in jobs that are executed locally and are
              responsible for staging data to a particular remote site. This
              profile is best set in the Site Catalog or in the Properties
              file</td>
</tr>
<tr>
<td>stagein.remote.clusters</td>
<td>This key provides finer grained control in determining
              the number of stage-in jobs that are executed remotely on the
              remote site and are responsible for staging data to it. This
              profile is best set in the Site Catalog or in the Properties
              file</td>
</tr>
<tr>
<td>stageout.clusters</td>
<td>This key determines the maximum number of
              <span class="emphasis"><em>stage-out</em></span> jobs that are can executed
              locally or remotely per compute site per workflow. This is used
              to configure the <span class="emphasis"><em>Bundle</em></span> Transfer Refiner, ,
              which is the Default Refiner used in Pegasus.</td>
</tr>
<tr>
<td>stageout.local.clusters</td>
<td>This key provides finer grained control in determining
              the number of stage-out jobs that are executed locally and are
              responsible for staging data from a particular remote site. This
              profile is best set in the Site Catalog or in the Properties
              file</td>
</tr>
<tr>
<td>stageout.remote.clusters</td>
<td>This key provides finer grained control in determining
              the number of stage-out jobs that are executed remotely on the
              remote site and are responsible for staging data from it. This
              profile is best set in the Site Catalog or in the Properties
              file</td>
</tr>
<tr>
<td>group</td>
<td>Tags a job with an arbitrary group identifier. The group
              site selector makes use of the tag.</td>
</tr>
<tr>
<td>change.dir</td>
<td>If true, tells <span class="emphasis"><em>kickstart</em></span> to change
              into the remote working directory. Kickstart itself is executed
              in whichever directory the remote scheduling system chose for
              the job.</td>
</tr>
<tr>
<td>create.dir</td>
<td>If true, tells <span class="emphasis"><em>kickstart</em></span> to create
              the the remote working directory before changing into the remote
              working directory. Kickstart itself is executed in whichever
              directory the remote scheduling system chose for the
              job.</td>
</tr>
<tr>
<td>transfer.proxy</td>
<td>If true, tells Pegasus to explicitly transfer the proxy
              for transfer jobs to the remote site. This is useful, when you
              want to use a full proxy at the remote end, instead of the
              limited proxy that is transferred by CondorG.</td>
</tr>
<tr>
<td>transfer.arguments</td>
<td>Allows the user to specify the arguments with which the
              transfer executable is invoked. However certain options are
              always generated for the transfer executable(base-uri
              se-mount-point).</td>
</tr>
<tr>
<td>style</td>
<td>Sets the condor submit file style. If set to globus,
              submit file generated refers to CondorG job submissions. If set
              to condor, submit file generated refers to direct Condor
              submission to the local Condor pool. It applies for glidein,
              where nodes from remote grid sites are glided into the local
              condor pool. The default style that is applied is
              globus.</td>
</tr>
<tr>
<td>pmc_request_memory</td>
<td>This key is used to set the -m option for
              pegasus-mpi-cluster. It specifies the amount of memory in MB
              that a job requires. This profile is usually set in the DAX for
              each job.</td>
</tr>
<tr>
<td>pmc_request_cpus</td>
<td>This key is used to set the -c option for
              pegasus-mpi-cluster. It specifies the number of cpu's that a job
              requires. This profile is usually set in the DAX for each
              job.</td>
</tr>
<tr>
<td>pmc_priority</td>
<td>This key is used to set the -p option for
              pegasus-mpi-cluster. It specifies the priority for a job . This
              profile is usually set in the DAX for each job. Negative values
              are allowed for priorities.</td>
</tr>
<tr>
<td>pmc_task_arguments</td>
<td>The key is used to pass any extra arguments to the PMC
              task during the planning time. They are added to the very end of
              the argument string constructed for the task in the PMC file.
              Hence, allows for overriding of any argument constructed by the
              planner for any particular task in the PMC job.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</div>
</div>
<div class="section" title="10.2.3. Sources for Profiles">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp12664752"></a>10.2.3. Sources for Profiles</h3></div></div></div>
<p>Profiles may enter the job-processing stream at various stages.
    Depending on the requirements and scope a profile is to apply, profiles
    can be associated at</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>as user property settings.</p></li>
<li class="listitem"><p>dax level</p></li>
<li class="listitem"><p>in the site catalog</p></li>
<li class="listitem"><p>in the transformation catalog</p></li>
</ul></div>
<p>Unfortunately, a different syntax applies to each level and context.
    This section shows the different profile sources and syntaxes. However, at
    the foundation of each profile lies the triple of namespace, key and
    value.</p>
<div class="section" title="10.2.3.1. User Profiles in Properties">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp22048704"></a>10.2.3.1. User Profiles in Properties</h4></div></div></div>
<p>Users can specify all profiles in the properties files where the
      property name is <span class="bold"><strong>[namespace].key</strong></span> and
      <span class="bold"><strong>value</strong></span> of the property is the value of
      the profile.</p>
<p>Namespace can be env|condor|globus|dagman|pegasus</p>
<p>Any profile specified as a property applies to the whole workflow
      unless overridden at the DAX level , Site Catalog , Transformation
      Catalog Level.</p>
<p>Some profiles that they can be set in the properties file are
      listed below</p>
<pre class="programlisting">env.JAVA_HOME "/software/bin/java"

condor.periodic_release 5
condor.periodic_remove  my_own_expression
condor.stream_error true
condor.stream_output fa

globus.maxwalltime  1000
globus.maxtime      900
globus.maxcputime   10
globus.project      test_project
globus.queue        main_queue

dagman.post.arguments --test arguments
dagman.retry  4
dagman.post simple_exitcode
dagman.post.path.simple_exitcode  /bin/exitcode/exitcode.sh
dagman.post.scope all
dagman.maxpre  12
dagman.priority 13

dagman.bigjobs.maxjobs 1


pegasus.clusters.size 5

pegasus.stagein.clusters 3</pre>
</div>
<div class="section" title="10.2.3.2. Profiles in DAX">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp13940352"></a>10.2.3.2. Profiles in DAX</h4></div></div></div>
<p>The user can associate profiles with logical transformations in
      DAX. Environment settings required by a job's application, or a maximum
      estimate on the run-time are examples for profiles at this stage.</p>
<pre class="programlisting">&lt;job id="ID000001" namespace="asdf" name="preprocess" version="1.0"
 level="3" dv-namespace="voeckler" dv-name="top" dv-version="1.0"&gt;
  &lt;argument&gt;-a top -T10  -i &lt;filename file="voeckler.f.a"/&gt;
 -o &lt;filename file="voeckler.f.b1"/&gt;
 &lt;filename file="voeckler.f.b2"/&gt;&lt;/argument&gt;
  <span class="bold"><strong>&lt;profile namespace="pegasus" key="walltime"&gt;2&lt;/profile&gt;
  &lt;profile namespace="pegasus" key="diskspace"&gt;1&lt;/profile&gt;</strong></span>
  &amp;mldr;
&lt;/job&gt;
</pre>
</div>
<div class="section" title="10.2.3.3. Profiles in Site Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp6820496"></a>10.2.3.3. Profiles in Site Catalog</h4></div></div></div>
<p>If it becomes necessary to limit the scope of a profile to a
      single site, these profiles should go into the site catalog. A profile
      in the site catalog applies to all jobs and all application run at the
      site. Commonly, site catalog profiles set environment settings like the
      LD_LIBRARY_PATH, or globus rsl parameters like queue and project
      names.</p>
<p>Currently, there is no tool to manipulate the site catalog, e.g.
      by adding profiles. Modifying the site catalog requires that you load it
      into your editor.</p>
<p>The XML version of the site catalog uses the following
      syntax:</p>
<pre class="programlisting"><span class="bold"><strong>&lt;profile namespace=</strong></span>"<span class="emphasis"><em>namespace</em></span>" <span class="bold"><strong>key=</strong></span>"<span class="emphasis"><em>key</em></span>"&gt;<span class="emphasis"><em>value</em></span><span class="bold"><strong>&lt;/profile&gt;</strong></span></pre>
<p>The XML schema requires that profiles are the first children of a
      pool element. If the element ordering is wrong, the XML parser will
      produce errors and warnings:</p>
<pre class="programlisting">&lt;pool handle="isi_condor" gridlaunch="/home/shared/pegasus/bin/kickstart"&gt;
  <span class="bold"><strong>&lt;profile namespace="env"
   key="GLOBUS_LOCATION"&gt;/home/shared/globus/&lt;/profile&gt;
  &lt;profile namespace="env"
   key="LD_LIBRARY_PATH" &gt;/home/shared/globus/lib&lt;/profile&gt;</strong></span>
  &lt;lrc url="rls://sukhna.isi.edu" /&gt;
  &amp;mldr;
&lt;/pool&gt;
</pre>
<p>The multi-line textual version of the site catalog uses the
      following syntax:</p>
<pre class="programlisting"><span class="bold"><strong>profile</strong></span> <span class="emphasis"><em>namespace "key" "value"</em></span></pre>
<p>The order within the textual pool definition is not important.
      Profiles can appear anywhere:</p>
<pre class="programlisting">pool isi_condor {
  gridlaunch "/home/shared/pegasus/bin/kickstart"
  <span class="bold"><strong>profile env "GLOBUS_LOCATION" "/home/shared/globus"
  profile env "LD_LIBRARY_PATH" "/home/shared/globus/lib"</strong></span>
  &amp;mldr;
}
</pre>
</div>
<div class="section" title="10.2.3.4. Profiles in Transformation Catalog">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp12740336"></a>10.2.3.4. Profiles in Transformation Catalog</h4></div></div></div>
<p>Some profiles require a narrower scope than the site catalog
      offers. Some profiles only apply to certain applications on certain
      sites, or change with each application and site. Transformation-specific
      and CPU-specific environment variables, or job clustering profiles are
      good candidates. Such profiles are best specified in the transformation
      catalog.</p>
<p>Profiles associate with a physical transformation and site in the
      transformation catalog. The Database version of the transformation
      catalog also permits the convenience of connecting a transformation with
      a profile.</p>
<p>The Pegasus tc-client tool is a convenient helper to associate
      profiles with transformation catalog entries. As benefit, the user does
      not have to worry about formats of profiles in the various
      transformation catalog instances.</p>
<pre class="programlisting">tc-client -a -P -E -p /home/shared/executables/analyze -t INSTALLED -r isi_condor -e env::GLOBUS_LOCATION=&amp;rdquor;/home/shared/globus&amp;rdquor;</pre>
<p>The above example adds an environment variable GLOBUS_LOCATION to
      the application /home/shared/executables/analyze on site isi_condor. The
      transformation catalog guide has more details on the usage of the
      tc-client.</p>
</div>
</div>
<div class="section" title="10.2.4. Profiles Conflict Resolution">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp12932416"></a>10.2.4. Profiles Conflict Resolution</h3></div></div></div>
<p>Irrespective of where the profiles are specified, eventually the
    profiles are associated with jobs. Multiple sources may specify the same
    profile for the same job. For instance, DAX may specify an environment
    variable X. The site catalog may also specify an environment variable X
    for the chosen site. The transformation catalog may specify an environment
    variable X for the chosen site and application. When the job is
    concretized, these three conflicts need to be resolved.</p>
<p>Pegasus defines a priority ordering of profiles. The higher priority
    takes precedence (overwrites) a profile of a lower priority.</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Transformation Catalog Profiles</p></li>
<li class="listitem"><p>Site Catalog Profiles</p></li>
<li class="listitem"><p>DAX Profiles</p></li>
<li class="listitem"><p>Profiles in Properties</p></li>
</ol></div>
</div>
<div class="section" title="10.2.5. Details of Profile Handling">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp17100208"></a>10.2.5. Details of Profile Handling</h3></div></div></div>
<p>The previous sections omitted some of the finer details for the sake
    of clarity. To understand some of the constraints that Pegasus imposes, it
    is required to look at the way profiles affect jobs.</p>
<div class="section" title="10.2.5.1. Details of env Profiles">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp12576384"></a>10.2.5.1. Details of env Profiles</h4></div></div></div>
<p>Profiles in the env namespace are translated to a
      semicolon-separated list of key-value pairs. The list becomes the
      argument for the Condor environment command in the job's submit
      file.</p>
<pre class="programlisting">######################################################################
# Pegasus WMS  SUBMIT FILE GENERATOR
# DAG : black-diamond, Index = 0, Count = 1
# SUBMIT FILE NAME : findrange_ID000002.sub
######################################################################
globusrsl = (jobtype=single)
<span class="bold"><strong>environment=GLOBUS_LOCATION=/shared/globus;LD_LIBRARY_PATH=/shared/globus/lib;</strong></span>
executable = /shared/software/linux/pegasus/default/bin/kickstart
globusscheduler = columbus.isi.edu/jobmanager-condor
remote_initialdir = /shared/CONDOR/workdir/isi_hourglass
universe = globus
&amp;mldr;
queue
######################################################################
# END OF SUBMIT FILE
</pre>
<p>Condor-G, in turn, will translate the
      <span class="emphasis"><em>environment</em></span> command for any remote job into Globus
      RSL environment settings, and append them to any existing RSL syntax it
      generates. To permit proper mixing, all <span class="emphasis"><em>environment</em></span>
      setting should solely use the env profiles, and none of the Condor nor
      Globus environment settings.</p>
<p>If <span class="emphasis"><em>kickstart</em></span> starts a job, it may make use of
      environment variables in its executable and arguments setting.</p>
</div>
<div class="section" title="10.2.5.2. Details of globus Profiles">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp16738800"></a>10.2.5.2. Details of globus Profiles</h4></div></div></div>
<p>Profiles in the <span class="emphasis"><em>globus</em></span> Namespaces are
      translated into a list of paranthesis-enclosed equal-separated key-value
      pairs. The list becomes the value for the Condor
      <span class="emphasis"><em>globusrsl</em></span> setting in the job's submit file:</p>
<pre class="programlisting">######################################################################
# Pegasus WMS SUBMIT FILE GENERATOR
# DAG : black-diamond, Index = 0, Count = 1
# SUBMIT FILE NAME : findrange_ID000002.sub
######################################################################
<span class="bold"><strong>globusrsl = (jobtype=single)(queue=fast)(project=nvo)</strong></span>
executable = /shared/software/linux/pegasus/default/bin/kickstart
globusscheduler = columbus.isi.edu/jobmanager-condor
remote_initialdir = /shared/CONDOR/workdir/isi_hourglass
universe = globus
&amp;mldr;
queue
######################################################################
# END OF SUBMIT FILE
</pre>
<p>For this reason, Pegasus prohibits the use of the
      <span class="emphasis"><em>globusrsl</em></span> key in the <span class="emphasis"><em>condor</em></span>
      profile namespace.</p>
</div>
</div>
</div>
<div class="section" title="10.3. Replica Selection">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="replica_selection"></a>10.3. Replica Selection</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#idp16798448">10.3.1. Configuration</a></span></dt>
<dt><span class="section"><a href="reference.php#idp11948336">10.3.2. Supported Replica Selectors</a></span></dt>
</dl></div>
<p>Each job in the DAX maybe associated with input LFN&amp;rsquor;s
  denoting the files that are required for the job to run. To determine the
  physical replica (PFN) for a LFN, Pegasus queries the Replica catalog to get
  all the PFN&amp;rsquor;s (replicas) associated with a LFN. The Replica
  Catalog may return multiple PFN's for each of the LFN's queried. Hence,
  Pegasus needs to select a single PFN amongst the various PFN's returned for
  each LFN. This process is known as replica selection in Pegasus. Users can
  specify the replica selector to use in the properties file.</p>
<p>This document describes the various Replica Selection Strategies in
  Pegasus.</p>
<div class="section" title="10.3.1. Configuration">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp16798448"></a>10.3.1. Configuration</h3></div></div></div>
<p>The user properties determine what replica selector Pegasus Workflow
    Mapper uses. The property <span class="bold"><strong>pegasus.selector.replica</strong></span> is used to specify the
    replica selection strategy. Currently supported Replica Selection
    strategies are</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Default</p></li>
<li class="listitem"><p>Restricted</p></li>
<li class="listitem"><p>Regex</p></li>
</ol></div>
<p>The values are case sensitive. For example the following property
    setting will throw a Factory Exception .</p>
<pre class="programlisting">pegasus.selector.replica  default</pre>
<p>The correct way to specify is</p>
<pre class="programlisting">pegasus.selector.replica  Default</pre>
</div>
<div class="section" title="10.3.2. Supported Replica Selectors">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp11948336"></a>10.3.2. Supported Replica Selectors</h3></div></div></div>
<p>The various Replica Selectors supported in Pegasus Workflow Mapper
    are explained below</p>
<div class="section" title="10.3.2.1. Default">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp6807712"></a>10.3.2.1. Default</h4></div></div></div>
<p>This is the default replica selector used in the Pegasus Workflow
      Mapper. If the property pegasus.selector.replica is not defined in
      properties, then Pegasus uses this selector.</p>
<p>This selector looks at each PFN returned for a LFN and checks to
      see if</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>the PFN is a file URL (starting with file:///)</p></li>
<li class="listitem"><p>the PFN has a pool attribute matching to the site handle of
          the site where the compute job that requires the input file is to be
          run.</p></li>
</ol></div>
<p>If a PFN matching the conditions above exists then that is
      returned by the selector .</p>
<p><span class="bold"><strong>Else,</strong></span> a random PFN is selected
      amongst all the PFN&amp;rsquor;s that have a pool attribute matching to
      the site handle of the site where a compute job is to be run.</p>
<p><span class="bold"><strong>Else,</strong></span> a random pfn is selected
      amongst all the PFN&amp;rsquor;s</p>
<p>To use this replica selector set the following
      property</p>
<pre class="programlisting">pegasus.selector.replica                  Default</pre>
</div>
<div class="section" title="10.3.2.2. Restricted">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp13543856"></a>10.3.2.2. Restricted</h4></div></div></div>
<p>This replica selector, allows the user to specify good sites and
      bad sites for staging in data to a particular compute site. A good site
      for a compute site X, is a preferred site from which replicas should be
      staged to site X. If there are more than one good sites having a
      particular replica, then a random site is selected amongst these
      preferred sites.</p>
<p>A bad site for a compute site X, is a site from which
      replica&amp;rsquor;s should not be staged. The reason of not accessing
      replica from a bad site can vary from the link being down, to the user
      not having permissions on that site&amp;rsquor;s data.</p>
<p>The good | bad sites are specified by the following
      properties</p>
<pre class="programlisting">pegasus.replica.*.prefer.stagein.sites
pegasus.replica.*.ignore.stagein.sites</pre>
<p>where the * in the property name denotes the name of the compute
      site. A * in the property key is taken to mean all sites. The value to
      these properties is a comma separated list of sites.</p>
<p>For example the following settings</p>
<pre class="programlisting">pegasus.selector.replica.*.prefer.stagein.sites            usc
pegasus.replica.uwm.prefer.stagein.sites                   isi,cit
</pre>
<p>means that prefer all replicas from site usc for staging in to any
      compute site. However, for uwm use a tighter constraint and prefer only
      replicas from site isi or cit. The pool attribute associated with the
      PFN's tells the replica selector to what site a replica/PFN is
      associated with.</p>
<p>The pegasus.replica.*.prefer.stagein.sites property takes
      precedence over pegasus.replica.*.ignore.stagein.sites property i.e. if
      for a site X, a site Y is specified both in the ignored and the
      preferred set, then site Y is taken to mean as only a preferred site for
      a site X.</p>
<p>To use this replica selector set the following property</p>
<pre class="programlisting">pegasus.selector.replica                  Restricted</pre>
</div>
<div class="section" title="10.3.2.3. Regex">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp18293568"></a>10.3.2.3. Regex</h4></div></div></div>
<p>This replica selector allows the user allows the user to specific
      regex expressions that can be used to rank various PFN&amp;rsquor;s
      returned from the Replica Catalog for a particular LFN. This replica
      selector selects the highest ranked PFN i.e the replica with the lowest
      rank value.</p>
<p>The regular expressions are assigned different rank, that
      determine the order in which the expressions are employed. The rank
      values for the regex can expressed in user properties using the
      property.</p>
<pre class="programlisting">pegasus.selector.replica.regex.rank.<span class="bold"><strong>[value]</strong></span>                  regex-expression</pre>
<p>The <span class="bold"><strong>[value]</strong></span> in the above property
      is an integer value that denotes the rank of an expression with a rank
      value of 1 being the highest rank.</p>
<p>For example, a user can specify the following regex expressions
      that will ask Pegasus to prefer file URL's over gsiftp url's from
      example.isi.edu</p>
<pre class="programlisting">pegasus.selector.replica.regex.rank.1                       file://.*
pegasus.selector.replica.regex.rank.2                       gsiftp://example\.isi\.edu.*</pre>
<p>User can specify as many regex expressions as they want.</p>
<p>Since Pegasus is in Java , the regex expression support is what
      Java supports. It is pretty close to what is supported by Perl. More
      details can be found at
      http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html</p>
<p>Before applying any regular expressions on the PFN&amp;rsquor;s
      for a particular LFN that has to be staged to a site X, the file
      URL&amp;rsquor;s that don't match the site X are explicitly filtered
      out.</p>
<p>To use this replica selector set the following
      property</p>
<pre class="programlisting">pegasus.selector.replica                  Regex</pre>
</div>
<div class="section" title="10.3.2.4. Local">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp15504736"></a>10.3.2.4. Local</h4></div></div></div>
<p>This replica selector always prefers replicas from the local host
      ( pool attribute set to local ) and that start with a file: URL scheme.
      It is useful, when users want to stagein files to a remote site from the
      submit host using the Condor file transfer mechanism.</p>
<p>To use this replica selector set the following
      property</p>
<pre class="programlisting">pegasus.selector.replica                  Default</pre>
</div>
</div>
</div>
<div class="section" title="10.4. Job Clustering">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="job_clustering"></a>10.4. Job Clustering</h2></div></div></div>
<div class="toc"><dl><dt><span class="section"><a href="reference.php#idp15534112">10.4.1. Overview</a></span></dt></dl></div>
<p>A large number of workflows executed through the Pegasus Workflow
  Management System, are composed of several jobs that run for only a few
  seconds or so. The overhead of running any job on the grid is usually 60
  seconds or more. Hence, it makes sense to cluster small independent jobs
  into a larger job. This is done while mapping an abstract workflow to an
  executable workflow. Site specific or transformation specific criteria are
  taken into consideration while clustering smaller jobs into a larger job in
  the executable workflow. The user is allowed to control the granularity of
  this clustering on a per transformation per site basis.</p>
<div class="section" title="10.4.1. Overview">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp15534112"></a>10.4.1. Overview</h3></div></div></div>
<p>The abstract workflow is mapped onto the various sites by the Site
    Selector. This semi executable workflow is then passed to the clustering
    module. The clustering of the workflow can be either be</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>level based (horizontal clustering )</p></li>
<li class="listitem"><p>label based (label clustering)</p></li>
</ul></div>
<p>The clustering module clusters the jobs into larger/clustered jobs,
    that can then be executed on the remote sites. The execution can either be
    sequential on a single node or on multiple nodes using MPI. To specify
    which clustering technique to use the user has to pass the <span class="bold"><strong>--cluster</strong></span> option to <span class="bold"><strong>pegasus-plan</strong></span> .</p>
<div class="section" title="10.4.1.1. Generating Clustered Executable Workflow">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp23317072"></a>10.4.1.1. Generating Clustered Executable Workflow</h4></div></div></div>
<p>The clustering of a workflow is activated by passing the <span class="bold"><strong>--cluster|-C</strong></span> option to <span class="bold"><strong>pegasus-plan</strong></span>. The clustering granularity of a
      particular logical transformation on a particular site is dependant upon
      the clustering techniques being used. The executable that is used for
      running the clustered job on a particular site is determined as
      explained in section 7.</p>
<pre class="programlisting">#Running pegasus-plan to generate clustered workflows

$ pegasus-plan --dax example.dax --dir ./dags -p siteX --output local
               --cluster [comma separated list of clustering techniques]  -verbose

Valid clustering techniques are horizontal and label.</pre>
<p>The naming convention of submit files of the clustered jobs
      is<span class="bold"><strong> merge_NAME_IDX.sub</strong></span> . The NAME is
      derived from the logical transformation name. The IDX is an integer
      number between 1 and the total number of jobs in a cluster. Each of the
      submit files has a corresponding input file, following the naming
      convention <span class="bold"><strong>merge_NAME_IDX.in </strong></span>. The
      input file contains the respective execution targets and the arguments
      for each of the jobs that make up the clustered job.</p>
<div class="section" title="10.4.1.1.1. Horizontal Clustering">
<div class="titlepage"><div><div><h5 class="title">
<a name="horizontal_clustering"></a>10.4.1.1.1. Horizontal Clustering</h5></div></div></div>
<p>In case of horizontal clustering, each job in the workflow is
        associated with a level. The levels of the workflow are determined by
        doing a modified Breadth First Traversal of the workflow starting from
        the root nodes. The level associated with a node, is the furthest
        distance of it from the root node instead of it being the shortest
        distance as in normal BFS. For each level the jobs are grouped by the
        site on which they have been scheduled by the Site Selector. Only jobs
        of same type (txnamespace, txname, txversion) can be clustered into a
        larger job. To use horizontal clustering the user needs to set the
        <span class="bold"><strong>--cluster</strong></span> option of <span class="bold"><strong>pegasus-plan to horizontal</strong></span> .</p>
<div class="section" title="10.4.1.1.1.1. Controlling Clustering Granularity">
<div class="titlepage"><div><div><h6 class="title">
<a name="idp13752576"></a>10.4.1.1.1.1. Controlling Clustering Granularity</h6></div></div></div>
<p>The number of jobs that have to be clustered into a single
          large job, is determined by the value of two parameters associated
          with the smaller jobs. Both these parameters are specified by the
          use of a PEGASUS namespace profile keys. The keys can be specified
          at any of the placeholders for the profiles (abstract transformation
          in the DAX, site in the site catalog, transformation in the
          transformation catalog). The normal overloading semantics apply i.e.
          profile in transformation catalog overrides the one in the site
          catalog and that in turn overrides the one in the DAX. The two
          parameters are described below.</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p><span class="bold"><strong>clusters.size
              factor</strong></span></p>
<p>The clusters.size factor denotes how many jobs need to be
              merged into a single clustered job. It is specified via the use
              of a PEGASUS namespace profile key
              &amp;ldquo;clusters.size&amp;rdquor;. for e.g. if at a
              particular level, say 4 jobs referring to logical transformation
              B have been scheduled to a siteX. The clusters.size factor
              associated with job B for siteX is say 3. This will result in 2
              clustered jobs, one composed of 3 jobs and another of 2 jobs.
              The clusters.size factor can be specified in the transformation
              catalog as follows</p>
<pre class="programlisting"><span class="bold"><strong>#site   transformation   pfn            type               architecture  profiles
</strong></span>
siteX    B     /shared/PEGASUS/bin/jobB INSTALLED       INTEL32::LINUX  PEGASUS::clusters.size=3
siteX    C     /shared/PEGASUS/bin/jobC INSTALLED       INTEL32::LINUX  PEGASUS::clusters.size=2
</pre>
<div class="figure">
<a name="idp28952080"></a><p class="title"><b>Figure 10.1. Clustering by clusters.size</b></p>
<div class="figure-contents"><div class="mediaobject"><img src="images/advanced-clustering-1.png" height="720" alt="Clustering by clusters.size"></div></div>
</div>
<br class="figure-break">
</li>
<li class="listitem">
<p><span class="bold"><strong>clusters.num
              factor</strong></span></p>
<p>The clusters.num factor denotes how many clustered jobs
              does the user want to see per level per site. It is specified
              via the use of a PEGASUS namespace profile key
              &amp;ldquo;clusters.num&amp;rdquor;. for e.g. if at a particular
              level, say 4 jobs referring to logical transformation B have
              been scheduled to a siteX. The
              &amp;ldquo;clusters.num&amp;rdquor; factor associated with job B
              for siteX is say 3. This will result in 3 clustered jobs, one
              composed of 2 jobs and others of a single job each. The
              clusters.num factor in the transformation catalog can be
              specified as follows</p>
<pre class="programlisting"><span class="bold"><strong>#site  transformation      pfn           type            architecture    profiles
</strong></span>
siteX    B     /shared/PEGASUS/bin/jobB INSTALLED       INTEL32::LINUX  PEGASUS::clusters.num=3
siteX    C     /shared/PEGASUS/bin/jobC INSTALLED       INTEL32::LINUX  PEGASUS::clusters.num=2
</pre>
<p>In the case, where both the factors are associated with
              the job, the clusters.num value supersedes the clusters.size
              value.</p>
<pre class="programlisting"><span class="bold"><strong>#site  transformation   pfn             type             architecture   profiles
</strong></span>
siteX    B     /shared/PEGASUS/bin/jobB INSTALLED       INTEL32::LINUX PEGASUS::clusters.size=3,clusters.num=3
</pre>
<p>In the above case the jobs referring to logical
              transformation B scheduled on siteX will be clustered on the
              basis of &amp;ldquo;clusters.num&amp;rdquor; value. Hence, if
              there are 4 jobs referring to logical transformation B scheduled
              to siteX, then 3 clustered jobs will be created.</p>
<div class="figure">
<a name="idp12513040"></a><p class="title"><b>Figure 10.2. Clustering by clusters.num</b></p>
<div class="figure-contents"><div class="mediaobject"><img src="images/advanced-clustering-2.png" height="720" alt="Clustering by clusters.num"></div></div>
</div>
<br class="figure-break">
</li>
</ul></div>
</div>
</div>
<div class="section" title="10.4.1.1.2. Runtime Clustering">
<div class="titlepage"><div><div><h5 class="title">
<a name="runtime_clustering"></a>10.4.1.1.2. Runtime Clustering</h5></div></div></div>
<p>Workflows often consist of jobs of same type, but have varying
        run times. Two or more instances of the same job, with varying inputs
        can differ significantly in their runtimes. A simple way to think
        about this is running the same program on two distinct input sets,
        where one input is smaller (1 MB) as compared to the other which is 10
        GB in size. In such a case the two jobs will having significantly
        differing run times. When such jobs are clustered using horizontal
        clustering, the benefits of job clustering may be lost if all smaller
        jobs get clustered together, while the larger jobs are clustered
        together. In such scenarios it would be beneficial to be able to
        cluster jobs together such that all clustered jobs have similar
        runtimes.</p>
<p>In case of runtime clustering, jobs in the workflow are
        associated with a level. The levels of the workflow are determined in
        the same manner as in horizontal clustering. For each level the jobs
        are grouped by the site on which they have been scheduled by the Site
        Selector. Only jobs of same type (txnamespace, txname, txversion) can
        be clustered into a larger job. To use runtime clustering the user
        needs to set the <span class="bold"><strong>--cluster</strong></span> option of
        <span class="bold"><strong>pegasus-plan to horizontal</strong></span>.</p>
<p>Basic Algorithm of grouping jobs into clusters is as
        follows</p>
<pre class="programlisting">// cluster.maxruntime - Is the maximum runtime for which the clustered job should run.
// j.runtime - Is the runtime of the job j.
1. Create a set of jobs of the same type (txnamespace, txname, txversion), and that run on the same site.
2. Sort the jobs in decreasing order of their runtime.
3. For each job j, repeat
  a. If j.runtime &gt; cluster.maxruntime then 
        ignore j.
  // Sum of runtime of jobs already in the bin + j.runtime &lt;= cluster.maxruntime
  b. If j can be added to any existing bin (clustered job) then 
        Add j to bin
     Else
        Add a new bin
        Add job j to newly added bin</pre>
<p>The runtime of a job, and maximum runtime for which a clustered
        jobs should run, is determined by the value of two parameters
        associated with the jobs.</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p><span class="bold"><strong>runtime</strong></span></p>
<p>expected runtime for a job</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>clusters.maxruntime</strong></span></p>
<p>maxruntime for the clustered job</p>
</li>
</ul></div>
<p>Both these parameters are specified by the use of a PEGASUS
        namespace profile keys. The keys can be specified at any of the
        placeholders for the profiles (abstract transformation in the DAX,
        site in the site catalog, transformation in the transformation
        catalog). The normal overloading semantics apply i.e. profile in
        transformation catalog overrides the one in the site catalog and that
        in turn overrides the one in the DAX. The two parameters are described
        below.</p>
<pre class="programlisting"><span class="bold"><strong>#site  transformation   pfn             type             architecture   profiles
</strong></span>
siteX    B     /shared/PEGASUS/bin/jobB INSTALLED       INTEL32::LINUX PEGASUS::clusters.maxruntime=250,runtime=100
siteX    C     /shared/PEGASUS/bin/jobC INSTALLED       INTEL32::LINUX PEGASUS::clusters.maxruntime=300,runtime=100</pre>
<div class="figure">
<a name="idp12384272"></a><p class="title"><b>Figure 10.3. Clustering by runtime</b></p>
<div class="figure-contents"><div class="mediaobject"><img src="images/advanced-clustering-5.png" height="720" alt="Clustering by runtime"></div></div>
</div>
<br class="figure-break"><p>In the above case the jobs referring to logical transformation B
        scheduled on siteX will be clustered such that all clustered jobs will
        run approximately for the same duration specified by the
        clusters.maxruntime property. In the above case we assume all jobs
        referring to transformation B run for 100 seconds. For jobs with
        significantly differing runtime, the runtime property will be
        associated with the jobs in the DAX.</p>
<p>In addition to the above two profiles, we need to inform
        pegasus-plan to use runtime clustering. This is done by setting the
        following property .</p>
<pre class="programlisting"><span class="bold"><strong> pegasus.clusterer.preference          Runtime</strong></span> </pre>
<p></p>
</div>
<div class="section" title="10.4.1.1.3. Label Clustering">
<div class="titlepage"><div><div><h5 class="title">
<a name="label_clustering"></a>10.4.1.1.3. Label Clustering</h5></div></div></div>
<p>In label based clustering, the user labels the workflow. All
        jobs having the same label value are clustered into a single clustered
        job. This allows the user to create clusters or use a clustering
        technique that is specific to his workflows. If there is no label
        associated with the job, the job is not clustered and is executed as
        is</p>
<div class="figure">
<a name="idp16758560"></a><p class="title"><b>Figure 10.4. Label-based clustering</b></p>
<div class="figure-contents"><div class="mediaobject"><img src="images/advanced-clustering-3.png" height="720" alt="Label-based clustering"></div></div>
</div>
<p><br class="figure-break"></p>
<p>Since, the jobs in a cluster in this case are not independent,
        it is important the jobs are executed in the correct order. This is
        done by doing a topological sort on the jobs in each cluster. To use
        label based clustering the user needs to set the <span class="bold"><strong>--cluster</strong></span> option of <span class="bold"><strong>pegasus-plan</strong></span> to label.</p>
<div class="section" title="10.4.1.1.3.1. Labelling the Workflow">
<div class="titlepage"><div><div><h6 class="title">
<a name="idp24445728"></a>10.4.1.1.3.1. Labelling the Workflow</h6></div></div></div>
<p>The labels for the jobs in the workflow are specified by
          associated <span class="bold"><strong>pegasus</strong></span> profile keys
          with the jobs during the DAX generation process. The user can choose
          which profile key to use for labeling the workflow. By default, it
          is assumed that the user is using the PEGASUS profile key label to
          associate the labels. To use another key, in the <span class="bold"><strong>pegasus</strong></span> namespace the user needs to set the
          following property</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>pegasus.clusterer.label.key</p></li></ul></div>
<p>For example if the user sets <span class="bold"><strong>pegasus.clusterer.label.key </strong></span>to <span class="bold"><strong>user_label</strong></span> then the job description in the
          DAX looks as follows</p>
<pre class="programlisting">&lt;adag &gt;
...
  &lt;job id="ID000004" namespace="app" name="analyze" version="1.0" level="1" &gt;
    &lt;argument&gt;-a bottom -T60  -i &lt;filename file="user.f.c1"/&gt;  -o &lt;filename file="user.f.d"/&gt;&lt;/argument&gt;
    &lt;profile namespace="pegasus" key="user_label"&gt;p1&lt;/profile&gt;
    &lt;uses file="user.f.c1" link="input" dontRegister="false" dontTransfer="false"/&gt;
    &lt;uses file="user.f.c2" link="input" dontRegister="false" dontTransfer="false"/&gt;
    &lt;uses file="user.f.d" link="output" dontRegister="false" dontTransfer="false"/&gt;
  &lt;/job&gt;
...
&lt;/adag&gt;</pre>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>The above states that the <span class="bold"><strong>pegasus</strong></span> profiles with key as <span class="bold"><strong>user_label</strong></span> are to be used for designating
              clusters.</p></li>
<li class="listitem"><p>Each job with the same value for <span class="bold"><strong>pegasus</strong></span> profile key <span class="bold"><strong>user_label </strong></span>appears in the same
              cluster.</p></li>
</ul></div>
</div>
</div>
<div class="section" title="10.4.1.1.4. Recursive Clustering">
<div class="titlepage"><div><div><h5 class="title">
<a name="idp19107296"></a>10.4.1.1.4. Recursive Clustering</h5></div></div></div>
<p>In some cases, a user may want to use a combination of
        clustering techniques. For e.g. a user may want some jobs in the
        workflow to be horizontally clustered and some to be label clustered.
        This can be achieved by specifying a comma separated list of
        clustering techniques to the<span class="bold"><strong>
        --cluster</strong></span> option of <span class="bold"><strong>pegasus-plan</strong></span>. In this case the clustering
        techniques are applied one after the other on the workflow in the
        order specified on the command line.</p>
<p>For example</p>
<pre class="programlisting">$ <span class="emphasis"><em>pegasus-plan --dax example.dax --dir ./dags --cluster label,horizontal -s siteX --output local --verbose</em></span></pre>
<div class="figure">
<a name="idp14650976"></a><p class="title"><b>Figure 10.5. Recursive clustering</b></p>
<div class="figure-contents"><div class="mediaobject"><img src="images/advanced-clustering-4.png" height="720" alt="Recursive clustering"></div></div>
</div>
<br class="figure-break">
</div>
</div>
<div class="section" title="10.4.1.2. Execution of the Clustered Job">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp15396240"></a>10.4.1.2. Execution of the Clustered Job</h4></div></div></div>
<p>The execution of the clustered job on the remote site, involves
      the execution of the smaller constituent jobs either</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p><span class="bold"><strong>sequentially on a single node of the
          remote site</strong></span></p>
<p>The clustered job is executed using <span class="bold"><strong>pegasus-cluster</strong></span>, a wrapper tool written in C
          that is distributed as part of the PEGASUS. It takes in the jobs
          passed to it, and ends up executing them sequentially on a single
          node. To use pegasus-cluster for executing any clustered job on a
          siteX, there needs to be an entry in the transformation catalog for
          an executable with the logical name seqexec and namespace as
          pegasus.</p>
<pre class="programlisting"><span class="bold"><strong>#site  transformation   pfn            type                 architecture    profiles</strong></span>

siteX    pegasus::seqexec     /usr/pegasus/bin/pegasus-cluster INSTALLED       INTEL32::LINUX NULL</pre>
<p>If the entry is not specified, Pegasus will attempt create a
          default path on the basis of the environment profile PEGASUS_HOME
          specified in the site catalog for the remote site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>On multiple nodes of the remote site
          using MPI based task management tool called Pegasus MPI Cluster
          (PMC)</strong></span></p>
<p>The clustered job is executed using <span class="bold"><strong>pegasus-mpi-cluster</strong></span>, a wrapper MPI program
          written in C that is distributed as part of the PEGASUS. A PMC job
          consists of a single master process (this process is rank 0 in MPI
          parlance) and several worker processes. These processes follow the
          standard master-worker architecture. The master process manages the
          workflow and assigns workflow tasks to workers for execution. The
          workers execute the tasks and return the results to the master.
          Communication between the master and the workers is accomplished
          using a simple text-based protocol implemented using MPI_Send and
          MPI_Recv. PMC relies on a shared filesystem on the remote site to
          manage the individual tasks stdout and stderr and stage it back to
          the submit host as part of it's own stdout/stderr.</p>
<p>The input format for PMC is a DAG based format similar to
          Condor DAGMan's. PMC follows the dependencies specified in the DAG
          to release the jobs in the right order and executes parallel jobs
          via the workers when possible. The input file for PMC is
          automatically generated by the Pegasus Planner when generating the
          executable workflow. PMC allows for a finer grained control on how
          each task is executed. This can be enabled by associating the
          following pegasus profiles with the jobs in the DAX</p>
<div class="table">
<a name="idp11824096"></a><p class="title"><b>Table 10.9. Table : Pegasus Profiles that can be associated with jobs
            in the DAX for PMC</b></p>
<div class="table-contents"><table summary="Table : Pegasus Profiles that can be associated with jobs
            in the DAX for PMC" border="1">
<colgroup>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><span class="bold"><strong>Key</strong></span></td>
<td><span class="bold"><strong>Description</strong></span></td>
</tr>
<tr>
<td>pmc_request_memory</td>
<td>This key is used to set the -m option for
                  pegasus-mpi-cluster. It specifies the amount of memory in MB
                  that a job requires. This profile is usually set in the DAX
                  for each job.</td>
</tr>
<tr>
<td>pmc_request_cpus</td>
<td>This key is used to set the -c option for
                  pegasus-mpi-cluster. It specifies the number of cpu's that a
                  job requires. This profile is usually set in the DAX for
                  each job.</td>
</tr>
<tr>
<td>pmc_priority</td>
<td>This key is used to set the -p option for
                  pegasus-mpi-cluster. It specifies the priority for a job .
                  This profile is usually set in the DAX for each job.
                  Negative values are allowed for priorities.</td>
</tr>
<tr>
<td>pmc_task_arguments</td>
<td>The key is used to pass any extra arguments to the
                  PMC task during the planning time. They are added to the
                  very end of the argument string constructed for the task in
                  the PMC file. Hence, allows for overriding of any argument
                  constructed by the planner for any particular task in the
                  PMC job.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>Refer to the pegasus-mpi-cluster man page in the <a class="link" href="reference.php#pegasus-cli-chapter">command line tools chapter</a> to
          know more about PMC and how it schedules individual tasks.</p>
<p>It is recommended to have a pegasus::mpiexec entry in the
          transformation catalog to specify the path to PMC on the remote and
          specify the relevant globus profiles such as xcount, host_xcount and
          maxwalltime to control size of the MPI job.</p>
<pre class="programlisting"><span class="bold"><strong>#site  transformation   pfn            type                 architecture    profiles</strong></span>

siteX    pegasus::mpiexec     /usr/pegasus/bin/pegasus-mpi-cluster INSTALLED       INTEL32::LINUX globus::xcount=32;globus::host_xcount=1</pre>
<p>If the entry is not specified, Pegasus will attempt create a
          default path on the basis of the environment profile PEGASUS_HOME
          specified in the site catalog for the remote site.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Users are encouraged to use label based clustering in
            conjunction with PMC</p>
</div>
</li>
</ul></div>
<div class="section" title="10.4.1.2.1. Specification of Method of Execution for Clustered Jobs">
<div class="titlepage"><div><div><h5 class="title">
<a name="idp18840304"></a>10.4.1.2.1. Specification of Method of Execution for Clustered Jobs</h5></div></div></div>
<p>The method execution of the clustered job(whether to launch via
        mpiexec or seqexec) can be specified</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p><span class="bold"><strong>globally in the properties
            file</strong></span></p>
<p>The user can set a property in the properties file that
            results in all the clustered jobs of the workflow being executed
            by the same type of executable.</p>
<pre class="programlisting"><span class="bold"><strong>#PEGASUS PROPERTIES FILE</strong></span>
pegasus.clusterer.job.aggregator seqexec|mpiexec</pre>
<p>In the above example, all the clustered jobs on the remote
            sites are going to be launched via the property value, as long as
            the property value is not overridden in the site catalog.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>associating profile key job.aggregator
            with the site in the site catalog</strong></span></p>
<pre class="programlisting">&lt;site handle="siteX" gridlaunch = "/shared/PEGASUS/bin/kickstart"&gt;
    &lt;profile namespace="env" key="GLOBUS_LOCATION" &gt;/home/shared/globus&lt;/profile&gt;
    &lt;profile namespace="env" key="LD_LIBRARY_PATH"&gt;/home/shared/globus/lib&lt;/profile&gt;
    &lt;profile namespace="pegasus" key="job.aggregator" &gt;seqexec&lt;/profile&gt;
    &lt;lrc url="rls://siteX.edu" /&gt;
    &lt;gridftp  url="gsiftp://siteX.edu/" storage="/home/shared/work" major="2" minor="4" patch="0" /&gt;
    &lt;jobmanager universe="transfer" url="siteX.edu/jobmanager-fork" major="2" minor="4" patch="0" /&gt;
    &lt;jobmanager universe="vanilla" url="siteX.edu/jobmanager-condor" major="2" minor="4" patch="0" /&gt;
    &lt;workdirectory &gt;/home/shared/storage&lt;/workdirectory&gt;
  &lt;/site&gt;</pre>
<p>In the above example, all the clustered jobs on a siteX are
            going to be executed via seqexec, as long as the value is not
            overridden in the transformation catalog.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>associating profile key job.aggregator
            with the transformation that is being clustered, in the
            transformation catalog</strong></span></p>
<pre class="programlisting"><span class="bold"><strong>#site  transformation   pfn            type                architecture profiles
</strong></span>
siteX    B     /shared/PEGASUS/bin/jobB INSTALLED       INTEL32::LINUX pegasus::clusters.size=3,job.aggregator=mpiexec</pre>
<p>In the above example, all the clustered jobs that consist of
            transformation B on siteX will be executed via mpiexec.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p><span class="bold"><strong> The clustering of jobs on a site
              only happens only if </strong></span></p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>there exists an entry in the transformation catalog
                    for the clustering executable that has been determined by
                    the above 3 rules</p></li>
<li class="listitem"><p>the number of jobs being clustered on the site are
                    more than 1</p></li>
</ul></div>
</div>
</li>
</ol></div>
</div>
</div>
<div class="section" title="10.4.1.3. Outstanding Issues">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp14944288"></a>10.4.1.3. Outstanding Issues</h4></div></div></div>
<div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem">
<p><span class="bold"><strong>Label Clustering</strong></span></p>
<p>More rigorous checks are required to ensure that the labeling
          scheme applied by the user is valid.</p>
</li></ol></div>
</div>
</div>
</div>
<div class="section" title="10.5. Data Transfers">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="transfer"></a>10.5. Data Transfers</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#ref_data_staging_configuration">10.5.1. Data Staging Configuration</a></span></dt>
<dt><span class="section"><a href="reference.php#local_vs_remote_transfers">10.5.2. Local versus Remote Transfers</a></span></dt>
<dt><span class="section"><a href="reference.php#idp16922688">10.5.3. Symlinking Against Input Data</a></span></dt>
<dt><span class="section"><a href="reference.php#idp13332752">10.5.4. Addition of Separate Data Movement Nodes to Executable
    Workflow</a></span></dt>
<dt><span class="section"><a href="reference.php#ref_output_mapper">10.5.5. Output Mappers</a></span></dt>
<dt><span class="section"><a href="reference.php#idp13849840">10.5.6. Executable Used for Transfer Jobs</a></span></dt>
<dt><span class="section"><a href="reference.php#idp24601488">10.5.7. Executables used for Directory Creation and Cleanup Jobs</a></span></dt>
<dt><span class="section"><a href="reference.php#cred_staging">10.5.8. Credentials Staging</a></span></dt>
<dt><span class="section"><a href="reference.php#idp12641440">10.5.9. Staging of Executables</a></span></dt>
<dt><span class="section"><a href="reference.php#idp14063568">10.5.10. Staging of Pegasus Worker Package</a></span></dt>
<dt><span class="section"><a href="reference.php#idp12098064">10.5.11. Using Amazon S3 as a Staging Site</a></span></dt>
<dt><span class="section"><a href="reference.php#idp15427664">10.5.12. iRODS data access</a></span></dt>
</dl></div>
<p>As part of the Workflow Mapping Process, Pegasus does data management
  for the executable workflow . It queries a Replica Catalog to discover the
  locations of the input datasets and adds data movement and registration
  nodes in the workflow to</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>stage-in input data to the staging sites ( a site associated with
      the compute job to be used for staging. In the shared filesystem setup,
      staging site is the same as the execution sites where the jobs in the
      workflow are executed )</p></li>
<li class="listitem"><p>stage-out output data generated by the workflow to the final
      storage site.</p></li>
<li class="listitem"><p>stage-in intermediate data between compute sites if
      required.</p></li>
<li class="listitem"><p>data registration nodes to catalog the locations of the output
      data on the final storage site into the replica catalog.</p></li>
</ol></div>
<p>The separate data movement jobs that are added to the executable
  workflow are responsible for staging data to a workflow specific directory
  accessible to the staging server on a staging site associated with the
  compute sites. Depending on the data staging configuration, the staging site
  for a compute site is the compute site itself. In the default case, the
  staging server is usually on the headnode of the compute site and has access
  to the shared filesystem between the worker nodes and the head node. Pegasus
  adds a directory creation job in the executable workflow that creates the
  workflow specific directory on the staging server.</p>
<p>In addition to data, Pegasus does transfer user executables to the
  compute sites if the executables are not installed on the remote sites
  before hand. This chapter gives an overview of how transfers of data and
  executables is managed in Pegasus.</p>
<div class="section" title="10.5.1. Data Staging Configuration">
<div class="titlepage"><div><div><h3 class="title">
<a name="ref_data_staging_configuration"></a>10.5.1. Data Staging Configuration</h3></div></div></div>
<p>Pegasus can be broadly setup to run workflows in the following
    configurations</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p><span class="bold"><strong>Shared File System</strong></span></p>
<p>This setup applies to where the head node and the worker nodes
        of a cluster share a filesystem. Compute jobs in the workflow run in a
        directory on the shared filesystem.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>NonShared FileSystem</strong></span></p>
<p>This setup applies to where the head node and the worker nodes
        of a cluster don't share a filesystem. Compute jobs in the workflow
        run in a local directory on the worker node</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Condor Pool Without a shared
        filesystem</strong></span></p>
<p>This setup applies to a condor pool where the worker nodes
        making up a condor pool don't share a filesystem. All data IO is
        achieved using Condor File IO. This is a special case of the non
        shared filesystem setup, where instead of using pegasus-transfer to
        transfer input and output data, Condor File IO is used.</p>
</li>
</ul></div>
<p>For the purposes of data configuration various sites, and
    directories are defined below.</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p><span class="bold"><strong>Submit Host</strong></span></p>
<p>The host from where the workflows are submitted . This is where
        Pegasus and Condor DAGMan are installed. This is referred to as the
        <span class="bold"><strong>"local"</strong></span> site in the site catalog
        .</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Compute Site</strong></span></p>
<p>The site where the jobs mentioned in the DAX are executed. There
        needs to be an entry in the Site Catalog for every compute site. The
        compute site is passed to pegasus-plan using <span class="bold"><strong>--sites</strong></span> option</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Staging Site</strong></span></p>
<p>A site to which the separate transfer jobs in the executable
        workflow ( jobs with stage_in , stage_out and stage_inter prefixes
        that Pegasus adds using the transfer refiners) stage the input data to
        and the output data from to transfer to the final output site.
        Currently, the staging site is always the compute site where the jobs
        execute.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Output Site</strong></span></p>
<p>The output site is the final storage site where the users want
        the output data from jobs to go to. The output site is passed to
        pegasus-plan using the <span class="bold"><strong>--output</strong></span>
        option. The stageout jobs in the workflow stage the data from the
        staging site to the final storage site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Input Site</strong></span></p>
<p>The site where the input data is stored. The locations of the
        input data are catalogued in the Replica Catalog, and the pool
        attribute of the locations gives us the site handle for the input
        site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Workflow Execution
        Directory</strong></span></p>
<p>This is the directory created by the create dir jobs in the
        executable workflow on the Staging Site. This is a directory per
        workflow per staging site. Currently, the Staging site is always the
        Compute Site.</p>
</li>
<li class="listitem">
<p><span class="bold"><strong>Worker Node Directory</strong></span></p>
<p>This is the directory created on the worker nodes per job
        usually by the job wrapper that launches the job.</p>
</li>
</ol></div>
<div class="section" title="10.5.1.1. Shared File System">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp15371248"></a>10.5.1.1. Shared File System</h4></div></div></div>
<p>By default Pegasus is setup to run workflows in the shared file
      system setup, where the worker nodes and the head node of a cluster
      share a filesystem.</p>
<div class="figure">
<a name="idp15744000"></a><p class="title"><b>Figure 10.6. Shared File System Setup</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="100%"><tr><td align="center"><img src="images/data-configuration-sharedfs.png" align="middle" height="360" alt="Shared File System Setup"></td></tr></table></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes ( either on Submit Host or Head Node ) to
          stage in input data from Input Sites ( 1---n) to a workflow specific
          execution directory on the shared filesystem.</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in the workflow execution
          directory. Accesses the input data using Posix IO</p></li>
<li class="listitem"><p>Compute Job executes on the worker node and writes out output
          data to workflow execution directory using Posix IO</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or Head Node )
          to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set <span class="bold"><strong>pegasus.data.configuration</strong></span>
        to <span class="bold"><strong>sharedfs</strong></span> to run in this
        configuration.</p>
</div>
</div>
<div class="section" title="10.5.1.2. Non Shared Filesystem">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp19717440"></a>10.5.1.2. Non Shared Filesystem</h4></div></div></div>
<p>In this setup , Pegasus runs workflows on local file-systems of
      worker nodes with the the worker nodes not sharing a filesystem. The
      data transfers happen between the worker node and a staging / data
      coordination site. The staging site server can be a file server on the
      head node of a cluster or can be on a separate machine.</p>
<p><span class="bold"><strong>Setup</strong></span></p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>compute and staging site are the different</p></li>
<li class="listitem"><p>head node and worker nodes of compute site don't share a
            filesystem</p></li>
<li class="listitem"><p>Input Data is staged from remote sites.</p></li>
<li class="listitem"><p>Remote Output Site i.e site other than compute site. Can be
            submit host.</p></li>
</ul></div>
<div class="figure">
<a name="idp14580224"></a><p class="title"><b>Figure 10.7. Non Shared Filesystem Setup</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="100%"><tr><td align="center"><img src="images/data-configuration-nonsharedfs.png" align="middle" height="360" alt="Non Shared Filesystem Setup"></td></tr></table></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executes ( either on Submit Host or on staging
          site ) to stage in input data from Input Sites ( 1---n) to a
          workflow specific execution directory on the staging site.</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in a local execution
          directory. Accesses the input data using pegasus transfer to
          transfer the data from the staging site to a local directory on the
          worker node</p></li>
<li class="listitem"><p>The compute job executes in the worker node, and executes on
          the worker node.</p></li>
<li class="listitem"><p>The compute Job writes out output data to the local directory
          on the worker node using Posix IO</p></li>
<li class="listitem"><p>Output Data is pushed out to the staging site from the worker
          node using pegasus-transfer.</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or staging site
          ) to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<p>In this case, the compute jobs are wrapped as <a class="link" href="running_workflows.php#pegasuslite" title="5.4. PegasusLite">PegasusLite</a> instances.</p>
<p>This mode is especially useful for running in the cloud
      environments where you don't want to setup a shared filesystem between
      the worker nodes. Running in that mode is explained in detail <a class="link" href="execution_environments.php#amazon_aws" title="6.3.1. Amazon EC2">here.</a></p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set p<span class="bold"><strong>egasus.data.configuration</strong></span>
        to <span class="bold"><strong>nonsharedfs</strong></span> to run in this
        configuration. The staging site can be specified using the <span class="bold"><strong>--staging-site</strong></span> option to pegasus-plan.</p>
</div>
<p>In this setup, Pegasus always stages the input files through the
      staging site i.e the stage-in job stages in data from the input site to
      the staging site. The PegasusLite jobs that start up on the worker
      nodes, then pull the input data from the staging site for each job. In
      some cases, it might be useful to setup the PegasusLite jobs to pull
      input data directly from the input site without going through the
      staging server. This is based on the assumption that the worker nodes
      can access the input site. Starting 4.3 release, users can enable this.
      However, you should be aware that the access to the input site is no
      longer throttled ( as in case of stage in jobs). If large number of
      compute jobs start at the same time in a workflow, the input server will
      see a connection from each job.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set <span class="bold"><strong>pegasus.transfer.bypass.input.staging
        </strong></span> to <span class="bold"><strong>true </strong></span>to enable the
        bypass of staging of input files via the staging server.</p>
</div>
</div>
<div class="section" title="10.5.1.3. Condor Pool Without a Shared Filesystem">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp16673888"></a>10.5.1.3. Condor Pool Without a Shared Filesystem</h4></div></div></div>
<p>This setup applies to a condor pool where the worker nodes making
      up a condor pool don't share a filesystem. All data IO is achieved using
      Condor File IO. This is a special case of the non shared filesystem
      setup, where instead of using pegasus-transfer to transfer input and
      output data, Condor File IO is used.</p>
<p><span class="bold"><strong>Setup</strong></span></p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>Submit Host and staging site are same</p></li>
<li class="listitem"><p>head node and worker nodes of compute site don't share a
            filesystem</p></li>
<li class="listitem"><p>Input Data is staged from remote sites.</p></li>
<li class="listitem"><p>Remote Output Site i.e site other than compute site. Can be
            submit host.</p></li>
</ul></div>
<div class="figure">
<a name="idp16190528"></a><p class="title"><b>Figure 10.8. Condor Pool Without a Shared Filesystem</b></p>
<div class="figure-contents"><div class="mediaobject" align="center"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="100%"><tr><td align="center"><img src="images/data-configuration-condorio.png" align="middle" height="360" alt="Condor Pool Without a Shared Filesystem"></td></tr></table></div></div>
</div>
<br class="figure-break"><p>The data flow is as follows in this case</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Stagein Job executeson the submit host to stage in input data
          from Input Sites ( 1---n) to a workflow specific execution directory
          on the submit host</p></li>
<li class="listitem"><p>Compute Job starts on a worker node in a local execution
          directory. Before the compute job starts, Condor transfers the input
          data for the job from the workflow execution directory on thesubmit
          host to the local execution directory on the worker node.</p></li>
<li class="listitem"><p>The compute job executes in the worker node, and executes on
          the worker node.</p></li>
<li class="listitem"><p>The compute Job writes out output data to the local directory
          on the worker node using Posix IO</p></li>
<li class="listitem"><p>When the compute job finishes, Condor transfers the output
          data for the job from the local execution directory on the worker
          node to the workflow execution directory on the submit host.</p></li>
<li class="listitem"><p>Stageout Job executes ( either on Submit Host or staging site
          ) to stage out output data from the workflow specific execution
          directory to a directory on the final output site.</p></li>
</ol></div>
<p>In this case, the compute jobs are wrapped as <a class="link" href="running_workflows.php#pegasuslite" title="5.4. PegasusLite">PegasusLite</a> instances.</p>
<p>This mode is especially useful for running in the cloud
      environments where you don't want to setup a shared filesystem between
      the worker nodes. Running in that mode is explained in detail <a class="link" href="execution_environments.php#amazon_aws" title="6.3.1. Amazon EC2">here.</a></p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set p<span class="bold"><strong>egasus.data.configuration</strong></span>
        to <span class="bold"><strong>condorio</strong></span> to run in this
        configuration. In this mode, the staging site is automatically set to
        site <span class="bold"><strong>local</strong></span></p>
</div>
<p>In this setup, Pegasus always stages the input files through the
      submit host i.e the stage-in job stages in data from the input site to
      the submit host (local site). The input data is then transferred to
      remote worker nodes from the submit host using Condor file transfers. In
      the case, where the input data is locally accessible at the submit host
      i.e the input site and the submit host are the same, then it is possible
      to bypass the creation of separate stage in jobs that copy the data to
      the workflow specific directory on the submit host. Instead, Condor file
      transfers can be setup to transfer the input files directly from the
      locally accessible input locations ( file URL's with site attribute set
      to local) specified in the replica catalog. Starting 4.3 release, users
      can enable this.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>Set <span class="bold"><strong>pegasus.transfer.bypass.input.staging
        </strong></span> to <span class="bold"><strong>true </strong></span>to bypass the
        creation of separate stage in jobs.</p>
</div>
</div>
</div>
<div class="section" title="10.5.2. Local versus Remote Transfers">
<div class="titlepage"><div><div><h3 class="title">
<a name="local_vs_remote_transfers"></a>10.5.2. Local versus Remote Transfers</h3></div></div></div>
<p>As far as possible, Pegasus will ensure that the transfer jobs added
    to the executable workflow are executed on the submit host. By default,
    Pegasus will schedule a transfer to be executed on the remote staging site
    only if there is no way to execute it on the submit host. For e.g if the
    file server specified for the staging site/compute site is a file server,
    then Pegasus will schedule all the stage in data movement jobs on the
    compute site to stage-in the input data for the workflow. Another case
    would be if a user has symlinking turned on. In that case, the transfer
    jobs that symlink against the input data on the compute site, will be
    executed remotely ( on the compute site ).</p>
<p>Users can specify the property <span class="bold"><strong>pegasus.transfer.*.remote.sites</strong></span> to change the
    default behaviour of Pegasus and force pegasus to run different types of
    transfer jobs for the sites specified on the remote site. The value of the
    property is a comma separated list of compute sites for which you want the
    transfer jobs to run remotely.</p>
<p>The table below illustrates all the possible variations of the
    property.</p>
<div class="table">
<a name="idp16418720"></a><p class="title"><b>Table 10.10. Property Variations for pegasus.transfer.*.remote.sites</b></p>
<div class="table-contents"><table summary="Property Variations for pegasus.transfer.*.remote.sites" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Property Name</th>
<th>Applies to</th>
</tr></thead>
<tbody>
<tr>
<td>pegasus.transfer.stagein.remote.sites</td>
<td>the stage in transfer jobs</td>
</tr>
<tr>
<td>pegasus.transfer.stageout.remote.sites</td>
<td>the stage out transfer jobs</td>
</tr>
<tr>
<td>pegasus.transfer.inter.remote.sites</td>
<td>the inter site transfer jobs</td>
</tr>
<tr>
<td>pegasus.transfer.*.remote.sites</td>
<td>all types of transfer jobs</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>The prefix for the transfer job name indicates whether the transfer
    job is to be executed locallly ( on the submit host ) or remotely ( on the
    compute site ). For example stage_in_local_ in a transfer job name
    stage_in_local_isi_viz_0 indicates that the transfer job is a stage in
    transfer job that is executed locally and is used to transfer input data
    to compute site isi_viz. The prefix naming scheme for the transfer jobs is
    <span class="bold"><strong>[stage_in|stage_out|inter]_[local|remote]_</strong></span> .</p>
</div>
<div class="section" title="10.5.3. Symlinking Against Input Data">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp16922688"></a>10.5.3. Symlinking Against Input Data</h3></div></div></div>
<p>If input data for a job already exists on a compute site, then it is
    possible for Pegasus to symlink against that data. In this case, the
    remote stage in transfer jobs that Pegasus adds to the executable workflow
    will symlink instead of doing a copy of the data.</p>
<p>Pegasus determines whether a file is on the same site as the compute
    site, by inspecting the pool attribute associated with the URL in the
    Replica Catalog. If the pool attribute of an input file location matches
    the compute site where the job is scheduled, then that particular input
    file is a candidate for symlinking.</p>
<p>For Pegasus to symlink against existing input data on a compute
    site, following must be true</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Property <span class="bold"><strong>pegasus.transfer.links</strong></span>
        is set to <span class="bold"><strong>true</strong></span></p></li>
<li class="listitem"><p>The input file location in the Replica Catalog has the pool
        attribute matching the compute site.</p></li>
</ol></div>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>To confirm if a particular input file is symlinked instead of
      being copied, look for the destination URL for that file in
      stage_in_remote*.in file. The destination URL will start with symlink://
      .</p>
</div>
<p>In the symlinking case, Pegasus strips out URL prefix from a URL and
    replaces it with a file URL.</p>
<p>For example if a user has the following URL catalogued in the
    Replica Catalog for an input file f.input</p>
<pre class="programlisting">f.input   gsiftp://server.isi.edu/shared/storage/input/data/f.input pool="isi"</pre>
<p>and the compute job that requires this file executes on a compute
    site named isi , then if symlinking is turned on the data stage in job
    (stage_in_remote_viz_0 ) will have the following source and destination
    specified for the file</p>
<pre class="programlisting">#viz viz
file:///shared/storage/input/data/f.input  symlink://shared-scratch/workflow-exec-dir/f.input
</pre>
</div>
<div class="section" title="10.5.4. Addition of Separate Data Movement Nodes to Executable Workflow">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp13332752"></a>10.5.4. Addition of Separate Data Movement Nodes to Executable
    Workflow</h3></div></div></div>
<p>Pegasus relies on a Transfer Refiner that comes up with the strategy
    on how many data movement nodes are added to the executable workflow. All
    the compute jobs scheduled to a site share the same workflow specific
    directory. The transfer refiners ensure that only one copy of the input
    data is transferred to the workflow execution directory. This is to
    prevent data clobbering . Data clobbering can occur when compute jobs of a
    workflow share some input files, and have different stage in transfer jobs
    associated with them that are staging the shared files to the same
    destination workflow execution directory.</p>
<p>The default Transfer Refiner used in Pegasus is the Bundle Refiner
    that allows the user to specify how many local|remote stagein|stageout
    jobs are created per execution site.</p>
<p>The behavior of the refiner is controlled by specifying certain
    pegasus profiles</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>either with the execution sites in the site catalog</p></li>
<li class="listitem"><p>OR globally in the properties file</p></li>
</ol></div>
<div class="table">
<a name="idp13088880"></a><p class="title"><b>Table 10.11. Pegasus Profile Keys For the Cluster Transfer Refiner</b></p>
<div class="table-contents"><table summary="Pegasus Profile Keys For the Cluster Transfer Refiner" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Profile Key</th>
<th>Description</th>
</tr></thead>
<tbody>
<tr>
<td>stagein.clusters</td>
<td>This key determines the maximum number of stage-in jobs
            that are can executed locally or remotely per compute site per
            workflow.</td>
</tr>
<tr>
<td>stagein.local.clusters</td>
<td>This key provides finer grained control in determining the
            number of stage-in jobs that are executed locally and are
            responsible for staging data to a particular remote site.</td>
</tr>
<tr>
<td>stagein.remote.clusters</td>
<td>This key provides finer grained control in determining the
            number of stage-in jobs that are executed remotely on the remote
            site and are responsible for staging data to it.</td>
</tr>
<tr>
<td>stageout.clusters</td>
<td>This key determines the maximum number of stage-out jobs
            that are can executed locally or remotely per compute site per
            workflow.</td>
</tr>
<tr>
<td>stageout.local.clusters</td>
<td>This key provides finer grained control in determining the
            number of stage-out jobs that are executed locally and are
            responsible for staging data from a particular remote
            site.</td>
</tr>
<tr>
<td>stageout.remote.clusters</td>
<td>This key provides finer grained control in determining the
            number of stage-out jobs that are executed remotely on the remote
            site and are responsible for staging data from it.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><div class="figure">
<a name="idp18817392"></a><p class="title"><b>Figure 10.9. Default Transfer Case : Input Data To Workflow Specific Directory
      on Shared File System</b></p>
<div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="100%"><tr><td><img src="images/cluster-transfer-refiner.png" height="360" alt="Default Transfer Case : Input Data To Workflow Specific Directory on Shared File System"></td></tr></table></div></div>
</div>
<br class="figure-break">
</div>
<div class="section" title="10.5.5. Output Mappers">
<div class="titlepage"><div><div><h3 class="title">
<a name="ref_output_mapper"></a>10.5.5. Output Mappers</h3></div></div></div>
<p>Starting 4.3 release, Pegasus has support for output mappers, that
    allow users fine grained control over how the output files on the output
    site are laid out. By default, Pegasus stages output products to the
    storage directory specified in the site catalog for the output site.
    Output mappers allow users finer grained control over where the output
    files are placed on the output site.</p>
<p>The following mappers are supported currently</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p><span class="bold"><strong>Flat</strong></span> : By default, Pegasus will
        place the output files in the storage directory specified in the site
        catalog for the output site.</p></li>
<li class="listitem">
<p><span class="bold"><strong>Fixed</strong></span> : This mapper allows
        users to specify an externally accesible url to the storage directory
        in their properties file. To use this mapper, the following property
        needs to be set.</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p> pegasus.dir.storage.mapper.fixed.url an externally
            accessible URL to the storage directory on the output site e.g.
            gsiftp://outputs.isi.edu/shared/outputs </p></li></ul></div>
<p>Note: For hierarchal workflows, the above property needs to be
        set separately for each dax job, if you want the sub workflow outputs
        to goto a different directory. </p>
</li>
<li class="listitem"><p><span class="bold"><strong>Hashed</strong></span> : This mapper results in
        the creation of a deep directory structure on the output site, while
        populating the results. The base directory on the remote end is
        determined from the site catalog. Depending on the number of files
        being staged to the remote site a Hashed File Structure is created
        that ensures that only 256 files reside in one directory. To create
        this directory structure on the storage site, Pegasus relies on the
        directory creation feature of the underlying file servers such as
        theGrid FTP server, which appeared in globus 4.0.x</p></li>
<li class="listitem">
<p><span class="bold"><strong>Replica: </strong></span>This mapper determines
        the path for an output file on the output site by querying an output
        replica catalog. The output site is one that is passed on the command
        line. The output replica catalog can be configured by specifying the
        properties</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>pegasus.dir.storage.mapper.replica Regex|File</p></li>
<li class="listitem"><p>pegasus.dir.storage.mapper.replica.file the RC file at the
            backend to use</p></li>
</ul></div>
</li>
</ol></div>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>The mappers can be configured by setting the property <span class="bold"><strong>pegasus.dir.storage.mapper</strong></span></p>
</div>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>The Fixed mapper will be available starting 4.3.1 release.</p>
</div>
</div>
<div class="section" title="10.5.6. Executable Used for Transfer Jobs">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp13849840"></a>10.5.6. Executable Used for Transfer Jobs</h3></div></div></div>
<p>Pegasus refers to a python script called <span class="bold"><strong>pegasus-transfer</strong></span> as the executable in the transfer
    jobs to transfer the data. pegasus-transfer is a python based wrapper
    around various transfer clients . pegasus-transfer looks at source and
    destination url and figures out automatically which underlying client to
    use. pegasus-transfer is distributed with the PEGASUS and can be found at
    $PEGASUS_HOME/bin/pegasus-transfer.</p>
<p>Currently, pegasus-transfer interfaces with the following transfer
    clients</p>
<div class="table">
<a name="idp14382672"></a><p class="title"><b>Table 10.12. Transfer Clients interfaced to by pegasus-transfer</b></p>
<div class="table-contents"><table summary="Transfer Clients interfaced to by pegasus-transfer" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Transfer Client</th>
<th>Used For</th>
</tr></thead>
<tbody>
<tr>
<td>globus-url-copy</td>
<td>staging files to and from a gridftp server.</td>
</tr>
<tr>
<td>lcg-copy</td>
<td>staging files to and from a SRM server.</td>
</tr>
<tr>
<td>wget</td>
<td>staging files from a HTTP server.</td>
</tr>
<tr>
<td>cp</td>
<td>copying files from a POSIX filesystem .</td>
</tr>
<tr>
<td>ln</td>
<td>symlinking against input files.</td>
</tr>
<tr>
<td>pegasus-s3/s3cmd</td>
<td>staging files to and from s3 bucket in the amazon
            cloud</td>
</tr>
<tr>
<td>scp</td>
<td>staging files using scp</td>
</tr>
<tr>
<td>iget</td>
<td>staging files to and from a irods server.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>For remote sites, Pegasus constructs the default path to
    pegasus-transfer on the basis of PEGASUS_HOME env profile specified in the
    site catalog. To specify a different path to the pegasus-transfer client ,
    users can add an entry into the transformation catalog with fully
    qualified logical name as <span class="bold"><strong>pegasus::pegasus-transfer</strong></span></p>
</div>
<div class="section" title="10.5.7. Executables used for Directory Creation and Cleanup Jobs">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp24601488"></a>10.5.7. Executables used for Directory Creation and Cleanup Jobs</h3></div></div></div>
<p>Starting 4.0, Pegasus has changed the way how the scratch
    directories are created on the staging site. The planner now prefers to
    schedule the directory creation and cleanup jobs locally. The jobs refer
    to python based tools, that call out to protocol specific clients to
    determine what client is picked up. For protocols, where specific remote
    cleanup and directory creation clients don't exist ( for example gridftp
    ), the python tools rely on the corresponding transfer tool to create a
    directory by initiating a transfer of an empty file. The python clients
    used to create directories and remove files are called</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>pegasus-create-dir</p></li>
<li class="listitem"><p>pegasus-cleanup</p></li>
</ul></div>
<p>Both these clients inspect the URL's to to determine what underlying
    client to pick up.</p>
<div class="table">
<a name="idp13233536"></a><p class="title"><b>Table 10.13. Clients interfaced to by pegasus-create-dir</b></p>
<div class="table-contents"><table summary="Clients interfaced to by pegasus-create-dir" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Client</th>
<th>Used For</th>
</tr></thead>
<tbody>
<tr>
<td>globus-url-copy</td>
<td>to create directories against a gridftp/ftp server</td>
</tr>
<tr>
<td>srm-mkdir</td>
<td>to create directories against a SRM server.</td>
</tr>
<tr>
<td>mkdir</td>
<td>to create a directory on the local filesystem</td>
</tr>
<tr>
<td>pegasus-s3</td>
<td>to create a s3 bucket in the amazon cloud</td>
</tr>
<tr>
<td>scp</td>
<td>staging files using scp</td>
</tr>
<tr>
<td>imkdir</td>
<td>to create a directory against an IRODS server</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><div class="table">
<a name="idp12833888"></a><p class="title"><b>Table 10.14. Clients interfaced to by pegasus-cleanup</b></p>
<div class="table-contents"><table summary="Clients interfaced to by pegasus-cleanup" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Client</th>
<th>Used For</th>
</tr></thead>
<tbody>
<tr>
<td>globus-url-copy</td>
<td>to remove a file against a gridftp/ftp server. In this case
            a zero byte file is created</td>
</tr>
<tr>
<td>srm-rm</td>
<td>to remove files against a SRM server.</td>
</tr>
<tr>
<td>rm</td>
<td>to remove a file on the local filesystem</td>
</tr>
<tr>
<td>pegasus-s3</td>
<td>to remove a file from the s3 bucket.</td>
</tr>
<tr>
<td>scp</td>
<td>to remove a file against a scp server. In this case a zero
            byte file is created.</td>
</tr>
<tr>
<td>irm</td>
<td>to remove a file against an IRODS server</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>The only case, where the create dir and cleanup jobs are scheduled
    to run remotely is when for the staging site, a file server is
    specified.</p>
</div>
<div class="section" title="10.5.8. Credentials Staging">
<div class="titlepage"><div><div><h3 class="title">
<a name="cred_staging"></a>10.5.8. Credentials Staging</h3></div></div></div>
<p>Pegasus tries to do data staging from localhost by default, but some
    data scenarios makes some <a class="link" href="reference.php#local_vs_remote_transfers" title="10.5.2. Local versus Remote Transfers">remote
    jobs do data staging</a>. An example of such a case is when running in
    <a class="link" href="reference.php#ref_data_staging_configuration" title="10.5.1. Data Staging Configuration">nonsharedfs</a> mode.
    Depending on the transfer protocols used, the job may have to carry
    credentials to enable these datat transfers. To specify where which
    credential to use and where Pegasus can find it, use environment variable
    profiles in your site catalog. The supported credential types are X.509
    grid proxies, Amazon AWS S3 keys, iRods password and SSH keys.</p>
<div class="section" title="10.5.8.1. X.509 Grid Proxies">
<div class="titlepage"><div><div><h4 class="title">
<a name="x509_cred"></a>10.5.8.1. X.509 Grid Proxies</h4></div></div></div>
<p>If the grid proxy is required by transfer jobs, and the proxy is
      in the standard location, Pegasus will pick the proxy up automatically.
      For non-standard proxy locations, you can use the
      <code class="varname">X509_USER_PROXY</code> environment variable. Site catalog
      example:</p>
<pre class="programlisting">&lt;profile namespace="env" key="X509_USER_PROXY" &gt;/some/location/x509up&lt;/profile&gt;</pre>
</div>
<div class="section" title="10.5.8.2. Amazon AWS S3">
<div class="titlepage"><div><div><h4 class="title">
<a name="s3_cred"></a>10.5.8.2. Amazon AWS S3</h4></div></div></div>
<p>If a workflow is using s3 URLs, Pegasus has to be told where to
      find the .s3cfg file. This format of the file is described in the <a class="link" href="cli-pegasus-s3.php" title="pegasus-s3">pegaus-s3 command line client's man
      page</a>. For the file to be picked up by the workflow, set the
      <code class="varname">S3CFG</code> environment profile to the location of the
      file. Site catalog example:</p>
<pre class="programlisting">&lt;profile namespace="env" key="S3CFG" &gt;/home/user/.s3cfg&lt;/profile&gt;</pre>
</div>
<div class="section" title="10.5.8.3. iRods Password">
<div class="titlepage"><div><div><h4 class="title">
<a name="irods_cred"></a>10.5.8.3. iRods Password</h4></div></div></div>
<p>If a workflow is using irods URLs, Pegasus has to be given an
      irodsEnv file. It is a standard file, with the addtion of an password
      attribute. Example:</p>
<pre class="programlisting"># iRODS personal configuration file.
#
# iRODS server host name:
irodsHost 'iren.renci.org'
# iRODS server port number:
irodsPort 1259

# Default storage resource name:
irodsDefResource 'renResc'
# Home directory in iRODS:
irodsHome '/tip-renci/home/mats'
# Current directory in iRODS:
irodsCwd '/tip-renci/home/mats'
# Account name:
irodsUserName 'mats'
# Zone:
irodsZone 'tip-renci' 

# this is used with Pegasus
irodsPassword 'somesecretpassword'</pre>
<p>The location of the file can be given to the workflow using the
      <code class="varname">irodsEnvFile</code> environment profile. Site catalog
      example:</p>
<pre class="programlisting">&lt;profile namespace="env" key="irodsEnvFile" &gt;/home/user/.irods/.irodsEnv&lt;/profile&gt;</pre>
</div>
<div class="section" title="10.5.8.4. SSH Keys">
<div class="titlepage"><div><div><h4 class="title">
<a name="ssh_cred"></a>10.5.8.4. SSH Keys</h4></div></div></div>
<p>New in Pegasus 4.0 is the support for data staging with scp using
      ssh public/private key authentication. In this mode, Pegasus transports
      a private key with the jobs. The storage machines will have to have the
      public part of the key listed in ~/.ssh/authorized_keys.</p>
<div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Warning</h3>
<p>SSH keys should be handled in a secure manner. In order to keep
        your personal ssh keys secure, It is recommended that a special set of
        keys are created for use with the workflow. Note that Pegasus will not
        pick up ssh keys automatically. The user will have to specify which
        key to use with <code class="varname">SSH_PRIVATE_KEY</code>.</p>
</div>
<p>The location of the ssh private key can be specified with the
      <code class="varname">SSH_PRIVATE_KEY</code> environment profile. Site catalog
      example:</p>
<pre class="programlisting">&lt;profile namespace="env" key="SSH_PRIVATE_KEY" &gt;/home/user/wf/wfsshkey&lt;/profile&gt;</pre>
</div>
</div>
<div class="section" title="10.5.9. Staging of Executables">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp12641440"></a>10.5.9. Staging of Executables</h3></div></div></div>
<p>Users can get Pegasus to stage the user executables ( executables
    that the jobs in the DAX refer to ) as part of the transfer jobs to the
    workflow specific execution directory on the compute site. The URL
    locations of the executables need to be specified in the transformation
    catalog as the PFN and the type of executable needs to be set to <span class="bold"><strong>STAGEABLE</strong></span> .</p>
<p>The location of a transformation can be specified either in</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>DAX in the executables section. More details <a class="link" href="reference.php#dax_transformation_catalog" title="10.9.1.1.3.. The Transformation Catalog Section">here</a> .</p></li>
<li class="listitem"><p>Transformation Catalog. More details <a class="link" href="creating_workflows.php#transformation" title="4.4. Executable Discovery (Transformation Catalog)">here</a> .</p></li>
</ul></div>
<p>A particular transformation catalog entry of type STAGEABLE is
    compatible with a compute site only if all the System Information
    attributes associated with the entry match with the System Information
    attributes for the compute site in the Site Catalog. The following
    attributes make up the System Information attributes</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>arch</p></li>
<li class="listitem"><p>os</p></li>
<li class="listitem"><p>osrelease</p></li>
<li class="listitem"><p>osversion</p></li>
</ol></div>
<div class="section" title="10.5.9.1. Transformation Mappers">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp19106096"></a>10.5.9.1. Transformation Mappers</h4></div></div></div>
<p>Pegasus has a notion of transformation mappers that determines
      what type of executables are picked up when a job is executed on a
      remote compute site. For transfer of executables, Pegasus constructs a
      soft state map that resides on top of the transformation catalog, that
      helps in determining the locations from where an executable can be
      staged to the remote site.</p>
<p>Users can specify the following property to pick up a specific
      transformation mapper</p>
<pre class="programlisting"><span class="bold"><strong>pegasus.catalog.transformation.mapper</strong></span> </pre>
<p>Currently, the following transformation mappers are
      supported.</p>
<div class="table">
<a name="idp22056752"></a><p class="title"><b>Table 10.15. Transformation Mappers Supported in Pegasus</b></p>
<div class="table-contents"><table summary="Transformation Mappers Supported in Pegasus" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Transformation Mapper</th>
<th>Description</th>
</tr></thead>
<tbody>
<tr>
<td>Installed</td>
<td>This mapper only relies on transformation catalog entries
              that are of type INSTALLED to construct the soft state map. This
              results in Pegasus never doing any transfer of executables as
              part of the workflow. It always prefers the installed
              executables at the remote sites</td>
</tr>
<tr>
<td>Staged</td>
<td>This mapper only relies on matching transformation
              catalog entries that are of type STAGEABLE to construct the soft
              state map. This results in the executable workflow referring
              only to the staged executables, irrespective of the fact that
              the executables are already installed at the remote end</td>
</tr>
<tr>
<td>All</td>
<td>This mapper relies on all matching transformation catalog
              entries of type STAGEABLE or INSTALLED for a particular
              transformation as valid sources for the transfer of executables.
              This the most general mode, and results in the constructing the
              map as a result of the cartesian product of the matches.</td>
</tr>
<tr>
<td>Submit</td>
<td>This mapper only on matching transformation catalog
              entries that are of type STAGEABLE and reside at the submit host
              (pool local), are used while constructing the soft state map.
              This is especially helpful, when the user wants to use the
              latest compute code for his computations on the grid and that
              relies on his submit host.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break">
</div>
</div>
<div class="section" title="10.5.10. Staging of Pegasus Worker Package">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp14063568"></a>10.5.10. Staging of Pegasus Worker Package</h3></div></div></div>
<p>Pegasus can optionally stage the pegasus worker package as part of
    the executable workflow to remote workflow specific execution directory.
    The pegasus worker package contains the pegasus auxillary executables that
    are required on the remote site. If the worker package is not staged as
    part of the executable workflow, then Pegasus relies on the installed
    version of the worker package on the remote site. To determine the
    location of the installed version of the worker package on a remote site,
    Pegasus looks for an environment profile PEGASUS_HOME for the site in the
    Site Catalog.</p>
<p>Users can set the following property to true to turn on worker
    package staging</p>
<pre class="programlisting"><span class="bold"><strong>pegasus.transfer.worker.package          true</strong></span> </pre>
<p>By default, when worker package staging is turned on pegasus pulls
    the compatible worker package from the Pegasus Website. To specify a
    different worker package location, users can specify the transformation
    <span class="bold"><strong>pegasus::worker</strong></span> in the transformation
    catalog with</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>type set to STAGEABLE</p></li>
<li class="listitem"><p>System Information attributes of the transformation catalog
        entry match the System Information attributes of the compute
        site.</p></li>
<li class="listitem"><p>the PFN specified should be a remote URL that can be pulled to
        the compute site.</p></li>
</ul></div>
<div class="section" title="10.5.10.1. Worker Package Staging in Non Shared Filesystem setup">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp12515824"></a>10.5.10.1. Worker Package Staging in Non Shared Filesystem setup</h4></div></div></div>
<p>Worker package staging is automatically set to true , when
      workflows are setup to run in a non shared filesystem setup i.e.
      <span class="bold"><strong>pegasus.data.configuration</strong></span> is set to
      <span class="bold"><strong>nonsharedfs</strong></span> or <span class="bold"><strong>condorio</strong></span> . In these configurations, a
      stage_worker job is created that brings in the worker package to the
      submit directory of the workflow. For each job, the worker package is
      then transferred with the job using Condor File Transfers ( <span class="bold"><strong>transfer_input_files</strong></span> ) . This transfer always
      happens unless, PEGASUS_HOME is specified in the site catalog for the
      site on which the job is scheduled to run.</p>
<p>Users can explicitly set the following property to false, to turn
      off worker package staging by the Planner. This is applicable , when
      running in the cloud and virtual machines / worker nodes already have
      the pegasus worker tools installed.</p>
<pre class="programlisting"><span class="bold"><strong>pegasus.transfer.worker.package          false</strong></span> </pre>
</div>
</div>
<div class="section" title="10.5.11. Using Amazon S3 as a Staging Site">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp12098064"></a>10.5.11. Using Amazon S3 as a Staging Site</h3></div></div></div>
<p>Pegasus can be configured to use Amazon S3 as a staging site. In
    this mode, Pegasus transfers workflow inputs from the input site to S3.
    When a job runs, the inputs for that job are fetched from S3 to the worker
    node, the job is executed, then the output files are transferred from the
    worker node back to S3. When the jobs are complete, Pegasus transfers the
    output data from S3 to the output site.</p>
<p>In order to use S3, it is necessary to create a config file for the
    S3 transfer client, <a class="link" href="cli-pegasus-s3.php" title="pegasus-s3">pegasus-s3</a>. See
    the <a class="link" href="cli-pegasus-s3.php" title="pegasus-s3">man page</a> for details on how to
    create the config file. You also need to specify <a class="link" href="running_workflows.php#non_shared_fs" title="5.3.2. Non Shared Filesystem">S3 as a staging site</a>.</p>
<p>Next, you need to modify your site catalog to tell the location of
    your s3cfg file. See <a class="link" href="reference.php#cred_staging" title="10.5.8. Credentials Staging">the section on
    credential staging</a>.</p>
<p>The following site catalog shows how to specify the location of the
    s3cfg file on the local site and how to specify an Amazon S3 staging
    site:</p>
<pre class="programlisting">&lt;sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog
             http://pegasus.isi.edu/schema/sc-3.0.xsd" version="3.0"&gt;
    &lt;site handle="local" arch="x86_64" os="LINUX"&gt;
        &lt;head-fs&gt;
            &lt;scratch&gt;
                &lt;shared&gt;
                    &lt;file-server protocol="file" url="file://" mount-point="/tmp/wf/work"/&gt;
                    &lt;internal-mount-point mount-point="/tmp/wf/work"/&gt;
                &lt;/shared&gt;
            &lt;/scratch&gt;
            &lt;storage&gt;
                &lt;shared&gt;
                    &lt;file-server protocol="file" url="file://" mount-point="/tmp/wf/storage"/&gt;
                    &lt;internal-mount-point mount-point="/tmp/wf/storage"/&gt;
                &lt;/shared&gt;
            &lt;/storage&gt;
        &lt;/head-fs&gt;
        <span class="bold"><strong>&lt;profile namespace="env" key="S3CFG"&gt;/home/username/.s3cfg&lt;/profile&gt;</strong></span>
    &lt;/site&gt;
    <span class="bold"><strong>&lt;site handle="s3" arch="x86_64" os="LINUX"&gt;
        &lt;head-fs&gt;
            &lt;scratch&gt;
                &lt;shared&gt;
                    &lt;!-- wf-scratch is the name of the S3 bucket that will be used --&gt;
                    &lt;file-server protocol="s3" url="s3://user@amazon" mount-point="/wf-scratch"/&gt;
                    &lt;internal-mount-point mount-point="/wf-scratch"/&gt;
                &lt;/shared&gt;
            &lt;/scratch&gt;
        &lt;/head-fs&gt;
    &lt;/site&gt;</strong></span>
    &lt;site handle="condorpool" arch="x86_64" os="LINUX"&gt;
        &lt;head-fs&gt;
            &lt;scratch/&gt;
            &lt;storage/&gt;
        &lt;/head-fs&gt;
        &lt;profile namespace="pegasus" key="style"&gt;condor&lt;/profile&gt;
        &lt;profile namespace="condor" key="universe"&gt;vanilla&lt;/profile&gt;
        &lt;profile namespace="condor" key="requirements"&gt;(Target.Arch == "X86_64")&lt;/profile&gt;
    &lt;/site&gt;
&lt;/sitecatalog&gt;
</pre>
</div>
<div class="section" title="10.5.12. iRODS data access">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp15427664"></a>10.5.12. iRODS data access</h3></div></div></div>
<p>iRODS can be used as a input data location, a storage site for intermediate
        data during workflow execution, or a location for final output data. Pegasus uses
        a URL notation to identify iRODS files. Example:
    </p>
<pre class="programlisting">irods://some-host.org/path/to/file.txt</pre>
<p>The path to the file is <span class="bold"><strong>relative</strong></span> to the internal
        iRODS location. In the example above, the path used to refer to the file in iRODS is
        <span class="emphasis"><em>path/to/file.txt</em></span> (no leading /).
    </p>
<p>See <a class="link" href="reference.php#cred_staging" title="10.5.8. Credentials Staging">the section on credential staging</a> for 
        information on how to set up an irodsEnv file to be used by Pegasus.
    </p>
</div>
</div>
<div class="section" title="10.6. Hierarchical Workflows">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="hierarchial_workflows"></a>10.6. Hierarchical Workflows</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#idp14855440">10.6.1. Introduction</a></span></dt>
<dt><span class="section"><a href="reference.php#idp13898064">10.6.2. Specifying a DAX Job in the DAX</a></span></dt>
<dt><span class="section"><a href="reference.php#idp18700720">10.6.3. Specifying a DAG Job in the DAX</a></span></dt>
<dt><span class="section"><a href="reference.php#idp11976352">10.6.4. File Dependencies Across DAX Jobs</a></span></dt>
<dt><span class="section"><a href="reference.php#idp13061488">10.6.5. Recursion in Hierarchal Workflows</a></span></dt>
<dt><span class="section"><a href="reference.php#idp13959056">10.6.6. Example</a></span></dt>
</dl></div>
<div class="section" title="10.6.1. Introduction">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp14855440"></a>10.6.1. Introduction</h3></div></div></div>
<p>The Abstract Workflow in addition to containing compute jobs, can
    also contain jobs that refer to other workflows. This is useful for
    running large workflows or ensembles of workflows.</p>
<p>Users can embed two types of workflow jobs in the DAX</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p>daxjob - refers to a sub workflow represented as a DAX. During
        the planning of a workflow, the DAX jobs are mapped to condor dagman
        jobs that have pegasus plan invocation on the dax ( referred to in the
        DAX job ) as the prescript.</p>
<div class="figure">
<a name="idp16649408"></a><p class="title"><b>Figure 10.10. Planning of a DAX Job</b></p>
<div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="100%"><tr><td><img src="./images/daxjob-mapping.png" height="360" alt="Planning of a DAX Job"></td></tr></table></div></div>
</div>
<br class="figure-break">
</li>
<li class="listitem">
<p>dagjob - refers to a sub workflow represented as a DAG. During
        the planning of a workflow, the DAG jobs are mapped to condor dagman
        and refer to the DAG file mentioned in the DAG job.</p>
<div class="figure">
<a name="idp14440224"></a><p class="title"><b>Figure 10.11. Planning of a DAG Job</b></p>
<div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="100%"><tr><td><img src="./images/dagjob-mapping.png" height="360" alt="Planning of a DAG Job"></td></tr></table></div></div>
</div>
<br class="figure-break">
</li>
</ol></div>
</div>
<div class="section" title="10.6.2. Specifying a DAX Job in the DAX">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp13898064"></a>10.6.2. Specifying a DAX Job in the DAX</h3></div></div></div>
<p>Specifying a DAXJob in a DAX is pretty similar to how normal compute
    jobs are specified. There are minor differences in terms of the xml
    element name ( dax vs job ) and the attributes specified. DAXJob XML
    specification is described in detail in the <a class="link" href="reference.php#api" title="10.9. API Reference">chapter on
    DAX API</a> . An example DAX Job in a DAX is shown below</p>
<a name="dax_job_example"></a><pre class="programlisting">  &lt;dax id="ID000002" name="black.dax" node-label="bar" &gt;
    &lt;profile namespace="dagman" key="maxjobs"&gt;10&lt;/profile&gt;
    &lt;argument&gt;-Xmx1024 -Xms512 -Dpegasus.dir.storage=storagedir  -Dpegasus.dir.exec=execdir -o local -vvvvv --force -s dax_site &lt;/argument&gt;
  &lt;/dax&gt;</pre>
<div class="section" title="10.6.2.1. DAX File Locations">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp17107296"></a>10.6.2.1. DAX File Locations</h4></div></div></div>
<p>The name attribute in the dax element refers to the LFN ( Logical
      File Name ) of the dax file. The location of the DAX file can be
      catalogued either in the</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Replica Catalog</p></li>
<li class="listitem">
<p>Replica Catalog Section in the <a class="link" href="reference.php#dax_replica_catalog" title="10.9.1.1.3.1. The Replica Catalog Section">DAX</a> .</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>Currently, only file url's on the local site ( submit host
              ) can be specified as DAX file locations.</p>
</div>
</li>
</ol></div>
</div>
<div class="section" title="10.6.2.2. Arguments for a DAX Job">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp14727200"></a>10.6.2.2. Arguments for a DAX Job</h4></div></div></div>
<p>Users can specify specific arguments to the DAX Jobs. The
      arguments specified for the DAX Jobs are passed to the pegasus-plan
      invocation in the prescript for the corresponding condor dagman job in
      the executable workflow.</p>
<p>The following options for pegasus-plan are inherited from the
      pegasus-plan invocation of the parent workflow. If an option is
      specified in the arguments section for the DAX Job then that overrides
      what is inherited.</p>
<div class="table">
<a name="idp24909696"></a><p class="title"><b>Table 10.16. Options inherited from parent workflow</b></p>
<div class="table-contents"><table summary="Options inherited from parent workflow" border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th>Option Name</th>
<th>Description</th>
</tr></thead>
<tbody><tr>
<td>--sites</td>
<td>list of execution sites.</td>
</tr></tbody>
</table></div>
</div>
<br class="table-break"><p>It is highly recommended that users <span class="bold"><strong>dont
      specify</strong></span> directory related options in the arguments section
      for the DAX Jobs. Pegasus assigns values to these options for the sub
      workflows automatically.</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>--relative-dir</p></li>
<li class="listitem"><p>--dir</p></li>
<li class="listitem"><p>--relative-submit-dir</p></li>
</ol></div>
</div>
<div class="section" title="10.6.2.3. Profiles for DAX Job">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp13575328"></a>10.6.2.3. Profiles for DAX Job</h4></div></div></div>
<p>Users can choose to specify dagman profiles with the DAX Job to
      control the behavior of the corresponding condor dagman instance in the
      executable workflow. In the example <a class="link" href="reference.php#dax_job_example">above</a> maxjobs is set to 10 for the sub
      workflow.</p>
</div>
<div class="section" title="10.6.2.4. Execution of the PRE script and Condor DAGMan instance">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp13913312"></a>10.6.2.4. Execution of the PRE script and Condor DAGMan instance</h4></div></div></div>
<p>The pegasus plan that is invoked as part of the prescript to the
      condor dagman job is executed on the submit host. The log from the
      output of pegasus plan is redirected to a file ( ending with suffix
      pre.log ) in the submit directory of the workflow that contains the DAX
      Job. The path to pegasus-plan is automatically determined.</p>
<p>The DAX Job maps to a Condor DAGMan job. The path to condor dagman
      binary is determined according to the following rules -</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>entry in the transformation catalog for condor::dagman for
          site local, else</p></li>
<li class="listitem"><p>pick up the value of CONDOR_HOME from the environment if
          specified and set path to condor dagman as
          $CONDOR_HOME/bin/condor_dagman , else</p></li>
<li class="listitem"><p>pick up the value of CONDOR_LOCATION from the environment if
          specified and set path to condor dagman as
          $CONDOR_LOCATION/bin/condor_dagman , else</p></li>
<li class="listitem"><p>pick up the path to condor dagman from what is defined in the
          user's PATH</p></li>
</ol></div>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Tip</h3>
<p>It is recommended that user dagman.maxpre in their properties
        file to control the maximum number of pegasus plan instances launched
        by each running dagman instance.</p>
</div>
</div>
</div>
<div class="section" title="10.6.3. Specifying a DAG Job in the DAX">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp18700720"></a>10.6.3. Specifying a DAG Job in the DAX</h3></div></div></div>
<p>Specifying a DAGJob in a DAX is pretty similar to how normal compute
    jobs are specified. There are minor differences in terms of the xml
    element name ( dag vs job ) and the attributes specified. For DAGJob XML
    details,see the <a class="link" href="reference.php#api" title="10.9. API Reference"> API Reference </a> chapter . An
    example DAG Job in a DAX is shown below</p>
<a name="dag_job_example"></a><pre class="programlisting">  &lt;dag id="ID000003" name="black.dag" node-label="foo" &gt;
    &lt;profile namespace="dagman" key="maxjobs"&gt;10&lt;/profile&gt;
    &lt;profile namespace="dagman" key="DIR"&gt;/dag-dir/test&lt;/profile&gt;
  &lt;/dag&gt;</pre>
<div class="section" title="10.6.3.1. DAG File Locations">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp14530480"></a>10.6.3.1. DAG File Locations</h4></div></div></div>
<p>The name attribute in the dag element refers to the LFN ( Logical
      File Name ) of the dax file. The location of the DAX file can be
      catalogued either in the</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Replica Catalog</p></li>
<li class="listitem">
<p>Replica Catalog Section in the DAX.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>Currently, only file url's on the local site ( submit host
              ) can be specified as DAG file locations.</p>
</div>
</li>
</ol></div>
</div>
<div class="section" title="10.6.3.2. Profiles for DAG Job">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp26320528"></a>10.6.3.2. Profiles for DAG Job</h4></div></div></div>
<p>Users can choose to specify dagman profiles with the DAX Job to
      control the behavior of the corresponding condor dagman instance in the
      executable workflow. In the example above, maxjobs is set to 10 for the
      sub workflow.</p>
<p>The dagman profile DIR allows users to specify the directory in
      which they want the condor dagman instance to execute. In the example
      <a class="link" href="reference.php#dag_job_example">above</a> black.dag is set to be
      executed in directory /dag-dir/test . The /dag-dir/test should be
      created beforehand.</p>
</div>
</div>
<div class="section" title="10.6.4. File Dependencies Across DAX Jobs">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp11976352"></a>10.6.4. File Dependencies Across DAX Jobs</h3></div></div></div>
<p>In hierarchal workflows , if a sub workflow generates some output
    files required by another sub workflow then there should be an edge
    connecting the two dax jobs. Pegasus will ensure that the prescript for
    the child sub-workflow, has the path to the cache file generated during
    the planning of the parent sub workflow. The cache file in the submit
    directory for a workflow is a textual replica catalog that lists the
    locations of all the output files created in the remote workflow execution
    directory when the workflow executes.</p>
<p>This automatic passing of the cache file to a child sub-workflow
    ensures that the datasets from the same workflow run are used. However,
    the passing of the locations in a cache file also ensures that Pegasus
    will prefer them over all other locations in the Replica Catalog. If you
    need the Replica Selection to consider locations in the Replica Catalog
    also, then set the following property.</p>
<pre class="programlisting"><span class="bold"><strong>pegasus.catalog.replica.cache.asrc  true</strong></span></pre>
<p>The above is useful in the case, where you are staging out the
    output files to a storage site, and you want the child sub workflow to
    stage these files from the storage output site instead of the workflow
    execution directory where the files were originally created.</p>
</div>
<div class="section" title="10.6.5. Recursion in Hierarchal Workflows">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp13061488"></a>10.6.5. Recursion in Hierarchal Workflows</h3></div></div></div>
<p>It is possible for a user to add a dax jobs to a dax that already
    contain dax jobs in them. Pegasus does not place a limit on how many
    levels of recursion a user can have in their workflows. From Pegasus
    perspective recursion in hierarchal workflows ends when a DAX with only
    compute jobs is encountered . However, the levels of recursion are limited
    by the system resources consumed by the DAGMan processes that are running
    (each level of nesting produces another DAGMan process) .</p>
<p>The figure below illustrates an example with recursion 2 levels
    deep.</p>
<div class="figure">
<a name="idp15227248"></a><p class="title"><b>Figure 10.12. Recursion in Hierarchal Workflows</b></p>
<div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="100%"><tr><td><img src="./images/recursion_in_hierarchal_workflows.png" height="360" alt="Recursion in Hierarchal Workflows"></td></tr></table></div></div>
</div>
<br class="figure-break"><p>The execution time-line of the various jobs in the above figure is
    illustrated below.</p>
<div class="figure">
<a name="idp11999696"></a><p class="title"><b>Figure 10.13. Execution Time-line for Hierarchal Workflows</b></p>
<div class="figure-contents"><div class="mediaobject"><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="100%"><tr><td><img src="./images/hierarchal_workflows_execution_timeline.png" height="360" alt="Execution Time-line for Hierarchal Workflows"></td></tr></table></div></div>
</div>
<br class="figure-break">
</div>
<div class="section" title="10.6.6. Example">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp13959056"></a>10.6.6. Example</h3></div></div></div>
<p>The Galactic Plane workflow is a Hierarchical workflow of many
    Montage workflows. For details, see <a class="link" href="example_workflows.php" title="Chapter 9. Example Workflows">Workflow of Workflows</a>.</p>
</div>
</div>
<div class="section" title="10.7. Notifications">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="notifications"></a>10.7. Notifications</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#idp13703536">10.7.1. Specifying Notifications in the DAX</a></span></dt>
<dt><span class="section"><a href="reference.php#pegasus_notify_file">10.7.2. Notify File created by Pegasus in the submit directory</a></span></dt>
<dt><span class="section"><a href="reference.php#idp15464592">10.7.3. Configuring pegasus-monitord for notifications</a></span></dt>
<dt><span class="section"><a href="reference.php#idp16765904">10.7.4. Default Notification Scripts</a></span></dt>
</dl></div>
<p>The Pegasus Workflow Mapper now supports job and workflow level
  notifications. You can specify in the DAX with the job or the
  workflow</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>the event when the notification needs to be sent</p></li>
<li class="listitem"><p>the executable that needs to be invoked.</p></li>
</ul></div>
<p>The notifications are issued from the submit host by the
  pegasus-monitord daemon that monitors the Condor logs for the workflow. When
  a notification is issued, pegasus-monitord while invoking the notifying
  executable sets certain environment variables that contain information about
  the job and workflow state.</p>
<p>The Pegasus release comes with default notification clients that send
  notifications via email or jabber.</p>
<div class="section" title="10.7.1. Specifying Notifications in the DAX">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp13703536"></a>10.7.1. Specifying Notifications in the DAX</h3></div></div></div>
<p>Currently, you can specify notifications for the jobs and the
    workflow by the use of invoke elements.</p>
<p>Invoke elements can be sub elements for the following elements in
    the DAX schema.</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>job - to associate notifications with a compute job in the
          DAX.</p></li>
<li class="listitem"><p>dax - to associate notifications with a dax job in the
          DAX.</p></li>
<li class="listitem"><p>dag - to associate notifications with a dag job in the
          DAX.</p></li>
<li class="listitem"><p>executable - to associate notifications with a job that uses a
          particular notification</p></li>
</ul></div>
<p>The invoke element can be specified at the root element level of the
    DAX to indicate workflow level notifications.</p>
<p>The invoke element may be specified multiple times, as needed. It
    has a mandatory <span class="bold"><strong>when</strong></span> attribute with the
    following value set</p>
<div class="table">
<a name="notification_conditions_table"></a><p class="title"><b>Table 10.17. Table 1. Invoke Element attributes and meaning.</b></p>
<div class="table-contents"><table summary="Table 1. Invoke Element attributes and meaning." border="1">
<colgroup>
<col>
<col>
</colgroup>
<thead><tr>
<th align="center">Enumeration of Values for when
            attribute</th>
<th align="center">Meaning</th>
</tr></thead>
<tbody>
<tr>
<td>never</td>
<td>(default). Never notify of anything. This is useful to
            temporarily disable an existing notifications.</td>
</tr>
<tr>
<td>start</td>
<td>create a notification when the job is submitted.</td>
</tr>
<tr>
<td>on_error</td>
<td>after a job finishes with failure (exitcode != 0).</td>
</tr>
<tr>
<td>on_success</td>
<td>after a job finishes with success (exitcode == 0).</td>
</tr>
<tr>
<td>at_end</td>
<td>after a job finishes, regardless of exitcode.</td>
</tr>
<tr>
<td>all</td>
<td>like start and at_end combined.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>You can specify multiple invoke elements corresponding to same when
    attribute value in the DAX. This will allow you to have multiple
    notifications for the same event.</p>
<p>Here is an example that illustrates that.</p>
<pre class="programlisting">&lt;job id="ID000001" namespace="example" name="mDiffFit" version="1.0" 
       node-label="preprocess" &gt;
    &lt;argument&gt;-a top -T 6  -i &lt;file name="f.a"/&gt;  -o &lt;file name="f.b1"/&gt;&lt;/argument&gt;

    &lt;!-- profiles are optional --&gt;
    &lt;profile namespace="execution" key="site"&gt;isi_viz&lt;/profile&gt;
    &lt;profile namespace="condor" key="getenv"&gt;true&lt;/profile&gt;

    &lt;uses name="f.a" link="input"  register="false" transfer="true" type="data" /&gt;
    &lt;uses name="f.b" link="output" register="false" transfer="true" type="data" /&gt;
    
    &lt;!-- 'WHEN' enumeration: never, start, on_error, on_success, on_end, all --&gt;
    <span class="bold"><strong>&lt;invoke when="start"&gt;/path/to/notify1 arg1 arg2&lt;/invoke&gt;
    &lt;invoke when="start"&gt;/path/to/notify1 arg3 arg4&lt;/invoke&gt;
    &lt;invoke when="on_success"&gt;/path/to/notify2 arg3 arg4&lt;/invoke&gt;</strong></span>
  &lt;/job&gt;</pre>
<p>In the above example the executable notify1 will be invoked twice
    when a job is submitted ( when="start" ), once with arguments arg1 and
    arg2 and second time with arguments arg3 and arg4.</p>
<p>The DAX Generator API <a class="link" href="reference.php#dax_generator_api" title="10.9.2. DAX Generator API">chapter</a> has information about how to
    add notifications to the DAX using the DAX api's.</p>
</div>
<div class="section" title="10.7.2. Notify File created by Pegasus in the submit directory">
<div class="titlepage"><div><div><h3 class="title">
<a name="pegasus_notify_file"></a>10.7.2. Notify File created by Pegasus in the submit directory</h3></div></div></div>
<p>Pegasus while planning a workflow writes out a notify file in the
    submit directory that contains all the notifications that need to be sent
    for the workflow. pegasus-monitord picks up this notifications file to
    determine what notifications need to be sent and when.</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p>ENTITY_TYPE ID NOTIFICATION_CONDITION ACTION</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p>ENTITY_TYPE can be either of the following keywords</p>
<div class="itemizedlist"><ul class="itemizedlist" type="circle">
<li class="listitem"><p>WORKFLOW - indicates workflow level notification</p></li>
<li class="listitem"><p>JOB - indicates notifications for a job in the
                executable workflow</p></li>
<li class="listitem"><p>DAXJOB - indicates notifications for a DAX Job in the
                executable workflow</p></li>
<li class="listitem"><p>DAGJOB - indicates notifications for a DAG Job in the
                executable workflow</p></li>
</ul></div>
</li>
<li class="listitem">
<p>ID indicates the identifier for the entity. It has different
            meaning depending on the entity type - -</p>
<div class="itemizedlist"><ul class="itemizedlist" type="circle">
<li class="listitem"><p>workflow - ID is wf_uuid</p></li>
<li class="listitem"><p>JOB|DAXJOB|DAGJOB - ID is the job identifier in the
                executable workflow ( DAG ).</p></li>
</ul></div>
</li>
<li class="listitem"><p>NOTIFICATION_CONDITION is the condition when the
            notification needs to be sent. The notification conditions are
            enumerated in <a class="link" href="reference.php#notification_conditions_table" title="Table 10.17. Table 1. Invoke Element attributes and meaning.">Table
            1</a></p></li>
<li class="listitem"><p>ACTION is what needs to happen when condition is satisfied.
            It is executable + arguments</p></li>
</ul></div>
</li>
<li class="listitem">
<p>INVOCATION JOB_IDENTIFIER INV.ID NOTIFICATION_CONDITION
        ACTION</p>
<p>The INVOCATION lines are only generated for clustered jobs, to
        specifiy the finer grained notifications for each constitutent
        job/invocation .</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>JOB IDENTIFIER is the job identifier in the executable
            workflow ( DAG ).</p></li>
<li class="listitem"><p>INV.ID indicates the index of the task in the clustered job
            for which the notification needs to be sent.</p></li>
<li class="listitem"><p>NOTIFICATION_CONDITION is the condition when the
            notification needs to be sent. The notification conditions are
            enumerated in <a class="link" href="reference.php#notification_conditions_table" title="Table 10.17. Table 1. Invoke Element attributes and meaning.">Table
            1</a></p></li>
<li class="listitem"><p>ACTION is what needs to happen when condition is satisfied.
            It is executable + arguments</p></li>
</ul></div>
</li>
</ol></div>
<p>A sample notifications file generated is listed below.</p>
<pre class="programlisting">WORKFLOW d2c4f79c-8d5b-4577-8c46-5031f4d704e8 on_error /bin/date1

INVOCATION merge_vahi-preprocess-1.0_PID1_ID1 1 on_success /bin/date_executable
INVOCATION merge_vahi-preprocess-1.0_PID1_ID1 1 on_success /bin/date_executable
INVOCATION merge_vahi-preprocess-1.0_PID1_ID1 1 on_error /bin/date_executable

INVOCATION merge_vahi-preprocess-1.0_PID1_ID1 2 on_success /bin/date_executable
INVOCATION merge_vahi-preprocess-1.0_PID1_ID1 2 on_error /bin/date_executable

DAXJOB subdax_black_ID000003 on_error /bin/date13
JOB    analyze_ID00004    on_success /bin/date
</pre>
</div>
<div class="section" title="10.7.3. Configuring pegasus-monitord for notifications">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp15464592"></a>10.7.3. Configuring pegasus-monitord for notifications</h3></div></div></div>
<p>Whenever pegasus-monitord enters a workflow (or sub-workflow)
    directory, it will read the notifications file generated by Pegasus.
    Pegasus-monitord will match events in the running workflow against the
    notifications specified in the notifications file and will initiate the
    script specified in a notification when that notification matches an event
    in the workflow. It is important to note that there will be a delay
    between a certain event happening in the workflow, and pegasus-monitord
    processing the log file and executing the corresponding notification
    script.</p>
<p>The following command line options (and properties) can change how
    pegasus-monitord handles notifications:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>--no-notifications (pegasus.monitord.notifications=False): Will
        disable notifications completely.</p></li>
<li class="listitem"><p>--notifications-max=nn (pegasus.monitord.notifications.max=nn):
        Will limit the number of concurrent notification scripts to nn. Once
        pegasus-monitord reaches this number, it will wait until one
        notification script finishes before starting a new one. Notifications
        happening during this time will be queued by the system. The default
        number of concurrent notification scripts for pegasus-monitord is
        10.</p></li>
<li class="listitem"><p>--notifications-timeout=nn
        (pegasus.monitord.notifications.timeout=nn): This setting is used to
        change how long will pegasus-monitord wait for a notification script
        to finish. By default pegasus-monitord will wait for as long as it
        takes (possibly indefinitely) until a notification script ends. With
        this option, pegasus-monitord will wait for at most nn seconds before
        killing the notification script.</p></li>
</ul></div>
<p>It is also important to understand that pegasus-monitord will not
    issue any notifications when it is executed in replay mode.</p>
<div class="section" title="10.7.3.1. Environment set for the notification scripts">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp13964080"></a>10.7.3.1. Environment set for the notification scripts</h4></div></div></div>
<p>Whenever a notification in the notifications file matches an event
      in the running workflow, pegasus-monitord will run the corresponding
      script specified in the ACTION field of the notifications file.
      Pegasus-monitord will set the following environment variables for each
      notification script is starts:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>PEGASUS_EVENT: The NOTIFICATION_CONDITION that caused the
          notification. In the case of the "all" condition, pegasus-monitord
          will substitute it for the actual event that caused the match (e.g.
          "start" or "at_end").</p></li>
<li class="listitem"><p>PEGASUS_EVENT_TIMESTAMP: Timestamp in EPOCH format for the
          event (better for automated processing).</p></li>
<li class="listitem"><p>PEGASUS_EVENT_TIMESTAMP_ISO: Same as above, but in ISO format
          (better for human readability).</p></li>
<li class="listitem"><p>PEGASUS_SUBMIT_DIR: The submit directory for the workflow
          (usually the value from "submit_dir" in the braindump.txt
          file)</p></li>
<li class="listitem"><p>PEGASUS_STDOUT: For workflow notifications, this will
          correspond to the dagman.out file for that workflow. For job and
          invocation notifications, this field will contain the output file
          (stdout) for that particular job instance.</p></li>
<li class="listitem"><p>PEGASUS_STDERR: For job and invocation notifications, this
          field will contain the error file (stderr) for the particular
          executable job instance. This field does not exist in case of
          workflow notifications.</p></li>
<li class="listitem"><p>PEGASUS_WFID: Contains the workflow id for this notification
          in the form of DAX_LABEL + DAX_INDEX (from the braindump.txt
          file).</p></li>
<li class="listitem"><p>PEGASUS_JOBID: For workflow notifications, this contains the
          worfkflow wf_uuid (from the braindump.txt file). For job and
          invocation notifications, this field contains the job identifier in
          the executable workflow ( DAG ) for the particular
          notification.</p></li>
<li class="listitem"><p>PEGASUS_INVID: Contains the index of the task in the clustered
          job for the notification.</p></li>
<li class="listitem"><p>PEGASUS_STATUS: For workflow notifications, this contains
          DAGMan's exit code. For job and invocation notifications, this field
          contains the exit code for the particular job/task. Please note that
          this field is not present for 'start' notification events.</p></li>
</ul></div>
</div>
</div>
<div class="section" title="10.7.4. Default Notification Scripts">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp16765904"></a>10.7.4. Default Notification Scripts</h3></div></div></div>
<p>Pegasus ships with two reference notification scripts. These can be
    used as starting point when creating your own notification scripts, or if
    the default one is all you need, you can use them directly in your
    workflows. The scripts are:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p><span class="bold"><strong>libexec/notification/email</strong></span> -
        sends email, including the output from
        <span class="command"><strong>pegasus-status</strong></span> (default) or
        <span class="command"><strong>pegasus-analyzer</strong></span>.</p>
<pre class="screen"><span class="bold"><strong>$ ./libexec/notification/email --help</strong></span>
Usage: email [options]

Options:
  -h, --help            show this help message and exit
  -t TO_ADDRESS, --to=TO_ADDRESS
                        The To: email address. Defines the recipient for the
                        notification.
  -f FROM_ADDRESS, --from=FROM_ADDRESS
                        The From: email address. Defaults to the required To:
                        address.
  -r REPORT, --report=REPORT
                        Include workflow report. Valid values are: none
                        pegasus-analyzer pegasus-status (default)
</pre>
</li>
<li class="listitem">
<p><span class="bold"><strong>libexec/notification/jabber </strong></span>-
        sends simple notifications to Jabber/GTalk. This can be useful for job
        failures.</p>
<pre class="screen"><span class="bold"><strong>$ ./libexec/notification/jabber --help</strong></span>
Usage: jabber [options]

Options:
  -h, --help            show this help message and exit
  -i JABBER_ID, --jabberid=JABBER_ID
                        Your jabber id. Example: user@jabberhost.com
  -p PASSWORD, --password=PASSWORD
                        Your jabber password
  -s HOST, --host=HOST  Jabber host, if different from the host in your jabber
                        id. For Google talk, set this to talk.google.com
  -r RECIPIENT, --recipient=RECIPIENT
                        Jabber id of the recipient. Not necessary if you want
                        to send to your own jabber id
</pre>
</li>
</ul></div>
<p>For example, if the DAX generator is written in Python and you want
    notifications on 'at_end' events (successful or failed):</p>
<pre class="programlisting"># job level notifications - in this case for at_end events
job.invoke('at_end', pegasus_home + "/libexec/notifications/email --to me@somewhere.edu")</pre>
<p>Please see the <a class="link" href="example_workflows.php#notifications_example" title="9.4. Notifications Example">notifications
    example</a> to see a full workflow using notifications.</p>
</div>
</div>
<div class="section" title="10.8. Monitoring">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="monitoring"></a>10.8. Monitoring</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#monitoring_pegasus-monitord">10.8.1. pegasus-monitord</a></span></dt>
<dt><span class="section"><a href="reference.php#stampede_schema_overview">10.8.2. Overview of the Stampede Database Schema.</a></span></dt>
</dl></div>
<p>Pegasus launches a monitoring daemon called pegasus-monitord per
  workflow ( a single daemon is launched if a user submits a hierarchal
  workflow ) . pegasus-monitord parses the workflow and job logs in the submit
  directory and populates to a database. This chapter gives an overview of the
  pegasus-monitord and describes the schema of the runtime database.</p>
<div class="section" title="10.8.1. pegasus-monitord">
<div class="titlepage"><div><div><h3 class="title">
<a name="monitoring_pegasus-monitord"></a>10.8.1. pegasus-monitord</h3></div></div></div>
<p><span class="bold"><strong>Pegasus-monitord</strong></span> is used to follow
    workflows, parsing the output of DAGMan's dagman.out file. In addition to
    generating the jobstate.log file, which contains the various states that a
    job goes through during the workflow execution, <span class="bold"><strong>pegasus-monitord</strong></span> can also be used to mine
    information from jobs' submit and output files, and either populate a
    database, or write a file with NetLogger events containing this
    information. <span class="bold"><strong>Pegasus-monitord</strong></span> can also
    send notifications to users in real-time as it parses the workflow
    execution logs.</p>
<p><span class="bold"><strong>Pegasus-monitord</strong></span> is automatically
    invoked by <span class="bold"><strong>pegasus-run</strong></span>, and tracks
    workflows in real-time. By default, it produces the jobstate.log file, and
    a SQLite database, which contains all the information listed in the <a class="link" href="reference.php#stampede-schema">Stampede schema</a>. When a workflow fails,
    and is re-submitted with a rescue DAG, <span class="bold"><strong>pegasus-monitord</strong></span> will automatically pick up from
    where it left previously and continue to write the jobstate.log file and
    populate the database.</p>
<p>If, after the workflow has already finished, users need to re-create
    the jobstate.log file, or re-populate the database from scratch, <span class="bold"><strong>pegasus-monitord</strong></span>'s <span class="bold"><strong>--replay</strong></span> option should be used when running it
    manually.</p>
<div class="section" title="10.8.1.1. Populating to different backend databases">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp12893088"></a>10.8.1.1. Populating to different backend databases</h4></div></div></div>
<p>In addition to SQLite, <span class="bold"><strong>pegasus-monitord</strong></span> supports other types of
      databases, such as MySQL and Postgres. Users will need to install the
      low-level database drivers, and can use the <span class="bold"><strong>--dest</strong></span> command-line option, or the <span class="bold"><strong>pegasus.monitord.output</strong></span> property to select where
      the logs should go.</p>
<p>As an example, the command:</p>
<pre class="programlisting">$ pegasus-monitord -r diamond-0.dag.dagman.out</pre>
<p>will launch <span class="bold"><strong>pegasus-monitord</strong></span> in
      replay mode. In this case, if a jobstate.log file already exists, it
      will be rotated and a new file will be created. It will also create/use
      a SQLite database in the workflow's run directory, with the name of
      diamond-0.stampede.db. If the database already exists, it will make sure
      to remove any references to the current workflow before it populates the
      database. In this case, <span class="bold"><strong>pegasus-monitord</strong></span> will process the workflow
      information from start to finish, including any restarts that may have
      happened.</p>
<p>Users can specify an alternative database for the events, as
      illustrated by the following examples:</p>
<pre class="programlisting">$ pegasus-monitord -r -d mysql://username:userpass@hostname/database_name diamond-0.dag.dagman.out</pre>
<pre class="programlisting">$ pegasus-monitord -r -d sqlite:////tmp/diamond-0.db diamond-0.dag.dagman.out</pre>
<p>In the first example, <span class="bold"><strong>pegasus-monitord</strong></span> will send the data to the
      <span class="bold"><strong>database_name</strong></span> database located at
      server <span class="bold"><strong>hostname</strong></span>, using the <span class="bold"><strong>username</strong></span> and <span class="bold"><strong>userpass</strong></span> provided. In the second example,
      <span class="bold"><strong>pegasus-monitord</strong></span> will store the data in
      the /tmp/diamond-0.db SQLite database.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>For absolute paths four slashes are required when specifying an
        alternative database path in SQLite.</p>
</div>
<p>Users should also be aware that in all cases, with the exception
      of SQLite, the database should exist before <span class="bold"><strong>pegasus-monitord</strong></span> is run (as it creates all needed
      tables but does not create the database itself).</p>
<p>Finally, the following example:</p>
<pre class="programlisting">$ pegasus-monitord -r --dest diamond-0.bp diamond-0.dag.dagman.out</pre>
<p>sends events to the diamond-0.bp file. (please note that in replay
      mode, any data on the file will be overwritten).</p>
<p>One important detail is that while processing a workflow,
      <span class="bold"><strong>pegasus-monitord</strong></span> will automatically
      detect if/when sub-workflows are initiated, and will automatically track
      those sub-workflows as well. In this case, although <span class="bold"><strong>pegasus-monitord</strong></span> will create a separate
      jobstate.log file in each workflow directory, the database at the
      top-level workflow will contain the information from not only the main
      workflow, but also from all sub-workflows.</p>
</div>
<div class="section" title="10.8.1.2. Monitoring related files in the workflow directory">
<div class="titlepage"><div><div><h4 class="title">
<a name="monitoring-files"></a>10.8.1.2. Monitoring related files in the workflow directory</h4></div></div></div>
<p><span class="bold"><strong>Pegasus-monitord</strong></span> generates a
      number of files in each workflow directory:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p><span class="bold"><strong>jobstate.log</strong></span>: contains a
          summary of workflow and job execution.</p></li></ul></div>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p><span class="bold"><strong>monitord.log</strong></span>: contains any
          log messages generated by <span class="bold"><strong>pegasus-monitord</strong></span>. It is not overwritten when
          it restarts. This file is not generated in replay mode, as all log
          messages from <span class="bold"><strong>pegasus-monitord</strong></span> are
          output to the console. Also, when sub-workflows are involved, only
          the top-level workflow will have this log file. Starting with
          release 4.0 and 3.1.1, monitord.log file is rotated if it exists
          already.</p></li>
<li class="listitem"><p><span class="bold"><strong>monitord.started</strong></span>: contains a
          timestamp indicating when <span class="bold"><strong>pegasus-monitord</strong></span> was started. This file get
          overwritten every time <span class="bold"><strong>pegasus-monitord</strong></span> starts.</p></li>
<li class="listitem"><p><span class="bold"><strong>monitord.done</strong></span>: contains a
          timestamp indicating when <span class="bold"><strong>pegasus-monitord</strong></span> finished. This file is
          overwritten every time <span class="bold"><strong>pegasus-monitord</strong></span> starts.</p></li>
<li class="listitem"><p><span class="bold"><strong>monitord.info</strong></span>: contains
          <span class="bold"><strong>pegasus-monitord</strong></span> state information,
          which allows it to resume processing if a workflow does not finish
          properly and a rescue dag is submitted. This file is erased when
          <span class="bold"><strong>pegasus-monitord</strong></span> is executed in
          replay mode.</p></li>
<li class="listitem"><p><span class="bold"><strong>monitord.recover</strong></span>: contains
          <span class="bold"><strong>pegasus-monitord</strong></span> state information
          that allows it to detect that a previous instance of <span class="bold"><strong>pegasus-monitord</strong></span> failed (or was killed)
          midway through parsing a workflow's execution logs. This file is
          only present while <span class="bold"><strong>pegasus-monitord</strong></span>
          is running, as it is deleted when it ends and the <span class="bold"><strong>monitord.info</strong></span> file is generated.</p></li>
<li class="listitem"><p><span class="bold"><strong>monitord.subwf.db</strong></span>: contains
          information that aids <span class="bold"><strong>pegasus-monitord</strong></span> to track when sub-workflows
          fail and are re-planned/re-tried. It is overwritten when <span class="bold"><strong>pegasus-monitord</strong></span> is started in replay
          mode.</p></li>
<li class="listitem"><p><span class="bold"><strong>monitord-notifications.log</strong></span>:
          contains the log file for notification-related messages. Normally,
          this file only includes logs for failed notifications, but can be
          populated with all notification information when <span class="bold"><strong>pegasus-monitord</strong></span> is run in verbose mode via
          the <span class="bold"><strong>-v</strong></span> command-line option.</p></li>
</ul></div>
</div>
</div>
<div class="section" title="10.8.2. Overview of the Stampede Database Schema.">
<div class="titlepage"><div><div><h3 class="title">
<a name="stampede_schema_overview"></a>10.8.2. Overview of the Stampede Database Schema.</h3></div></div></div>
<p>Pegasus takes in a DAX which is composed of tasks. Pegasus plans it
    into a Condor DAG / Executable workflow that consists of Jobs. In case of
    Clustering, multiple tasks in the DAX can be captured into a single job in
    the Executable workflow. When DAGMan executes a job, a job instance is
    populated . Job instances capture information as seen by DAGMan. In case
    DAGMan retires a job on detecting a failure , a new job instance is
    populated. When DAGMan finds a job instance has finished , an invocation
    is associated with job instance. In case of clustered job, multiple
    invocations will be associated with a single job instance. If a Pre script
    or Post Script is associated with a job instance, then invocations are
    populated in the database for the corresponding job instance.</p>
<p>The current schema version is <span class="bold"><strong>4.0</strong></span>
    that is stored in the schema_info table.</p>
<div class="figure">
<a name="idp12572000"></a><p class="title"><b>Figure 10.14. Stampede Database Schema</b></p>
<div class="figure-contents"><div class="mediaobject"><img src="images/stampede-schema-small.png" width="NaN" alt="Stampede Database Schema"></div></div>
</div>
<br class="figure-break"><div class="section" title="10.8.2.1. Stampede Schema Upgrade Tool">
<div class="titlepage"><div><div><h4 class="title">
<a name="schema_upgrade_tool"></a>10.8.2.1. Stampede Schema Upgrade Tool</h4></div></div></div>
<p>Starting Pegasus 4.x the monitoring and statistics database schema
      has changed. If you want to use the pegasus-statistics, pegasus-analyzer
      and pegasus-plots against a 3.x database you will need to upgrade the
      schema first using the schema upgrade tool
      /usr/share/pegasus/sql/schema_tool.py or
      /path/to/pegasus-4.x/share/pegasus/sql/schema_tool.py</p>
<p>Upgrading the schema is required for people using the MySQL
      database for storing their monitoring information if it was setup with
      3.x monitoring tools.</p>
<p>If your setup uses the default SQLite database then the new
      databases run with Pegasus 4.x are automatically created with the
      correct schema. In this case you only need to upgrade the SQLite
      database from older runs if you wish to query them with the newer
      clients.</p>
<p>To upgrade the database</p>
<pre class="programlisting">For SQLite Database

<span class="bold"><strong>cd /to/the/workflow/directory/with/3.x.monitord.db</strong></span>

Check the db version<span class="bold"><strong>

/usr/share/pegasus/sql/schema_tool.py -c connString=sqlite:////to/the/workflow/directory/with/workflow.stampede.db</strong></span>
2012-02-29T01:29:43.330476Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.init | 
2012-02-29T01:29:43.330708Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.check_schema.start | 
2012-02-29T01:29:43.348995Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.check_schema 
                                   | Current version set to: 3.1. 
2012-02-29T01:29:43.349133Z ERROR  netlogger.analysis.schema.schema_check.SchemaCheck.check_schema 
                                   | Schema version 3.1 found - expecting 4.0 - database admin will need to run upgrade tool.


Convert the Database to be version 4.x compliant<span class="bold"><strong>

/usr/share/pegasus/sql/schema_tool.py -u connString=sqlite:////to/the/workflow/directory/with/workflow.stampede.db
</strong></span>2012-02-29T01:35:35.046317Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.init | 
2012-02-29T01:35:35.046554Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.check_schema.start | 
2012-02-29T01:35:35.064762Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.check_schema 
                                  | Current version set to: 3.1. 
2012-02-29T01:35:35.064902Z ERROR  netlogger.analysis.schema.schema_check.SchemaCheck.check_schema 
                                  | Schema version 3.1 found - expecting 4.0 - database admin will need to run upgrade tool. 
2012-02-29T01:35:35.065001Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.upgrade_to_4_0 
                                  | Upgrading to schema version 4.0.

Verify if the database has been converted to Version 4.x<span class="bold"><strong>

/usr/share/pegasus/sql/schema_tool.py -c connString=sqlite:////to/the/workflow/directory/with/workflow.stampede.db</strong></span>
2012-02-29T01:39:17.218902Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.init | 
2012-02-29T01:39:17.219141Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.check_schema.start | 
2012-02-29T01:39:17.237492Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.check_schema | Current version set to: 4.0. 
2012-02-29T01:39:17.237624Z INFO   netlogger.analysis.schema.schema_check.SchemaCheck.check_schema | Schema up to date. 

For upgrading a MySQL database the steps remain the same. The only thing that changes is the connection String to the database
E.g.<span class="bold"><strong>

/usr/share/pegasus/sql/schema_tool.py -u connString=mysql://username:password@server:port/dbname

</strong></span></pre>
<p>After the database has been upgraded you can use either 3.x or 4.x
      clients to query the database with <span class="bold"><strong>pegasus-statistics</strong></span>, as well as <span class="bold"><strong>pegasus-plots </strong></span>and <span class="bold"><strong>pegasus-analyzer.</strong></span></p>
</div>
<div class="section" title="10.8.2.2. Storing of Exitcode in the database">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp15465584"></a>10.8.2.2. Storing of Exitcode in the database</h4></div></div></div>
<p>Kickstart records capture raw status in addition to the exitcode .
      The exitcode is derived from the raw status. Starting with Pegasus 4.0
      release, all exitcode columns ( i.e invocation and job instance table
      columns ) are stored with the raw status by pegasus-monitord. If an
      exitcode is encountered while parsing the dagman log files , the value
      is converted to the corresponding raw status before it is stored. All
      user tools, pegasus-analyzer and pegasus-statistics then convert the raw
      status to exitcode when retrieving from the database.</p>
</div>
<div class="section" title="10.8.2.3. Multiplier Factor">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp12496992"></a>10.8.2.3. Multiplier Factor</h4></div></div></div>
<p>Starting with the 4.0 release, there is a multiplier factor
      associated with the jobs in the job_instance table. It defaults to one,
      unless the user associates a Pegasus profile key named <span class="bold"><strong>cores</strong></span> with the job in the DAX. The factor can be
      used for getting more accurate statistics for jobs that run on multiple
      processors/cores or mpi jobs.</p>
<p>The multiplier factor is used for computing the following metrics
      by pegasus statistics.</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>In the summary, the workflow cumulative job walltime</p></li>
<li class="listitem"><p>In the summary, the cumulative job walltime as seen from the
          submit side</p></li>
<li class="listitem"><p>In the jobs file, the multiplier factor is listed along-with
          the multiplied kickstart time.</p></li>
<li class="listitem"><p>In the breakdown file, where statistics are listed per
          transformation the mean, min , max and average values take into
          account the multiplier factor.</p></li>
</ul></div>
</div>
</div>
</div>
<div class="section" title="10.9. API Reference">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="api"></a>10.9. API Reference</h2></div></div></div>
<div class="toc"><dl>
<dt><span class="section"><a href="reference.php#idp13985216">10.9.1. DAX XML Schema</a></span></dt>
<dt><span class="section"><a href="reference.php#dax_generator_api">10.9.2. DAX Generator API</a></span></dt>
<dt><span class="section"><a href="reference.php#idp10760848">10.9.3. DAX Generator without a Pegasus DAX API</a></span></dt>
</dl></div>
<div class="section" title="10.9.1. DAX XML Schema">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp13985216"></a>10.9.1. DAX XML Schema</h3></div></div></div>
<p>The DAX format is described by the XML schema instance document
    <a class="ulink" href="http://pegasus.isi.edu/wms/docs/schemas/dax-3.3/dax-3.3.xsd" target="_top">dax-3.3.xsd</a>.
    A local copy of the schema definition is provided in the
    <span class="quote">“<span class="quote">etc</span>”</span> directory. The documentation of the XML schema and its
    elements can be found in <a class="ulink" href="http://pegasus.isi.edu/wms/docs/schemas/dax-3.3/dax-3.3.html" target="_top">dax-3.3.html</a>
    as well as locally in
    <code class="filename">doc/schemas/dax-3.3/dax-3.3.html</code> in your Pegasus
    distribution.</p>
<div class="section" title="10.9.1.1. DAX XML Schema In Detail">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp13602240"></a>10.9.1.1. DAX XML Schema In Detail</h4></div></div></div>
<p>The DAX file format has four major sections, with the second
      section divided into more sub-sections. The DAX format works on the
      abstract or logical level, letting you focus on the shape of the
      workflows, what to do and what to work upon.</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p>Workflow-level Notifications</p>
<p>Very simple workflow-level notifications. These are defined in
          the <a class="link" href="reference.php#notifications" title="10.7. Notifications">Notification</a>
          section.</p>
</li>
<li class="listitem">
<p>Catalogs</p>
<p>The first section deals with included catalogs. While we do
          recommend to use external replica- and transformation catalogs, it
          is possible to include some replicas and transformations into the
          DAX file itself. Any DAX-included entry takes precedence over
          regular replica catalog (RC) and transformation catalog (TC)
          entries.</p>
<p>The first section (and any of its sub-sections) is completely
          optional.</p>
<div class="orderedlist"><ol class="orderedlist" type="a">
<li class="listitem"><p>The first sub-section deals with included replica
              descriptions.</p></li>
<li class="listitem"><p>The second sub-section deals with included transformation
              descriptions.</p></li>
<li class="listitem"><p>The third sub-section declares multi-item
              executables.</p></li>
</ol></div>
</li>
<li class="listitem">
<p>Job List</p>
<p>The jobs section defines the job- or task descriptions. For
          each task to conduct, a three-part logical name declares the task
          and aides identifying it in the transformation catalog or one of the
          <span class="emphasis"><em>executable</em></span> section above. During planning, the
          logical name is translated into the physical executable location on
          the chosen target site. By declaring jobs abstractly, physical
          layout consideration of the target sites do not matter. The job's
          <span class="emphasis"><em>id</em></span> uniquley identifies the job within this
          workflow.</p>
<p>The arguments declare what command-line arguments to pass to
          the job. If you are passing filenames, you should refer to the
          logical filename using the <span class="emphasis"><em>file</em></span> element in the
          argument list.</p>
<p>Important for properly planning the task is the list of files
          consumed by the task, its input files, and the files produced by the
          task, its output files. Each file is described with a
          <span class="emphasis"><em>uses</em></span> element inside the task.</p>
<p>Elements exist to link a logical file to any of the stdio file
          descriptors. The <span class="emphasis"><em>profile</em></span> element is Pegasus's
          way to abstract site-specific data.</p>
<p>Jobs are nodes in the workflow graph. Other nodes include
          unplanned workflows (DAX), which are planned and then run when the
          node runs, and planned workflows (DAG), which are simply
          executed.</p>
</li>
<li class="listitem">
<p>Control-flow Dependencies</p>
<p>The third section lists the dependencies between the tasks.
          The relationships are defined as child parent relationships, and
          thus impacts the order in which tasks are run. No cyclic
          dependencies are permitted.</p>
<p>Dependencies are directed edges in the workflow graph.</p>
</li>
</ol></div>
<div class="section" title="10.9.1.1.1. XML Intro">
<div class="titlepage"><div><div><h5 class="title">
<a name="idp12861040"></a>10.9.1.1.1. XML Intro</h5></div></div></div>
<p>If you have seen the DAX schema before, not a lot of new items
        in the root element. <span class="emphasis"><em>However</em></span>, we did retire the
        (old) attributes ending in <span class="emphasis"><em>Count</em></span>.</p>
<pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;!-- generated: 2011-07-28T18:29:57Z --&gt;
&lt;adag xmlns="http://pegasus.isi.edu/schema/DAX" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.3.xsd" 
      version="3.3" 
      name="diamond" 
      index="0" 
      count="1"&gt;</pre>
<p>The following attributes are supported for the root element
        <span class="emphasis"><em>adag</em></span>.</p>
<div class="table">
<a name="idp13258688"></a><p class="title"><b>Table 10.18. </b></p>
<div class="table-contents"><table border="1">
<colgroup>
<col>
<col>
<col>
<col>
</colgroup>
<thead><tr>
<th>attribute</th>
<th>optional?</th>
<th>type</th>
<th>meaning</th>
</tr></thead>
<tbody>
<tr>
<td>version</td>
<td>required</td>
<td>
                  <span class="emphasis"><em>VersionPattern</em></span>
                </td>
<td>Version number of DAX instance document. Must be
                3.3.</td>
</tr>
<tr>
<td>name</td>
<td>required</td>
<td>string</td>
<td>name of this DAX (or set of DAXes).</td>
</tr>
<tr>
<td>count</td>
<td>optional</td>
<td>positiveInteger</td>
<td>size of list of DAXes with this
                <span class="emphasis"><em>name</em></span>. Defaults to 1.</td>
</tr>
<tr>
<td>index</td>
<td>optional</td>
<td>nonNegativeInteger</td>
<td>current index of DAX with same
                <span class="emphasis"><em>name</em></span>. Defaults to 0.</td>
</tr>
<tr>
<td>fileCount</td>
<td>removed</td>
<td>nonNegativeInteger</td>
<td>Old 2.1 attribute, removed, do not use.</td>
</tr>
<tr>
<td>jobCount</td>
<td>removed</td>
<td>positiveInteger</td>
<td>Old 2.1 attribute, removed, do not use.</td>
</tr>
<tr>
<td>childCount</td>
<td>removed</td>
<td>nonNegativeInteger</td>
<td>Old 2.1 attribute, removed, do not use.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>The <span class="emphasis"><em>version</em></span> attribute is restricted to the
        regular expression <code class="code">\d+(\.\d+(\.\d+)?)?</code>.This expression
        represents the <span class="emphasis"><em>VersionPattern</em></span> type that is used
        in other places, too. It is a more restrictive expression than before,
        but allows us to compute comparable version number using the following
        formula:</p>
<table border="1" id="idp12259136">
<tr>
            <td>version1: a.b.c</td>

            <td>version2: d.e.f</td>
          </tr>
<tr>
            <td>n = a * 1,000,000 + b * 1,000 + c</td>

            <td>m = d * 1,000,000 + e * 1,000 + f</td>
          </tr>
<tr>
            <td align="center" colspan="2">version1 &gt; version2 if n &gt;
            m</td>
          </tr>
</table>
</div>
<div class="section" title="10.9.1.1.2. Workflow-level Notifications">
<div class="titlepage"><div><div><h5 class="title">
<a name="idp21776080"></a>10.9.1.1.2. Workflow-level Notifications</h5></div></div></div>
<p>(something to be said here.)</p>
<pre class="programlisting">  &lt;!-- part 1.1: invocations --&gt;
  &lt;invoke when="at_end"&gt;/bin/date -Ins &amp;gt;&amp;gt; my.log&lt;/invoke&gt;</pre>
<p>The above snippet will append the current time to a log file in
        the current directory. This is with regards to the monitord instance
        acting on the <a class="link" href="reference.php#notifications" title="10.7. Notifications">notification</a>.</p>
</div>
<div class="section" title="10.9.1.1.3. The Catalogs Section">
<div class="titlepage"><div><div><h5 class="title">
<a name="idp13899152"></a>10.9.1.1.3. The Catalogs Section</h5></div></div></div>
<p>The initial section features three sub-sections:</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>a catalog of files used,</p></li>
<li class="listitem"><p>a catalog of transformations used, and</p></li>
<li class="listitem"><p>compound transformation declarations.</p></li>
</ol></div>
<div class="section" title="10.9.1.1.3.1. The Replica Catalog Section">
<div class="titlepage"><div><div><h6 class="title">
<a name="dax_replica_catalog"></a>10.9.1.1.3.1. The Replica Catalog Section</h6></div></div></div>
<p>The file section acts as in in-file replica catalog (RC). Any
          files declared in this section take precedence over files in
          external replica catalogs during planning.</p>
<pre class="programlisting">  &lt;!-- part 1.2: included replica catalog --&gt;
  &lt;file name="example.a" &gt;
    &lt;!-- profiles are optional --&gt;
    &lt;!-- The "stat" namespace is ONLY AN EXAMPLE --&gt;
    &lt;profile namespace="stat" key="size"&gt;/* integer to be defined */&lt;/profile&gt;
    &lt;profile namespace="stat" key="md5sum"&gt;/* 32 char hex string */&lt;/profile&gt;
    &lt;profile namespace="stat" key="mtime"&gt;/* ISO-8601 timestamp */&lt;/profile&gt;

    &lt;!-- metadata is currently NOT SUPPORTED --&gt;
    &lt;metadata key="timestamp" type="int"&gt;/* ISO-8601 *or* 20100417134523:int */&lt;/metadata&gt;
    &lt;metadata key="origin" type="string"&gt;ocean&lt;/metadata&gt;
    
    &lt;!-- PFN to by-pass replica catalog --&gt;
    &lt;!-- The "site attribute is optional --&gt;
    &lt;pfn url="file:///tmp/example.a" site="local"&gt;
      &lt;profile namespace="stat" key="owner"&gt;voeckler&lt;/profile&gt;
    &lt;/pfn&gt;
    &lt;pfn url="file:///storage/funky.a" site="local"/&gt;    
  &lt;/file&gt;

  &lt;!-- a more typical example from the black diamond --&gt;
  &lt;file name="f.a"&gt;
    &lt;pfn url="file:///Users/voeckler/f.a" site="local"/&gt;
  &lt;/file&gt;</pre>
<p>The first <span class="emphasis"><em>file</em></span> entry above is an example
          of a data file with two replicas. The <span class="emphasis"><em>file</em></span>
          element requires a logical file <span class="emphasis"><em>name</em></span>. Each
          logical filename may have additional information associated with it,
          enumerated by <span class="emphasis"><em>profile</em></span> elements. Each file entry
          may have 0 or more <span class="emphasis"><em>metadata</em></span> associated with it.
          Each piece of metadata has a <span class="emphasis"><em>key</em></span> string and
          <span class="emphasis"><em>type</em></span> attribute describing the element's
          value.</p>
<div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Warning</h3>
<p>The <span class="emphasis"><em>metadata</em></span> element is not support as
            of this writing! Details may change in the future.</p>
</div>
<p>The <span class="emphasis"><em>file</em></span> element can provide 0 or more
          <span class="emphasis"><em>pfn</em></span> locations, taking precedence over the
          replica catalog. A <span class="emphasis"><em>file</em></span> element that does not
          name any <span class="emphasis"><em>pfn</em></span> children-elements will still
          require look-ups in external replica catalogs. Each
          <span class="emphasis"><em>pfn</em></span> element names a concrete location of a
          file. Multiple locations constitute replicas of the same file, and
          are assumed to be usable interchangably. The
          <span class="emphasis"><em>url</em></span> attribute is mandatory, and typically would
          use a file schema URL. The <span class="emphasis"><em>site</em></span> attribute is
          optional, and defaults to value <span class="emphasis"><em>local</em></span> if
          missing. A <span class="emphasis"><em>pfn</em></span> element may have
          <span class="emphasis"><em>profile</em></span> children-elements, which refer to
          attributes of the physical file. The file-level profiles refer to
          attributes of the logical file.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>The <code class="literal">stat</code> profile namespace is ony an
            example, and details about stat are not yet implemented. The
            proper namespaces <code class="literal">pegasus</code>,
            <code class="literal">condor</code>, <code class="literal">dagman</code>,
            <code class="literal">env</code>, <code class="literal">hints</code>,
            <code class="literal">globus</code> and <code class="literal">selector</code> enjoy
            full support.</p>
</div>
<p>The second <span class="emphasis"><em>file</em></span> entry above shows a usage
          example from the black-diamond example workflow that you are more
          likely to encouter or write.</p>
<p>The presence of an in-file replica catalog lets you declare a
          couple of interesting advanced features. The DAG and DAX file
          declarations are just files for all practical purposes. For deferred
          planning, the location of the site catalog (SC) can be captured in a
          file, too, that is passed to the job dealing with the deferred
          planning as logical filename.</p>
<pre class="programlisting">  &lt;file name="black.dax" &gt;
    &lt;!-- specify the location of the DAX file --&gt;
    &lt;pfn url="file:///Users/vahi/Pegasus/work/dax-3.0/blackdiamond_dax.xml" site="local"/&gt;
  &lt;/file&gt;

  &lt;file name="black.dag" &gt;
    &lt;!-- specify the location of the DAG file --&gt;
    &lt;pfn url="file:///Users/vahi/Pegasus/work/dax-3.0/blackdiamond.dag" site="local"/&gt;
  &lt;/file&gt;
  
  &lt;file name="sites.xml" &gt;
    &lt;!-- specify the location of a site catalog to use for deferred planning --&gt;
    &lt;pfn url="file:///Users/vahi/Pegasus/work/dax-3.0/conf/sites.xml" site="local"/&gt;
  &lt;/file&gt;</pre>
</div>
<div class="section" title="10.9.1.1.3.. The Transformation Catalog Section">
<div class="titlepage"><div><div><h6 class="title">
<a name="dax_transformation_catalog"></a>10.9.1.1.3.. The Transformation Catalog Section</h6></div></div></div>
<p>The executable section acts as an in-file transformation
          catalog (TC). Any transformations declared in this section take
          precedence over the external transformation catalog during
          planning.</p>
<pre class="programlisting">  &lt;!-- part 1.3: included transformation catalog --&gt;
  &lt;executable namespace="example" name="mDiffFit" version="1.0" 
              arch="x86_64" os="linux" installed="true" &gt;
    &lt;!-- profiles are optional --&gt;
    &lt;!-- The "stat" namespace is ONLY AN EXAMPLE! --&gt;
    &lt;profile namespace="stat" key="size"&gt;5000&lt;/profile&gt;
    &lt;profile namespace="stat" key="md5sum"&gt;AB454DSSDA4646DS&lt;/profile&gt;
    &lt;profile namespace="stat" key="mtime"&gt;2010-11-22T10:05:55.470606000-0800&lt;/profile&gt;

    &lt;!-- metadata is currently NOT SUPPORTED! --&gt;
    &lt;metadata key="timestamp" type="int"&gt;/* see above */&lt;/metadata&gt;
    &lt;metadata key="origin" type="string"&gt;ocean&lt;/metadata&gt;
 
    &lt;!-- PFN to by-pass transformation catalog --&gt;
    &lt;!-- The "site" attribute is optional --&gt;
    &lt;pfn url="file:///tmp/mDiffFit"          site="local"/&gt;     
    &lt;pfn url="file:///tmp/storage/mDiffFit"  site="local"/&gt;     
  &lt;/executable&gt;

  &lt;!-- to be used in compound transformation later --&gt;
  &lt;executable namespace="example" name="mDiff" version="1.0" 
              arch="x86_64" os="linux" installed="true" &gt;
    &lt;pfn url="file:///tmp/mDiff" site="local"/&gt;        
  &lt;/executable&gt;

  &lt;!-- to be used in compound transformation later --&gt;
  &lt;executable namespace="example" name="mFitplane" version="1.0"
              arch="x86_64" os="linux" installed="true" &gt;
    &lt;pfn url="file:///tmp/mDiffFitplane"  site="local"&gt;
      &lt;profile namespace="stat" key="md5sum"&gt;0a9c38b919c7809cb645fc09011588a6&lt;/profile&gt;
    &lt;/pfn&gt;
    &lt;invoke when="at_end"&gt;/path/to/my_send_email some args&lt;/invoke&gt;
  &lt;/executable&gt;

  &lt;!-- a more likely example from the black diamond --&gt;
  &lt;executable namespace="diamond" name="preprocess" version="2.0" 
              arch="x86_64"
              os="linux" 
              osversion="2.6.18"&gt;
    &lt;pfn url="file:///opt/pegasus/default/bin/keg" site="local" /&gt;
  &lt;/executable&gt;</pre>
<p>Logical filenames pertaining to a single executables in the
          transformation catalog use the <span class="emphasis"><em>executable</em></span>
          element. Any <span class="emphasis"><em>executable</em></span> element features the
          optional <span class="emphasis"><em>namespace</em></span> attribute, a mandatory
          <span class="emphasis"><em>name</em></span> attribute, and an optional
          <span class="emphasis"><em>version</em></span> attribute. The
          <span class="emphasis"><em>version</em></span> attribute defaults to "1.0" when
          absent. An executable typically needs additional attributes to
          describe it properly, like the architecture, OS release and other
          flags typically seen with transformations, or found in the
          transformation catalog.</p>
<div class="table">
<a name="idp13521744"></a><p class="title"><b>Table 10.19. </b></p>
<div class="table-contents"><table border="1">
<colgroup>
<col>
<col>
<col>
<col>
</colgroup>
<thead><tr>
<th>attribute</th>
<th>optional?</th>
<th>type</th>
<th>meaning</th>
</tr></thead>
<tbody>
<tr>
<td>name</td>
<td>required</td>
<td>string</td>
<td>logical transformation name</td>
</tr>
<tr>
<td>namespace</td>
<td>optional</td>
<td>string</td>
<td>namespace of logical transformation, default to
                  <span class="emphasis"><em>null</em></span> value.</td>
</tr>
<tr>
<td>version</td>
<td>optional</td>
<td>VersionPattern</td>
<td>version of logical transformation, defaults to
                  "1.0".</td>
</tr>
<tr>
<td>installed</td>
<td>optional</td>
<td>boolean</td>
<td>whether to stage the file (false), or not (true,
                  default).</td>
</tr>
<tr>
<td>arch</td>
<td>optional</td>
<td>Architecture</td>
<td>restricted set of tokens, see schema definition
                  file.</td>
</tr>
<tr>
<td>os</td>
<td>optional</td>
<td>OSType</td>
<td>restricted set of tokens, see schema definition
                  file.</td>
</tr>
<tr>
<td>osversion</td>
<td>optional</td>
<td>VersionPattern</td>
<td>kernel version as beginning of `uname -r`.</td>
</tr>
<tr>
<td>glibc</td>
<td>optional</td>
<td>VersionPattern</td>
<td>version of libc.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>The rationale for giving these flags in the
          <span class="emphasis"><em>executable</em></span> element header is that PFNs are just
          identical replicas or instances of a given LFN. If you need a
          different 32/64 bit-ed-ness or OS release, the underlying PFN would
          be different, and thus the LFN for it should be different,
          too.</p>
<div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>We are still discussing some details and implications of
            this decision.</p>
</div>
<p>The initial examples come with the same caveats as for the
          included replica catalog.</p>
<div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Warning</h3>
<p>The <span class="emphasis"><em>metadata</em></span> element is not support as
            of this writing! Details may change in the future.</p>
</div>
<p>Similar to the replica catalog, each
          <span class="emphasis"><em>executable</em></span> element may have 0 or more
          <span class="emphasis"><em>profile</em></span> elements abstracting away site-specific
          details, zero or more <span class="emphasis"><em>metadata</em></span> elements, and
          zero or more <span class="emphasis"><em>pfn</em></span> elements. If there are no
          <span class="emphasis"><em>pfn</em></span> elements, the transformation must still be
          searched for in the external transformation catalog. As before, the
          <span class="emphasis"><em>pfn</em></span> element may have
          <span class="emphasis"><em>profile</em></span> children-elements, referring to
          attributes of the physical filename itself.</p>
<p>Each <span class="emphasis"><em>executable</em></span> element may also feature
          <span class="emphasis"><em>invoke</em></span> elements. These enable notifications at
          the appropriate point when every job that uses this executable
          reaches the point of notification. Please refer to the <a class="link" href="reference.php#notifications" title="10.7. Notifications">notification section</a> for details and
          caveats.</p>
<p>The last example above comes from the black diamond example
          workflow, and presents the kind and extend of attributes you are
          most likely to see and use in your own workflows.</p>
</div>
<div class="section" title="10.9.1.1.3.3. The Compound Transformation Section">
<div class="titlepage"><div><div><h6 class="title">
<a name="idp13093088"></a>10.9.1.1.3.3. The Compound Transformation Section</h6></div></div></div>
<p>The compound transformation section declares a transformation
          that comprises multiple plain transformation. You can think of a
          compound transformation like a script interpreter and the script
          itself. In order to properly run the application, you must start
          both, the script interpreter and the script passed to it. The
          compound transformation helps Pegasus to properly deal with this
          case, especially when it needs to stage executables.</p>
<pre class="programlisting">  &lt;transformation namespace="example" version="1.0" name="mDiffFit" &gt;
    &lt;uses name="mDiffFit" /&gt;
    &lt;uses name="mDiff" namespace="example" version="2.0" /&gt;
    &lt;uses name="mFitPlane" /&gt;
    &lt;uses name="mDiffFit.config" executable="false" /&gt;
  &lt;/transformation&gt;</pre>
<p>A <span class="emphasis"><em>transformation</em></span> element declares a set
          of purely logical entities, executables and config (data) files,
          that are all required together for the same job. Being purely
          logical entities, the lookup happens only when the transformation
          element is referenced (or instantiated) by a job element later
          on.</p>
<p>The <span class="emphasis"><em>namespace</em></span> and
          <span class="emphasis"><em>version</em></span> attributes of the transformation
          element are optional, and provide the defaults for the inner uses
          elements. They are also essential for matching the transformation
          with a job.</p>
<p>The <span class="emphasis"><em>transformation</em></span> is made up of 1 or
          more <span class="emphasis"><em>uses</em></span> element. Each
          <span class="emphasis"><em>uses</em></span> has a boolean attribute
          <span class="emphasis"><em>executable</em></span>, <code class="literal">true</code> by default,
          or <code class="literal">false</code> to indicate a data file. The
          <span class="emphasis"><em>name</em></span> is a mandatory attribute, refering to an
          LFN declared previously in the File Catalog
          (<span class="emphasis"><em>executable</em></span> is <code class="literal">false</code>),
          Executable Catalog (<span class="emphasis"><em>executable</em></span> is
          <code class="literal">true</code>), or to be looked up as necessary at
          instantiation time. The lookup catalog is determined by the
          <span class="emphasis"><em>executable</em></span> attribute.</p>
<p>After <span class="emphasis"><em>uses</em></span> elements, any number of
          <span class="emphasis"><em>invoke</em></span> elements may occur to add a <a class="link" href="reference.php#notifications" title="10.7. Notifications">notification</a> each whenever this
          transformation is instantiated.</p>
<p>The <span class="emphasis"><em>namespace</em></span> and
          <span class="emphasis"><em>version</em></span> attributes' default values inside
          <span class="emphasis"><em>uses</em></span> elements are inherited from the
          <span class="emphasis"><em>transformation</em></span> attributes of the same name.
          There is no such inheritance for <span class="emphasis"><em>uses</em></span> elements
          with <span class="emphasis"><em>executable</em></span> attribute of
          <code class="literal">false</code>.</p>
</div>
</div>
<div class="section" title="10.9.1.1.4. Graph Nodes">
<div class="titlepage"><div><div><h5 class="title">
<a name="api-graph-nodes"></a>10.9.1.1.4. Graph Nodes</h5></div></div></div>
<p>The nodes in the DAX comprise regular job nodes, already
        instantiated sub-workflows as dag nodes, and still to be instantiated
        dax nodes. Each of the graph nodes can has a mandatory
        <span class="emphasis"><em>id</em></span> attribute. The <span class="emphasis"><em>id</em></span>
        attribute is currently a restriction of type
        <span class="emphasis"><em>NodeIdentifierPattern</em></span> type, which is a
        restriction of the <code class="code">xs:NMTOKEN</code> type to letters, digits,
        hyphen and underscore.</p>
<p>The <span class="emphasis"><em>level</em></span> attribute is deprecated, as the
        planner will trust its own re-computation more than user input. Please
        do not use nor produce any <span class="emphasis"><em>level</em></span>
        attribute.</p>
<p>The <span class="emphasis"><em>node-label</em></span> attribute is optional. It
        applies to the use-case when every transformation has the same name,
        but its arguments determine what it really does. In the presence of a
        <span class="emphasis"><em>node-label</em></span> value, a workflow grapher could use
        the label value to show graph nodes to the user. It may also come in
        handy while debugging.</p>
<p>Any job-like graph node has the following set of children
        elements, as defined in the <span class="emphasis"><em>AbstractJobType</em></span>
        declaration in the schema definition:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>0 or 1 <span class="emphasis"><em>argument</em></span> element to declare the
            command-line of the job's invocation.</p></li>
<li class="listitem"><p>0 or more <span class="emphasis"><em>profile</em></span> elements to abstract
            away site-specific or job-specific details.</p></li>
<li class="listitem"><p>0 or 1 <span class="emphasis"><em>stdin</em></span> element to link a logical
            file the the job's standard input.</p></li>
<li class="listitem"><p>0 or 1 <span class="emphasis"><em>stdout</em></span> element to link a logical
            file to the job's standard output.</p></li>
<li class="listitem"><p>0 or 1 <span class="emphasis"><em>stderr</em></span> element to link a logical
            file to the job's standard error.</p></li>
<li class="listitem"><p>0 or more <span class="emphasis"><em>uses</em></span> elements to declare
            consumed data files and produced data files.</p></li>
<li class="listitem"><p>0 or more <span class="emphasis"><em>invoke</em></span> elements to solicit
            <a class="link" href="reference.php#notifications" title="10.7. Notifications">notifications</a> whence a job
            reaches a certain state in its life-cycle.</p></li>
</ul></div>
<div class="section" title="10.9.1.1.4.1. Job Nodes">
<div class="titlepage"><div><div><h6 class="title">
<a name="api-job-nodes"></a>10.9.1.1.4.1. Job Nodes</h6></div></div></div>
<p>A job element has a number of attributes. In addition to the
          <span class="emphasis"><em>id</em></span> and <span class="emphasis"><em>node-label</em></span>
          described in (Graph Nodes)above, the optional
          <span class="emphasis"><em>namespace</em></span>, mandatory <span class="emphasis"><em>name</em></span>
          and optional <span class="emphasis"><em>version</em></span> identify the
          transformation, and provide the look-up handle: first in the DAX's
          <span class="emphasis"><em>transformation</em></span> elements, then in the
          <span class="emphasis"><em>executable</em></span> elements, and finally in an external
          transformation catalog.</p>
<pre class="programlisting">  &lt;!-- part 2: definition of all jobs (at least one) --&gt;
  &lt;job id="ID000001" namespace="example" name="mDiffFit" version="1.0" 
       node-label="preprocess" &gt;
    &lt;argument&gt;-a top -T 6  -i &lt;file name="f.a"/&gt;  -o &lt;file name="f.b1"/&gt;&lt;/argument&gt;

    &lt;!-- profiles are optional --&gt;
    &lt;profile namespace="execution" key="site"&gt;isi_viz&lt;/profile&gt;
    &lt;profile namespace="condor" key="getenv"&gt;true&lt;/profile&gt;

    &lt;uses name="f.a" link="input"  register="false" transfer="true" type="data" /&gt;
    &lt;uses name="f.b" link="output" register="false" transfer="true" type="data" /&gt;
    
    &lt;!-- 'WHEN' enumeration: never, start, on_error, on_success, on_end, all --&gt;
    &lt;!-- PEGASUS_* env-vars: event, status, submit dir, wf/job id, stdout, stderr --&gt;
    &lt;invoke when="start"&gt;/path/to arg arg&lt;/invoke&gt;
    &lt;invoke when="on_success"&gt;&lt;![CDATA[/path/to arg arg]]&gt;&lt;/invoke&gt;
    &lt;invoke when="on_end"&gt;&lt;![CDATA[/path/to arg arg]]&gt;&lt;/invoke&gt;
  &lt;/job&gt;</pre>
<p>The <span class="emphasis"><em>argument</em></span> element contains the
          complete command-line that is needed to invoke the executable. The
          only variable components are logical filenames, as included
          <span class="emphasis"><em>file</em></span> elements.</p>
<p>The <span class="emphasis"><em>profile</em></span> argument lets you encapsulate
          site-specific knowledge .</p>
<p>The <span class="emphasis"><em>stdin</em></span>, <span class="emphasis"><em>stdout</em></span>
          and <span class="emphasis"><em>stderr</em></span> element permits you to connect a
          stdio file descriptor to a logical filename. Note that you will
          still have to declare these files in the <span class="emphasis"><em>uses</em></span>
          section below.</p>
<p>The <span class="emphasis"><em>uses</em></span> element enumerates all the files
          that the task consumes or produces. While it is not necessary nor
          required to have all files appear on the command-line, it is
          imperative that you declare even hidden files that your task
          requires in this section, so that the proper ancilliary staging- and
          clean-up tasks can be generated during planning.</p>
<p>The <span class="emphasis"><em>invoke</em></span> element may be specified
          multiple times, as needed. It has a mandatory when attribute with
          the following value set:</p>
<div class="table">
<a name="idp13462064"></a><p class="title"><b>Table 10.20. </b></p>
<div class="table-contents"><table border="1">
<colgroup>
<col>
<col>
<col>
</colgroup>
<thead><tr>
<th align="center">keyword</th>
<th align="center">job life-cycle state</th>
<th align="center">meaning</th>
</tr></thead>
<tbody>
<tr>
<td>never</td>
<td>never</td>
<td>(default). Never notify of anything. This is useful
                  to temporarily disable an existing notifications.</td>
</tr>
<tr>
<td>start</td>
<td>submit</td>
<td>create a notification when the job is
                  submitted.</td>
</tr>
<tr>
<td>on_error</td>
<td>end</td>
<td>after a job finishes with failure (exitcode !=
                  0).</td>
</tr>
<tr>
<td>on_success</td>
<td>end</td>
<td>after a job finishes with success (exitcode ==
                  0).</td>
</tr>
<tr>
<td>at_end</td>
<td>end</td>
<td>after a job finishes, regardless of exitcode.</td>
</tr>
<tr>
<td>all</td>
<td>always</td>
<td>like start and at_end combined.</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Warning</h3>
<p>In clustered jobs, a notification can only be sent at the
            start or end of the clustered job, not for each member.</p>
</div>
<p>Each <span class="emphasis"><em>invoke</em></span> is a simple local invocation
          of an executable or script with the specified arguments. The
          executable inside the invoke body will see the following environment
          variables:</p>
<div class="table">
<a name="idp12299056"></a><p class="title"><b>Table 10.21. </b></p>
<div class="table-contents"><table border="1">
<colgroup>
<col>
<col>
<col>
</colgroup>
<thead><tr>
<th align="center">variable</th>
<th align="center">job life-cycle state</th>
<th align="center">meaning</th>
</tr></thead>
<tbody>
<tr>
<td>PEGASUS_EVENT</td>
<td>always</td>
<td>The value of the <code class="code">when</code> attribute</td>
</tr>
<tr>
<td>PEGASUS_STATUS</td>
<td>end</td>
<td>The exit status of the graph node. Only available for
                  end notifications.</td>
</tr>
<tr>
<td>PEGASUS_SUBMIT_DIR</td>
<td>always</td>
<td>In which directory to find the job (or
                  workflow).</td>
</tr>
<tr>
<td>PEGASUS_JOBID</td>
<td>always</td>
<td>The job (or workflow) identifier. This is potentially
                  more than merely the value of the <span class="emphasis"><em>id</em></span>
                  attribute.</td>
</tr>
<tr>
<td>PEGASUS_STDOUT</td>
<td>always</td>
<td>The filename where <span class="emphasis"><em>stdout</em></span> goes.
                  Empty and possibly non-existent at submit time (though we
                  still have the filename). The kickstart record for job
                  nodes.</td>
</tr>
<tr>
<td>PEGASUS_STDERR</td>
<td>always</td>
<td>The filename where <span class="emphasis"><em>stderr</em></span> goes.
                  Empty and possibly non-existent at submit time (though we
                  still have the filename).</td>
</tr>
</tbody>
</table></div>
</div>
<br class="table-break"><p>Generators should use CDATA encapsulated values to the invoke
          element to minimize interference. Unfortunately, CDATA cannot be
          nested, so if the user invocation contains a CDATA section, we
          suggest that they use careful XML-entity escaped strings. The <a class="link" href="reference.php#notifications" title="10.7. Notifications">notifications section</a> describes these
          in further detail.</p>
</div>
<div class="section" title="10.9.1.1.4.2. DAG Nodes">
<div class="titlepage"><div><div><h6 class="title">
<a name="idp14568736"></a>10.9.1.1.4.2. DAG Nodes</h6></div></div></div>
<p>A workflow that has already been concretized, either by an
          earlier run of Pegasus, or otherwise constructed for DAGMan
          execution, can be included into the current workflow using the
          <span class="emphasis"><em>dag</em></span> element.</p>
<pre class="programlisting">  &lt;dag id="ID000003" name="black.dag" node-label="foo" &gt;
    &lt;profile namespace="dagman" key="DIR"&gt;/dag-dir/test&lt;/profile&gt;
    &lt;invoke&gt; &lt;!-- optional, should be possible --&gt; &lt;/invoke&gt;
    &lt;uses file="sites.xml" link="input" register="false" transfer="true" type="data"/&gt;     
  &lt;/dag&gt;</pre>
<p>The <span class="emphasis"><em>id</em></span> and
          <span class="emphasis"><em>node-label</em></span> attributes were described <a class="link" href="reference.php#api-graph-nodes" title="10.9.1.1.4. Graph Nodes">previously</a>. The
          <span class="emphasis"><em>name</em></span> attribute refers to a file from the File
          Catalog that provides the actual DAGMan DAG as data content. The
          <span class="emphasis"><em>dag</em></span> element features optional
          <span class="emphasis"><em>profile</em></span> elements. These would most likely
          pertain to the <code class="literal">dagman</code> and <code class="literal">env</code>
          profile namespaces. It should be possible to have the optional
          <span class="emphasis"><em>notify</em></span> element in the same manner as for
          jobs.</p>
<p>A graph node that is a dag instead of a job would just use a
          different submit file generator to create a DAGMan invocation. There
          can be an <span class="emphasis"><em>argument</em></span> element to modify the
          command-line passed to DAGMan.</p>
</div>
<div class="section" title="10.9.1.1.4.3. DAX Nodes">
<div class="titlepage"><div><div><h6 class="title">
<a name="idp16938336"></a>10.9.1.1.4.3. DAX Nodes</h6></div></div></div>
<p>A still to be planned workflow incurs an invocation of the
          Pegasus planner as part of the workflow. This still abstract
          sub-workflow uses the <span class="emphasis"><em>dax</em></span> element.</p>
<pre class="programlisting">  &lt;dax id="ID000002" name="black.dax" node-label="bar" &gt;
    &lt;profile namespace="env" key="foo"&gt;bar&lt;/profile&gt;
    &lt;argument&gt;-Xmx1024 -Xms512 -Dpegasus.dir.storage=storagedir  -Dpegasus.dir.exec=execdir -o local --dir ./datafind -vvvvv --force -s dax_site &lt;/argument&gt;
    &lt;invoke&gt; &lt;!-- optional, may not be possible here --&gt; &lt;/invoke&gt;
    &lt;uses file="sites.xml" link="input" register="false" transfer="true" type="data" /&gt;
  &lt;/dax&gt;</pre>
<p>In addition to the <span class="emphasis"><em>id</em></span> and
          <span class="emphasis"><em>node-label</em></span> attributes, See <a class="link" href="reference.php#api-graph-nodes" title="10.9.1.1.4. Graph Nodes">Graph Nodes</a>. The
          <span class="emphasis"><em>name</em></span> attribute refers to a file from the File
          Catalog that provides the to be planned DAX as external file data
          content. The <span class="emphasis"><em>dax</em></span> element features optional
          <span class="emphasis"><em>profile</em></span> elements. These would most likely
          pertain to the <code class="literal">pegasus</code>, <code class="literal">dagman</code>
          and <code class="literal">env</code> profile namespaces. It may be possible to
          have the optional <span class="emphasis"><em>notify</em></span> element in the same
          manner as for jobs.</p>
<p>A graph node that is a <span class="emphasis"><em>dax</em></span> instead of a
          job would just use yet another submit file and pre-script generator
          to create a DAGMan invocation. The <span class="emphasis"><em>argument</em></span>
          string pertains to the command line of the to-be-generated DAGMan
          invocation.</p>
</div>
<div class="section" title="10.9.1.1.4.4. Inner ADAG Nodes">
<div class="titlepage"><div><div><h6 class="title">
<a name="idp14334928"></a>10.9.1.1.4.4. Inner ADAG Nodes</h6></div></div></div>
<p>While completeness would argue to have a recursive nesting of
          <span class="emphasis"><em>adag</em></span> elements, such recursive nestings are
          currently not supported, not even in the schema. If you need to nest
          workflows, please use the <span class="emphasis"><em>dax</em></span> or
          <span class="emphasis"><em>dag</em></span> element to achieve the same goal.</p>
</div>
</div>
<div class="section" title="10.9.1.1.5. The Dependency Section">
<div class="titlepage"><div><div><h5 class="title">
<a name="idp15634352"></a>10.9.1.1.5. The Dependency Section</h5></div></div></div>
<p>This section describes the dependencies between the jobs.</p>
<pre class="programlisting">  &lt;!-- part 3: list of control-flow dependencies --&gt;
  &lt;child ref="ID000002"&gt;
    &lt;parent ref="ID000001" edge-label="edge1" /&gt;
  &lt;/child&gt;
  &lt;child ref="ID000003"&gt;
    &lt;parent ref="ID000001" edge-label="edge2" /&gt;
  &lt;/child&gt;
  &lt;child ref="ID000004"&gt;
    &lt;parent ref="ID000002" edge-label="edge3" /&gt;
    &lt;parent ref="ID000003" edge-label="edge4" /&gt;
  &lt;/child&gt;</pre>
<p>Each <span class="emphasis"><em>child</em></span> element contains one or more
        <span class="emphasis"><em>parent</em></span> element. Either element refers to a
        <span class="emphasis"><em>job</em></span>, <span class="emphasis"><em>dag</em></span> or
        <span class="emphasis"><em>dax</em></span> element id attribute using the
        <span class="emphasis"><em>ref</em></span> attribute. In this version, we relaxed the
        <code class="code">xs:IDREF</code> constraint in favor of a restriction on the
        <code class="code">xs:NMTOKEN</code> type to permit a larger set of
        identifiers.</p>
<p>The <span class="emphasis"><em>parent</em></span> element has an optional
        <span class="emphasis"><em>edge-label</em></span> attribute.</p>
<div class="warning" title="Warning" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Warning</h3>
<p>The <span class="emphasis"><em>edge-label</em></span> attribute is currently
          unused.</p>
</div>
<p>Its goal is to annotate edges when drawing workflow
        graphs.</p>
</div>
<div class="section" title="10.9.1.1.6. Closing">
<div class="titlepage"><div><div><h5 class="title">
<a name="idp8295280"></a>10.9.1.1.6. Closing</h5></div></div></div>
<p>As any XML element, the root element needs to be closed.</p>
<pre class="programlisting">&lt;/adag&gt;</pre>
</div>
</div>
<div class="section" title="10.9.1.2. DAX XML Schema Example">
<div class="titlepage"><div><div><h4 class="title">
<a name="idp16748928"></a>10.9.1.2. DAX XML Schema Example</h4></div></div></div>
<p>The following code example shows the XML instance document
      representing the diamond workflow.</p>
<pre class="programlisting">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;adag xmlns="http://pegasus.isi.edu/schema/DAX"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.3.xsd"
 version="3.3" name="diamond" index="0" count="1"&gt;
  &lt;!-- part 1.1: invocations --&gt;
  &lt;invoke when="on_error"&gt;/bin/mailx -s &amp;apos;diamond failed&amp;apos; use@some.domain&lt;/invoke&gt;

  &lt;!-- part 1.2: included replica catalog --&gt;
  &lt;file name="f.a"&gt;
    &lt;pfn url="file:///lfs/voeckler/src/svn/pegasus/trunk/examples/grid-blackdiamond-perl/f.a" site="local" /&gt;
  &lt;/file&gt;

  &lt;!-- part 1.3: included transformation catalog --&gt;
  &lt;executable namespace="diamond" name="preprocess" version="2.0" arch="x86_64" os="linux" installed="false"&gt;
    &lt;profile namespace="globus" key="maxtime"&gt;2&lt;/profile&gt;
    &lt;profile namespace="dagman" key="RETRY"&gt;3&lt;/profile&gt;
    &lt;pfn url="file:///opt/pegasus/latest/bin/keg" site="local" /&gt;
  &lt;/executable&gt;
  &lt;executable namespace="diamond" name="analyze" version="2.0" arch="x86_64" os="linux" installed="false"&gt;
    &lt;profile namespace="globus" key="maxtime"&gt;2&lt;/profile&gt;
    &lt;profile namespace="dagman" key="RETRY"&gt;3&lt;/profile&gt;
    &lt;pfn url="file:///opt/pegasus/latest/bin/keg" site="local" /&gt;
  &lt;/executable&gt;
  &lt;executable namespace="diamond" name="findrange" version="2.0" arch="x86_64" os="linux" installed="false"&gt;
    &lt;profile namespace="globus" key="maxtime"&gt;2&lt;/profile&gt;
    &lt;profile namespace="dagman" key="RETRY"&gt;3&lt;/profile&gt;
    &lt;pfn url="file:///opt/pegasus/latest/bin/keg" site="local" /&gt;
  &lt;/executable&gt;

  &lt;!-- part 2: definition of all jobs (at least one) --&gt;
  &lt;job namespace="diamond" name="preprocess" version="2.0" id="ID000001"&gt;
    &lt;argument&gt;-a preprocess -T60 -i &lt;file name="f.a" /&gt; -o &lt;file name="f.b1" /&gt; &lt;file name="f.b2" /&gt;&lt;/argument&gt;
    &lt;uses name="f.b2" link="output" register="false" transfer="true" /&gt;
    &lt;uses name="f.b1" link="output" register="false" transfer="true" /&gt;
    &lt;uses name="f.a" link="input" /&gt;
  &lt;/job&gt;
  &lt;job namespace="diamond" name="findrange" version="2.0" id="ID000002"&gt;
    &lt;argument&gt;-a findrange -T60 -i &lt;file name="f.b1" /&gt; -o &lt;file name="f.c1" /&gt;&lt;/argument&gt;
    &lt;uses name="f.b1" link="input" register="false" transfer="true" /&gt;
    &lt;uses name="f.c1" link="output" register="false" transfer="true" /&gt;
  &lt;/job&gt;
  &lt;job namespace="diamond" name="findrange" version="2.0" id="ID000003"&gt;
    &lt;argument&gt;-a findrange -T60 -i &lt;file name="f.b2" /&gt; -o &lt;file name="f.c2" /&gt;&lt;/argument&gt;
    &lt;uses name="f.b2" link="input" register="false" transfer="true" /&gt;
    &lt;uses name="f.c2" link="output" register="false" transfer="true" /&gt;
  &lt;/job&gt;
  &lt;job namespace="diamond" name="analyze" version="2.0" id="ID000004"&gt;
    &lt;argument&gt;-a analyze -T60 -i &lt;file name="f.c1" /&gt; &lt;file name="f.c2" /&gt; -o &lt;file name="f.d" /&gt;&lt;/argument&gt;
    &lt;uses name="f.c2" link="input" register="false" transfer="true" /&gt;
    &lt;uses name="f.d" link="output" register="false" transfer="true" /&gt;
    &lt;uses name="f.c1" link="input" register="false" transfer="true" /&gt;
  &lt;/job&gt;

  &lt;!-- part 3: list of control-flow dependencies --&gt;
  &lt;child ref="ID000002"&gt;
    &lt;parent ref="ID000001" /&gt;
  &lt;/child&gt;
  &lt;child ref="ID000003"&gt;
    &lt;parent ref="ID000001" /&gt;
  &lt;/child&gt;
  &lt;child ref="ID000004"&gt;
    &lt;parent ref="ID000002" /&gt;
    &lt;parent ref="ID000003" /&gt;
  &lt;/child&gt;
&lt;/adag&gt;
</pre>
<p>The above workflow defines the black diamond from the abstract
      workflow section of the <a class="link" href="about.php" title="Chapter 1. Introduction">Introduction</a>
      chapter. It will require minimal configuration, because the catalog
      sections include all necessary declarations.</p>
<p>The file element defines the location of the required input file
      in terms of the local machine. Please note that</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem"><p>The <span class="bold"><strong>file</strong></span> element declares the
          required input file "f.a" in terms of the local machine. Please note
          that if you plan the workflow for a remote site, the has to be some
          way for the file to be staged from the local site to the remote
          site. While Pegasus will augment the workflow with such ancillary
          jobs, the site catalog as well as local and remote site have to be
          set up properlyl. For a locally run workflow you don't need to do
          anything.</p></li>
<li class="listitem"><p>The <span class="bold"><strong>executable</strong></span> elements
          declare the same executable keg that is to be run for each the
          logical transformation in terms of the remote site
          <span class="emphasis"><em>futuregrid</em></span>. To declare it for a local site, you
          would have to adjust the <span class="emphasis"><em>site</em></span> attribute's value
          to <code class="literal">local</code>. This section also shows that the same
          executable may come in different guises as transformation.</p></li>
<li class="listitem"><p>The <span class="bold"><strong>job</strong></span> elements define the
          workflow's logical constituents, the way to invoke the
          <code class="literal">keg</code> command, where to put filenames on the
          commandline, and what files are consumed or produced. In addition to
          the direction of files, further attributes determine whether to
          register the file with a replica catalog and whether to transfer it
          to the output site in case of a product. We are only interested in
          the final data product "f.d" in this workflow, and not any
          intermediary files. Typically, you would also want to register the
          data products in the replica catalog, especially in larger
          scenarios.</p></li>
<li class="listitem"><p>The <span class="bold"><strong>child</strong></span> elements define the
          control flow between the jobs.</p></li>
</ul></div>
</div>
</div>
<div class="section" title="10.9.2. DAX Generator API">
<div class="titlepage"><div><div><h3 class="title">
<a name="dax_generator_api"></a>10.9.2. DAX Generator API</h3></div></div></div>
<p>The DAX generating APIs support Java, Perl and Python. This section
    will show in each language the necessary code, using Pegasus-provided
    libraries, to generate the diamond DAX example above. There may be minor
    differences in details, e.g. to show-case certain features, but
    effectively all generate the same basic diamond.</p>
<div class="section" title="10.9.2.1. The Java DAX Generator API">
<div class="titlepage"><div><div><h4 class="title">
<a name="api-java"></a>10.9.2.1. The Java DAX Generator API</h4></div></div></div>
<p>The Java DAX API provided with the Pegasus distribution allows
      easy creation of complex and huge workflows. This API is used by several
      applications to generate their abstract DAX. SCEC, which is Southern
      California Earthquake Center, uses this API in their CyberShake workflow
      generator to generate huge DAX containing 10&amp;rsquor;s of thousands
      of tasks with 100&amp;rsquor;s of thousands of input and output files.
      The <a class="ulink" href="javadoc/index.html" target="_top">Java API</a> is well documented
      using <a class="ulink" href="javadoc/edu/isi/pegasus/planner/dax/ADAG.html" target="_top">Javadoc
      for ADAGs</a> .</p>
<p>The steps involved in creating a DAX using the API are</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Create a new <span class="emphasis"><em>ADAG</em></span> object</p></li>
<li class="listitem"><p>Add any Workflow notification elements</p></li>
<li class="listitem"><p>Create <span class="emphasis"><em>File</em></span> objects as necessary. You can
          augment the files with physical information, if you want to include
          them into your DAX. Otherwise, the physical information is
          determined from the replica catalog.</p></li>
<li class="listitem"><p>(Optional) Create <span class="emphasis"><em>Executable</em></span> objects, if
          you want to include your transformation catalog into your DAX.
          Otherwise, the translation of a job/task into executable location
          happens with the transformation catalog.</p></li>
<li class="listitem"><p>Create a new <span class="emphasis"><em>Job</em></span> object.</p></li>
<li class="listitem"><p>Add arguments, files, profiles, notifications and other
          information to the <span class="emphasis"><em>Job</em></span> object</p></li>
<li class="listitem"><p>Add the job object to the <span class="emphasis"><em>ADAG</em></span>
          object</p></li>
<li class="listitem"><p>Repeat step 4-6 as necessary.</p></li>
<li class="listitem"><p>Add all dependencies to the <span class="emphasis"><em>ADAG</em></span>
          object.</p></li>
<li class="listitem"><p>Call the <span class="emphasis"><em>writeToFile()</em></span> method on the
          <span class="emphasis"><em>ADAG</em></span> object to render the XML DAX file.</p></li>
</ol></div>
<p>An example Java code that generates the diamond dax show above is
      listed below. This same code can be found in the Pegasus distribution in
      the <code class="filename">examples/grid-blackdiamond-java</code> directory
      as <code class="filename">BlackDiamonDAX.java</code>:</p>
<pre class="programlisting">/**
 *  Copyright 2007-2008 University Of Southern California
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *  http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */

import edu.isi.pegasus.planner.dax.*;


/**
 * An example class to highlight how to use the JAVA DAX API to generate a diamond
 * DAX.
 * 
 */
public class Diamond {

    

    public ADAG generate(String site_handle, String pegasus_location) throws Exception {

        java.io.File cwdFile = new java.io.File (".");
        String cwd = cwdFile.getCanonicalPath(); 

        ADAG dax = new ADAG("blackdiamond");
        dax.addNotification(Invoke.WHEN.start,"/pegasus/libexec/notification/email -t notify@example.com");
        dax.addNotification(Invoke.WHEN.at_end,"/pegasus/libexec/notification/email -t notify@example.com");
        File fa = new File("f.a");
        fa.addPhysicalFile("file://" + cwd + "/f.a", "local");
        dax.addFile(fa);

        File fb1 = new File("f.b1");
        File fb2 = new File("f.b2");
        File fc1 = new File("f.c1");
        File fc2 = new File("f.c2");
        File fd = new File("f.d");
        fd.setRegister(true);

        Executable preprocess = new Executable("pegasus", "preprocess", "4.0");
        preprocess.setArchitecture(Executable.ARCH.X86).setOS(Executable.OS.LINUX);
        preprocess.setInstalled(true);
        preprocess.addPhysicalFile("file://" + pegasus_location + "/bin/keg", site_handle);

        Executable findrange = new Executable("pegasus", "findrange", "4.0");
        findrange.setArchitecture(Executable.ARCH.X86).setOS(Executable.OS.LINUX);
        findrange.setInstalled(true);
        findrange.addPhysicalFile("file://" + pegasus_location + "/bin/keg", site_handle);

        Executable analyze = new Executable("pegasus", "analyze", "4.0");
        analyze.setArchitecture(Executable.ARCH.X86).setOS(Executable.OS.LINUX);
        analyze.setInstalled(true);
        analyze.addPhysicalFile("file://" + pegasus_location + "/bin/keg", site_handle);

        dax.addExecutable(preprocess).addExecutable(findrange).addExecutable(analyze);

        // Add a preprocess job
        Job j1 = new Job("j1", "pegasus", "preprocess", "4.0");
        j1.addArgument("-a preprocess -T 60 -i ").addArgument(fa);
        j1.addArgument("-o ").addArgument(fb1);
        j1.addArgument(" ").addArgument(fb2);
        j1.uses(fa, File.LINK.INPUT);
        j1.uses(fb1, File.LINK.OUTPUT);
        j1.uses(fb2, File.LINK.OUTPUT);
        j1.addNotification(Invoke.WHEN.start,"/pegasus/libexec/notification/email -t notify@example.com");
        j1.addNotification(Invoke.WHEN.at_end,"/pegasus/libexec/notification/email -t notify@example.com");
        dax.addJob(j1);

        // Add left Findrange job
        Job j2 = new Job("j2", "pegasus", "findrange", "4.0");
        j2.addArgument("-a findrange -T 60 -i ").addArgument(fb1);
        j2.addArgument("-o ").addArgument(fc1);
        j2.uses(fb1, File.LINK.INPUT);
        j2.uses(fc1, File.LINK.OUTPUT);
        j2.addNotification(Invoke.WHEN.start,"/pegasus/libexec/notification/email -t notify@example.com");
        j2.addNotification(Invoke.WHEN.at_end,"/pegasus/libexec/notification/email -t notify@example.com");
        dax.addJob(j2);

        // Add right Findrange job
        Job j3 = new Job("j3", "pegasus", "findrange", "4.0");
        j3.addArgument("-a findrange -T 60 -i ").addArgument(fb2);
        j3.addArgument("-o ").addArgument(fc2);
        j3.uses(fb2, File.LINK.INPUT);
        j3.uses(fc2, File.LINK.OUTPUT);
        j3.addNotification(Invoke.WHEN.start,"/pegasus/libexec/notification/email -t notify@example.com");
        j3.addNotification(Invoke.WHEN.at_end,"/pegasus/libexec/notification/email -t notify@example.com");
        dax.addJob(j3);

        // Add analyze job
        Job j4 = new Job("j4", "pegasus", "analyze", "4.0");
        j4.addArgument("-a analyze -T 60 -i ").addArgument(fc1);
        j4.addArgument(" ").addArgument(fc2);
        j4.addArgument("-o ").addArgument(fd);
        j4.uses(fc1, File.LINK.INPUT);
        j4.uses(fc2, File.LINK.INPUT);
        j4.uses(fd, File.LINK.OUTPUT);
        j4.addNotification(Invoke.WHEN.start,"/pegasus/libexec/notification/email -t notify@example.com");
        j4.addNotification(Invoke.WHEN.at_end,"/pegasus/libexec/notification/email -t notify@example.com");
        dax.addJob(j4);

        dax.addDependency("j1", "j2");
        dax.addDependency("j1", "j3");
        dax.addDependency("j2", "j4");
        dax.addDependency("j3", "j4");
        return dax;
    }
    
    /**
     * Create an example DIAMOND DAX
     * @param args
     */
    public static void main(String[] args) {
        if (args.length != 1) {
            System.out.println("Usage: java GenerateDiamondDAX  &lt;pegasus_location&gt; ");
            System.exit(1);
        }

        try {
            Diamond diamond = new Diamond();
            String pegasusHome = args[0];
            String site = "TestCluster";
            ADAG dag = diamond.generate( site, pegasusHome );
            dag.writeToSTDOUT();
            //generate(args[0], args[1]).writeToFile(args[2]);
        }
        catch (Exception e) {
            e.printStackTrace();
        }

    }
}

</pre>
<p>Of course, you will have to set up some catalogs and properties to
      run this example. The details are catpured in the examples directory
      <code class="filename">examples/grid-blackdiamond-java</code>.</p>
</div>
<div class="section" title="10.9.2.2. The Python DAX Generator API">
<div class="titlepage"><div><div><h4 class="title">
<a name="api-python"></a>10.9.2.2. The Python DAX Generator API</h4></div></div></div>
<p>Refer to the <a class="ulink" href="python/" target="_top">auto-generated python
      documentation</a> explaining this API.</p>
<pre class="programlisting">#!/usr/bin/env python

from Pegasus.DAX3 import *
import sys
import os

if len(sys.argv) != 2:
        print "Usage: %s PEGASUS_HOME" % (sys.argv[0])
        sys.exit(1)

# Create a abstract dag
diamond = ADAG("diamond")

# Add input file to the DAX-level replica catalog
a = File("f.a")
a.addPFN(PFN("file://" + os.getcwd() + "/f.a", "local"))
diamond.addFile(a)
        
# Add executables to the DAX-level replica catalog
# In this case the binary is keg, which is shipped with Pegasus, so we use
# the remote PEGASUS_HOME to build the path.
e_preprocess = Executable(namespace="diamond", name="preprocess", version="4.0", os="linux", arch="x86_64")
e_preprocess.addPFN(PFN("file://" + sys.argv[1] + "/bin/keg", "TestCluster"))
diamond.addExecutable(e_preprocess)
        
e_findrange = Executable(namespace="diamond", name="findrange", version="4.0", os="linux", arch="x86_64")
e_findrange.addPFN(PFN("file://" + sys.argv[1] + "/bin/keg", "TestCluster"))
diamond.addExecutable(e_findrange)
        
e_analyze = Executable(namespace="diamond", name="analyze", version="4.0", os="linux", arch="x86_64")
e_analyze.addPFN(PFN("file://" + sys.argv[1] + "/bin/keg", "TestCluster"))
diamond.addExecutable(e_analyze)

# Add a preprocess job
preprocess = Job(namespace="diamond", name="preprocess", version="4.0")
b1 = File("f.b1")
b2 = File("f.b2")
preprocess.addArguments("-a preprocess","-T60","-i",a,"-o",b1,b2)
preprocess.uses(a, link=Link.INPUT)
preprocess.uses(b1, link=Link.OUTPUT)
preprocess.uses(b2, link=Link.OUTPUT)
diamond.addJob(preprocess)

# Add left Findrange job
frl = Job(namespace="diamond", name="findrange", version="4.0")
c1 = File("f.c1")
frl.addArguments("-a findrange","-T60","-i",b1,"-o",c1)
frl.uses(b1, link=Link.INPUT)
frl.uses(c1, link=Link.OUTPUT)
diamond.addJob(frl)

# Add right Findrange job
frr = Job(namespace="diamond", name="findrange", version="4.0")
c2 = File("f.c2")
frr.addArguments("-a findrange","-T60","-i",b2,"-o",c2)
frr.uses(b2, link=Link.INPUT)
frr.uses(c2, link=Link.OUTPUT)
diamond.addJob(frr)

# Add Analyze job
analyze = Job(namespace="diamond", name="analyze", version="4.0")
d = File("f.d")
analyze.addArguments("-a analyze","-T60","-i",c1,c2,"-o",d)
analyze.uses(c1, link=Link.INPUT)
analyze.uses(c2, link=Link.INPUT)
analyze.uses(d, link=Link.OUTPUT, register=True)
diamond.addJob(analyze)

# Add control-flow dependencies
diamond.depends(parent=preprocess, child=frl)
diamond.depends(parent=preprocess, child=frr)
diamond.depends(parent=frl, child=analyze)
diamond.depends(parent=frr, child=analyze)

# Add notification for analyze job
analyze.invoke(When.ON_ERROR, '/home/user/bin/email -s "Analyze job failed" user@example.com')

# Add notification for workflow
diamond.invoke(When.AT_END, '/home/user/bin/email -s "Workflow finished" user@example.com')
diamond.invoke(When.ON_SUCCESS, '/home/user/bin/publish_workflow_result')

# Write the DAX to stdout
diamond.writeXML(sys.stdout)</pre>
</div>
<div class="section" title="10.9.2.3. The Perl DAX Generator">
<div class="titlepage"><div><div><h4 class="title">
<a name="api-perl"></a>10.9.2.3. The Perl DAX Generator</h4></div></div></div>
<p>The Perl API example below can be found in file
      <code class="filename">blackdiamond.pl</code> in directory <code class="filename">examples/grid-blackdiamond-perl</code>. It
      requires that you set the environment variable
      <code class="envar">PEGASUS_HOME</code> to the installation directory of Pegasus,
      and include into <code class="envar">PERL5LIB</code> the path to the directory
      <code class="filename">lib/perl</code> of the Pegasus
      installation. The actual code is longer, and will not require these
      settings, only the example below does. The Perl API is documented using
      <a class="ulink" href="http://pegasus.isi.edu/wms/docs/3.0/perl/" target="_top">perldoc</a>.
      For each of the modules you can invoke
      <span class="application">perldoc</span>, if your <code class="envar">PERL5LIB</code>
      variable is set.</p>
<p>The steps to generate a DAX from Perl are similar to the Java
      steps. However, since most methods to the classes are deeply within the
      Perl class modules, the convenience module
      <code class="code">Perl::DAX::Factory</code> makes most constructors accessible
      without you needing to type your fingers raw:</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem"><p>Create a new <span class="emphasis"><em>ADAG</em></span> object.</p></li>
<li class="listitem"><p>Create <span class="emphasis"><em>Job</em></span> objects as necessary.</p></li>
<li class="listitem"><p>As example, the required input file "f.a" is declared as
          <span class="emphasis"><em>File</em></span> object and linked to the
          <span class="emphasis"><em>ADAG</em></span> object.</p></li>
<li class="listitem"><p>The first job arguments and files are filled into the job, and
          the job is added to the <span class="emphasis"><em>ADAG</em></span> object.</p></li>
<li class="listitem"><p>Repeat step 4 for the remaining jobs.</p></li>
<li class="listitem"><p>Add dependencies for all jobs. You have the option of
          assigning label text to edges, though these are not used
          (yet).</p></li>
<li class="listitem"><p>To generate the DAX file, invoke the
          <span class="emphasis"><em>toXML()</em></span> method on the <span class="emphasis"><em>ADAG</em></span>
          object. The first argument is an opened file handle or
          <code class="code">IO::Handle</code> descriptor scalar to write to, the second
          the default indentation for the root element, and the third the XML
          namespace to use for elements and attributes. The latter is
          typically unused unless you want to include your output into another
          XML document.</p></li>
</ol></div>
<pre class="programlisting">#!/usr/bin/env perl
#
use 5.006;
use strict;
use IO::Handle;
use Cwd;
use File::Spec;
use File::Basename;
use Sys::Hostname;
use POSIX ();

BEGIN { $ENV{'PEGASUS_HOME'} ||= `pegasus-config --nocrlf --home` }
use lib File::Spec-&gt;catdir( $ENV{'PEGASUS_HOME'}, 'lib', 'perl' );

use Pegasus::DAX::Factory qw(:all);
use constant NS =&gt; 'diamond';

my $adag = newADAG( name =&gt; NS );
my $job1 = newJob( namespace =&gt; NS, name =&gt; 'preprocess', version =&gt; '2.0' );
my $job2 = newJob( namespace =&gt; NS, name =&gt; 'findrange', version =&gt; '2.0' );
my $job3 = newJob( namespace =&gt; NS, name =&gt; 'findrange', version =&gt; '2.0' );
my $job4 = newJob( namespace =&gt; NS, name =&gt; 'analyze', version =&gt; '2.0' );

# create "f.a" locally
my $fn = "f.a";
open( F, "&gt;$fn" ) || die "FATAL: Unable to open $fn: $!\n";
my @now = gmtime();
printf F "%04u-%02u-%02u %02u:%02u:%02uZ\n",
        $now[5]+1900, $now[4]+1, @now[3,2,1,0];
close F;

my $file = newFile( name =&gt; 'f.a' );
$file-&gt;addPFN( newPFN( url =&gt; 'file://' . Cwd::abs_path($fn),
                       site =&gt; 'local' ) );
$adag-&gt;addFile($file);

# follow this path, if the PEGASUS_HOME was determined
if ( exists $ENV{'PEGASUS_HOME'} ) {
    my $keg = File::Spec-&gt;catfile( $ENV{'PEGASUS_HOME'}, 'bin', 'keg' );
    my @os = POSIX::uname();
    # $os[2] =~ s/^(\d+(\.\d+(\.\d+)?)?).*/$1/;  ## create a proper osversion
    $os[4] =~ s/i.86/x86/;

    # add Executable instances to DAX-included TC. This will only work,
    # if we know how to access the keg executable. HOWEVER, for a grid
    # workflow, these entries are not used, and you need to
    # [1] install the work tools remotely
    # [2] create a TC with the proper entries
    if ( -x $keg ) {
        for my $j ( $job1, $job2, $job4 ) {
            my $app = newExecutable( namespace =&gt; $j-&gt;namespace,
                                     name =&gt; $j-&gt;name,
                                     version =&gt; $j-&gt;version,
                                     installed =&gt; 'false',
                                     arch =&gt; $os[4],
                                     os =&gt; lc($^O) );
            $app-&gt;addProfile( 'globus', 'maxtime', '2' );
            $app-&gt;addProfile( 'dagman', 'RETRY', '3' );
            $app-&gt;addPFN( newPFN( url =&gt; "file://$keg", site =&gt; 'local' ) );
            $adag-&gt;addExecutable($app);
        }
    }
}

my %hash = ( link =&gt; LINK_OUT, register =&gt; 'false', transfer =&gt; 'true' );
my $fna = newFilename( name =&gt; $file-&gt;name, link =&gt; LINK_IN );
my $fnb1 = newFilename( name =&gt; 'f.b1', %hash );
my $fnb2 = newFilename( name =&gt; 'f.b2', %hash );
$job1-&gt;addArgument( '-a', $job1-&gt;name, '-T60', '-i', $fna,
                    '-o', $fnb1, $fnb2 );
$adag-&gt;addJob($job1);

my $fnc1 = newFilename( name =&gt; 'f.c1', %hash );
$fnb1-&gt;link( LINK_IN );
$job2-&gt;addArgument( '-a', $job2-&gt;name, '-T60', '-i', $fnb1,
                    '-o', $fnc1 );
$adag-&gt;addJob($job2);

my $fnc2 = newFilename( name =&gt; 'f.c2', %hash );
$fnb2-&gt;link( LINK_IN );
$job3-&gt;addArgument( '-a', $job3-&gt;name, '-T60', '-i', $fnb2,
                    '-o', $fnc2 );
$adag-&gt;addJob($job3);
# a convenience function -- you can specify multiple dependents
$adag-&gt;addDependency( $job1, $job2, $job3 );

my $fnd = newFilename( name =&gt; 'f.d', %hash );
$fnc1-&gt;link( LINK_IN );
$fnc2-&gt;link( LINK_IN );
$job4-&gt;separator('');                # just to show the difference wrt default
$job4-&gt;addArgument( '-a ', $job4-&gt;name, ' -T60 -i ', $fnc1, ' ', $fnc2,
                    ' -o ', $fnd );
$adag-&gt;addJob($job4);
# this is a convenience function adding parents to a child.
# it is clearer than overloading addDependency
$adag-&gt;addInverse( $job4, $job2, $job3 );

# workflow level notification in case of failure
# refer to Pegasus::DAX::Invoke for details
my $user = $ENV{USER} || $ENV{LOGNAME} || scalar getpwuid($&gt;);
$adag-&gt;invoke( INVOKE_ON_ERROR,
               "/bin/mailx -s 'blackdiamond failed' $user" );

my $xmlns = shift;
$adag-&gt;toXML( \*STDOUT, '', $xmlns );</pre>
</div>
</div>
<div class="section" title="10.9.3. DAX Generator without a Pegasus DAX API">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp10760848"></a>10.9.3. DAX Generator without a Pegasus DAX API</h3></div></div></div>
<p>If you are using some other scripting or programming environment,
    you can directly write out the DAX format using the provided schema using
    any language. For instance, LIGO, the Laser Interferometer Gravitational
    Wave Observatory, generate their DAX files as XML using their own Python
    code, not using our provided API.</p>
<p>If you write your own XML, you <span class="emphasis"><em>must</em></span> ensure that
    the generated XML is well formed and valid with respect to the DAX schema.
    You can use the <span class="command"><strong>pegasus-dax-validator</strong></span> to verify the
    validity of your generated file. Typically, you generate a smallish test
    file to, validate that your generator creates valid XML using the
    validator, and then ramp it up to produce the full workflow(s) you want to
    run. At this point the <span class="command"><strong>pegasus-dax-validator</strong></span> is a very
    simple program that will only take exactly one argument, the name of the
    file to check.The following snippet checks a black-diamond file that uses
    an improper <span class="emphasis"><em>osversion</em></span> attribute in its
    <span class="emphasis"><em>executable</em></span> element:</p>
<pre class="screen"><code class="prompt">$</code> <span class="command"><strong>pegasus-dax-validator <em class="replaceable"><code>blackdiamond.dax</code></em></strong></span>
ERROR: cvc-pattern-valid: Value '2.6.18-194.26.1.el5' is not facet-valid
 with respect to pattern '[0-9]+(\.[0-9]+(\.[0-9]+)?)?' for type 'VersionPattern'.
ERROR: cvc-attribute.3: The value '2.6.18-194.26.1.el5' of attribute 'osversion'
 on element 'executable' is not valid with respect to its type, 'VersionPattern'.

0 warnings, 2 errors, and 0 fatal errors detected.</pre>
<p>We are working on improving this program, e.g. provide output with
    regards to the line number where the issue occurred. However, it will
    return with a non-zero exit code whenever errors were detected.</p>
</div>
</div>
<div class="section" title="10.10. Command Line Tools">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="cli"></a>10.10. Command Line Tools</h2></div></div></div>
<div class="toc"><dl>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-analyzer.php">pegasus-analyzer</a></span><span class="refpurpose"> — debugs a workflow.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-archive.php">pegasus-archive</a></span><span class="refpurpose"> — Compresses a workflow submit directory in a way that allows pegasus-dashboard, pegasus-statistics, pegasus-plots, and pegasus-analyzer to keep working.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-cleanup.php">pegasus-cleanup</a></span><span class="refpurpose"> — Removes files during Pegasus workflows enactment.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-cluster.php">pegasus-cluster</a></span><span class="refpurpose"> — run a list of applications</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-config.php">pegasus-config</a></span><span class="refpurpose"> — The authority for where parts of the Pegasus system exists                  on the filesystem. pegasus-config can be used to find                  libraries such as the DAX generators.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-create-dir.php">pegasus-create-dir</a></span><span class="refpurpose"> — Creates work directories in Pegasus workflows.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-dagman.php">pegasus-dagman</a></span><span class="refpurpose"> — Wrapper around *condor_dagman*. Not to be run by user.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-dax-validator.php">pegasus-dax-validator</a></span><span class="refpurpose"> — determines if a given DAX file is valid.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-exitcode.php">pegasus-exitcode</a></span><span class="refpurpose"> — Checks the stdout/stderr files of a workflow job for any indication that an error occurred in the job. This script is intended to be invoked automatically by DAGMan as the POST script of a job.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-gridftp.php">pegasus-gridftp</a></span><span class="refpurpose"> — Perform file and directory operations on remote GridFTP servers</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-invoke.php">pegasus-invoke</a></span><span class="refpurpose"> — invokes a command from a file</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-keg.php">pegasus-keg</a></span><span class="refpurpose"> — kanonical executable for grids</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-kickstart.php">pegasus-kickstart</a></span><span class="refpurpose"> — run an executable in a more universal environment.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-monitord.php">pegasus-monitord</a></span><span class="refpurpose"> — tracks a workflow progress, mining information</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-mpi-cluster.php">pegasus-mpi-cluster</a></span><span class="refpurpose"> — a tool for running computational workflows expressed as DAGs (Directed Acyclic Graphs) on computational clusters using MPI.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-plan.php">pegasus-plan</a></span><span class="refpurpose"> — runs Pegasus to generate the executable workflow</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-plots.php">pegasus-plots</a></span><span class="refpurpose"> — A tool to generate graphs and charts to visualize workflow run.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-rc-client.php">pegasus-rc-client</a></span><span class="refpurpose"> — shell client for replica implementations</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-remove.php">pegasus-remove</a></span><span class="refpurpose"> — removes a workflow that has been planned and submitted using pegasus-plan and pegasus-run</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-run.php">pegasus-run</a></span><span class="refpurpose"> — executes a workflow that has been planned using *pegasus-plan*.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-s3.php">pegasus-s3</a></span><span class="refpurpose"> — Upload, download, delete objects in Amazon S3</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-sc-client.php">pegasus-sc-client</a></span><span class="refpurpose"> — generates a site catalog by querying sources.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-sc-converter.php">pegasus-sc-converter</a></span><span class="refpurpose"> — A client to convert site catalog from one format to another format.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-statistics.php">pegasus-statistics</a></span><span class="refpurpose"> — A tool to generate statistics about the workflow run.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-status.php">pegasus-status</a></span><span class="refpurpose"> — Pegasus workflow- and run-time status</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-submit-dag.php">pegasus-submit-dag</a></span><span class="refpurpose"> — Wrapper around *condor_submit_dag*. Not to be run by user.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-tc-client.php">pegasus-tc-client</a></span><span class="refpurpose"> — A full featured generic client to handle adds, deletes and queries to the Transformation Catalog (TC).</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-tc-converter.php">pegasus-tc-converter</a></span><span class="refpurpose"> — A client to convert transformation catalog from one format to another format.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-transfer.php">pegasus-transfer</a></span><span class="refpurpose"> — Handles data transfers in Pegasus workflows.</span>
</dt>
<dt>
<span class="refentrytitle"><a href="cli-pegasus-version.php">pegasus-version</a></span><span class="refpurpose"> — print or match the version of the toolkit.</span>
</dt>
</dl></div>
</div>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="example_workflows.php">Prev</a> </td>
<td width="20%" align="center"> </td>
<td width="40%" align="right"> <a accesskey="n" href="cli-pegasus-analyzer.php">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Chapter 9. Example Workflows </td>
<td width="20%" align="center"><a accesskey="h" href="index.php">Table of Contents</a></td>
<td width="40%" align="right" valign="top"> pegasus-analyzer</td>
</tr>
</table>
</div>
</div><?php  
            do_html_footer();
        ?>
