Launch a Lemon workflow¶
C++ Workflows¶
Since Lemon is a header-only library, users of the C++ API have complete control over how they wish to launch their workflows. The lemon::Options class allows users to read command line options in a consistant manner. See the previous section for a more indepth discussion of this class. The following function is used to launch a lemon workflow.
-
template<typename
Function
, typenameCollector
>
intlemon
::
launch
(const Options &o, Function &&worker, Collector &collect)¶ Launch a Lemon workflow.
This function reads Lemon options and passes them to the appropriate
run_parallel
function. This is the main entry point of a C++ Lemon program.- Return
0 on success or a non-zero integer on error.
- Parameters
[in] o
: An instance of theOptions
used to pass arguments to Lemonworker
: Function object representing the body of the workflow.collect
: Function object for collect the results ofworker
.
Submitting Lemon jobs¶
If a Lemon workflow does not require any custom features (IE custom options passed to the main function from the shell), users can use the script called launch_lemon.pbs to launch their workflow on PBS-based submission systems. An example of how to use this given below for a sample workflow.
wget -N https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
qsub launch_lemon.pbs -v LEMON_PROG=protein_angle
Note: The provided script will not work on all systems and may need to be edited to work in a given computing environment.
Python Workflows using lemon_python¶
The program lemon_python is secretly a Lemon workflow! This workflow searches for a class called MyWorkflow (which must subclass lemon.Workflow. It then instantiates this class and runs the member function Workflow.worker for every entry in the PDB. The member function Workflow.finalize is called when the workflow completes. An example of using this workflow is given below:
wget -N https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar xf full.tar
lemon_python -p tmscore.py -w full/
Note: The entire python script passed to lemon_python is evaluated before the MyWorkflow class is instantiated. Control is not passed back to the script after the Worker.finalize method is called.
Using the Python Interpreter¶
For users of the candiy_lemon PyPI package, one must submit the derived Lemon workflow manually using the lemon.launch function. An example is given below. The arguments to this function are the lemon.Workflow daughter class, the path to the RCSB Hadoop files and the number of cores to use.
from candiy_lemon.lemon import *
class MyWorkflow(Workflow):
def worker(self, entry, pdbid):
return entry.topology().residue(1).get("chainname").get().as_string() + '\n'
def finalize(self):
pass
work = MyWorkflow()
launch(work, "full", 2)
Prefiltering the PDB with searches originating on RCSB¶
The advanced search features on the RCSB website can be used to prefliter the PDB. First, one performs a search on the website. Then, they must obtain the query details by clicking the blue button on the bottom left corner of the search results. This will result in an XML version of their search being generated. This must be given as input to the script obtain_entries_from_search.pl provided with Lemon which will produce an entry file that can be passed to a workflow. An example XML result and entry file example are given below:
<orgPdbQuery>
<version>head</version>
<queryType>org.pdb.query.simple.AdvancedKeywordQuery</queryType>
<description>Text Search for: hiv</description>
<queryId>AC7A2BC3</queryId>
<resultCount>2875</resultCount>
<runtimeStart>2019-01-28T00:51:01Z</runtimeStart>
<runtimeMilliseconds>1686</runtimeMilliseconds>
<keywords>HIV</keywords>
</orgPdbQuery>
Then
perl obtain_entries_from_search.pl hiv_search.xml > hiv_prots.lst
tar xf full.tar
./small_molecules -w full -e hiv_prots.lst
Danger Zone: Internal documentation!¶
These functions are internal to Lemon and not meant to be used as part of the external API. They are documented here for the interested reader and future Lemon developers.
-
template<typename
Function
, typenameCollector
>
voidlemon
::
run_parallel
(Function &&worker, const std::string &p, Collector &collector, size_t ncpu = 1, const Entries &entries = Entries(), const Entries &skip_entries = Entries())¶ The
run_parallel
function launches jobs which do return data.Use this function to run the
worker
function onncpu
threads. Theworker
should accept two arguments, achemfiles::Frame
and astd::string
. It must return a value as this value will be appended, using thecombine
function object, the thecollector
. See theLemon Workflow
documention for more details.- Parameters
worker
: A function object (C++11 lambda, struct the with operator() overloaded, or std::function object) that the user wishes to apply.[in] p
: A path to the Hadoop sequence file directory. by the worker object which are appended withcombine
.collector
: A function object that handles the output ofworker
.[in] ncpu
: The number of threads to use.[in] entries
: Which entries to use. Not used if blank.[in] skip_entries
: Which entries to skip. Not used if blank.
-
class
lemon
::
Hadoop
¶ The
Hadoop
class is used to read input sequence files.This class reads an Apache Hadoop Sequence file and iterates through the key/value pairs. It has been modified so that it can only read files supplied by RCSB at this location: Full
Public Functions
-
Hadoop
(std::istream &stream)¶ Create a
Hadoop
class using astd::istream
.This Hadoop constructor takes a binary stream as input. This stream must be open and contain data from a sequence file obtained from RCSB.
-
bool
has_next
()¶ Returns if a sequence file has remaining MMTF records in it.
Use this function to check if the sequence file has any remaining MMTF records stored in it.
- Return
True if another MMTF record is present. False otherwise.
-
std::pair<std::string, std::vector<char>>
next
()¶ Returns the next MMTF file.
This function reads the next MMTF record from the underlying stream. Be-warned that this function does minimal error checking and should only be used if has_next() has returned
true
.- Return
A pair of
std::vector<char>
s. The first member contains the PDB ID and the second contains the GZ compressed MMTF file.
Public Static Attributes
-
auto constexpr
HADOOP_HEADER_SIZE
= 90¶ The size of the starting header.
-
-
std::vector<std::string>
lemon
::
read_hadoop_dir
(const std::string &p)¶ Read a directory containing hadoop sequence files.