Launch a Lemon workflow

C++ Workflows

Since Lemon is a header-only library, users of the C++ API have complete control over how they wish to launch their workflows. The lemon::Options class allows users to read command line options in a consistant manner. See the previous section for a more indepth discussion of this class. The following function is used to launch a lemon workflow.

template<typename Function, typename Collector>
int lemon::launch(const Options &o, Function &&worker, Collector &collect)

Launch a Lemon workflow.

This function reads Lemon options and passes them to the appropriate run_parallel function. This is the main entry point of a C++ Lemon program.

Return

0 on success or a non-zero integer on error.

Parameters
  • [in] o: An instance of the Options used to pass arguments to Lemon

  • worker: Function object representing the body of the workflow.

  • collect: Function object for collect the results of worker.

Submitting Lemon jobs

If a Lemon workflow does not require any custom features (IE custom options passed to the main function from the shell), users can use the script called launch_lemon.pbs to launch their workflow on PBS-based submission systems. An example of how to use this given below for a sample workflow.

wget -N https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
qsub launch_lemon.pbs -v LEMON_PROG=protein_angle

Note: The provided script will not work on all systems and may need to be edited to work in a given computing environment.

Python Workflows using lemon_python

The program lemon_python is secretly a Lemon workflow! This workflow searches for a class called MyWorkflow (which must subclass lemon.Workflow. It then instantiates this class and runs the member function Workflow.worker for every entry in the PDB. The member function Workflow.finalize is called when the workflow completes. An example of using this workflow is given below:

wget -N https://mmtf.rcsb.org/v1.0/hadoopfiles/full.tar
tar xf full.tar
lemon_python -p tmscore.py -w full/

Note: The entire python script passed to lemon_python is evaluated before the MyWorkflow class is instantiated. Control is not passed back to the script after the Worker.finalize method is called.

Using the Python Interpreter

For users of the candiy_lemon PyPI package, one must submit the derived Lemon workflow manually using the lemon.launch function. An example is given below. The arguments to this function are the lemon.Workflow daughter class, the path to the RCSB Hadoop files and the number of cores to use.

from candiy_lemon.lemon import *
class MyWorkflow(Workflow):
    def worker(self, entry, pdbid):
        return entry.topology().residue(1).get("chainname").get().as_string() + '\n'
    def finalize(self):
        pass

work = MyWorkflow()

launch(work, "full", 2)

Prefiltering the PDB with searches originating on RCSB

The advanced search features on the RCSB website can be used to prefliter the PDB. First, one performs a search on the website. Then, they must obtain the query details by clicking the blue button on the bottom left corner of the search results. This will result in an XML version of their search being generated. This must be given as input to the script obtain_entries_from_search.pl provided with Lemon which will produce an entry file that can be passed to a workflow. An example XML result and entry file example are given below:

<orgPdbQuery>
    <version>head</version>
    <queryType>org.pdb.query.simple.AdvancedKeywordQuery</queryType>
    <description>Text Search for: hiv</description>
    <queryId>AC7A2BC3</queryId>
    <resultCount>2875</resultCount>
    <runtimeStart>2019-01-28T00:51:01Z</runtimeStart>
    <runtimeMilliseconds>1686</runtimeMilliseconds>
    <keywords>HIV</keywords>
</orgPdbQuery>

Then

perl obtain_entries_from_search.pl hiv_search.xml > hiv_prots.lst
tar xf full.tar
./small_molecules -w full -e hiv_prots.lst

Danger Zone: Internal documentation!

These functions are internal to Lemon and not meant to be used as part of the external API. They are documented here for the interested reader and future Lemon developers.

template<typename Function, typename Collector>
void lemon::run_parallel(Function &&worker, const std::string &p, Collector &collector, size_t ncpu = 1, const Entries &entries = Entries(), const Entries &skip_entries = Entries())

The run_parallel function launches jobs which do return data.

Use this function to run the worker function on ncpu threads. The worker should accept two arguments, a chemfiles::Frame and a std::string. It must return a value as this value will be appended, using the combine function object, the the collector. See the Lemon Workflow documention for more details.

Parameters
  • worker: A function object (C++11 lambda, struct the with operator() overloaded, or std::function object) that the user wishes to apply.

  • [in] p: A path to the Hadoop sequence file directory. by the worker object which are appended with combine.

  • collector: A function object that handles the output of worker.

  • [in] ncpu: The number of threads to use.

  • [in] entries: Which entries to use. Not used if blank.

  • [in] skip_entries: Which entries to skip. Not used if blank.

class lemon::Hadoop

The Hadoop class is used to read input sequence files.

This class reads an Apache Hadoop Sequence file and iterates through the key/value pairs. It has been modified so that it can only read files supplied by RCSB at this location: Full

Public Functions

Hadoop(std::istream &stream)

Create a Hadoop class using a std::istream.

This Hadoop constructor takes a binary stream as input. This stream must be open and contain data from a sequence file obtained from RCSB.

bool has_next()

Returns if a sequence file has remaining MMTF records in it.

Use this function to check if the sequence file has any remaining MMTF records stored in it.

Return

True if another MMTF record is present. False otherwise.

std::pair<std::string, std::vector<char>> next()

Returns the next MMTF file.

This function reads the next MMTF record from the underlying stream. Be-warned that this function does minimal error checking and should only be used if has_next() has returned true.

Return

A pair of std::vector<char>s. The first member contains the PDB ID and the second contains the GZ compressed MMTF file.

Public Static Attributes

auto constexpr HADOOP_HEADER_SIZE = 90

The size of the starting header.

std::vector<std::string> lemon::read_hadoop_dir(const std::string &p)

Read a directory containing hadoop sequence files.