3. Running GOMAP

  1. Install (local) or load (HPC) Singularity container (version 3.5.2).

    #On HPC Systems
    module load singularity
    # Sometimes specific version has to be loaded as
    module load singularity/3.5.2
    
  2. Clone the git repository

    mkdir -p /path/to/GOMAP-singularity/install/location
    git clone https://github.com/Dill-PICL/GOMAP-singularity.git /path/to/GOMAP-singularity/install/location
    git checkout v1.3.5
    cd /path/to/GOMAP-singularity/install/location
    
  3. Run the setup step to make necessary directories and download data files from CyVerse

    1. Run setup

      ./setup.sh
      

      Attention

      The pipeline download is large and would require ~40GB of free hard drive space during the setup step.

      Important

      Default image downloaded will be built for mpich-3.2.1 for parallelization. Please submit a issue request on GitHub if you want the image for a different mpi version or you can download the Singularity files and build the image yourself.

  4. [optional] Test whether the container and the data files are working as intended.

    1. Add your email to the test/config.yml. This is necessary to submit jobs to Argot2.5.

    2. Run the test using following command.

    ./test.sh
    

    Attention

    This has to be performed from the GOMAP-singularity install location because the test directory location is fixed.

  5. Edit the config file

    1. Declare export GOMAP_LOC environment variable

      # Add this to your ~/.bashrc or run the line in the terminal
      export GOMAP_LOC="/path/to/GOMAP-singularity/install/location"
      
    2. Download the config.yml file and make necessary changes. Change the highlighted lines to fit your input data

      Attention

      A boilerplate for running GOMAP-singularity on SLURM environment has been made available on Github at GOMAP-boilerplate. You can follow instructions there to get to annotating faster.

     1#Input section 
     2input:
     3  #input fasta file name
     4  fasta: test.fa
     5  # output file basename
     6  basename: test
     7  #input NCBI taxonomy id
     8  taxon: "4577"
     9  # Name of the species
    10  species: "Zea mays"
    11  # Email is mandatory
    12  email: 
    13  #Number of CPUs used for tools
    14  cpus: 4
    15  #Whether openmpi should be used
    16  mpi: False
    17  #what the name of the temporary directory is
    18  tmpdir: "/tmpdir"
    
  6. Run the pipeline

    GOMAP has 7 distinct steps for running the pipeline after setup. The steps are listed in the table below.

    Number

    Step

    Single

    Parallel

    Concurrent

    1

    seqsim

    Y

    N

    Y

    2

    domain

    Y

    Y

    Y

    3

    fanngo

    Y

    N

    Y

    4

    mixmeth-blast

    Y

    Y

    Y

    5

    mixmeth-preproc

    Y

    N

    N

    6

    mixmeth

    Y

    N

    N

    7

    aggregate

    Y

    N

    N

    First four steps seqsim, domain, fanngo, and mixmeth-blast can be run concurrently. This will allow the pipeline to complete faster. Subsequent steps mixmeth-preproc, mixmeth and aggregate steps depend on the output of the first four steps.

    GOMAP-singularity helper scripts

    GOMAP-singularity git repository has two helper scripts.

    1. run-GOMAP-SINGLE.sh

      This scipt can be used to run GOMAP steps 1-7 on a single machine or a single node on the cluster

    2. run-GOMAP-mpi.sh

      This scipt can be used to run GOMAP steps 2 (domain) and 4 (mixmeth-preproc) on a multiple nodes on the SLURM cluster. This step is parallelized using mpich for parallelization.

    Tip

    If you are familiar with singularity then you can directly run the GOMAP-singularity container with the necessary binds, but it will be easier to use the helper scripts

    Attention

    Steps 1-4 can be run concurrently, because they do not depend on each other. Subsequent steps do depend on previous output so they can be run only one at a time and after the first four are finished.

    The details of how to run the GOMAP steps are below

    1. seqsim

      ./run-GOMAP-SINGLE.sh --step=seqsim --config=test/config.yml
      
    2. domain

      Running on a Single node

      ./run-GOMAP-SINGLE.sh --step=domain --config=test/config.yml
      

      Running on a multiple nodes (MPI)

      Warning

      Slurm job scheduler will be required to use mpi to work with the scripts provided. This will also require the correct version of MPI for the container

      Attention

      The line 16 from the config file should be changed to true enable mpi. If this is set to false then the mpi will not be enabled

       1#Input section 
       2input:
       3  #input fasta file name
       4  fasta: test.fa
       5  # output file basename
       6  basename: test
       7  #input NCBI taxonomy id
       8  taxon: "4577"
       9  # Name of the species
      10  species: "Zea mays"
      11  # Email is mandatory
      12  email: 
      13  #Number of CPUs used for tools
      14  cpus: 4
      15  #Whether openmpi should be used
      16  mpi: True
      17  #what the name of the temporary directory is
      18  tmpdir: "/tmpdir"
      

      Slurm commands needed for successful sbatch submission

      # This can be any number of nodes, but 10-20 has been optimal
      #SBATCH -N 10
      
      #SBATCH --ntasks-per-node=1
      #SBATCH --cpus-per-task=16 #or the CPU for each node
      

      You may also need to load the mpich module on HPC systems.

      #On HPC Systems
      module load mpich
      
      #Or it might be packaged as part of MVAPICH
      module load mvapich
      
      ./run-GOMAP-mpi.sh --step=domain --config=test/config.yml
      
    3. fanngo

      ./run-GOMAP-SINGLE.sh --step=fanngo --config=test/config.yml
      
    4. mixmeth-blast

      Running on a Single node

      ./run-GOMAP-SINGLE.sh --step=mixmeth-blast --config=test/config.yml
      

      Running on a multiple nodes (MPI)

      ./run-GOMAP-mpi.sh --step=mixmeth-blast --config=test/config.yml
      

      The --nodes and --cpus-per-task can be optimized based on the cluster for slurm schedulers

    5. mixmeth-preproc

      ./run-GOMAP-SINGLE.sh --step=mixmeth-preproc --config=test/config.yml
      
    6. mixmeth

      ./run-GOMAP-SINGLE.sh --step=mixmeth --config=test/config.yml
      

      Attention

      The mixmeth step sumbits annotation jobs to Argot2.5 webserver. Please wait till you have received the job completion emails before you run the next step

    7. aggregate

      Attention

      Please wait for all your Argot2.5 jobs to finish before running this step. You will get emails from Argot2.5 when your jobs are submitted and when they are finished. You can also check the status of all current jobs from all users here.

      ./run-GOMAP-SINGLE.sh --step=aggregate --config=test/config.yml
      
  1. Final dataset will be available at GOMAP-[basename]/gaf/e.agg/[basename].aggregate.gaf. [basename] is defined in the config.yml file that was used