3. Running GOMAP

Install (local) or load (HPC) Singularity container (version 3.5.2).

#On HPC Systems
module load singularity
# Sometimes specific version has to be loaded as
module load singularity/3.5.2

Clone the git repository

mkdir -p /path/to/GOMAP-singularity/install/location
git clone https://github.com/Dill-PICL/GOMAP-singularity.git /path/to/GOMAP-singularity/install/location
git checkout v1.3.5
cd /path/to/GOMAP-singularity/install/location

Run the setup step to make necessary directories and download data files from CyVerse
1. Run setup
  ./setup.sh
  
  Attention
  
  The pipeline download is large and would require ~40GB of free hard drive space during the setup step.
  
  Important
  
  Default image downloaded will be built for mpich-3.2.1 for parallelization. Please submit a issue request on GitHub if you want the image for a different mpi version or you can download the Singularity files and build the image yourself.
[optional] Test whether the container and the data files are working as intended.
1. Add your email to the test/config.yml. This is necessary to submit jobs to Argot2.5.
2. Run the test using following command.
./test.sh
Attention

This has to be performed from the GOMAP-singularity install location because the test directory location is fixed.

Edit the config file

Declare export GOMAP_LOC environment variable
# Add this to your ~/.bashrc or run the line in the terminal
export GOMAP_LOC="/path/to/GOMAP-singularity/install/location"
Download the config.yml file and make necessary changes. Change the highlighted lines to fit your input data

Attention

A boilerplate for running GOMAP-singularity on SLURM environment has been made available on Github at GOMAP-boilerplate. You can follow instructions there to get to annotating faster.
 1#Input section 
 2input:
 3  #input fasta file name
 4  fasta: test.fa
 5  # output file basename
 6  basename: test
 7  #input NCBI taxonomy id
 8  taxon: "4577"
 9  # Name of the species
10  species: "Zea mays"
11  # Email is mandatory
12  email: 
13  #Number of CPUs used for tools
14  cpus: 4
15  #Whether openmpi should be used
16  mpi: False
17  #what the name of the temporary directory is
18  tmpdir: "/tmpdir"

Run the pipeline
GOMAP has 7 distinct steps for running the pipeline after setup. The steps are listed in the table below.

Number

Step

Single

Parallel

Concurrent

1

seqsim

Y

N

Y

2

domain

Y

Y

Y

3

fanngo

Y

N

Y

4

mixmeth-blast

Y

Y

Y

5

mixmeth-preproc

Y

N

N

6

mixmeth

Y

N

N

7

aggregate

Y

N

N

First four steps seqsim, domain, fanngo, and mixmeth-blast can be run concurrently. This will allow the pipeline to complete faster. Subsequent steps mixmeth-preproc, mixmeth and aggregate steps depend on the output of the first four steps.

GOMAP-singularity helper scripts
GOMAP-singularity git repository has two helper scripts.

run-GOMAP-SINGLE.sh

This scipt can be used to run GOMAP steps 1-7 on a single machine or a single node on the cluster

run-GOMAP-mpi.sh

This scipt can be used to run GOMAP steps 2 (domain) and 4 (mixmeth-preproc) on a multiple nodes on the SLURM cluster. This step is parallelized using mpich for parallelization.

Tip

If you are familiar with singularity then you can directly run the GOMAP-singularity container with the necessary binds, but it will be easier to use the helper scripts

Attention

Steps 1-4 can be run concurrently, because they do not depend on each other. Subsequent steps do depend on previous output so they can be run only one at a time and after the first four are finished.
The details of how to run the GOMAP steps are below
1. seqsim
  ./run-GOMAP-SINGLE.sh --step=seqsim --config=test/config.yml
2. domain
  Running on a Single node
  
  ./run-GOMAP-SINGLE.sh --step=domain --config=test/config.yml
  
  Running on a multiple nodes (MPI)
  
  Warning
  
  Slurm job scheduler will be required to use mpi to work with the scripts provided. This will also require the correct version of MPI for the container
  
  Attention
  
  The line 16 from the config file should be changed to true enable mpi. If this is set to false then the mpi will not be enabled
  
  1#Input section 2input: 3 #input fasta file name 4 fasta: test.fa 5 # output file basename 6 basename: test 7 #input NCBI taxonomy id 8 taxon: "4577" 9 # Name of the species 10 species: "Zea mays" 11 # Email is mandatory 12 email: 13 #Number of CPUs used for tools 14 cpus: 4 15 #Whether openmpi should be used 16 mpi: True 17 #what the name of the temporary directory is 18 tmpdir: "/tmpdir"
  
  Slurm commands needed for successful sbatch submission
  
  # This can be any number of nodes, but 10-20 has been optimal #SBATCH -N 10 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=16 #or the CPU for each node
  
  You may also need to load the mpich module on HPC systems.
  
  #On HPC Systems module load mpich #Or it might be packaged as part of MVAPICH module load mvapich
  
  ./run-GOMAP-mpi.sh --step=domain --config=test/config.yml
3. fanngo
  ./run-GOMAP-SINGLE.sh --step=fanngo --config=test/config.yml
4. mixmeth-blast
  Running on a Single node
  
  ./run-GOMAP-SINGLE.sh --step=mixmeth-blast --config=test/config.yml
  
  Running on a multiple nodes (MPI)
  
  ./run-GOMAP-mpi.sh --step=mixmeth-blast --config=test/config.yml
  
  The --nodes and --cpus-per-task can be optimized based on the cluster for slurm schedulers
5. mixmeth-preproc
  ./run-GOMAP-SINGLE.sh --step=mixmeth-preproc --config=test/config.yml
6. mixmeth
  ./run-GOMAP-SINGLE.sh --step=mixmeth --config=test/config.yml
  
  Attention
  
  The mixmeth step sumbits annotation jobs to Argot2.5 webserver. Please wait till you have received the job completion emails before you run the next step
7. aggregate
  Attention
  
  Please wait for all your Argot2.5 jobs to finish before running this step. You will get emails from Argot2.5 when your jobs are submitted and when they are finished. You can also check the status of all current jobs from all users here.
  
  ./run-GOMAP-SINGLE.sh --step=aggregate --config=test/config.yml

Final dataset will be available at GOMAP-[basename]/gaf/e.agg/[basename].aggregate.gaf. [basename] is defined in the config.yml file that was used

Number	Step	Single	Parallel	Concurrent
1	seqsim	Y	N	Y
2	domain	Y	Y	Y
3	fanngo	Y	N	Y
4	mixmeth-blast	Y	Y	Y
5	mixmeth-preproc	Y	N	N
6	mixmeth	Y	N	N
7	aggregate	Y	N	N