2. Running GOMAP
Install (local) or load (HPC) Singularity container (version 3.1.x).
#On HPC Systems module load singularity
Clone the git repository
mkdir -p /path/to/GOMAP-singularity/install/location git clone -b v1.3 https://github.com/Dill-PICL/GOMAP-singularity.git /path/to/GOMAP-singularity/install/location cd /path/to/GOMAP-singularity/install/location
Run the setup step to make necessary directories and download data files from CyVerse
Run setup
./setup.sh
Attention
The pipeline download is large and would require ~40GB of free hard drive space during the setup step.
Important
Default image downloaded will be built for mpich-3.2.1 for parallelization. Please submit a issue request on GitHub if you want the image for a different mpi version or you can download the Singularity files and build the image yourself.
[optional] Test whether the container and the data files are working as intended.
Add your email to the
test/config.yml
. This is necessary to submit jobs to Argot2.5.Run the test using following command.
./test.sh
Attention
This has to be performed from the GOMAP-singularity install location because the test directory location is fixed.
Edit the config file
Declare export
GOMAP_LOC
environment variable# Add this to your ~/.bashrc or run the line in the terminal export GOMAP_LOC="/path/to/GOMAP-singularity/install/location"
Download the config.yml file and make necessary changes. Change the highlighted lines to fit your input data
Attention
A boilerplate for running GOMAP-singularity on SLURM environment has been made available on Github at GOMAP-boilerplate. You can follow instructions there to get to annotating faster.
1#Input section 2input: 3 #input fasta file name 4 fasta: test.fa 5 # output file basename 6 basename: test 7 #input NCBI taxonomy id 8 taxon: "4577" 9 # Name of the species 10 species: "Zea mays" 11 # Email is mandatory 12 email: 13 #Number of CPUs used for tools 14 cpus: 4 15 #Whether openmpi should be used 16 mpi: False 17 #what the name of the temporary directory is 18 tmpdir: "/tmpdir"
Run the pipeline
GOMAP has 7 distinct steps for running the pipeline after setup. The steps are listed in the table below.
Number
Step
Single
Parallel
Concurrent
1
seqsim
Y
N
Y
2
domain
Y
Y
Y
3
fanngo
Y
N
Y
4
mixmeth-blast
Y
Y
Y
5
mixmeth-preproc
Y
N
N
6
mixmeth
Y
N
N
7
aggregate
Y
N
N
First four steps seqsim, domain, fanngo, and mixmeth-blast can be run concurrently. This will allow the pipeline to complete faster. Subsequent steps mixmeth-preproc, mixmeth and aggregate steps depend on the output of the first four steps.
GOMAP-singularity helper scripts
GOMAP-singularity git repository has two helper scripts.
run-GOMAP-SINGLE.sh
This scipt can be used to run GOMAP steps 1-7 on a single machine or a single node on the cluster
run-GOMAP-mpi.sh
This scipt can be used to run GOMAP steps 2 (domain) and 4 (mixmeth-preproc) on a multiple nodes on the SLURM cluster. This step is parallelized using mpich for parallelization.
Tip
If you are familiar with singularity then you can directly run the GOMAP-singularity container with the necessary binds, but it will be easier to use the helper scripts
Attention
Steps 1-4 can be run concurrently, because they do not depend on each other. Subsequent steps do depend on previous output so they can be run only one at a time and after the first four are finished.
The details of how to run the GOMAP steps are below
seqsim
./run-GOMAP-SINGLE.sh --step=seqsim --config=test/config.yml
domain
Running on a Single node
./run-GOMAP-SINGLE.sh --step=domain --config=test/config.yml
Running on a multiple nodes (MPI)
Warning
Slurm job scheduler will be required to use mpi to work with the scripts provided. This will also require the correct version of MPI for the container
Attention
The line 16 from the config file should be changed to true enable mpi. If this is set to false then the mpi will not be enabled
1#Input section 2input: 3 #input fasta file name 4 fasta: test.fa 5 # output file basename 6 basename: test 7 #input NCBI taxonomy id 8 taxon: "4577" 9 # Name of the species 10 species: "Zea mays" 11 # Email is mandatory 12 email: 13 #Number of CPUs used for tools 14 cpus: 4 15 #Whether openmpi should be used 16 mpi: True 17 #what the name of the temporary directory is 18 tmpdir: "/tmpdir"
Slurm commands needed for successful sbatch submission
# This can be any number of nodes, but 10-20 has been optimal #SBATCH -N 10 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=16 #or the CPU for each node
You may also need to load the mpich module on HPC systems.
#On HPC Systems module load mpich #Or it might be packaged as part of MVAPICH module load mvapich
./run-GOMAP-mpi.sh --step=domain --config=test/config.yml
fanngo
./run-GOMAP-SINGLE.sh --step=fanngo --config=test/config.yml
mixmeth-blast
Running on a Single node
./run-GOMAP-SINGLE.sh --step=mixmeth-blast --config=test/config.yml
Running on a multiple nodes (MPI)
./run-GOMAP-mpi.sh --step=mixmeth-blast --config=test/config.yml
The
--nodes
and--cpus-per-task
can be optimized based on the cluster for slurm schedulersmixmeth-preproc
./run-GOMAP-SINGLE.sh --step=mixmeth-preproc --config=test/config.yml
mixmeth
./run-GOMAP-SINGLE.sh --step=mixmeth --config=test/config.yml
Attention
The mixmeth step sumbits annotation jobs to Argot2.5 webserver. Please wait till you have received the job completion emails before you run the next step
aggregate
Attention
Please wait for all your Argot2.5 jobs to finish before running this step. You will get emails from Argot2.5 when your jobs are submitted and when they are finished. You can also check the status of all current jobs from all users here.
./run-GOMAP-SINGLE.sh --step=aggregate --config=test/config.yml
Final dataset will be available at
GOMAP-[basename]/gaf/e.agg/[basename].aggregate.gaf
. [basename] is defined in the config.yml file that was used