Sun Java Solaris Communities My SDN Account Join SDN

Article

Integrating BLAST with Sun Grid Engine Software

 
By Bogdan Vasiliu, July 2003  

Introduction

How do you take advantage of a collection of computing resources to run BLAST (Basic Local Alignment Search Tool) searches against a DNA or protein database? This paper shows how to integrate BLAST with Sun Grid Engine software (SGE). Although the paper targets system administrators supporting life sciences research teams, anyone proficient in UNIX system administration should be able to follow these instructions.

Typically, BLAST usage consists of many CPU-intensive and memory-demanding independent runs. By integrating BLAST with Sun Grid Engine software, each of these independent runs can be assigned to a host that is best suited for its CPU and memory demands. Also, by taking advantage of underutilized or unused hosts, many of these runs can be done in parallel in the SGE cluster rather than serially on a single host, and thus significantly improve the throughput of BLAST queries.

This document is not intended to replace the user and installation guides for BLAST and Sun Grid Engine software. For complete instructions, please follow the original documentation.

Problem Explanation

Assume that you have an eclectic pool of computing resources -- like the ones in Table 1 -- and that you would like to use them for your BLAST runs ([1], [2], [3]). These machines could be tightly connected by a fast network, or they could be scattered around the building, or some of them could be desktop machines sitting in your colleagues' offices. In the latter case, users do not have to surrender control of their machines. Instead, the SGE can take advantage of idle cycles during the night, on weekends, or on holidays.

Name Type of Workstation CPU RAM OS Version
saqqara Sun Blade 2000 2 X 1056 MHZ US-III+ 8 Gbyte RAM Solaris 8
caesar Sun Blade 2000 2 X 900 MHZ US-III+ 4 Gbyte RAM Solaris 9
andre Sun Enterprise E420R 4 X 450 MHZ US-II 4 Gbyte RAM Solaris 8
kaiser Ultra 60 2 X 360 MHZ US-II 512 Mbyte RAM Solaris 8
tonylama Ultra 60 2 X 360 MHZ US-II 896 Mbyte RAM Solaris 7
odiche Ultra 5 1 X 300 MHZ US-IIi 512 Mbyte RAM Solaris 8

TABLE 1: Pool of Computing Resources

The machines in Table 1 differ from one another in operating system version, CPU type, and memory size. You may want to use the more powerful machines like saqqara or caesar for the more demanding BLAST jobs and allocate the lighter queries to the slower boxes like odiche or tonylama. The next section presents a simple step-by-step method for integrating all these resources for use with BLAST queries. While this method can support one user or many users, the focus of this article will be on the single-user setup.

Solution

This section describes the main steps that should be followed. Note that for some steps root access will be needed.

  1. Environment setup
  2. Install and configure BLAST
  3. Install and configure SGE
  4. Submit and monitor BLAST jobs

1. Environment setup

1.1 We will need three accounts to create an integrated BLAST- SGE environment:

  • root for system administration purposes
  • sgeadmin, which contains the SGE files and directories, and is also used for administering the SGE
  • blast, which will manage the BLAST installation

Throughout the article we will assume that the '#' symbol is the UNIX prompt for the root account, and the '%' symbol is the UNIX prompt for a regular user without root privileges, for example, sgeadmin, blast.

1.2 These instructions require a collection of hosts running the Solaris OS, SPARC Platform Edition, with version 9, 8, 7, or 2.6. The instructions in this paper can also be used for other UNIX systems that support SGE and BLAST.

1.3 You should have NIS, NIS+, or LDAP configured as name services for passwd, groups, hosts, and auto.home. You should be able to automatically NFS mount users' home directories on all of the machines involved in your work.

1.4 The Sun compilers (Workshop 6 Update 2, or Sun ONE Studio 7 or 8) or GNU compilers have to be installed in your network only if you decide to compile the BLAST sources instead of using the executables from NCBI's (National Center for Biotechnology Information) site. More details about installing and compiling BLAST are provided in the "Install and configure BLAST" section.

1.5 You will need to allocate around 61 Mbyte of disk space for the BLAST binaries downloaded from NCBI, or around 275 Mbyte if you download the entire distribution including the sources. You'll also need disk space for the DNA and protein databases from NCBI. Depending on your needs, these databases can take from a few hundred Mbyte of disk space to up to 10 or more Gbyte. The disk space used for the examples in this paper was around 8 Gbyte. For fast disk access these databases should be installed locally on each host. The disk space requirement for the SGE binaries and documentation is around 80 Mbyte. You'll also need disk space allocated for the SGE log files, ideally between 30-200 Mbyte for the master host spool directories, and around 10-20 Mbyte for each host in this project. So, for a configuration with six hosts you may need up to 400 Mbyte of disk space.

1.6 A 100-Mbit LAN should serve this setup well. The most time-consuming component will be the transfer of the DNA and protein databases from the NCBI site to the local hosts.

2. Install and configure BLAST

For a complete guide to installing and configuring BLAST please refer to [4] and the documentation available in the doc subdirectory that comes with the BLAST archive. The installation and configuration of the BLAST binaries will be done as a regular user (blast in this paper).

2.1 Pick a machine and a location where you would like the BLAST files to reside. For our example we'll pick caesar and the directory /net/caesar/files/BLAST. We need to make sure that the /net/caesar/files/BLAST directory can be read and written from the other machines in the cluster so that SGE jobs launched on other hosts have access to the files and directories stored here. To achieve this, login as root on caesar and add the following line to /etc/dfs/dfstab:

share -F nfs -o rw=andre:tonylama:saqqara:odiche:kaiser,ro\
-d "BLAST" /net/caesar/files/BLAST/

and then as root type:

# shareall

2.2 Leave the root account. Then, as user blast, create the local directories for the installation of the BLAST files as shown in Table 2.

% mkdir -p /net/caesar/files/BLAST/bin Location of the BLAST binaries
% mkdir -p /net/caesar/files/BLAST/data Location of the DNA and protein databases
% mkdir -p /net/caesar/files/BLAST/output Location of the output of the BLAST runs
% mkdir -p /net/caesar/files/BLAST/queries Location of the BLAST query files
% mkdir -p /net/caesar/files/BLAST/test_suite Location of the BLAST submit scripts
% mkdir -p /net/caesar/files/BLAST/matrices Location of the BLAST matrices

Table 2: Environment Setup

We will assume the setup shown in Table 2 for the rest of this paper.

2.3 Download the BLAST sources from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz.

2.4 Unpack the BLAST archive in /net/caesar/files/BLAST:

% gzcat ncbi.tar.gz | tar xvf -

or

% gzcat blast.solaris.tar.gz | tar xvf -

If you have transferred only the executables then there isn't much left to do, just copy them in the desired location, for example, /net/caesar/files/BLAST/bin.

If you transferred the source files then you'll have to compile the binaries yourself. Assuming that the unpacked BLAST archive is located in /net/caesar/files/BLAST/ncbi/<BLAST subdirectories>, you can just type:

% ./ncbi/make/makedis.csh

Once the compilation ends, the binaries should be located in /net/caesar/files/BLAST/ncbi/bin.

2.5 Download the DNA and protein databases from ftp://ftp.ncbi.nih.gov/blast/db/ in /net/caesar/files/BLAST/data. For example:

Protein databases: 
	swissprot, nr

DNA databases:
	est, nt

In the current setup these files will be located in a directory that is NFS mounted. For faster disk access they should be replicated on each machine in this cluster.

2.6 Copy your own nucleotide or protein query files to /net/caesar/files/BLAST/queries.

2.7 Download the BLAST matrices from ftp://ftp.ncbi.nih.gov/blast/matrices/ in /net/caesar/files/BLAST/matrices.

2.8 You should test that BLAST is correctly installed. Below is an example of a simple shell script that starts a BLAST job on the local machine:

	#!/bin/csh
	#

 	# Arguments:
	# -p blastn    		Use the blastn program
	# -d nt        		Use the nt database
	# -i ${seqpath}/nt.123     Use the nt.123 query file
	# -e 0.1			Expectation value
	# -o ${outpath}/out.123.blastn Output file (final report)

	# Location of the nt database
	setenv BLASTDB   /net/caesar/files/BLAST/data

	
	# Location of the BLOSUM62 matrix
	setenv BLASTMAT /net/caesar/files/BLAST/matrices

	# Location of BLAST executables
	set progpath = /net/caesar/files/BLAST/bin

	# Location of BLAST query files
	set seqpath = /net/caesar/files/BLAST/queries

	# Location of BLAST output (final report) file
	set outpath = /net/caesar/files/BLAST/output
 
			echo "--- blastn 123 ---"
			${progpath}/blastall -p blastn \
                               -d nt     \
                               -i ${seqpath}/nt.123 \
                               -e 0.1    \
                               -o ${outpath}/out.123.blastn

3. Install and configure SGE

In this paper the Sun Grid Engine 5.3 software is used. This software is freely available from here. For advanced configurations, multiple users, or teams and departments sharing common resources, the Sun Grid Engine Enterprise Edition 5.3 software is recommended. For complete installation instructions consult the SGE documentation [6].

3.1 Create an SGE administrator account that can be NFS mounted across the network. For our example we will assume that this account is called sgeadmin and that it is located in /home/sgeadmin. Make sure that this account has enough disk space as described at step 1.5.

3.2 Login to the sgeadmin account and download the SGE 5.3 archives and patches from here. You can choose between the pkgadd and tar.gz installations. In this paper the tar.gz archives are installed:

sge-5_3p2-bin-solsparc32.tar.gz
sge-5_3p2-bin-solsparc64.tar.gz
sge-5_3p2-common.tar.gz
sge-5_3p2-doc.tar.gz

Do not forget to download the patches that will upgrade the current 5.3p2 SGE version to 5.3p3. The patches for both the tar.gz and pkgadd Solaris SPARC binaries are located on the same page as the SGE distribution. At the time this document was written, the most recent tar.gz SGE Solaris SPARC patches were 113849-02, 113850-02, 113853-01, and 113854-01.

3.3 Unpack the SGE archives in the /home/sgeadmin directory:

% gzcat sge-5_3p2-bin-solsparc32.tar.gz | tar xvf -
% gzcat sge-5_3p2-bin-solsparc64.tar.gz | tar xvf -
% gzcat sge-5_3p2-common.tar.gz | tar xvf -
% gzcat sge-5_3p2-doc.tar.gz | tar xvf -

Then unpack the patches in a similar way and follow the installation instructions in the README file that comes with each patch. For example, to install patch 113849-02 you'll have to follow these steps (in the /home/sgeadmin directory):

% gzcat 113849-02.tar.Z | tar xvf -
% cp 113849-02/sge-5_3p3-bin-solsparc32.tar.gz .
% gzip -dc sge-5_3p3-bin-solsparc32.tar.gz | tar xvf -

We assumed that the SGE daemons were not running (they will be started at steps 3.5 and 3.6), otherwise you will have to stop the daemons before applying the patches. Instructions on how to stop the SGE daemons are provided in the README files that come with the patches.

3.4 On each host where you would like to install SGE, add the following line in the /etc/services file:

% sge_commd        <port number>/tcp

Alternatively, you can add this line to the NIS services or NIS+ database and in this way you'll avoid having to update each host's /etc/services file. You will need root access to modify this file. <port number> is a value recommended to be less than 600, for example:

% sge_commd        595/tcp

Make sure no other service uses this port number. On the Solaris Operating System you can check this with:

% netstat -an -f inet

3.5 Designate a machine as the SGE master host. This machine should be fast enough to insure a good response time and it should also have enough RAM to accommodate the SGE daemons. In our example, we chose caesar, but andre or saqqara are good candidates too. Then "su" as root and type:

# su
Password:
# cd /home/sgeadmin
# ./install_qmaster

Then follow the instructions. Most likely you will only have to enter the names of the hosts involved in your project and group id range. For the other questions you can just accept the default options. The install_qmaster script has to be called from the installation directory /home/sgeadmin as shown above. This step is done only once on the SGE master host machine.

3.6 On each of the hosts included in your project, login as sgeadmin, then su root and then run:

# su
Password:
# cd /home/sgeadmin
# ./install_execd

In most cases you won't have to type anything, just accept the default options. This script will configure the current host as an execution host, which means that SGE jobs can be launched on this machine. The host that was chosen as SGE master host can also be an execution host. The install_execd script has to be called from the installation directory /home/sgeadmin, as shown above.

3.7 If the shell of the blast account is csh or tcsh, add the following line in the .cshrc or .login file:

source /home/sgeadmin/default/common/settings.csh

or if the shell is sh, ksh, or bash add the following line in the .profile, .kshrc, or .login file:

. /home/sgeadmin/default/common/settings.sh

3.8 A template SGE script is included below:

	#!/bin/csh

	# specify the shell for this job

	#$ -S /bin/csh

	# Tell Sun Grid Engine to send an email when the job begins
	# and when it ends.

	#$ -M xyz@sun.com
	#$ -m be

	# Merge the standard output and standard error
	#$ -j y

	# Specify the location of the output
	#$ -o /net/caesar/files/BLAST/output/blast.out
	< your stuff here >

4. Submit and monitor BLAST jobs

By merging the two scripts from 2.8 and 3.8 you can start submitting your jobs to the SGE. Here is the final script blast.csh (we assume it is located in /net/caesar/files/BLAST/test_suite):

	#!/bin/csh

	# Specify the shell for this job

	#$ -S /bin/csh

	# Tell Sun Grid Engine to send an email when the job begins
	# and when it ends.

	#$ -M xyz@sun.com
	#$ -m be

	# Merge the standard output and standard error
	#$ -j y

	# Specify the location of the output
	#$ -o /net/caesar/files/BLAST/output/

          # BLAST Section
 	# Arguments:
	# -p blastn    		Use the blastn program
	# -d nt        		Use the nt database
	# -i ${seqpath}/nt.123     Use the nt.123 query file
	# -e 0.1			Expectation value
	# -o ${outpath}/out.123.blastn Output file (final report)

	# Location of the nt database
	setenv BLASTDB   /net/caesar/files/BLAST/data
	
	# Location of the BLOSUM62 matrix
	setenv BLASTMAT /net/caesar/files/BLAST/matrices

	# Location of BLAST executables
	set progpath = /net/caesar/files/BLAST/bin

	# Location of BLAST query files
	set seqpath = /net/caesar/files/BLAST/queries

	# Location of BLAST output (final report) file
	set outpath = /net/caesar/files/BLAST/output

          echo "­­­ blastn 123 ­­­"
          ${progpath}/blastall -p blastn \
                               -d nt     \
                               -i ${seqpath}/nt.123 \
                               -e 0.1    \
                               -o ${outpath}/out.123.blastn

To submit a job to SGE, type:

% qsub blast.csh 

in the /net/caesar/files/BLAST/test_suite directory.

To monitor the job you could type qstat or use the qmon GUI (for more details see [6]).

The qstat command or the qmon GUI will tell you where your job was allocated. Here is an example of qstat output for 15 jobs submitted to our cluster of 6 computers and a total of 13 CPUs:

job-ID prior name       user   state submit/start at    queue master ja-task-ID
------------------------------------------------------------------------------
113    0    blast_5.cs blast   r   05/19/2003 15:57:24 andre.q    MASTER
115    0    blast_7.cs blast   r   05/19/2003 15:57:24 andre.q    MASTER
116    0    blast_8.cs blast   r   05/19/2003 15:57:24 andre.q    MASTER
119    0    blast_11.c blast   r   05/19/2003 15:58:24 andre.q    MASTER
120    0    blast_12.c blast   r   05/19/2003 15:58:54 caesar.q   MASTER
121    0    blast_13.c blast   r   05/19/2003 15:58:55 saqqara.q  MASTER
112    0    blast_4.cs blast   r   05/19/2003 15:57:24 odiche.q   MASTER
117    0    blast_9.cs blast   r   05/19/2003 15:57:25 kaiser.q   MASTER
118    0    blast_10.c blast   r   05/19/2003 15:57:25 kaiser.q   MASTER
122    0    blast_14.c blast   r   05/19/2003 15:59:39 tonylama.q MASTER
123    0    blast_15.c blast   qw  05/19/2003 15:57:14
124    0    blast_16.c blast   qw  05/19/2003 15:57:14
125    0    blast_17.c blast   qw  05/19/2003 15:57:14
126    0    blast_18.c blast   qw  05/19/2003 15:57:14
127    0    blast_19.c blast   qw  05/19/2003 15:57:14

Depending on the resources demanded by the BLAST query, you can specify certain options to the qsub command line or in the blast.csh script. For example, to avoid swapping, you may want your job to fit entirely in the physical memory of the host on which it will be allocated by the SGE. You can specify this as a soft limit (that is, these resources are ideal, but not mandatory) or as a hard limit, which means that the job shouldn't be started unless these requirements are met. The user guide [6] is a good source of help.

To request 5 Gbyte of RAM as a hard limit, you can type the following command:

% qsub -hard -l mf=5G blast.csh

and qstat will show that the job was allocated on the only host in the cluster that has this amount of physical memory available:

% qstat
job-ID prior name    user  state submit/start at queue master ja-task-ID
------------------------------------------------------------------------
151      0 blast.csh blast r    06/04/2003 12:14:16 saqqara.q MASTER

BLAST queries can also be assigned to hosts whose CPUs have certain clock rates. One way to achieve this, is to create a new complex. Let's call it minmaxcpu, with two attributes maxcpu and mincpu, as follows:

    
#name  shortcut type value relop requestable consumable default
#----------------------------------------------------------------
mincpu  mnc     DOUBLE 0     <      YES         NO         0
maxcpu  mxc     DOUBLE 0     >=     YES         NO         0

After creating the minmaxcpu complex you have to attach it to the SGE host object of each of the hosts in this cluster via SGE commands and then assign the value of the CPU clock rate of the respective host to both maxcpu and mincpu. For example, for the host andre, both maxcpu and mincpu should be set to 450. This task can be easily done with the SGE GUI. The Sun Grid Engine 5.3 Administration and User's Guide [6] describes in detail how user-defined complexes are created and configured.

Jobs that request mincpu will be assigned to hosts with CPUs having at least mincpu clock rate. This allows a user to request SGE to allocate BLAST queries on CPUs with high clock rates for jobs whose results are needed as soon as possible. Similarly, jobs that request maxcpu will be assigned to hosts with CPUs having at most maxcpu clock rate. This potentially allows the user to assign BLAST queries to slower hosts, freeing the faster ones for jobs whose results are needed sooner.

For example, to request that a BLAST job is assigned to a host with CPUs running at 330 MHz or less you can type:

% qsub -hard -l mxc=330 blast.csh
% qstat

job-ID  prior name user state submit/start at queue master  ja-task-ID
------------------------------------------------------------------------
162     0 blast.csh blast r  06/04/2003 13:37:20 odiche.q MASTER 

The qstat command shows that this job was assigned to odiche, whose CPU clock rate meets the user's request (330 >= 300 evaluates to TRUE).

To request a fast CPU for your BLAST job you can type:

% qsub -hard -l mnc=1000 blast.csh
% qstat
job-ID  prior name user state submit/start at   queue  master ja-task-ID
------------------------------------------------------------------------
165     0 blast.csh blast  r  06/04/2003 13:45:10 saqqara.q MASTER

Here qstat shows that the job was allocated as requested on saqqara, whose CPU clock rate is 1056 MHz. (Note that the expression 1000<1056 evaluates to TRUE.)

You can combine memory requirements with CPU clock rate; for example you can request that a BLAST job has at least 3 Gbyte of physical memory available and runs on a host with CPU clock rates of at most 500 MHz:

% qsub -hard -l mf=3G,mxc=500 blast.csh
% qstat
job-ID  prior name user state submit/start at    queue master ja-task-ID
------------------------------------------------------------------------
166     0 blast.csh blast  r  06/04/2003 13:55:19 andre.q MASTER

andre is the only host in this cluster that satisfied these requirements, and the SGE correctly picked this machine as shown by the qstat command in the preceding example.

The minmaxcpu complex is only an example. You can define other complexes that address other resources, like network bandwidth, number of licenses, and so on.

Conclusion

This article showed how to install BLAST and Sun Grid Engine software, and how to configure SGE scripts to launch BLAST jobs. Step-by-step instructions for BLAST and SGE installation and integration were provided. This method shows how to use a collection of heterogeneous computers running Solaris OS, SPARC Platform Edition, to run BLAST searches. This setup can be used to take advantage of underutilized computing resources, such as desktop machines that sit idle during the night or on weekends or holidays. This paper is only an introduction to SGE and BLAST environments. For more complex configurations consult the SGE and BLAST documentation ([3], [4], and [6]).

References

[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman (1990), "Basic local alignment search tool," Journal of Molecular Biology, 215, pp. 403-410
[2] S.F. Altschul, W. Gish (1996), "Local alignment statistics," Methods in Enzymology (R. Doolittle editor), 266, pp. 460-480
[3] NCBI BLAST Home Page: http://www.ncbi.nlm.nih.gov/BLAST/
[4] Installing NCBI's BLAST2.x Executables: http://genome.nhgri.nih.gov/blastall/blast_install/
[5] Sun Grid Engine Software: http://wwws.sun.com/software/gridware/
[6] Sun ONE Grid Engine, Enterprise Edition Administration and User's Guide, Sun Microsystems, Inc.: http://docs-pdf.sun.com/816-4739-11/816-4739-11.pdf

About the Author

Bogdan Vasiliu is a member of technical staff at Sun Microsystems. He ports, optimizes, and benchmarks independent software vendor applications for Sun systems.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.

Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.