IntroductionHow do you take advantage of a collection of computing resources to run BLAST (Basic Local Alignment Search Tool) searches against a DNA or protein database? This paper shows how to integrate BLAST with Sun Grid Engine software (SGE). Although the paper targets system administrators supporting life sciences research teams, anyone proficient in UNIX system administration should be able to follow these instructions. Typically, BLAST usage consists of many CPU-intensive and memory-demanding independent runs. By integrating BLAST with Sun Grid Engine software, each of these independent runs can be assigned to a host that is best suited for its CPU and memory demands. Also, by taking advantage of underutilized or unused hosts, many of these runs can be done in parallel in the SGE cluster rather than serially on a single host, and thus significantly improve the throughput of BLAST queries. This document is not intended to replace the user and installation guides for BLAST and Sun Grid Engine software. For complete instructions, please follow the original documentation. Problem ExplanationAssume that you have an eclectic pool of computing resources -- like the ones in Table 1 -- and that you would like to use them for your BLAST runs ([1], [2], [3]). These machines could be tightly connected by a fast network, or they could be scattered around the building, or some of them could be desktop machines sitting in your colleagues' offices. In the latter case, users do not have to surrender control of their machines. Instead, the SGE can take advantage of idle cycles during the night, on weekends, or on holidays.
TABLE 1: Pool of Computing Resources The machines in Table 1 differ from one another in operating system version, CPU type, and memory size. You may want to use the more powerful machines like saqqara or caesar for the more demanding BLAST jobs and allocate the lighter queries to the slower boxes like odiche or tonylama. The next section presents a simple step-by-step method for integrating all these resources for use with BLAST queries. While this method can support one user or many users, the focus of this article will be on the single-user setup. SolutionThis section describes the main steps that should be followed. Note that for some steps root access will be needed.
1. Environment setup
2. Install and configure BLAST For a complete guide to installing and configuring BLAST please refer to [4] and the documentation available in the doc subdirectory that comes with the BLAST archive. The installation and configuration of the BLAST binaries will be done as a regular user (blast in this paper).
3. Install and configure SGE In this paper the Sun Grid Engine 5.3 software is used. This software is freely available from here. For advanced configurations, multiple users, or teams and departments sharing common resources, the Sun Grid Engine Enterprise Edition 5.3 software is recommended. For complete installation instructions consult the SGE documentation [6].
4. Submit and monitor BLAST jobs By merging the two scripts from 2.8 and 3.8 you can start submitting your jobs to the SGE. Here is the final script blast.csh (we assume it is located in /net/caesar/files/BLAST/test_suite):
#!/bin/csh
# Specify the shell for this job
#$ -S /bin/csh
# Tell Sun Grid Engine to send an email when the job begins
# and when it ends.
#$ -M xyz@sun.com
#$ -m be
# Merge the standard output and standard error
#$ -j y
# Specify the location of the output
#$ -o /net/caesar/files/BLAST/output/
# BLAST Section
# Arguments:
# -p blastn Use the blastn program
# -d nt Use the nt database
# -i ${seqpath}/nt.123 Use the nt.123 query file
# -e 0.1 Expectation value
# -o ${outpath}/out.123.blastn Output file (final report)
# Location of the nt database
setenv BLASTDB /net/caesar/files/BLAST/data
# Location of the BLOSUM62 matrix
setenv BLASTMAT /net/caesar/files/BLAST/matrices
# Location of BLAST executables
set progpath = /net/caesar/files/BLAST/bin
# Location of BLAST query files
set seqpath = /net/caesar/files/BLAST/queries
# Location of BLAST output (final report) file
set outpath = /net/caesar/files/BLAST/output
echo " blastn 123 "
${progpath}/blastall -p blastn \
-d nt \
-i ${seqpath}/nt.123 \
-e 0.1 \
-o ${outpath}/out.123.blastn
To submit a job to SGE, type: % qsub blast.csh in the /net/caesar/files/BLAST/test_suite directory. To monitor the job you could type qstat or use the qmon GUI (for more details see [6]). The qstat command or the qmon GUI will tell you where your job was allocated. Here is an example of qstat output for 15 jobs submitted to our cluster of 6 computers and a total of 13 CPUs: job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------------------------ 113 0 blast_5.cs blast r 05/19/2003 15:57:24 andre.q MASTER 115 0 blast_7.cs blast r 05/19/2003 15:57:24 andre.q MASTER 116 0 blast_8.cs blast r 05/19/2003 15:57:24 andre.q MASTER 119 0 blast_11.c blast r 05/19/2003 15:58:24 andre.q MASTER 120 0 blast_12.c blast r 05/19/2003 15:58:54 caesar.q MASTER 121 0 blast_13.c blast r 05/19/2003 15:58:55 saqqara.q MASTER 112 0 blast_4.cs blast r 05/19/2003 15:57:24 odiche.q MASTER 117 0 blast_9.cs blast r 05/19/2003 15:57:25 kaiser.q MASTER 118 0 blast_10.c blast r 05/19/2003 15:57:25 kaiser.q MASTER 122 0 blast_14.c blast r 05/19/2003 15:59:39 tonylama.q MASTER 123 0 blast_15.c blast qw 05/19/2003 15:57:14 124 0 blast_16.c blast qw 05/19/2003 15:57:14 125 0 blast_17.c blast qw 05/19/2003 15:57:14 126 0 blast_18.c blast qw 05/19/2003 15:57:14 127 0 blast_19.c blast qw 05/19/2003 15:57:14 Depending on the resources demanded by the BLAST query, you can specify certain options to the qsub command line or in the blast.csh script. For example, to avoid swapping, you may want your job to fit entirely in the physical memory of the host on which it will be allocated by the SGE. You can specify this as a soft limit (that is, these resources are ideal, but not mandatory) or as a hard limit, which means that the job shouldn't be started unless these requirements are met. The user guide [6] is a good source of help. To request 5 Gbyte of RAM as a hard limit, you can type the following command: % qsub -hard -l mf=5G blast.csh and qstat will show that the job was allocated on the only host in the cluster that has this amount of physical memory available: % qstat job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------------------ 151 0 blast.csh blast r 06/04/2003 12:14:16 saqqara.q MASTER BLAST queries can also be assigned to hosts whose CPUs have certain clock rates. One way to achieve this, is to create a new complex. Let's call it minmaxcpu, with two attributes maxcpu and mincpu, as follows: #name shortcut type value relop requestable consumable default #---------------------------------------------------------------- mincpu mnc DOUBLE 0 < YES NO 0 maxcpu mxc DOUBLE 0 >= YES NO 0 After creating the minmaxcpu complex you have to attach it to the SGE host object of each of the hosts in this cluster via SGE commands and then assign the value of the CPU clock rate of the respective host to both maxcpu and mincpu. For example, for the host andre, both maxcpu and mincpu should be set to 450. This task can be easily done with the SGE GUI. The Sun Grid Engine 5.3 Administration and User's Guide [6] describes in detail how user-defined complexes are created and configured. Jobs that request mincpu will be assigned to hosts with CPUs having at least mincpu clock rate. This allows a user to request SGE to allocate BLAST queries on CPUs with high clock rates for jobs whose results are needed as soon as possible. Similarly, jobs that request maxcpu will be assigned to hosts with CPUs having at most maxcpu clock rate. This potentially allows the user to assign BLAST queries to slower hosts, freeing the faster ones for jobs whose results are needed sooner. For example, to request that a BLAST job is assigned to a host with CPUs running at 330 MHz or less you can type: % qsub -hard -l mxc=330 blast.csh % qstat job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------------------ 162 0 blast.csh blast r 06/04/2003 13:37:20 odiche.q MASTER The qstat command shows that this job was assigned to odiche, whose CPU clock rate meets the user's request (330 >= 300 evaluates to TRUE). To request a fast CPU for your BLAST job you can type: % qsub -hard -l mnc=1000 blast.csh % qstat job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------------------ 165 0 blast.csh blast r 06/04/2003 13:45:10 saqqara.q MASTER Here qstat shows that the job was allocated as requested on saqqara, whose CPU clock rate is 1056 MHz. (Note that the expression 1000<1056 evaluates to TRUE.) You can combine memory requirements with CPU clock rate; for example you can request that a BLAST job has at least 3 Gbyte of physical memory available and runs on a host with CPU clock rates of at most 500 MHz: % qsub -hard -l mf=3G,mxc=500 blast.csh % qstat job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------------------ 166 0 blast.csh blast r 06/04/2003 13:55:19 andre.q MASTER andre is the only host in this cluster that satisfied these requirements, and the SGE correctly picked this machine as shown by the qstat command in the preceding example. The minmaxcpu complex is only an example. You can define other complexes that address other resources, like network bandwidth, number of licenses, and so on. ConclusionThis article showed how to install BLAST and Sun Grid Engine software, and how to configure SGE scripts to launch BLAST jobs. Step-by-step instructions for BLAST and SGE installation and integration were provided. This method shows how to use a collection of heterogeneous computers running Solaris OS, SPARC Platform Edition, to run BLAST searches. This setup can be used to take advantage of underutilized computing resources, such as desktop machines that sit idle during the night or on weekends or holidays. This paper is only an introduction to SGE and BLAST environments. For more complex configurations consult the SGE and BLAST documentation ([3], [4], and [6]). References
[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman (1990), "Basic local alignment search tool," Journal of Molecular Biology, 215, pp. 403-410
About the AuthorBogdan Vasiliu is a member of technical staff at Sun Microsystems. He ports, optimizes, and benchmarks independent software vendor applications for Sun systems. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
|
| ||||||||||||