elixir technical feasibility study wp13.2: assessment of ... · feasibility in this case covers...

ELIXIR Technical Feasibility Study WP13.2:Assessment of Supercomputing Facilities for

genomic data

PartnersBSC-CNS - Barcelona Supercomputing Center (BSC); CSC - IT Center for Science Ltd (CSC);European Bioinformatics Institute (EBI).

AuthorsSarah Hunter (EBI); Antony Quinn (EBI); Josep Lluis Gelpi (BSC); Tommi Nyronen (CSC)

Executive summary

Europe has several large supercomputing centres and currently is planning for majorupgrades. As the compute demands for the life sciences grow rapidly, it is sensible toexplore the feasibility of performing analytical tasks using European supercomputer datacentres. Feasibility in this case covers cost-effectiveness, transparency in the workingenvironment and performance. Work Package 13.2 investigates the potential of using theseresources for the HMMER algorithm from the InterProScan signature scanning software.

The key conclusion arising from the technical feasibility study is the absolute importance ofcollaboration between experts at the compute providers and compute utilisers (i.e. nodeswithin ELIXIR). Compute providers, such as supercomputing facilities, tend to deliver abase-line level of service that includes the provision of computational facilities and a certainlevel of support to users (outlined in more detail in the appendices to this report). It isexpected that in order to effectively run a service-based, distributed infrastructure such asthe one currently proposed by ELIXIR, it will be necessary to expand and evolve this basic,service-level agreement according to demands placed upon the infrastructure in the future.

The quality of service, however, will need to be balanced against the economic costs ofproviding it; making a commitment to support and develop an infrastructure requiresdedicated expertise, something which is has a direct impact on the cost of the service. Inthe past, funding of compute infrastructures to support biological research projects hastended to be within national borders. If a stable, pan-European infrastructure is to beprovided for the long-term, funding agencies at both the national and European level willneed to account for this.

Another important aspect to take into consideration is that the TFS only covered a singlecomputational problem faced by biologists; using HMMER to characterise protein data (acomputationally-heavy problem). There are many other challenges within bioinformatics,many of which involve data security issues (patient confidentiality) or much larger datavolumes (human genomic variation, for example). It would be prudent to also addressthese use cases before a complete strategy for compute within ELIXIR can be drawn up.

Main conclusions/recommendations of the TFS

The final conclusions of the ELIXIR feasibility study 13.2 are as follows:

• Expert involvement is absolutely necessary to set up and optimise computetasks on the system but, once this is well-established, it is likely theamount of support required will reduce. We can imagine an initial peak ofexpert involvement when a new requirement is placed on the infrastructure.Assuming no changes then occur, the overhead will reduce to a stable level until thenext change in software or hardware occurs. It is also possible that experts may beinvolved in optimising a system so the support overhead is lowered in the end. It isimportant to clearly define how any collaboration will be organised between anELIXIR member organisation and a computing provider

• To fully incorporate supercomputing centres' resources seamlessly intodatabase production systems would require considerably more work thanwas allowed for within the TFS. However, it would be necessary for this tohappen in the long-term if supercomputing centre facilities were to be utilised as acentral component of data production infrastructures, to ensure accessibility to theservice.

• Economic costings for compute services is very difficult to measure, anddepends greatly on not only the physical hardware and softwareinfrastructures put in place but also the expertise required to maintain andsupport it. Indeed, making experts available for collaboration is perhaps the mostcostly but important part of the infrastructure.

• Allocation of a greater number of compute hours over a longer period oftime (e.g. 1 year), which could then be used on-demand is preferable toshort periods, due to an effective increase in responsiveness to usersrequirements and a reduction in the overhead experienced when applying forcompute more frequently

• Grid should be further investigated as a possible resource for dataproduction work. A workshop to bring together experts in the field and end-usersis recommended

• Relying on commercial, profit-making service providers for computing (e.g.Amazon EC2) imposes operational risks in the long-term, and would not beappropriate for core operations of public databases such as building InterPro.However, "cloud" interfaces to computing can be provided by public e-Infrastructures, if needed. Thus software can be made "cloud-ready" and run on thechosen provider by the third party.

• New hardware paradigms (e.g. Cell-based processors and multi-core GPUand CPU systems), and the applications intending to be run on them,should be investigated thoroughly to ensure appropriate implementations ofhardware and software

• The performance for the network infrastructure was sufficient between Hinxton -Espoo and Barcelona sites for the computational problem in the TFS. However,investigation of Lightpath technologies is recommended if larger data volumes needto be transferred between sites.

Introduction

The technical feasibility study (TFS) in ELIXIR Work Package 13.2 explores the potential forutilising European supercomputing facilities to perform computationally-expensivebioinformatics tasks as a part of a possible ELIXIR infrastructure.

Biological background

The amount of biological data in the public domain continues to grow rapidly, its growthmainly fuelled by the availability of increasingly inexpensive genomic sequencing methodsand the advent of high-throughput transcriptomics and proteomics technologies.

A major challenge that has arisen as a consequence of the vast number of DNA and proteinsequences being produced is how to accurately predict the function of these proteinswithout undertaking large-scale, time-consuming and costly experimental researchprogrammes. One widely-used approach consists of building protein signatures that modelthe sequence and/or structural similarities found between proteins with shared function, andthen using these signatures to predict whether or not a novel sequence might also share thesame characteristics.

InterPro

InterPro is a database which amalgamates predictive protein signatures from 10 differentdatabases together into a single resource. These signatures include Hidden Markov Models(HMMs, described below), together with other types of models such as regular-expressions,multi-motif fingerprints and position specific scoring matrices (PSSMs). The searchingalgorithms for these different types of models are incorporated into a software tool knownas InterProScan.

InterPro has wide utility in bioinformatics and biological research: genome sequencingprojects very often use InterPro as a way of obtaining a "first-pass" annotation of proteinspredicted as being encoded by the genomic DNA. UniProtKB, the protein informationresource, uses InterPro to automatically enhance the functional annotation of the proteinscontained within it. World-wide, InterPro is the only resource managing such an in-depthclassification of all known protein sequences, a classification that is the only informationavailable for the vast majority of proteins; as a consequence, InterPro has many thousandsof users globally.

Hidden Markov Models

The majority of the protein signatures currently contained within InterPro are based onprofile Hidden Markov Models (HMMs). An HMM is a probabilistic model where the systembeing modelled is assumed to be a Markov process with unknown parameters, and thechallenge is to determine the hidden parameters, from the observable parameters, based onthis assumption. The extracted model parameters can then be used to perform furtheranalysis, for example for pattern recognition applications.

In a regular Markov model, the state is directly visible to the observer, and therefore thestate transition probabilities are the only parameters. A hidden Markov model adds outputs:each state has a probability distribution over the possible output tokens. Therefore, lookingat a sequence of tokens generated by an HMM does not directly indicate the sequence ofstates. State transitions in a Hidden Markov Model are illustrated in the diagram below:

Within the field of bioinformatics, profile HMMs are often used to model the commonalitiesof the amino acid sequences for a family of proteins. Considered to be more expressive thana standard consensus sequence or regular expression, profile HMMs allow position-dependent insertion and deletion penalties, as well as the option to use a separatedistribution for inserted portions of the amino acid sequence. Once a model is trained on anumber of amino acid sequences from a given family or group, it is most commonly used forthree purposes:

▪ By aligning sequences to the model, one can construct multiple alignments▪ The model itself can offer insights into the characteristics of the family when one

examines the structure and probabilities for the trained HMM▪ The model can be used to score how well a new protein sequence fits the motif. For

example, one could train a model on a number of proteins in a family and thenmatch sequences in a database to that model in order to try to find other familymembers. This technique is also used to infer protein structure and function. It isthe fitting of new proteins to the existing family motifs (modelled by HMMs) whichwere the focus of the calculations in this feasibility study

A software package known as HMMER exists which contains various programs for buildingand calibrating HMMs and searching the resultant models against sequences. In the latestversion (2.3.2) of the software, the searching programs are known as hmmsearch (searchesan HMM against a sequence database) and hmmpfam (searches a sequence databaseagainst an HMM database), the former generally performing at better speeds than thelatter. HMMER also has a utility to transform the HMMs (which are ASCII by default) intobinary formats, also resulting in a performance improvement. The algorithms utilised bythese search programs are the Viterbi algorithm and the forward algorithm.

Rationale for use case selection in the study

At the start of the ELIXIR project, it wasanticipated that, in addition to the newsequences being produced fromgenomics projects, there would also be alarge influx of "meta-genomic" data, thatis, sequence that is not immediately anddirectly attributable to a particularorganism but rather is taken from anenvironmental context. A well-publicisedexample of a meta-genomics sequencingproject is the Global Ocean Survey (GOS)project. It was anticipated that therewould be a doubling of sequence dataevery 18 months as a consequence ofincreased genomic and meta-genomicsequencing activity, resulting in a total ofapproximately 60 million sequences inthe UniParc protein archive database bymid-2009. This influx of data has not yetmaterialised, however, it is a certaintythat it will appear as the meta-genomicscommunity becomes better establishedand more coherently organised.

The chart on the right shows the growthin the number of sequences contained inthe UniParc protein archive databasesince its inception in 2003 until thecurrent date. (Note the leap in sequencenumbers experienced between 2006 and2007, which correspond to the firstrelease of meta-genomic data to thedatabase).

Growth in UniParc protein archive

A major drawback of scanning algorithms such as the HMMER software which is used tosearch HMMs against sequences, is that they can be computationally expensive to run.Indeed, the current version of HMMER (v2.3.2) contributes the majority of computationalcost to InterProScan. With the total number of known protein sequences archived in theUniParc database now exceeding 19 million, and the total number of HMMs in InterPro beingin the region of ninety thousand models, this poses a serious computational challenge: Howto keep up with the classification of increasing numbers of proteins using increasingnumbers of models?

The HMMER programs are particularly well suited to parallel processing as they are able torun on multiple CPUs and because each HMM can be compared to the sequence databaseindependently (an "embarrassingly parallel" application).

This feasibility study was therefore intended to give an indication, given certain projectionsfor sequence and signature database growth, whether the use of supercomputing facilitieswould be desirable or feasible from three main perspectives: cost-effectiveness,transparency in the working environment and performance.

Participants in the Study

The study involved personnel from BSC-CNS (Barcelona Supercomputing Center - CentroNacional de Supercomputación, based in Barcelona, Spain); CSC (CSC - IT Center forScience Ltd., based in Keilaranta, Espoo, Finland) and EBI (European BioinformaticsInstitute, based in Hinxton, UK).

BSC-CNS

Early in 2004 the Ministry of Education and Science (Spanish Government), Generalitat deCatalunya (local Catalan Government) and Technical University of Catalonia (UPC) took theinitiative of creating a National Supercomputing Center in Barcelona. BSC-CNS (BarcelonaSupercomputing Center - Centro Nacional de Supercomputación) is the NationalSupercomputing Facility in Spain and was officially constituted in April 2005. BSC-CNSmanages MareNostrum, one of the most powerful supercomputers in Europe, located at theTorre Girona chapel. The mission of BSC-CNS is to investigate, develop and manageinformation technology in order to facilitate scientific progress. With this aim, specialdedication has been made to areas such as Computational Sciences, Life Sciences and EarthSciences.

BSC also hosts the Computational Bioinformatics node of the National Institute ofBioinformatics (INB). INB is a research and services structure founded by the Spanishgovernment through the Genoma España Foundation and aims to develop newbioinformatics tools and provide bioinformatics support to the Spanish research community.The computational Bioinformatics Node's mission is to provide common resources likebiological databases and applications for sequence and structural analysis. This node alsoprovides the computational infrastructure needed to keep and distribute the results of suchresearch, giving support to the bioinformatics users that need access to the centre'sresources. Therefore BSC are developing work protocols that provide a good integration ofthe INB's projects with the special supercomputational environment (see http://inb.bsc.es)

All these activities are complementary to each other and are very tightly related. In thisway, a multidisciplinary loop is set up: our exposure to industrial and non-computer scienceacademic practices improves our understanding of their needs and helps us to focus ourbasic research towards improving those practices. The result is very positive both for ourresearch work as well as for improving the way we service our society.

CSC

CSC - IT Center for Science Ltd is administered by the Finnish Ministry of Education. CSC isa non-profit limited company providing versatile IT services, support and resources foracademia, research institutes, and companies. The company has core competences inmodelling, computing and information services. CSC provides Finland's widest selection ofscientific software and databases and Finland's most powerful supercomputers thatresearchers can use via the Funet network. (see http://www.csc.fi)CSC provides numerous services to its users:

• Data services for science and culture - Solutions for data storage, management andanalysis needs - Scope: National/Nordic/European

• Funet network services - Fast backbone network for academia - Versatile networkservices - Scope: National/Nordic/European/global

• Computing services - High-performance computing and IT consulting services -Scope: National/Nordic/European

http://inb.bsc.es/

http://www.csc.fi/

• Application services - Support, development and infrastructure for computationalscience and engineering - Scope: National/Nordic/European/global

• Information management services - Information technology services for the needsof academia, research institutes, state administration and companies - Scope:National/Nordic

• CSC Service Desk (help desk)• Courses and Events related to services are arranged regularly (http://www.csc.fi/

courses/)• Books and Magazines are available in PDF and in print (http://www.csc.fi/english/

csc/publications/)

EMBL-EBI

EBI - European Bioinformatics Institute is part of the European Molecular Biology Laboratory(EMBL) (see http://www.ebi.ac.uk/). EMBL-EBI is the European node for globallycoordinated efforts to collect and disseminate biological data. Building on more than 20years experience in bioinformatics, it maintains the world's most comprehensive range ofmolecular databases. Many of these databases are household names to biologists - theyinclude EMBL-Bank (DNA and RNA sequences), Ensembl (genomes), ArrayExpress(microarray-based gene-expression data), UniProt (protein sequences), InterPro (proteinfamilies, domains and motifs) and PDBe (macromolecular structures). Others, such asIntAct (protein-protein interactions), Reactome (pathways) and ChEBI (small molecules),are new resources that help researchers to understand not only the molecular parts that gotowards constructing an organism, but how these parts combine to create systems. Thedetails of each database vary, but they all uphold the same principles of service provision.The EBI's mission is as follows:

▪ To provide freely available data and bioinformatics services to all facets of thescientific community in ways that promote scientific progress

▪ To contribute to the advancement of biology through basic investigator-drivenresearch in bioinformatics

▪ To provide advanced bioinformatics training to scientists at all levels, from PhDstudents to independent investigators

Details of the Feasibility study

Timeline

Start date: Month 1 (November 2007)End date: Month 20 (July 2009)

Tasks

• T13.2.1 Kick off meeting and review of scope (month 3) and workshop planning CSC& BSC: opening user accounts for EMBL-EBI, computing capacity allocation, settingup initial computing environment

• T13.2.2 CSC/EBI: survey data management requirements;• T13.2.3 Present TFS at stakeholders meeting at EMBL-EBI (Month 6)• T13.2.4 Workshop with key stakeholder and opinion leaders in Barcelona and

Interim Review meeting (month 8)• T13.2.5 CSC & BSC: maintaining computing environment, test alternative

computing platforms CSC/EBI: networking performance survey

http://www.csc.fi/courses/

http://www.csc.fi/courses/

http://www.csc.fi/english/csc/publications/

http://www.csc.fi/english/csc/publications/

• T13.2.6 Interim report (month 12)• T13.2.7 Present interim report at stakeholders progress meeting at EMBL-EBI• T13.2.8 CSC & BSC: maintaining computing environment, test adapting jobs to

distributed computing platforms (e.g. grids); CSC/BSC/EBI: validation of results• T13.2.9 Draft final report (month 18)• T13.2.10 Third committee meetings (month 19)• T13.2.11 Third Stakeholders' Meeting (month 19)

Deliverables

• D13.2 Report summarising the technical and budgetary information on the feasibilityof utilising European supercomputing facilities to address tasks such as theInterProScan indexing task and outcomes from the workshop with key opinionleaders and decision makers.

Methods

Hardware Used

anthill cluster at EBI

The main production cluster used at EBI is a Centos 4.4 Linux LSF cluster called "anthill". Tosubmit jobs to the cluster, users login to one of the interactive/submit hosts and use therelevant cluster commands (LSF). Anthill contains 64bit hosts only and approximately 280CPU cores across 37 machines.

• CPU: 8-core 64-bit (AMD Opteron 880 2.4 GHz and Intel Xeon X5355 2.6 GHz)• RAM: 15 nodes with 16GB, 49 nodes with 32 GB (64 nodes in total)• OS: Linux kernel 2.6.9-42.ELsmp• Software: LSF 6.2

MareNostrum at BSC-CNS

In March 2004 the Spanish government and IBM signed an agreement to build one of thethe fastest compute facilities in Europe: MareNostrum (MN). MareNostrum is housed atBSC-CNS and is one of the most powerful supercomputers in Europe (number 40 in theworld, according to the latest Top500 list). By November 2006 its capacity had beenincreased due to large demands from scientific projects, reaching 94.21 teraflops (10 240processors) and doubling its previous capacity of 42.35 teraflops (4812 processors).MareNostrum is based on PowerPC processors, the BladeCenter architecture, a Linux systemand a Myrinet interconnection.

• Peak performance of 94.21 Teraflops• 10 240 IBM Power PC 970MP processors at 2.3 GHz (2560 JS21 blades)• 20 TB of main memory• 390 + 90 TB of disk storage• Interconnection networks: Myrinet and Gigabit Ethernet• Linux: SuSe Distribution

MareNostrum has 44 racks and takes up a space of 120m2.

murska.csc.fi at CSC

Murska is CSC's HP CP4000 BL ProLiant cluster system that has 1088 dual-core AMDOpteron processors in computing nodes. Murska can be used for running both medium-size(up to 256 cores) parallel as well as serial codes.

• Network: CSC <-> EU - 840 Mb/s UDP (Iperf) - 100 - 770 Mb/s TCP (Iperf) - ~ 0PLL

• Network: InfiniBand - 4.5 µs ping ping - 18.8 µs RR latency - 1.13 GB/s minbandwidth - 67.2 GB/s PTRANS

• Disk: IOZone - 6 GB/s read - 4.7 GB/s write - ~ 100 TB storage• Computing: - 2176 2.6 GHz AMD core - 1-8 GB / core - RHEL4 compatible• Grid/batch SW: - DEISA/UNICORE - EGEE/gLite - NorduGrid/ARC - SLURM/LSF/

SGE/PBS/LL

M-Grid via CSC

The main contribution to the grid feasibility study, briefly outlined below, was the M-grid,the Finnish national grid infrastructure (http://www.csc.fi/english/research/Computing_services/grid_environments/mgrid/index_html). In addition, a small fraction ofMurska was included in some of the runs and as well as a small test cluster called Kiniini.Computational hardware used in the feasibility study:

• M-grid◦ Geographically distributed throughout Finland◦ AMD Opteron based clusters (HP DL145 G1 and G2)◦ Gigabit Ethernet interconnect◦ NFS shared file system over Ethernet◦ total number of cores: about 900◦ number of cores prioritised for grid use: 170

• Murska◦ See the specifications above◦ number of cores prioritised for grid use: 64

• Kiniini◦ HP BL460 Intel Xeon 5340 based cluster◦ Gigabit Ethernet interconnect◦ NFS shared file system over Ethernet◦ number of cores: 64

Total number of available cores during the feasibility study runs was thus about 300 ofwhich 64 cores could be used without queuing for resources. This could be scaled upconsiderably for production.

Software used

HMMER

HMMER 2.3.2 was used for all calculations; the specific programs were hmmsearch andhmmpfam.

HMMER on anthill@EBI

There are two versions of HMMER 2.3.2 available to users on anthill. The first is a patchedversion (rare bug-fix) of the program available via http://hmmer.janelia.org. The second is

http://www.csc.fi/english/research/Computing_services/grid_environments/mgrid/index_html

http://www.csc.fi/english/research/Computing_services/grid_environments/mgrid/index_html

http://hmmer.janelia.org/

a licensed, software-accelerated product, LDHmmer (from Logical Depth) which hasconsiderable performance speed-ups compared to vanilla HMMER with no loss of accuracy(from benchmarking prior to the TFS). However, LDHmmer was not used in the TFS.

HMMER on murska@CSC

Once users have logged in and transferred their files, the environment is set up by running"module load hmmer". This command loads the most recent version of HMMER (currently2.3.2). Murska has LSF (Load Sharing Facility) for compute job distribution and handlingand EBI users use bsub to submit jobs. An example of a typical Murska batch job script wasplaced in the shared project directory /fs/proj1/ELIXIR for reference. Further details aboutMurska are in the manual on the web http://www.csc.fi/english/pages/murska_guide/.

An optimised version of HMMER 2.3.2 was produced in collaboration with Åbo Akademi, whoprovided patches to improve the performance of hmmer. We found slight differences inresults between the original version and the patched version. Eventually Åbo Akademiidentified the issue and provided a patch. Due to the late arrival of the corrected version,only original hmmer 2.3.2 was used on Murska.

HMMER on grid@CSC

Grid jobs were submitted from a gateway machine hosting the software environment thatprovides the user interface and takes care of executing the tasks in the grid. Users accessthe gateway through an interactive shell (ssh). The necessary input files (profiles anddatabases) are copied over scp to the gateway, then users submit the searches using agsub command (which intentionally bears a close resemblance to LSF bsub) andconsequently appears as if the job were being submitted to a local cluster.

For comparison, below are the command lines (file-paths removed for brevity) forsubmitting the jobs at EBI and grid@CSC:

• EBI: bsub -J PF00001.ls -o PF00001.ls.out -e PF00001.ls.err "hmmsearch --cpu 1 -A 0 PF00001.ls 10k.fasta > results/PF00001.ls_out.txt"

• CSC: gsub -J PF00001.ls -o PF00001.ls.out -e PF00001.ls.err hmmsearch -A 0PF00001.ls 10k.fasta

The interface software took care of assembling the searches into feasible sized grid jobs,transferring the input data, executing the jobs in the grid and returning the results into thelocal directory of the gateway machine. The grid middleware used was Nordugrid ARC(http://www.nordugrid.org/).

As in the case of murska, an optimised version of HMMER 2.3.2 was produced incollaboration with Åbo Akademi was installed on the grid, but for the sake of consistencyand comparability of the performance only original hmmer 2.3.2 was used.

HMMER on MN@BSC

The group of Rosa Badia, at BSC's Computer Sciences department, has addressed in ageneral way the need to adapt serial calculations (such as those in the TFS) into gridenvironment workflows. They have developed a set of tools called Grid Superscalar (GS)that effectively transform a series of tasks, natively serial, into parallel single tasks that runin a variety of grid environments. Grid Superscalar takes care of designing the execution

http://www.csc.fi/english/pages/murska_guide/

http://www.nordugrid.org/

schedule of the tasks and takes into account the intrinsic data dependencies on theworkflow. A special version of Grid Superscalar was adapted to run in MN, and has beenused by the INB team to adapt HMMER and other bioinformatics applications to be run in acluster. It should be noted that Grid Superscalar effectively uses MN as a grid (albeit onelocated in a single domain), something which can potentially generate problems in contrastto a usual distributed grid, due to faster communications and the concurrence of a largenumber of jobs on the same databases. The version of GS that is on MN is tuned to handlethe fast communications in MN, however some overhead in grid management is expected.GS is based on the C programming language.

Recently, GS has evolved into a new version called COMP Superscalar (COMPSs). It differsfrom its predecessor in two main ways: on one hand, COMPSs offers a programming modelthat is specially tailored for Java applications, in the other, the runtime of COMPSs is formedby a set of components which follow the Grid Component Model (GCM). The basetechnologies of COMPSs are ProActive (used to build the components) and JavaGAT(enables COMPSs to access numerous types of Grid middleware through a uniforminterface).

A couple of applications have been prepared to run HHMER's hmmsearch, using GS. Bothare built on the INB's CommLine API, so they can be run as standalone applications from asimple script, run on MN itself or in users' own computers. runHMMSearch analyzes a seriesof sequences against a single model, splitting the input sequences into a series ofprocessors. In contrast runHMMSearchMassive splits the model database using the same setof sequences for all of them. In both cases a single hmmsearch is split amongst a number ofprocessors, acting on MN as a single parallel job.

In the case of hmmpfam, the program already analyzes a series of models and sequences,so splitting among processors could be more complex. In this case COMP Superscalar wasconsidered more suitable. The application built in the case of hmmpfam, splits the input setof sequences among processors, and also splits the input model database to improvememory management among processors. The application rebuilds the output of the analysisto give the same result as if the run was done in a single processor. In this way the use ofCOMPSs is transparent to the user and there is no need for the user to adapt to the newsystem. The application is prepared as a shell script which hides the details of theexecution.

Data management and Networking Surveys

Data management covers both manual and automatic aspects of how data is handledbetween the EBI and the computing sites. For example, data transfer, data processing, jobmonitoring. The disk space required for all the model files in the TFS from the 7 HMMdatabases is approximately 11.5 GB. The sequence file for all of UniParc sequences was inthe beginning of 2008 about 4.6 GB. The sum size of the output files typically generated bya run can vary quite a bit; for example, for TIGRFAMs, one of the smaller databases, 38 MBof results were generated, but for example for Gene3D this would be several orders ofmagnitude larger.

The command iperf (http://iperf.sourceforge.net/) was used to measure TCP/UDPbandwidth performance between EBI and the two supercomputing centres in bothdirections.

http://iperf.sourceforge.net/

Access: User account set-up and compute allocation

CSC Account setup for EBI users

A CSC ELIXIR computing project (GID 593) was set up and EBI users were first given accessto murska.csc.fi at CSC (64-bit RHEL4 Linux-based cluster). The project was given a totalquota of 245 000 computing units (1 unit being equivalent to about 1 hour of computing ona AMD 2.6GHz core). CSC monitored resource usage by EBI from the LSF batch queuesystem, and original and remaining resources were displayed for EBI users with the unixcommand 'saldo'.

Each CSC user account has a personal $HOME pointing to /home/u3/user. CSC computingproject members also have a shared disk $PROJ at /fs/proj1/ELIXIR. Fast disk (unlimited)for computing is $WRKDIR. The set up of the computing environment is automated (using'module load hmmer') which loads the most suitable version of HMMER (currently 2.3.2).Murska uses LSF/bsub to submit jobs. An example of a Murska batch submission was placedin the project directory /fs/proj1/ELIXIR.

ELIXIR technical issues were channelled directly through the project manager([email protected], +358-9-4572235) or via specialists. The preferred general addressfor registering service requests from EBI was the CSC Service Desk ([email protected]).

BSC Account setup for EBI users

MareNostrum's computing power is available to public researchers based on activity calls.The activity calls are continuously open and evaluations occur every three months,evaluated by an Access Committee. The duration of activities is limited to 4 months; largeractivities would need to divide their work into 4 month periods. The Access Committee hasfour members: a BSC-CNS external manager with wide experience in InnovationManagement, an ANEP representative, a Supercomputing expert, external to BSC-CNS, anda Supercomputing expert, member of BSC-CNS. The Access Committee is advised by anExpert Panel composed of prestigious Spanish scientists, external to BSC-CNS, who haveoutstanding careers and experience in the management of research projects and are mainlymanagers of National Programs or of the Agencia Nacional De Evaluación Y Prospectiva(ANEP).

This structure of applying for compute time at MN is less suitable for continuous calculationssubmissions, but is adequate for projects involving a large number of processors during alimited period of time. In the scenario presented by the ELIXIR TFS, this could potentiallycorrespond to large updates of the InterPro database either due to the inclusion of newmodels or the addition of large sets of sequences from specific projects requiring analysis.In those conditions, a project could be submitted to the advisory board and could beexecuted without altering the normal use policy of the centre. Projects which are grantedcompute time on MareNostrum are required to fill in a report stating their progress everytwo weeks (otherwise, access to the facility is temporarily de-activated).

During the study three ways of accessing BSC resources from EBI have been tested:1. Access as internal BSC users. BSC has a limited amount of CPU time for the use of

MareNostrum reserved for internal research groups and their collaborations. Thisentry point requires usage to be related administratively to BSC. EBI users weregranted internal CPU provide that ELIXIR was a project participated in by groupsfrom the Life Sciences Department. Calculations by the local team were also

performed using internal time. A relatively small number of CPU hours were grantedthrough this channel, 5000 of which were used by BSC teams to set up thesoftware.

2. Normal access to MN time for external users. This access required applying usingthe ordinary channels (web) and subsequent approval by a external advisorycommittee. EBI users were granted 500 000 CPU hours during a four month period.

3. Access via extraordinary project. Reserved to special projects that cannot use thenormal procedure due to limitations in their schedule, or need for resources. Thisapproach still requires approval but can be done irrespective of the ordinary terms.EBI users were granted 150 000 CPU hours for the final proof of conceptcalculations.

A total of 1.5 million hours were granted to the TFS (1 million in the first period and 0.5million in the second period). Of these, 520 027 CPU hours were used.

MareNostrum was accessed via nodes mn1.bsc.es to mn4.bsc.es. User home directorieswere set up in /gpfs/home/, with project and scratch directories in /gpfs/projects and /gpfs/scratch respectively. A total of 20Gb was allocated to the projects directory, with 10 Gbavailable in scratch and 50Gb available on HSM/tape.

Technical issues were directed to the helpdesk ([email protected]) and directly to experttechnical personnel at MN Computer and Life Sciences departments. Technical information,resource usage reports, etc. were accessed at http://www.bsc.es/RES.

Compute tasks carried out as part of TFS 13.2

Data used

Three sequence files were produced, containing 10, 5000 (5k) and 10000 (10k) sequences.All files had a similar mean length of sequence which corresponded with the mean length ofsequence in the whole of UniParc (284.5), with the 10k average slightly higher than theothers. A wide distribution of sequence lengths were sampled using Chebyshev's inequality,which states that 98% of the samples will be within seven standard deviations of the mean(we assumed normal distribution). We also created a separate file of very large sequences,as an extreme case, as there are a small number of very large (>10 000 aa) proteins in theprotein databases and these are known to cause problems for some algorithms.

Database Version Min Len Max Len Mean Len SD Len Count

10.fasta N/A 284 285 284.4 0.5 10

5k.fasta N/A 8 2442 281.0 265.1 5000

10k.fasta N/A 8 2442 307.6 265.3 10 000

big.fasta N/A 11 088 23 054 18 840.5 3199.2 10

UniParc 13.0 1 37 777 284.5 311.7 16 277 049

Len = sequence lengthSD = standard deviation

The UniProt Decoy database (v14.6) was also downloaded for use in the study. This consistsof 7 199 319 randomised sequences

http://www.bsc.es/RES

HMMER searches the HMM files which are provided by InterPro's member databases:Gene3D, Pfam (2 files), PIRSF, SuperFamily, SMART and TIGRFAMs. These HMMs vary insize and number per database. The statistics of the initial set used are as follows:

Database Version Min Len Max Len Mean Len SD Len Count

Gene 3D 3.0.0 11 1174 153.7 88.4 8426

PFAM FS 22.0 5 2295 209.8 173.9 9318

PFAM LS 22.0 5 2295 209.8 173.9 9318

PIRSF 270 33 6092 435.1 405.5 2742

SMART 5.1 11 971 137.9 121.0 725

SuperFamily 169 21 1504 180.5 119.1 10 894

TIGRFAMS 7.0 10 2773 348.7 246.0 3418

Total 44 841

Len = sequence lengthSD = standard deviation

Benchmarking of HMMER searches

For the feasibility study, comparisons of search performance and accuracy were done usinga test set of sequences against various HMM databases. Sequence files containing 10, 5000and 10 000 sequences were created where the average length of the sequence in each wasthe same and the sequence lengths were evenly distributed. Statistics about the modelscontained in each of the model databases were calculated. These sequences were thensearched against the databases in a variety of modes, listed below:

1. hmmsearch ASCII models vs 10, 5000 and 10000 sequences2. hmmsearch binary models vs 10, 5000 and 10000 sequences3. hmmpfam ASCII models vs 5000 sequences4. hmmpfam binary models vs 5000 sequences

Other computational tasks

The following computational tasks were attempted as part of the feasibility study:1. Searching new meta-genomics sequences against HMM databases: One of

the aims in the TFS was to see whether or not the supercomputing centres could beused to search large numbers of meta-genomic sequences when they entered thepublic domain. During the course of the project a number of new meta-genomicsequences became available (471 239) and it was decided to use this as a "real"case.

2. Building HMMs from sequence alignments: The hmmalign and hmmcalibrateprograms can be used to build HMMs from sequences. It was decided to use this asan initial test case on the supercomputing infrastructures. The sequence alignmentsof proteins from the HAMAP protein families database were used for the modelbuilding.

3. Running UniProtKB decoy database against HMM databases: A way of testingthe sensitivity and accuracy of HMMs (i.e. how likely it is that a model matching asequence is a false positive) is to search a randomised protein sequence databaseand see whether or not any results are obtained. If they are, this leads one to

believe that the model is not performing well. The randomised "decoy database"produced by UniProtKB was searched against HMM databases using hmmpfam inorder that this might be tested.

4. Running against HMM databases on grid resources: As previously mentioned,access to compute grid resources was made available for the project, somethingwhich was not in the original proposal but which was felt to be of importance to thefeasibility study. The 10 000 sequences used in the other parts of the study werealso used here, alongside the Pfam 23.0 database.

5. Calculation of MD5 checksums for all UniParc protein sequences: Despite thisbeing slightly outside the scope of the TFS, calculation of MD5 checksums on allUniParc sequences were performed, to test the performance of running very shortjobs on the CSC infrastructure. MD5 checksums can be used to uniquely identifyprotein sequences (by treating the sequence as a string) with very little practicalchance of a collision.

Other non-computational tasks carried out as part of TFS 13.2

Grand Challenges in Computational Biology workshop

The Grand Challenges in Computational Biology workshop was held in Barcelona in June2008. The outcome of the workshop is listed in the "results" section.Dates: 2-4 June 2008Location: Barcelona. Institut d'Estudis CatalansOrganizers: Barcelona Supercomputing Center & Institute of Research in BiomedicineAttendants: 130Invited speakers: 19Communications presented: 47 (4 oral, 43 poster)

Results

Data management and networking Surveys

Surveys between EBI and CSC

A networking performance survey between EBI and CSC was carried out using the aboveuse case to outline the end-user case in using the network. Lossless UDP and TCP up to300-500 Mbit/s in both CSC -> EBI and EBI -> CSC directions was reached. Over 500 Mbit/swas not tried. This seemed sufficient at this point for the use case, therefore dedicatedlightpath or lambda solutions were not investigated. It appears that public network is notthe bottleneck for exporting computations such as HMMer searching, rather the localnetworks and IT infrastructure at corresponding organisations.

Experiences from the networking survey suggest that data transfer should be done using acustom protocol (e.g. GridFTP) or a modified version of SSH. We can see that the regularSSH (3.9p1) performance is only about 1.5 MB/s. Therefore we investigated andrecommend usage of higher-performance HPN-SSH. The most important features of HPN-SSH are 1) the capability to use "None" encryption for data while using encryption forpassword exchange, and 2) ability to use larger (and dynamic) buffers in file transfers.Replacing SSH with HPN-SSH on both ends allow up to 55 MB/s in EBI->CSC (thebottleneck being CPU on the transmitting server sirpale.csc.fi) and 70 MB/s in CSC->EBI

direction (bottleneck not tested), where the stable rate is somewhat lower. CSC->EBItransfer can still reach 50 MB/s only by replacing SSH in EBI (receiver, client) side.

We also observed that when the buffers are increased to 8MB, the fluctuations in datatransfer appear more frequently, probably because the system gets hit by CPU bottlenecksearlier. This probably results in packet loss/delays which slows down TCP. Therefore, in thepossible production environment it makes sense to adjust the TCP parameters in such afashion that they are not overly aggressive. The assumption here is that no encryption isneeded for data transfer; the CPU bottleneck is lower if encryption/decryption is needed.

Surveys between EBI and BSC

A networking performance survey was carried out between EBI and BSC. Raw throughputcan be obtained in both directions at close to the limit imposed by the EBI's current JANETconnection (1Gb/s). Actual mileage would vary with application traffic. The network speedwas measured at 900Mb/s from EBI -> BSC. The reverse (BSC -> EBI) was carried out at atime of higher congestion but regardless managed to achieve speeds of more than 650Mb/s.

User account set-up and compute allocation experiences

The CSC accounts and compute hours were allocated within three weeks of the kick-offmeeting and lasted for the duration of the project. However, applying for accounts andexternal compute hours at MareNostrum was more complicated. EBI was unfamiliar with theapplication and reporting processes in place at BSC and encountered delays because of this.Firstly, the initial deadline for application for compute time in April was missed by a day, aswe had been misinformed regarding the deadline's date. Due to the missed deadline inApril, the next application was submitted and granted in October, with a contingencynumber of internal hours allotted as a temporary measure in September. EBI found theonline application form confusing and hard to use and due to this, applied for 10 x thenumber of hours that they actually intended to use by mistake. In order to complete theproject in May 2009, an extra set of CPU hours were granted using an extraordinary projectapplication process in order to complete the final tasks of the TFS.

Outcomes of computational tasks

For details of the outcome of the computational tasks, please see Appendix I.

Outcomes of non-computational tasks

Grand Challenges in Computational Biology workshop

Biology is one of the very few cases of human activity growing faster than Moore's Law. Theexponential growth of sequence and structural databases, and the discovery of thecomplexity of most biomolecular interactions are giving rise to computational challengesthat cannot be addressed with current hardware architectures. In many fields, the limitedpower of computers is becoming the main obstacle for the advance of biological research.During the meeting, leaders in the field reviewed the current state of the art incomputational biology, from atomistic simulation to systems biology. We also learned aboutthe recent advances in computer architecture that are defining the characteristics ofsupercomputers in the coming decade.

The interaction between supercomputer centres and computational biology platforms ismandatory in order to find computational solutions to the challenges that new genomicsprojects are arising. Grand challenges are visible both in the area of simulation and of dataprocessing and mining. The incoming generation of supercomputer centres will incorporatemassively parallel machines, with many moderately powerful computational cores connectedby fast networks. Peak power will be above the petaflop range, but its practical use incomputational biology studies will require a large effort in: i) updating programs to allow amore efficient use of multiple processors, ii) improvement in the way in whichsupercomputers make use of extremely large databases, and iii) a change in the paradigmof use of high performance computer platforms from the computational biology community.

Conclusions

Other considerations, not included in TFS

HMMER on Cell BP architecture@BSC

The new generation of supercomputers will explore different architectures. BSC is involvedin benchmarking Cell BP architectures within the PRACE project with a prototype namedMariCel. This computer has 12 IBM-JS22 Power 6 nodes and 72 IBM-QS22 compute nodespowered by 2x PowerXCel 8i. We have explored possible use of HMMER in this architecture.A partial porting of HMMER 2.3 (hmmsearch), prepared at IBM's Watson research institute isavailable. Unfortunately this is not a fully operative version, and only the core of thecalculation, which exploits the parallel architecture, can be run. We have installed andtested hmmsearch successfully and attempted to complete the necessary code, however,changes needed were too complex to fit in the present study. We are now planning to portthe new HMMER v.3.

Grid workshop

An idea of a practical hands-on grid workshop for ELIXIR was proposed by experts in theTFS. A draft agenda for the workshop could be:

• Scope: two days of hands-on work• Max attendants 20 (24 absolute maximum) from CSC, EBI and possibly one or two

other organisations from ELIXIR. Workshop teachers would include gridprofessionals (outside Europe) to look at ELIXIR-defined computing problems

• EBI would start the workshop program by setting the scene: define the computingproblems/use cases related to ELIXIR (e.g. database production workflow). This setsthe technical challenge for the technical experts to solve using grid/cloud computing

• CSC would then give introductory lectures on the grid technology terms to calibratethe terminology

• Remainder of day one and beginning of day two could be spent on working on theELIXIR challenge with the given grid technologies

• We would close the day to do something more free form in the evening thatsupports problem discussions related to days work. Goal is to fire up working thenext morning with fresh minds

• Closure of day two would involve spending an hour or so writing technicalrecommendations based on the challenge that was set in the beginning of theworkshop, including how well the work succeeded. The recommendation would be

written as a wiki page or blog that could be started at the workshop and extendedlater.

Commercial Cloud infrastructures

Since the proposal was written, new compute solutions have appeared in the guise of cloudcomputing, particularly from commercial vendors such as Amazon and Google. Theseinfrastructures have not been directly assessed by the TFS; however, costings for runningon these systems are publicly available, for example Amazon EC2 (http://aws.amazon.com/ec2/#pricing), and appear to be quite competitive. However, basing core production taskson a commercial service would leave EBI exposed to changes in pricing and is notconsidered feasible.

Improved data transfer performance

For the TFS, further enhancements and optimisations of the network connectivity betweenthe EBI and CSC/BSC were not explored, as only relatively small amounts of data werebeing moved around between the centres. This meant that the existing network andhardware set-ups were sufficient for the demands. However, the establishment of a direct,dedicated Lightpath could be a possibility (which has a capacity of 10-40Gb/s) should largerdata sets be involved. As data gets larger, the limiting factor becomes not the network butthe hardware behind it, such as disk set-up and CPU.

Overview, compute provider perspective

CSC perspective

Benchmarking shows that computations on similar cluster systems are not distinguishable inperformance. Fine-tuning of performance with different run-time parameters and differentcompilers is possible and should be considered if large-scale production runs are started.

If there are other users with massive serial jobs running at the same time as hmmerproduction runs are made, in practice it caused a temporarily lower throughput for serialcomputing since the disk systems became a bottleneck. Disk performance was thusimportant when performing massive serial computing analogous to hmmer computations.

A few options can be fine-tuned to increase the throughput at a computing provider's end:1. Increase resources for serial jobs generally for the entire target machine.2. Dedicate a queue for serial jobs to be able to control the job priorities better.3. Use grid technology to balance peak computing. If an application such as hmmer

has been enabled in grid, control of resources is streamlined and can be flexiblyincreased to get the job finished on schedule.

Scientific data storage and management systems are evolving especially quickly at themoment. The biological use cases from ELIXIR had an impact on planning of data storagesystems at CSC. Storing research data requires multi-disciplinary data management tools,well-functioning access management, secured data integrity, privacy and availability, as wellas cost-effectiveness, all these aspects must be balanced. To achieve these targets,

http://aws.amazon.com/ec2/#pricing

http://aws.amazon.com/ec2/#pricing

multiple-level storage solutions with centralised as well as distributed capacity are needed,and should be built in close collaboration with the user communities.

In summary:• As noted above, the CSC and EBI clusters with similar processors compute equally

fast. This is not a surprise• The key factor in service provision is that interactions/processes work effortlessly

between the staff specialists in collaborating organisations• Once an application is enabled on a grid through a gateway, possibilities to scale up

computing dynamically are straightforward

Total price of computing thus includes in addition to investments salaries for personnel,space, electricity. Price and feasibility of scaling up computing dynamically is an importantconsideration for total economical costs of ELIXIR. In the proposed hub-node structure ofELIXIR, networked organisations will obviously need local computing/IT resources tomaintain their own expertise and flexibility. However, scaled-up computing resource shouldnot be a necessary role for all collaborating organisations, if, as in the case of HMMER,computations can be exported between organisations.

From CSC's perspective, the services for EBI/ELIXIR could be divided into two service levelsor operational models that were tested in the TFS:

• Light collaboration model: A standard account/computing project is set up for EBI/ELIXIR. In this operational model EBI staff members are able to compute at CSCusing a CSC account/project described above. EBI/ELIXIR computing project thenhas similar resources and status that e.g. a large university group could get fromCSC. The collaboration does not involve dedicated expert staff actively participatingin the scientific goals of the computations.

• Extended collaboration model: CSC and EBI agree to build a common infrastructurefor ELIXIR and its end users/stakeholders. This is much more labour intensive,building is a challenging task and running the service requires dedicated technicalstaff with long-term commitments. Target technical function could be hosting/distributing computing defined by EBI/ELIXIR into a "grid/cloud" through a gateway.There is a difference in this type of service to proprietary cloud providers. Theservice quality is dependent on the availability of expert staff (scientific andtechnical) at the resource providers site. As a stable non-profit organisation CSCcould take a role such as this in the long term and dedicate staff to focus on thegoals set by ELIXIR. It is, however, probably not possible for most commercial"grid/cloud" providers considering the time-scales and operational stability sketchedby ELIXIR plans. Furthermore, if the required ICT infrastructure is very complicatedit may be difficult to switch between the service providers. The downside is thatmaintaining highest-level topical expertise staff anywhere constitutes high costs inthe long term.

Either service level can be set up by CSC for EBI. However, specialisation and deepercollaboration between hubs/nodes would enable organisations to concentrate on what theycan/want to do best.

BSC perspective

Interaction between research institutions based on different conceptions is always achallenge. Two different aspects of the work done have to be considered.

On one hand the technical adaptation to the use of bioinformatics applications (HMMERbeing the test case) that are already routinely established in the requesting institution (EBI)has been solved. This was done by a combination of tools built and adapted at BSC to fillthe gap between normal bioinformatics, data-intensive applications and the optimal use ofsupercomputers, where a high grade of parallelisation is desired. Tools developed at BSClike Grid and COMP Superscalar have been used successfully in running massive HMMERjobs in the MareNostrum supercomputer. The use of scripting-based APIs and middlewareallows interacting with the centres with simple modifications of EBI protocols. Fullintegration in EBI pipeline is theoretically possible but it has not been tested in this study.

The second aspect to consider is the management of the relationship between EBI and BSC,which is a good example of the kind of problems that may arise when two institutions try tocollaborate. BSC, due to its administrative rules, needs to have a strict use policy wheremost activities require an explicit approval by an external advisory committee. In the TFS itwas therefore decided to reserve large calculations that could be presented as singleprojects to the committee for BSC. We have tested the three possible ways to access BSCcomputer resources.

1. A limited amount of computer time is allowed for internal use. In this study, internaltime was used to setup the software by the BSC team (ca 5000 h) and also by EBIusers for test runs.

2. Application for computer time using the normal procedures was also used. Asignificant amount of computer time was granted to the project. The practical use ofthis CPU time was, however, less that expected due to the limitations in dates of theapproved period.

3. The third method was to present the project as a singular one, using a specialallocation of CPU time, that was granted after the necessary formalities.

The three approaches have pros and cons. Internal time is strongly limited; normal externalaccess requires advance planning as time is granted on four months periods, and the thirdapproach, indeed the most suitable from user's point of view, is highly irregular from BSCpoint of view, and requires altering the MN schedule. The general recommendation in thecase of BSC, after the study, and considering the necessary formalities required on eachcase, is to make use of the regular procedure. This would require an advanced planning ofthe projects, but has less impact in the centre and requires far fewer formalities. BSC isadapting its internal policies to both accept external groups as collaborative internal usersand to extend the periods beyond the four months limit to allow long term projects withoutthe hassle of constant renewal. Both improvements were triggered partially by ELIXIR TFSand would allow for future projects to be carried out more easily.

Overview, end user (EBI) perspective

EBI is a service organisation, and the InterPro database needs to have the most up-to-datedata possible in order that its users are satisfied, and to fulfil its remit to be the mostcomprehensive protein family and domains database in the world. EBI therefore needs to beconfident that any computational tasks that are performed are done to the same level ofaccuracy and quality as if they were carried out locally at EBI. It is clear that thecalculations performed at both BSC and CSC are equal in this regard to those carried out atEBI.

The TFS has taught EBI some useful lessons regarding estimation of compute time andresources. Firstly, it is clear that it is very difficult to anticipate the volume of sequencesproduced in the public domain. Similarly, it is also quite difficult to predict when a new setof protein signatures will be released and what computational load the new combination of

algorithm and models will have. In the original proposal, the number of sequences predictedto be available by the time of the writing of this report was three times what it actually is.These predictions were based on trends in database at the time the proposal was written; itmay be that similar rapid increases in sequence volumes over short periods will recur in thefuture. In these cases, having access to facilities which can accommodate these excessesabove the normal levels of sequence production would be very useful. However, it is hard topin-point when these bursts may happen and so forward planning is very difficult. Secondly,disk I/O contention, which occurs as a consequence of running multiple parallel processesthat were not predicted during small-scale testing, only becomes a concern if a user has alimited amount of wallclock time in which to complete their calculations. The TFS hashighlighted the importance of optimising compute job sizes to avoid such bottlenecks,particularly on systems where time is shared between many different users running manydifferent tasks.

Transparency is a keyword which arisen from this TFS: transparency in both communicationand the set-up of the computing environment is of utmost importance. The feasibility studyworked best when there was clear communication channels between EBI and CSC/BSC. Theavailability of experts at the supercomputing centres was crucial in enabling the computeenvironments to be set up appropriately for running the tasks that had been set-out. Whenthis did not happen, there was confusion regarding the most appropriate way to carry outthe tasks which inevitably led to delays, wasted time and frustration. The experts availablevia the helpdesks at both CSC and BSC are to be commended for their support to the EBIpersonnel during the course of the study. It also became clear that, if a computingenvironment is set up by the supercomputing centres where the technicalities of theimplementation is hidden as much as possible from the users at the EBI, the transition tousing that environment was much smoother. The grid study carried out at CSC should begiven special mention, as the changes to the grid set-up were done invisibly to the enduser, making for a much improved experience.

The operational set-up of EBI and CSC are much more closely aligned than that of EBI andBSC. EBI found the process of applying for compute time at BSC much more difficult than atCSC, however, EBI understands why such procedures are necessary for a large resourcesuch as MN. It is unclear whether or not estimating and applying for compute hours everyfour months will be feasible in practical terms, due to the unpredictable nature of the dataitself (described above). The requirement of a report to BSC every two weeks in order tokeep access active is an undesirable overhead.

If a set amount of time could be reserved in a longer period (e.g. per year) which could beused at times of peak demand (with a certain amount of warning for scheduling purposes),this would be much more suitable for EBI's needs as it would mean that the resource wassuitably responsive to our needs. The short study carried out on the grid infrastructure wasvery encouraging. However, more work is required to check that any eventual gains inthroughput would not be lost due to system and network latency.

Overall, despite the teething troubles encountered, the outcome of the TFS has been apositive one, with all partners learning from the experience. It is expected that, oncesystems for performing calculations have been set-up and tested (utilising in-houseexpertise at both the supercomputing centre and the EBI), the actual overhead for keepingthese running would actually be relatively low and would only require new expert input if anentirely new set of calculations (e.g. using a new algorithm) were to be started.

References

Acknowledgements

Those involved in the work package would also like to acknowledge the contribution of thefollowing people:

EBI: Many thanks to Craig McAnulla for helping to prepare this report and contributing tothe TFS.CSC: Thanks also goes to Olli Tourunen, Samuli Saarinen, Marko Myllynen, Pekka Savola,Jarno Tuimala for contributing to the TFS.BSC: Thanks to Romina Royo, David Torrents and all other helpdesk personnel who workedon the project at BSC.

Appendices

Appendix I: Details of computational tasks carried out in the TFS

Initial benchmarking of HMMER searches

It was important to check how performance compared between the two supercomputer sitesand EBI. At EBI we therefore ran some benchmarking of different sized sequence filesagainst all the HMM databases in InterPro (except Panther). Below are the results of runningthe hmmsearch program against ASCII and binary models (Note: no results exist for"Gene3D binary" because it could not be converted. SMART was not run due to thecomplexity of applying E-values). All results confirm that running against binary models isconsistently faster than ASCII.

EBI results - ASCII models vs binary models (figures in CPU hours)

Database 10 seqs 5000 seqs 10 000 seqs 10 (large) Total

Gene 3D ASCII 0.1 14.6 28.6 1.9 45.2

Gene 3D binary – – – – –

Pfam FS ASCII 0.1 24.2 49.6 3.2 77.1

Pfam FS binary 0.1 18.9 43.7 2.2 64.9

Pfam LS ASCII 0.1 22.9 43.9 3.2 70.1

Pfam LS binary 0.1 20.0 43.0 2.8 65.9

PIRSF ASCII 0.1 12.9 27.1 1.9 42

PIRSF binary 0.1 12.8 27.8 1.9 42.6

SMART ASCII – – – – –

SMART binary – – – – –

Superfamily ASCII 0.1 19.9 41.7 3.0 64.7

Superfamily binary 0.1 20.6 43.5 2.7 66.9

TIGRFAMs ASCII 0.1 13.4 27.6 2.2 43.3

TIGRFAMs binary 0.1 13.2 26.5 1.9 41.7

Graph of binary vs ASCII performance at EBI

BSC results - hmmpfam vs hmmsearch (figures in CPU hours)

Database 10 seqs 5000 seqs 10 000 seqs 10 (large)

Gene 3D ASCII 0.23 4.45 10.06 1.98

Gene 3D binary – – – –

Pfam FS ASCII 0.29 8.41 18.29 2.08

Pfam FS binary 0.28 8.40 18.27 2.05

Pfam LS ASCII 0.2 8.23 19.04 2.10

Pfam LS binary 0.18 7.63 19.02 2.09

PIRSF ASCII 0.08 4.93 12.48 1.68

PIRSF binary 0.06 5.38 12.48 1.63

SMART ASCII – – – –

SMART binary – – – –

Superfamily ASCII 0.26 7.34 15.52 1.51

Superfamily binary 0.24 7.55 14.68 1.52

TIGRFAMs ASCII 0.07 4.15 10.52 1.42

TIGRFAMs binary 0.05 4.81 9.90 1.45

BSC repeated the benchmarking as per EBI, with some differences:▪ Only jobs that are shorter than 48 hours were submitted (the maximum wall-clock

time of MareNostrum).

▪ Pfam 23.0 and Gene3D 3.1.0 were used, rather than the versions of the databasesused at EBI .

▪ --cpu=1 parameter was used in all executions to force a maximum of one CPU to beused per job (rather than the threaded version of HMMER).

It should be noticed that the larger sequence files seem to perform better at BSC than EBI,probably due to the way that MareNostrum and SuperScalar are configured.

EBI results - hmmpfam vs hmmsearch (figures in CPU hours)

Benchmarks were also run to compare the hmmpfam and hmmsearch algorithms at EBI.Hmmpfam running against a file of 5000 sequences performed as expected, with hmmpfamrunning significantly slower than the fastest hmmsearch searches.

Database hmmpfam vs ASCII hmmpfam vs binary hmmsearch vs binary

Gene 3D 83.4 55.2 –

Pfam FS 72.9 49.6 18.9

Pfam LS 70.5 61.7 20.0

PIRSF 43.4 38.1 12.8

SMART 6.5 3.2 –

Superfamily 114.6 59.7 20.6

TIGRFAMs 80.3 35.9 13.2

Total 471.6 303.3 107.9

Graph of hmmsearch vs hmmpfam results at EBI

Optimised CSC HMMER

An optimised version of HMMER was produced by CSC and also searched against the samesequences resulting in an improvement in performance compared to the vanilla version ofHMMER. However, there were accuracy problems and core dumps when EBI initially tried torun (which were later fixed, as explained above).

BSC GS and COMPSs results

Due to the grid management overhead, runHMMSearch is only adequate to run small jobsand is not suitable for large scale projects. In has been tested only with web-services. Incontrast, runHMMSearchMassive with a large number of sequences has been testedsuccessfully in MareNostrum. Optimal configuration of runHMMSearchMassive would be setsof 500 000 sequences over 64 processors. A test run of this size against TIGRFAM (3800models) lasted 15.4h which means a 60.4 speed-up factor (around 90% of optimal scaling).

Benchmarking of the COMPSs hmmpfam tool with sets of 5000 sequences on 48 processorsgives a general 20x speed-up. It is expected that larger input sequences give a betterspeed-up, but no systematic test has been performed.

Comparing typical performance at EBI, CSC, BSC

The 10 000 sequences file was searched at EBI, CSC and BSC against the TIGRFAMs 7.0database. The results, both CPU and wall-clock hours, are shown below.

SoftwareCPU

hours(EBI)

CPUhours(CSC)

CPUhours(BSC)

Wallclockhours (EBI)

Wallclockhours (CSC)

Wallclockhours (BSC)

hmmpfam 56.5 66.7 82.8 58.75 33 26.7

hmmsearch 27.6 25.2 NA 0.30 0.48 N/A

Results of running other tasks on supercomputing resources

Originally, tasks were to be split between computing for genomic and meta-genomic newsequences (at CSC and BSC respectively), but because of the limitations outlined above wewere forced to structure things differently. Instead we decided upon a few small computetasks which could we carried out at either of the supercomputing centres and latercompared. These are highlighted in the sections below.

Murska@CSC

Various tasks were performed by EBI users on Murska at CSC. These included the searchingof new meta-genomics sequences against HMM databases; calculating MD5 checksums forUniParc sequences; building HMMs from sequence alignments and running the UniProKBdecoy database against the TIGRFAMs 7.0 database. Data was transferred from the localmachines at EBI to CSC using scp (secure copy command).

The metagenomics task was eventually completed successfully. This highlighted problemswith not restricting the number of processors used by HMMER to one (HMMER tries to useall cores on a machine by default, even if there are other jobs running there) which led to asmall number of core dumps.

Running UniProtKB decoy database against HMM databases, we also encountered a smallnumber of problems. One of these was that EBI underestimated the required CPU time forthe searches to be completed (we originally intended to search all HMM databases). Severalof the searches also took longer than the maximum allotted CPU time on Murska. In total,50 695 CPUh were used during this exercise, as we restricted the searches to only theTIGRFAMs 7.0 database. We also encountered a known bug in HMMER where the errormessage "FATAL: you can't init get_wee_midpt with a T" is occasionally reported. This isfixed by editing the source code and re-setting the RAMLIMIT. Once re-compiled, the jobswere able to complete. Finally, the same problem with multi-threaded HMMER searchesoccured as before, this time, LSF was run with the option to reserve 4 cores on the computehost for the search (by using the -n 4 option to bsub).

Two other tasks which did not directly use the search programs from the HMMER packagebut were still of interest to EBI were carried out: building HMMs from sequence alignmentsusing hmmalign and hmmcallibrate and the calculation of MD5 checksums for sequences.

The MD5 task was initially run in a similar manner to the HMMER searching jobs, with theinput sequence file (complete UniParc set of sequences, over 17 million at the time) split upinto chunks and jobs individually spawned for each chunk. However, this spawned millionsof very short jobs, and any that remained in the queue were terminated after approximately4 hours. Instead, a Perl script was run which required certain BioPerl modules which werenot present on Murska and had to be copied to the user home directory and then exportedas part of the command submitted via bsub. This allowed the search to completesuccessfully.

The HMM building task involved using sequence alignments from the HAMAP database thatwere submitted to hmmalign and hmmcalibrate using Python scripts and then the resultingHMMs were searched against all UniParc sequences. This also completed successfully,producing the same results as the equivalent commands on EBI resources.

MareNostrum@BSC

Once user accounts had been set up at MareNostrum, EBI users checked that it was possibleto log-in and run compute jobs. As with CSC, data was transferred from the local machinesat EBI to BSC using scp (secure copy command).

Initially, when trying to run simple jobs, the following error was received: "WARNING: youruser could be disabled to run jobs. Contact [email protected]". After contacting support, theissue was resolved.

Numerous tasks were performed at BSC once the external hours for the project had beengranted. EBI experienced a steeper learning curve using BSC systems compared to CSC, asthe BSC job submission system is quite different to that used at EBI and CSC. Initially, weattempted to use the same strategy as at CSC and EBI, which is to script submissioncommands to the queue system (via the command 'mnsubmit'). However, this caused anoverload of the BSC system, leading to many empty files due to premature termination ofthe jobs. Instead, EBI used the BSC-developed Grid Superscalar (GS) system, with helpfrom BSC support. EBI also had to specify, as with CSC, a wallclock time limit, somethingwhich is not necessary at EBI.

First of all, EBI repeated the benchmarking of 10 000 sequences against TIGRFAMs 7.0using EBI's submission scripts rather than GS, which eventually completed successfully,after re-submission of a small number of jobs.

Next, the UniProt Decoy database was run against TIGRFAMs 7.0, as with CSC, andadditionally against PIRSF 2.70, both using GS. The run against TIGRFAMs 7.0 highlighteddifficulties with trying to run multiple similar searches concurrently on MareNostrum, as thevolume of HMMER searches attempting to access the TIGRFAMs HMM file was so great thatit caused disk contention issues, leading to the maximum allowed wallclock time beingexceeded on many of the jobs. These issues had not arisen earlier because the scale of thejob was considerably larger than anything run previously in testing and we were attemptingto push jobs through in a shorter time period. A solution to this problem that was tried wasto create multiple copies of the database files and limit the number of jobs run in parallel,again, with help from personnel at BSC.

The use of both the CSC and BSC clusters in particular highlighted that there are sometimesproblems that arise with large-scale parallel computes which cannot be anticipated orpredicted by testing on a smaller sub-set of the data.

Comparing performance of Murska@CSC versus Grid Superscalar@BSC

The UniProt Decoy database was run against the TIGRFAMs 7.0 library using hmmpfam withresults as follows:

Run location Total CPU time(hours)

Total elapsed runtime (hours)

Total wallclocktime (hours)

Murska@CSC 50 695 45 345 453

GridSuperscalar@BSC 53 043 125 140 208

Using M-Grid

The M-grid (Finnish grid) resource was accessed via CSC in order to compare performanceon grid with performance on clusters. This was something not in the original TFS, however,it was agreed that it was important to explore the possibilities of utilising grid resources aspart of the ELIXIR infrastructure.

To be able to directly compare grid performance benchmarking with cluster performancebenchmarking, the number of processors on the grid should have been fixed. Thismeasurement would have given us the average performance of CPUs on the grid andoverhead produced by grid middleware. However, this would not have given the full pictureof capabilities of grid for HMMER computations. The idea in terms of wallclock time is thatonce an application is enabled on the grid, there will be a large pool of processors to drawfrom for the tasks. In the actual operative situation ELIXIR would ideally only need to definewhen a task must be ready; the resource provider would then scale-up the resources to beable to crunch the job in time.

The Pfam 23.0 database was run against a test set of 10 000 sequences using hmmsearchrun on the grid via CSC in various configurations.

Runname

Average grid job wallclock time(mins)

Total wallclock time(mins)

Total CPU time(hours)

CSC-grid-1 45 96 62

CSC-grid-2 15 57 57

CSC-grid-3 5 40 49

EBI N/A 82 55

Each run at CSC was tweaked by CSC support staff, lowering the number of tasks per job totry to find the optimal number to minimise the wallclock time per run on the grid. The CSCgrid was easy to use thanks to the gsub interface provided by CSC. Jobs ran quickly andreliably, allowing EBI to perform several runs to investigate the effect of grid job weightingon performance. As can be seen above, the total time experienced by a user using the thirdconfiguration was approximately twice as fast as using local EBI resources.

MareNostrum is effectively set up as a grid. However, as stated earlier, the interface wasnot as simple to use as that set up by CSC, and EBI noticed problems with disk contentionrunning hmmpfam, which meant throughput was not as high as could have been. It ispossible that these problems could be minimised by working with the BSC support team tooptimally configure jobs.

Appendix II: Example Supercomputer service contract

(To be supplied separately)

Appendix III: Potential budgetary information

(To be supplied separately)

elixir technical feasibility study wp13.2: assessment of ... · feasibility in this case covers...

Documents