the technology of the human protein reference database (draft, 2003)
DESCRIPTION
Between 2002 and 2004, I managed the technology team that built the Human Protein Reference Database (http://hprd.org) at the Institute of Bioinformatics in Bangalore and Johns Hopkins University in Baltimore. These are my notes on the tech from sometime in 2003, rediscovered in 2014 when I was looking through old files.TRANSCRIPT
Human Protein Reference Database
An analysis of the technology powering the database and website,
and how it was developed.
Kiran Jonnalagadda
2
Facts About HPRD
• HPRD is a database of all disease causing proteins in the human body.
• It is the most comprehensive database of its kind in the world today.
• Unlike most other biological databases, HPRD is protein-centric, not gene-centric.
3
Factors Leading to Choice of DB
• The biologists hadn’t settled on what information was to be stored and therefore the data type definitions changed often.
• Several data types were fairly similar to others but not the same.
• Future extensions had to be built by tech-savvy biologists with minimal assistance from programmers.
4
What We Used
• The Zope application server, comprising of:– The Web publishing object framework.– ZODB, the object database storage system.– ZCatalog, the indexing and search system.– ZEO, the stand-alone database server for
multiple front-end Web servers.
5
Why an RDBMS Was Not Suited
• Data type definition changed frequently. In an RDBMS, this would have meant redefining tables every week.
• The code currently has about forty data classes. Imagine having that many data tables, plus tables for relationships between them, all under frequent revision.
6
How Zope Handled These Issues
• Zope is built on Python, which offers dynamic data structures.
• ZODB uses this ability to makes the entire database look like one large data structure, transparently swapping unused parts to disk and recovering them as needed.
• ZCatalog indexes data for searching.
7
At Zope’s Core is Python
• Python is a dynamic language.• When I say dynamic, I mean everything is dynamic!• Code, variables, classes, modules, everything can
be modified at run-time.• Most of Zope is built around this ability. Zope
could not have been implemented in another language.
8
Data Storage in Zope
• In Zope, data is stored in instances of a data class.• The data class has variables, which are like fields,
and methods, which manipulate data.• Instances of a data class (objects) are stored in
the ZODB, making the database.• Objects can contain other objects, forming
hierarchies.
9
Components of Zope
• ZServer (formerly Medusa)– Handles incoming requests.– Does HTTP, FTP, WebDAV, XML-RPC; soon SOAP.
• ZPublisher– Maps URLs to objects and handles security.
• ZODB (Zope Object DataBase)– Stores objects on disk in a transactional DB.
• ZEO (Zope Enterprise Objects)– ZODB server for multiple Zope front-end servers.
10
Security in Zope
• Security is fine grained.• Security is defined around four concepts:
– Users, Roles, Permissions and Hierarchies.• A user is assigned one or more roles.• A role is assigned a set of permissions.• This set can be reassigned at different
positions in the hierarchy.
11
Security Outside Zope
• Zope’s security mechanism is limited to the Web front.
• It is applied only to objects that directly interface with the end-user.
• Code written in a module in the filesystem has no security restrictions. It can do anything.
12
Limitations in Zope
• The API for creating extensions (called Products) is complicated and poorly documented.
• The Property Manager interface is too primitive. It only handles the very basic data types such as strings, integers, boolean fields, selection lists and multi-line text.
13
Our Extensions to Zope
• A framework for separating Zope specifics from our data types, making it much simpler to add new data types.
• An extended property management system that could handle changes in data type definitions and automatically migrate data.
Part IIUser Interface
The rationale behind decisions affecting how a user experiences the
database.
15
User Interface Design
• We started with exposing Zope’s hierarchy as the public user interface
• But there were some elements such as the category browser and the
16
Templates for the Web UI
• Choice of DTML and ZPT for templates.• ZPT for templating system.
Part IIIProject Management Lessons
What we learnt about managing a project across continents and distant
time zones.
18
Project Management Issues 1
• We learnt the hard way that a project manager’s place is with his team, not with the client.
• Productivity suffers in the absence of an effective collaboration tool.
• E-mail and instant messengers are not effective collaboration tools.
19
Project Management Issues 2
• Collaboration over e-mail imposes the burden of articulation on the communicator, which many dislike and therefore avoid.
• Instant messaging prevents collecting thoughts before presenting them and is therefore a poor planning tool.
20
Collaboration Tools
• We experimented with several collaboration systems, with varying effectiveness:– Phone calls.– Instant messengers.– Wikis.– Issue tracking software.– Mailing lists.
21
Phone Calls
• Next best thing to face-to-face discussions.• But only connect two people unless non-
standard equipment is used.• International calls are usually too expensive
for the resulting gain.
22
Instant Messengers
• Provide critical communication between geographically distributed team members.
• But the pressure of maintaining continuity in a conversation hinders pausing to gather thoughts.
• Typing is much slower than talking. Therefore little else gets done alongside.
23
Wikis
• The easy hyperlinking system of a wiki combined with structured text makes presenting information a snap.
• With a little code thrown in, Wikis could make a wonderful project management tool.
• A changed page notification system is needed or changes go unnoticed.
24
Issue Tracking Software
• We use BugZilla to track issues.• But in eight months using it, only 30 issues have
been reported using it.• The other few hundred were reported over e-
mail, instant messengers and in person.• Clearly, the problem is with BugZilla’s usability.
Search for a new system is on.
25
Mailing Lists
• E-mail is push media: the latest is always on top of your inbox.
• E-mail makes an effective to-do list in an interface the user is comfortable with.
• Mailing lists are e-mail in broadcast mode.• Mailing lists have been the most effective
collaboration tool we’ve used so far.
26
Issues With Programmers
• Programmer skill levels and attitudes vary.• C programmers tend to write C code in
Python.• PHP programmers tend to write PHP code
in Python.• Learning Python is easy but thinking in
Python takes a long time.
27
Programming Tools We Used
• CVS for source control.• ViewCVS for a Web front-end to CVS.• Vim in GUI mode for source editing
(preferred editor of everyone in the team).• The print statement for debugging.
28
Tools We Should Have Used
• WingIDE is a $35 piece of software that provides an interactive Python debugger usable with Zope that would have in a few minutes of usage more than paid for itself for the hours in programmer time we instead spent debugging using the print statement.
Part IVThings Needing Fixing
Mistakes we made during development, how they affect things
now, and how they can be fixed.
30
Naming Conventions
• We started with assuming HPRD was gene-centric and named several things as GeneSomething.
• In code, this can be considered just an identifier.
• But in a URL, there is potential for confusing users and needs renaming.
31
Reusable Modules
• All of the code currently sits in one directory.
• Several important pieces have nothing to do with how they are being used.
• These modules could be separated and contributed independently to the open source code pool.
32
Data in Code
• There are bits of implementation specific data embedded in code in some places, particularly related to graph generation.
• These were introduced as quick patches for a temporary problem but have remained in place for months now.
• These need to be taken out so that the code is truly reusable.
33
Documentation
• DocStrings needed in code.• Consistent language in DocStrings.• HTML documentation files to be
distributed with code.