[ieee 2009 16th working conference on reverse engineering - lille, france (2009.10.13-2009.10.16)]...

Legacy and Future of Data Reverse Engineering

Jean-Luc Hainaut Laboratory of Database Engineering

PReCISE Research Centre University of Namur, Belgium [email protected]

Abstract

Data(base) reverse engineering is the process through which the missing technical and/or semantic schemas of a database (or, equivalently, of a set of files) are reconstructed. If carefully performed, this process allows legacy databases to be safely maintained, extended, migrated to modern platforms or merged with other, possibly heterogeneous, databases. Although this process is mostly pertinent for old databases, that are supposed to be poorly documented, it proves highly useful for recent databases as well, in as much as many of them are huge and complex, but poorly designed and insufficiently (if ever) documented.

As compared to standard software reverse engineering, database reverse engineering exhibits some interesting particularities. Firstly, its very goal is to recover the complete specification of a database in such a way that its conversion to another data model could be automated, a ability that is, so far, not achievable for procedural code. Secondly, it makes use on a large variety on information sources, ranging from DDL (data definition language) code analysis to data analysis, program code analysis, program behaviour observation and ontology alignment. Finally, it quickly appears that database reverse engineering requires program understanding techniques, in the same way as serious data intensive program understanding requires database reverse engineering.

Historically, we can identify three periods in DBRE: discovery, deepening and widening. They more or less correspond to the last three decades.

The first period, the eighties, was mainly devoted to solving the problem of migrating CODASYL databases, IMS databases and standard files to relational technology. The techniques were based on automated DDL code interpretation augmented with some trivial heuristics to elicit undeclared constraints such as implicit foreign keys. Unfortunately, this approach proved insufficient to recover the complete database schemas, since it ignored the many implicit data structures and constraints which were implemented in the procedural code and in user interfaces for instance.

The main objectives of the second period were to refine elicitation techniques to recover implicit constructs and to develop more flexible (semi-automated) methodologies to address the problem in all its complexity. In particular, sophisticated tool-based application code analysis and data analysis were designed in order to recover field and record structures, relationships, constraints and is-a hierarchies. In addition, the need for reverse engineering relational databases was admitted.

In the present decade, the scope of data(base) reverse engineering and the supporting techniques are being considerably extended. The increasing consensus on XML as a data model, the view of the web as an infinite database, the expression of data semantics through ontology technologies, the development of model-driven transformational models of engineering processes, the requirement of maintaining data traceability, the high cost of system (schemas + data + programs) migration, the explosion of web databases developed by unqualified developers, the increasing complexity and size of corporate databases, the need for heterogeneous database integration, the inescapable shortage of legacy database technology skills, the use of dynamic SQL in most web information systems (that makes popular program static analysis practically useless), the increasing use of ORM (object-relational mapping) environments that bury the database as a mere transparent persistence service, all are facts and trends that make data reverse engineering both more necessary and more complex by an order of magnitude.

The future of data(base) reverse engineering is tied to its ability to address these challenges and to contribute to their solving. Conversely, the future of information system engineering seems, to a large extend, to be dependent on these solutions.

2009 16th Working Conference on Reverse Engineering

1095-1350/09 $26.00 © 2009 IEEEDOI 10.1109/WCRE.2009.58

4

[ieee 2009 16th working conference on reverse engineering - lille, france (2009.10.13-2009.10.16)]...

Documents