querying models for historical-legal language databases

Costanza Badii, Queryng models for historical-legal language databases Report of research program at University of Missouri-Rolla (USA)Istituto di Teoria e Tecniche dell’Informazione Giuridica Consiglio Nazionale delle Ricerche
SHORT TERM MOBILITY PROGRAMM 2005
QUERYING MODELS FOR HISTORICAL-LEGAL LANGUAGE DATABASES
REPORT OF RESEARCH PROGRAMM AT UNIVERSITY OF MISSOURI-ROLLA (USA)
Costanza Badii Firenze - 2005
Rapporto tecnico n. 8/2005
QUERYING MODELS FOR HISTORICAL-LEGAL LANGUAGE DATABASES
REPORT OF RESEARCH PROGRAMM AT UNIVERSITY OF MISSOURI-ROLLA (USA)
Costanza Badii Istituto di Teoria e Tecniche dell’Informazione Giuridica
ITTIG CNR Florence, Italy
After searching, analyzing and studying the material founded in the Computer Science Department, University of Missouri –Rolla, we have been pointed out the necessity to write down about PHP language and MySQL systems. These kind of languages and systems has been used to build LLI (Italian Legislative Language), LGI (Italian Legal Lexicon), and Index (Legal Language Subject Index), that are the legal lessical databases of ITTIG CNR, subjected to our activity and implementation1. This is to understand and better consent their implementation and the applications of these languages and systems to our projects. This kind of research is an interdisciplinary one and requires very clear communication between the technicians that take care the software and the
1 About these ITTIG projects, see C. Badii, P. Mariani, Methods of digital access for language documentation, in Proceedings LREC 2004 (4th International Conference on language resources and evaluation), Lisbon, 24th-30th May, 2004, vol. II, pp. 461-64, Artipol, Lisbon, 2004; C. Badii, P. Mariani, Resources and tools for electronic legal language documentation, in Proceedings of L&T'05 (Language & Technology Conference 2005), Poznan, Poland, 21th-23th April 2005. You can find the databases online: http://www.ittig.cnr.it/BancheDatiGuide/lli/Index.htm (LLI); http://w3.ittig.cnr.it/vocanet/ricerca.asp(LGI); http://www.ittig.cnr.it/BancheDatiGuide/vgi (Index). The responsible of the project is Dr. Paola Mariani Biagini, Senior researcher of CNR; the software were realized by the team group of technicians of ITTIG and by AdActa S.r.l., Florence.
application of the databases and the people who deal with the analysis and study of historic semantic evolution of legal language. Information Technology techniques are very important tools, to retrieve an information in a lessical legal corpus. The huge quantity of data can be organized, stored and retrieved through different methods of query. Databases not only store large amounts of data, but also impose an organization in data, which gives easy access to researchers and application developers. The report presents a practical approach, with some examples, which is more useful to better understand these applications. It’s so important to consider the last version of projects, texts and papers, because in this sector it’s necessary to know the last development of these. Some topics have been detailed in appendix. (Appendix 1: Database terminology).
1. WHAT IS PHP LANGUAGE 1.1 PHP's Place in the Web World
PHP is a programming language that's used mostly for building web sites. Instead of a PHP program running on a desktop computer for the use of one person, it typically runs on a web server and is accessed by lots of people using web browsers on their own computers. This section explains how PHP fits into the interaction between a web browser and a web server2.
When you sit down at your computer and pull up a web page using a browser such as Internet Explorer, you cause a little conversation to happen over the Internet between your computer and another computer. This conversation and how it makes a web page appear on your screen is illustrated in Figure 1.
Figure 1. Client and server communication without PHP
2 PHP is an open source project of the Apache Software Foundation and it's the most popular Apache web server add-on module, with around 53% of the Apache HTTP servers having PHP capabilities. PHP is particularly suited to web database applications because of its integration tools for the Web and database environments. In particular, the flexibility of embedding scripts in HTML pages permits easy integration of HTML presentation and code.
3
Here's what's happening in the numbered steps of the diagram:
1. You type www.example.com/catalog.html into the location bar of Internet Explorer.
2. Internet Explorer sends a message over the Internet to the computer named www.example.com asking for the /catalog.html page.
3. Apache, a program running on the www.example.com computer, gets the message and reads the catalog.html file from the disk drive.
4. Apache sends the contents of the file back to your computer over the Internet as a response to Internet Explorer's request.
5. Internet Explorer displays the page on the screen, following the instructions of the HTML tags in the page.
Every time a browser asks for http://www.example.com/catalog.html, the web server sends back the contents of the same catalog.html file. The only time the response from the web server changes is if someone edits the file on the server.
When PHP is involved, however, the server does more work for its half of the conversation. Figure 2 shows what happens when a web browser asks for a page that is generated by PHP.
Figure 2. Client and server communication with PHP
4
Here's what's happening in the numbered steps of the PHP-enabled conversation:
1. You type www.example.com/catalog/yak.php into the location bar of Internet Explorer.
2. Internet Explorer sends a message over the Internet to the computer named www.example.com asking for the /catalog/yak.php page.
3. Apache, a program running on the www.example.com computer, gets the message and asks the PHP interpreter, another program running on the www.example.com computer, "What does /catalog/yak.php look like?"
4. The PHP interpreter reads the file /usr/local/www/catalog/yak.php from the disk drive.
5. The PHP interpreter runs the commands in yak.php, possibly exchanging data with a database program such as MySQL.
6. The PHP interpreter takes the yak.php program output and sends it back to Apache as an answer to "What does /catalog/yak.php look like?"
7. Apache sends the page contents it got from the PHP interpreter back to your computer over the Internet in response to Internet Explorer's request.
8. Internet Explorer displays the page on the screen, following the instructions of the HTML tags in the page.
"PHP" is a programming language. Something in the web server reads your PHP programs, which are instructions written in this programming language, and figures out what to do. The "PHP interpreter" follows your instructions. Programmers often say "PHP" when they mean either the programming language or the interpreter.
5
If PHP (the programming language) is like English (the human language), then the PHP interpreter is like an English-speaking person. The English language defines various words and combinations that, when read or heard by an English-speaking person, translate into various meanings that cause the person to do things such as feel embarrassed, go to the store to buy some milk, or put on pants. The programs you write in PHP (the programming language) cause the PHP interpreter to do things such as talk to a database, generate a personalized web page, or display an image.
PHP is called a server-side language because, as Figure 2 illustrates, it runs on a web server. Languages and technologies such as JavaScript and Flash, in contrast, are called client-side because they run on a web client (like a desktop PC). The instructions in a PHP program cause the PHP interpreter on a web server to output a web page. The instructions in a JavaScript program cause Internet Explorer, while running on your desktop PC, to do something such as pop up a new window. Once the web server has sent the generated web page to the client (Step 7 in the Figure 2), PHP is out of the picture. If the page content contains some JavaScript, then that JavaScript runs on the client but is totally disconnected from the PHP program that generated the page.
A plain HTML web page is like, for example, the "sorry you found a cockroach in your soup" form letter you might get after dispatching an angry complaint to a bug- infested airline. When your letter arrives at airline headquarters, the overburdened secretary in the customer service department pulls the "cockroach reply letter" out of the filing cabinet, makes a copy, and puts the copy in the mail back to you. Every similar request gets the exact same response.
In contrast, a dynamic page that PHP generates is like a postal letter you write to a friend across the globe. You can put whatever you like down on the page - diagrams, haikus, stories. The content of your letter is tailored to the specific person to whom it's being sent. Once you put that letter in the mailbox, however, you can't change it any more. It wings its way across the globe and is read by your friend. You don't have any way to modify the letter as your friend is reading it.
In other terms, in a dynamic page, when we would like to find an information, inserting a word in a HTML page, and linking on the botton “search”, the word comes to the server, which uses the received word to do the query and organize the results.
These results are given in HTML form, and they’re sent to the client program which did the query.
So, the server builds the HTML page in dynamic way, that depends on the imput that we gave for the querying.
6
For example from our databases, if we search for the headword “locazione” in LGI (Italian Legal Lexicon), we will obtain some HTML pages, concerning that results with; if we look for the term “enfiteusi”, we will have another results, another HTML pages.
It does mean that for every query we will have different results and different HTML pages.
A static page, although, is always the same page; the home page of LGI, for example, doesn’t change.
1.2 Why we have to use PHP
There are lots of great reasons to write computer programs in PHP. Maybe you want to learn PHP because you need to put together a small web site for yourself that has some interactive elements. Perhaps PHP is being used where you work and you have to get up to speed.
1.3 What's So Great About PHP?
a) PHP Is Free (as in Money)
You don't have to pay anyone to use PHP. Whether you run the PHP interpreter on a beat-up 10-year-old PC in your basement or in a room full of million-dollar "enterprise-class" servers, there are no licensing fees, support fees, maintenance fees, upgrade fees, or any other kind of charge.
Most Linux distributions come with PHP already installed. If yours doesn't, or you are using another operating system such as Windows, you can download PHP from http://www. php .net/ .
b) PHP Is Free (as in Speech)
As an open source project, PHP makes its innards available for anyone to inspect. If it doesn't do what you want, or you're just curious about why a feature works the way it does, you can poke around in the guts of the PHP interpreter (written in the C programming language) to see what's what. Even if you don't have the technical expertise to do that, you can get someone who does to do the investigating for you. Most people can't fix their own cars, but it's nice to be able to take your car to a mechanic who can pop open the hood and fix it.
c) PHP Is Cross-Platform-Database integration
You can use PHP with a web server computer that runs Windows, Mac OS X, Linux, Solaris, and many other versions of Unix. Plus, if you switch web server operating
systems, you generally don't have to change any of your PHP programs. Just copy them from your Windows server to your Unix server, and they will still work.
PHP has native connections available to many database systems. In addition to MySQL, you can directly connect to PostgreSQL, mSQL, among others3. Using the Open Database Connectivity Standard (ODBC), you can connect to any database that provides an ODBC driver. This includes Microsoft products and many others.
d) PHP Is Widely Used
As of March 2004, PHP is installed on more than 15 million different web sites, from countless tiny personal home pages to giants like Yahoo!. There are many books, magazines, and web sites devoted to teaching PHP and exploring what you can do with it. There are companies that provide support and training for PHP. In short, if you are a PHP user, you are not alone.
e) PHP Hides Its Complexity
You can build powerful e-commerce engines in PHP that handle millions of customers. You can also build a small site that automatically maintains links to a changing list of articles or press releases. When you're using PHP for a simpler project, it doesn't get in your way with concerns that are only relevant in a massive system. When you need advanced features such as caching, custom libraries, or dynamic image generation, they are available. If you don't need them, you don't have to worry about them. You can just focus on the basics of handling user input and displaying output.
f) PHP Is Built for Web Programming
Unlike most other programming languages, PHP was created from the ground up for generating web pages. This means that common web programming tasks, such as accessing form submissions and talking to a database, are often easier in PHP. PHP comes with the capability to format HTML, manipulate dates and times, and manage web cookies - tasks that are often available only as add-on libraries in other programming languages.
g) Performance
PHP is very efficient. Using a single inexpensive server, you can serve millions of hits per day. If you use large numbers of commodity servers, your capacity is effectively unlimited. Benchmarks published by Zend Technologies (http://www.zend.com) show PHP outperforming its competition.
3 PHP 5 also has a built-in SQL interface to a flat file, called SQLite.
h) Built-in Libraries
Because PHP was designed for use on the Web, it has many built-in functions for performing many useful web-related tasks. You can generate GIF images on the fly, connect to web services and other network services, parse XML, send email, work with cookies, and generate PDF documents, all with just a few lines of code.
i) Ease of Learning PHP
The syntax of PHP is based on other programming languages, primarily C and Perl. If you already know C or Perl, or a C-like language such as C++ or Java, you will be productive using PHP almost immediately.
l) Object-Oriented Support
PHP version 54 has well-designed object-oriented features. If you learned to program in Java or C++, you will find the features (and generally the syntax) that you expect, such as inheritance, private and protected attributes and methods, abstract classes and methods, interfaces, constructors, and destructors. You will even find some less common features such as built-in iteration behavior. Some of this functionality was available in PHP versions 3 and 4, but the object-oriented support in version 5 is much more complete.
m) Source Code
You have access to PHP's source code. With PHP, unlike commercial, closed-source products, if you want to modify something or add to the language, you are free to do so.
You do not need to wait for the manufacturer to release patches. You also don't need to worry about the manufacturer going out of business or deciding to stop supporting a product.
n) Availability of Support
Zend Technologies (www.zend.com), the company behind the engine that powers PHP, funds its PHP development by offering support and related software on a commercial basis.
o) Flexible for integration with HTML
One or more PHP scripts can be embedded into static HTML files and this makes client tier integration easy. On the downside, this can blend the scripts with the presentation.
4 See next section about the different versions of PHP.
p) Fast at running scripts
Using its built-in Zend scripting engine, PHP script execution is fast and all components run within the main memory space of PHP (in contrast to other scripting frameworks, in which components are in distinct modules). Our experiments suggest that for tasks of at least moderate complexity, PHP is faster than other popular scripting tools.
1.4 The Different Version of PHP
1.4.1 PHP4, PHP5
The current version of PHP is PHP4 (Version 4.3.4), that is the version applied in LLI, LGI, Index, the projects of ITTIG. PHP5 is available for beta testing at the time of writing as Version 5.0.0b3.
1.4.2 What Is New in PHP 5.0?
We can remember the version PHP 5.0, moving from one of the PHP 4.x versions. As you would expect in a new major version, it has some significant changes. The Zend engine beneath PHP has been rewritten for this version. Major new features are as follows:
• Better object-oriented support built around a completely new object model;
• Exceptions for scalable, maintainable error handling;
• Simple XML for easy handling of XML data.
PHP4 included the first release of the Zend engine version 1.0, PHP's scripting engine that implements the syntax of the language and provides all of the tools needed to run library functions. PHP5 includes a new Zend engine version 2.0, that's enhanced to address the limitations of version 1.0 and to include new features that have been requested by developers. However, unlike the changes that occurred when PHP3 became PHP4, the changes from PHP4 to PHP5 only affect part of the language. Most code that's written for PHP4 will run without modification under PHP5.
In brief, the following are the major new features in PHP5.
a) New Object Model
PHP4 has a simple object model that doesn't include many of the features that object-oriented programmers expect in an OOP (Object Oriented Programming) language such as destructors, private and protected member
10
b) Exception Handling
New statements are available that are aimed at improving the robustness of applications when errors occur.
c) Improved memory handling and speed
PHP4 was fast, but PHP5 is faster and makes even better use of memory.
d) New XML support
There were several different tools for working with the eXtensible Markup Language (XML) in PHP4. These tools have been replaced with a single new, robust framework in PHP5.
So, we could apply this new version of PHP to the ITTIG databases, once the final version will be developed.
e) The Improved MySQL library (mysqli)
A new MySQL function library is available in PHP5 that supports MySQL 4. The library has the significant feature that it allows an SQL query to be prepared once, and executed many times, and this substantially improves speed if a query is often used5.
2. WHAT IS A RELATIONAL DATABASE
A relational database management system (RDBMS) is an essential tool in many environments, from the more traditional uses in business, research, and education contexts, to newer applications, such as powering search engines on the Internet. However, despite the importance of a good database for managing and accessing information resources, many organizations have found them to be out of reach of their financial resources. Historically, database systems have been an expensive proposition, with vendors charging healthy fees both for software and for support. In addition, because database engines often had substantial hardware requirements to run with any reasonable performance, the cost was even greater.
In recent years, the situation has changed on both the hardware and software sides of the picture. Personal computers have become inexpensive but powerful, and a whole movement has sprung up to write high-performance operating systems for them that
5 You can find out more about what's new in PHP5 from http://www.zend.com/zend/future.php.
are available for the cost of an inexpensive CD, or even free over the Internet. These include several BSD UNIX derivatives as well as various forms of Linux.
Production of free operating systems to drive personal computers to their full capabilities has proceeded in concert with - and to a large extent has been made possible by - the development of freely available tools such as gcc, the GNU C compiler. These efforts to make software available to anyone who wants it have resulted in what is now called the Open Source movement, and which has produced many important pieces of software. For example, Apache is the most widely used Web server on the Internet. Other Open Source successes are the Perl general- purpose scripting language and our PHP, that is popular due largely to the ease with which it allows dynamic Web pages to be written. These all stand in contrast to proprietary solutions that lock you into high-priced products from vendors that don't even provide source code6.
2.1 Designing Your Web Database
Now that we understand the basics of PHP, we can begin looking at integrating a database. The advantages of using a relational database instead of a flat file are the following:
• RDBMSs can provide faster access to data than flat files;
• RDBMSs can be easily queried to extract sets of data that fit certain criteria;
• RDBMSs have built-in mechanisms for dealing with concurrent access;
• RDBMSs provide random access to your data;
• RDBMSs have built-in privilege systems.
For some concrete examples, using a relational database allows you to quickly and easily answer queries, for examples, about where your customers are from, which of your products is selling the best, or what types of customers spend the most.
2.2 Relational Database Concepts
Relational databases are, by far, the most commonly used type of database. They depend on a sound theoretical basis in relational algebra. So we can understand better with an example.
a) Tables
6 We introduce here the relational databases, before to go into MySQL. To have more information about the relational databases and their terminology, go to the Appendix 1.
12
Relational databases are made up of relations, more commonly called tables. A table is exactly what it sounds like - a table of data. If you've used an electronic spreadsheet, you've already used a table.
Look at the table in Figure 3. It contains the names and addresses of the customers of a bookstore named Book-O-Rama.
Figure 3. Book-O-Rama's customer details are stored in a table.
The table has a name (Customers); a number of columns, each corresponding to a different piece of data; and rows that correspond to individual customers.
b) Columns
Each column in the table has a unique name and contains different data. Additionally, each column has an associated data type. For instance, in the Customers table in Figure 3 you can see that CustomerID is an integer and the other three columns are strings. Columns are sometimes called fields or attributes.
c) Rows
Each row in the table represents a different customer. Because of the tabular format, each row has the same attributes. Rows are also called records or tuples.
d) Values
Each row consists of a set of individual values that correspond to columns. Each value must have the data type specified by its column.
e) Keys
We need to have a way of identifying each specific customer. Names usually aren't a very good way of doing this. If we have a common name, we probably understand why. Consider Julie Smith from the Customers table, for example. If we open your telephone directory, you may find too many listings of that name to count.
13
We could distinguish Julie in several ways. Chances are, she's the only Julie Smith living at her address. Talking about "Julie Smith, of 25 Oak Street, Airport West" is pretty cumbersome and sounds too much like legalese. It also requires using more than one column in the table.
What you have done in this example, and what you will likely do in your applications, is assign a unique CustomerID. This is the same principle that leads to your having a unique bank account number or club membership number. It makes storing your details in a database easier. An artificially assigned identification number can be guaranteed to be unique. Few pieces of real information, even if used in combination, have this property.
The identifying column in a table is called the key or the primary key. A key can also consist of multiple columns. If, for example, you choose to refer to Julie as "Julie Smith, of 25 Oak Street, Airport West," the key would consist of the Name, Address, and City columns and could not be guaranteed to be unique.
Databases usually consist of multiple tables and use a key as a reference from one table to another. Figure 4 shows a second table added to the database. This one stores orders placed by customers. Each row in the Orders table represents a single order, placed by a single customer. You know who the customer is because you store her CustomerID. You can look at the order with OrderID 2, for example, and see that the customer with CustomerID 1 placed it. If you then look at the Customers table, you can see that CustomerID 1 refers to Julie Smith.
Figure 4. Each order in the Orders table refers to a customer from the Customers table.
14
The relational database term for this relationship is foreign key. CustomerID is the primary key in Customers, but when it appears in another table, such as Orders, it is referred to as a foreign key.
f) Schemas
The complete set of table designs for a database is called the database schema. A schema should show the tables along with their columns, and the primary key of each table and any foreign keys. A schema does not include any data, but you might want to show the data with your schema to explain what it is for. The schema can be shown in informal diagrams as you have done, in entity relationship diagrams or in a text form, such as
Customers(CustomerID, Name, Address, City) Orders(OrderID, CustomerID, Amount, Date)
Underlined terms in the schema are primary keys in the relation in which they are underlined. Dotted underlined terms are foreign keys in the relation in which they appear with a dotted underline.
15
g) Relationships
Foreign keys represent a relationship between data in two tables. For example, the link from Orders to Customers represents a relationship between a row in the Orders table and a row in the Customers table.
Three basic kinds of relationships exist in a relational database. They are classified according to the number of elements on each side of the relationship. Relationships can be either one-to-one, one-to-many, or many-to-many7.
A one-to-one relationship means that one of each thing is used in the relationship. For example, if you put addresses in a separate table from Customers, they would have a one-to-one relationship between them. You could have a foreign key from Addresses to Customers or the other way around (both are not required).
In a one-to-many relationship, one row in one table is linked to many rows in another table. In this example, one Customer might place many Orders. In these relationships, the table that contains the many rows has a foreign key to the table with the one row. Here, you put the CustomerID into the Order table to show the relationship.
In a many-to-many relationship, many rows in one table are linked to many rows in another table. For example, if you have two tables, Books and Authors, you might find that one book was written by two coauthors, each of whom had written other books, on their own or possibly with other authors. This type of relationship usually gets a table all to itself, so you might have Books, Authors, and Books_Authors. This third table would contain only the keys of the other tables as foreign keys in pairs.
Well, we understand the structure of a relational database.
LGI (Italian Legal Lexicon) or Index (Legal Language Subject Index), created by ITTIG, are composed by different tables linked by different relationships.
7 This is the ER modeling (entity-relation). In the model, cardinality refers to the three possible relationships between two entities. One-to-one relationship means that for the two entities connected by the line, there is exactly one instance of the first entity for each one instance for the second entity. One-to-many relationship means that for the two entities connected by the line, there are one or more instances of the second entity for each one instance of the first entity. For the perspective of the second entity, any instance of the second entity is related to only one instance of the first entity. Many-to-many relationship means that for two entities connected by the line, each instance of the first entity is related to one or more instances of the second entity and, from the other perspective, each instance of the second relationship is related to one or more instances of the first entity.
16
Here you can find: headword table with frequencies of the words (INDEXFREQ); lessical variants table (INDEXVARIANTE); technical legal expressions (INDEXFRASEOLOGIA); the table of the images of the original digital documents (INDEXIMM), with the relative data (legal source, author, language, dating), that made so particular these databases8; the table for all the different meanings of each headword (INDEXACC).
Each table has each own ID, that identifies each row.
Figure 5. Relationships among LGI information units
It could be very interesting to go into details about the different relationships that can exist among these table.
If we analyze the relationship between the headword and his lessical variant, (or his expression), for example, we understand that it’s a one-to-many one, because for each headword I can find: null, one, or many variant (or legal expressions).
So, all the table are linked to headword table in one-to-many relationship.
For example, the word “contingente”4 hasn’t any variants; “enfiteusi” has only the variant “emfiteusi”; “abbazia” has more variants: “abazia”, “abbatia”, “abbadia”, “abbacia”9.
8 ITTIG databases are so particular because they contain the digitalised original historical-legal documents. You can find some dictionaries of legal language In the libraries, but none of these has the link to the image format reproduction of original documents. 9 These examples are in Italian language, but we could apply the technique for different language, or for multilingual database.
17
Let’s think about a headword which has different legal meanings: the word is linked to the different meanings which this can have, as well one meaning is linked to the images of the documents that represent and contain that meaning.
For example, the headword “contingente” has four different meaning: military troop (noun), tax (noun), relative to a particular and current event (here is an adjective), relative to a share or part (in this meaning, it’s an adjective). The occurrences are 68, in the time span from 1322 to 1967.
So, in the database each meaning is linked to the images of the documents that contain that particular meaning, as well the images are linked only to the relative meaning, and not to the others.
It’s impossible instead to have a many-to-many relationship, because one row can be linked, as we said before, to one or to many rows; but we can’t image that many rows, for example many syntagms, are linked to many other rows, for example to many images.
If we think about another example, “contestabile”, from the database we can know that this word has one variant, “constabile”, and one legal expression, “gran contestabile”. Digitalized images are linked to the relative meaning, as we can show from the following scheme.
Figure 6. Scheme of INDEX information units
18
So, from these examples we can understand that a relational database is a suitable tool for our goals, because it represents and realizes the relationships among the different records.
19
2.3 What Is SQL?
SQL stands for Structured Query Language. It's the most standard language for accessing relational database management systems (RDBMSs). SQL is used to store data to and retrieve it from a database. It is used in database systems such as MySQL, Oracle, PostgreSQL, Sybase, and Microsoft SQL Server, among others10.
There's an ANSI standard11 for SQL, and database systems such as MySQL generally strive to implement this standard. There are some subtle differences between standard SQL and MySQL's SQL. Some of these differences are planned to become standard in future versions of MySQL, and some are deliberate differences.
You might have heard the terms Data Definition Language (DDL), used for defining databases, and Data Manipulation Language (DML), used for querying databases. SQL covers both of these bases.
You will use the DML aspects of SQL far more frequently because these are the parts that you use to store and retrieve real data in a database.
2.4 What Is MySQL?
MySQL, used in the ITTIG projects, (pronounced My-Ess-Que-Ell) is a very fast, robust, relational database management system (RDBMS). A database enables you to efficiently store, search, sort, and retrieve data. The MySQL server controls access to your data to ensure that multiple users can work with it concurrently, to provide fast access to it, and to ensure that only authorized users can obtain access. Hence, MySQL is a multiuser, multithreaded server. It uses Structured Query Language (SQL), the standard database query language worldwide. MySQL has been publicly available since 1996 but has a development history going back to 1979. It is the world's most popular open source database and has won the Linux Journal Readers' Choice Award on a number of occasions.
2.4.1 Tools Provided with MySQL
The MySQL distribution includes the following tools:
• A SQL server. This is the engine that powers MySQL and provides access to your databases.
10 As regards as SQL, some authors believe that SQL does not stand for Structured Query Language and isn't pronounced Sequel: it's pronounced as the three -letter acronym S-Q-L and it doesn't stand for anything. 11 American National Standards Institute.
20
• Client programs for accessing the server. An interactive program allows you to enter queries directly and view the results, and several administrative and utility programs help you run your site. One utility allows you to control the server. Others let you import or export data, check access permissions, and more12.
• A client library for writing your own programs. You can write clients in C because the library is in C, but the library also provides the basis for third- party bindings for other languages.
When you use MySQL, you're actually using two programs, because MySQL operates using a client/server architecture:
• The server program, mysqld, is located on the machine where your databases are stored. It listens for client requests coming in over the network and accesses database contents according to those requests to provide clients with the information they request.
• Clients are programs that connect to the database server and issue queries to tell it what information they want.
The MySQL distribution includes the database server and several client programs.
2.4.2 Benefits of MySQL's client/server architecture
The server provides concurrency control so that two users cannot modify the same record at the same time. All client requests go through the server, so the server sorts out who gets to do what and when. If multiple clients want to access the same table at the same time, they don't all have to find and negotiate with each other. They just send their requests to the server and let it take care of determining the order in which the requests will be performed.
• You don't have to be logged in on the machine where your database is located. MySQL understands how to work over the Internet, so you can run a client program from wherever you happen to be, and the client can connect to the server over the network. Distance isn't a factor; you can access the
12 To avoid confusion, we should point out that MySQL refers to the entire MySQL RDBMS and mysql is the name of a particular client program. They sound the same if you pronounce them, but they're distinguished here by capitalization and typeface differences.Speaking of pronunciation, MySQL is pronounced "my-ess-queue-ell." You know this because the MySQL Reference Manual says so. On the other hand, SQL is pronounced "sequel" or "ess-queue-ell," depending on who you ask.
21
server from anywhere in the world. If the server is located on a computer in Canada, you can take your laptop computer on a trip to Iceland and still access your database. Does that mean anyone can get at your data just by connecting to the Internet? No. MySQL includes a flexible security system, so you can allow access only to people who should have it. And you can make sure those people are able to do only what they should. Perhaps John in the billing office should be able to read and update (modify) records, but David at the service desk should be able only to look at them. You can set each person's privileges accordingly. If you do want to run a self-contained system, just set the access privileges so that clients can connect only from the host on which the server is running.
In addition to the software provided with MySQL itself, MySQL is used by many talented and capable people who like writing software to enhance their productivity and who are willing to make that software available. The result is that you have access to a variety of third-party tools that make MySQL easier to use or that extend its reach into areas such as Web site development.
2.4.3 The History of MySQL
This story actually goes back to 1979 when MySQL’s inventor, Michael Widenius (a. k. Monty) developed an in-house database tool called UNIREG for managing databases. UNIREG is a tty interface builder that uses a low-level connection to an ISAM storage with indexing. Since when, UNIREG has been rewritten in several different languages and extended to handle big databases. It is still available today, but is largely supplanted by MySQL. The Swedish company TcX began developing web-based applications in 1994 and used UNIREG to support this effort. Unfortunately, UNIREG created too much overhead to be successful in dynamically generating web pages. TcX thus began looking as alternatives.
TcX looked at SQL and mSQL. miniSQL was a cheap DBMS that gave away its source code with database licenses – almost open source. At the time, mSQL was still in its 1.x releases and had even fewer than the currently available version. Most important to Monty, it did not support any indexes. mSQL’s performance was therefore poor in comparison to UNIREG.
Monty contacted David Hughes, the author of mSQL, to see if Hughes would be interested in connecting mSQL to UNIREG’s B + ISAM handler to provide indexing to mSQL. Hughes was already well on this way to mSQL 2, however, and had his indexing infrastructure in place. TcX decided to create a database server that was more compatible with its requirements.
22
Tcx was smart enough not to try to reinvent the wheel. It built upon UNIREG and capitalized on the growing number of third party mSQL utilities by writing an API into its system that was, at least initially, practically identical to the mSQL API. Consequently, an mSQL user who wanted to move to TcX’s more feature-rich database server would only have to make trivial changes to any existing code. The code supporting this new database, however, was completely original.
By May 1995, TcX hat a database that met its internal needs: MySQL 3.11. A business partner, David Axmark at Detron HB, began pressing TcX to release this server on the Internet and follow a business model pioneered by Aladdin’s L. Peter Deutsch. Specifically, this business model enable TcX developers to work on projects of their software generated enough income to create a comfortable lifestyle. The result is a very flexible copyright that makes MySQL “more free” tha mSQL. Eventually, Monty released MySQL under the GPL so that MySQL is now “free as in speech” and “free as in beer”.
As for the name MySQL, Monty says, “It is not perfectly clear where the name MySQL derives from. TcX’s base directory and a large amount of their libraries and tools have had the prefix ‘My’ for well over ten years. However, my daughter (some years younger) is also named My. So which of the two gave its name to MySQL is still a mystery.”
A few years ago, TcX evolved into the company MySQL AB, at htpp://www.mysql.com. This change better enabled its commercial control of the development and support of MySQL. MySQL AB, a Swedish company run by MySQL’s core developers, owns the copyright to MySQL, as well as the trademark “MySQL”. Since the initial Internet release of MySQL it has been ported to a host of Unix opersting systems (including Linux, FreeBSD, and Mac OS X), WIN32, and OS/2. MySQL AB estimates that MySQL runs on about four million servers.
2.4.4 Some of MySQL's Strengths
MySQL's main competitors are PostgreSQL, Microsoft SQL Server, and Oracle.
a) Performance
MySQL is undeniably fast. You can see the developers' benchmark page at http://web. mysql .com/benchmark.html . Many of these benchmarks show MySQL to be orders of magnitude faster than the competition. In 2002, eWeek published a benchmark comparing five databases powering a web application. The best result was a tie between MySQL and the much more expensive Oracle.
b) Low Cost
MySQL is available at no cost under an open source license or at low cost under a commercial license. You need a license if you want to redistribute MySQL as part of an application and do not want to license your application under an Open Source license. If you do not intend to distribute your application or are working on Free Software, you do not need to buy a license.
c) Ease of Use
Most modern databases use SQL. If you have used another RDBMS, you should have no trouble adapting to this one. MySQL is also easier to set up than many similar products.
d) Portability
MySQL can be used on many different Unix systems as well as under Microsoft Windows.
e) Source Code
As with PHP, you can obtain and modify the source code for MySQL. This point is not important to most users most of the time, but it provides you with excellent peace of mind, ensuring future continuity and giving you options in an emergency.
f) Availability of Support
Not all open source products have a parent company offering support, training, consulting, and certification, but you can get all of these benefits from MySQL AB (www. mysql .com ).
g) Speed
MySQL is fast. The developers contend that MySQL is about the fastest database you can get. You can investigate this claim by visiting http://www. mysql .com/benchmark.html , a performance-comparison page on the MySQL Web site.
h) Query language support
MySQL understands SQL, the language of choice for all modern database systems.
i) Capability
l) Connectivity and security
MySQL is fully networked, and databases can be accessed from anywhere on the Internet, so you can share your data with anyone, anywhere. But MySQL has access control so that people who shouldn't see your data can't.
m) Small size
MySQL has a modest distribution size, especially compared to the huge disk space footprint of certain commercial database systems.
n) Open distribution
MySQL is easy to obtain; just use your Web browser. If you don't understand how something works or are curious about an algorithm, you can get the source code and poke around in it. If you don't like how something works, you can change it. If you think you've found a bug, report it; the developers listen.
What about support? Good question, a database isn't much use if you can't get help for it. You'll find that other resources are available and that MySQL has good support. MySQL is freely available, but you're not on your own when you install it:
• The MySQL Reference Manual is included in MySQL distributions and also is available online. The Reference Manual regularly receives good marks in the MySQL user community. This is important, because the value of a good product is diminished if no one can figure out how to use it.
• Training classes and technical support contracts are available from MySQL AB, for those who prefer or require formal arrangements.
• There is an active mailing list to which anyone may subscribe. The list has many helpful participants, including several MySQL developers. As a support resource, many people find this list sufficient for their purposes.
25
The MySQL community, developers and non-developers alike, is very responsive. Answers to questions on the mailing list often arrive within minutes. When bugs are reported, the developers generally release a fix quickly, and fixes become available immediately over the Internet13.
MySQL is an ideal candidate for evaluation if you are in the database-selection process. You can try MySQL with no risk or financial commitment. Yet, if you get stuck, you can use the mailing list to get help. An evaluation costs some of your time, but that's true no matter what database system you're considering—and it's a safe bet that your installation and setup time for MySQL will be less than for many other systems.
2.4.5 Querying a MySQL Database Using PHP
In PHP, library functions are provided for executing SQL statements, as well as for managing result sets returned from queries, error handling, and controlling how data is passed from the database server to the PHP engine. We overview these functions here and show how they can be combined to access the MySQL server14.
In this section, we introduce the basic PHP scripting techniques to query a MySQL server and produce HTML for display in a web browser.
Connecting to and querying a MySQL server with PHP is a five-step process.
1. Connect to the server with the MySQL function mysql_connect( ).We use three parameters here: the hostname of the database server, a username, and a password. Let's assume here that MySQL is installed on the same server as the scripting engine and, therefore, localhost is the hostname. If the servers are on different machines, you can replace localhost with the domain name of the machine that hosts the database server.
The function mysql_connect( ) returns a connection resource that is used later to work with the server. Many server functions return resources that you
13 As regards as security, providing secure transactions using the Internet is a matter of examining the flow of information in your system and ensuring that, at each point, your information is secure. In the context of network security, there are no absolutes. No system is ever going to be impenetrable. By secure, we mean that the level of effort required to compromise a system or transmission is high compared to the value of the information involved. The details of each transaction occurring in your system will vary, depending both on your system design and on the user data and actions that triggered the transaction. The system has three main parts: the user's machine, Internet, your system. Obviously, the user's machine and the Internet are largely out of your control.
14 PHP4.3 and MySQL 4.0 are the stable releases. The MySQL library functions that are discussed here work with those versions. The PHP5 MySQL library functions also work with MySQL 4.0.
26
pass to further calls. In most cases, the variable type and value of the resource isn't important: the resource is simply stored after it's created and used as required. In Step 3, running a query also returns a resource that's used to access results.
2. Select the database. Once you connect, you can select a database to use through the connection with the mysql_select_db( ) function.
3. Run the query on the database using mysql_query( ). The function takes two parameters: the SQL query itself and the server connection resource to use. The connection resource is the value returned from connecting in the first step. The function mysql_query( ) returns a result set resource, a value that can retrieve the result set from the query in the next step.
4. Retrieve a row of results. The function mysql_fetch_array( ) retrieves one row of the result set, taking the result set resource from the third step as the first parameter. Each row is stored in an array $row, and the attribute values in the array are extracted in Step 5. The second parameter is a PHP constant that tells the function to return a numerically accessed array; we explain how array indexing affects query processing later in this section.
A while loop is used to retrieve rows of database results and, each time the loop executes, the variable $row is overwritten with a new row of database results. When there are no more rows to fetch, the function mysql_fetch_array( ) returns false and the loop ends.
5. Process the attribute values. For each retrieved row, a foreach loop is used with a print statement to display each of the attribute values in the current row. For a wine table, for example, there are six attributes in each row: wine_id, wine_name, type, year, winery_id, and description.
The script prints each row on a line, separating each attribute value with a single space character. Each line is terminated with a carriage return using print "\n" and Steps 4 and 5 are repeated.
2.4.6 How MySQL can help you
This section describes situations in which the MySQL database system is useful. This will give you an idea of the kinds of things MySQL can do and the ways in which it can help you.
27
A database system is essentially just a way to manage lists of information. The information can come from a variety of sources. We can understand better using some examples. It can represent research data, business records, customer requests, sports statistics, sales reports, lessical data, or student grades. However, although database systems can deal with a wide range of information, you don't use such a system for its own sake. If a job is easy to do already, there's no reason to drag a database into it just to use one. A grocery list is a good example; you write down the items to get, cross them off as you do your shopping, and then throw the list away. It's highly unlikely that you'd use a database for this. Even if you have a palmtop computer, you'd probably keep track of a grocery list by using its notepad function rather than its database capabilities.
The power of a database system comes into play when the information you want to organize and manage becomes voluminous or complex and your records become more burdensome than you care to deal with by hand. Clearly this is the case for large corporations processing millions of transactions a day; a database is a necessity under such circumstances. But even small-scale operations involving a single person maintaining information of personal interest may require a database. It's not difficult to think of scenarios in which the use of a database can be beneficial because you needn't have huge amounts of information before that information becomes difficult to manage. Consider the following situations:
• Your carpentry business has several employees. You need to maintain employee and payroll records so that you know whom you've paid and when, and you must summarize those records so that you can report earnings statements to the government for tax purposes. You also need to keep track of the jobs your company has been hired to do and which employees you've scheduled to work on each job.
• You run a network of automobile parts warehouses and need to be able to tell which ones have any given part in their inventories so that you can fill customer orders.
• You're a teacher who needs to keep track of grades and attendance. Each time you give a quiz or a test, you record every student's grade. It's easy enough to write down scores in a gradebook, but using the scores later is a tedious chore. You'd rather avoid sorting the scores for each test to determine the grading curve, and you'd really rather not add up each student's scores when you determine final grades at the end of the grading period. Counting each student's absences is no fun, either.
• If we think about the legal linguistic databases of ITTIG, that we mentioned above, it so clear that a database system is necessary to management
28
thousand of data. Manually, we couldn’t do query and extraction of a single headword from the context of the original historic documents; however, it would be much longer, and we could obtain probably only a part of the results that a database system consents.
For example, if we search for the headword “contract” in LLI (Italian Legal Lexicon), we find automatically in few seconds 2773 occurrences in 63 different legal texts form 1773 to 1996, and we can directly read the context of the document which contains the queried word.
If we conduct the same research manually, probably we would spend many weeks and many energies to obtain the same results, subjected to the human error or distraction, too.
These scenarios range from situations involving relatively small amounts to large amounts of information. They share the common characteristic of involving tasks that can be performed manually but that could be performed more efficiently by a database system.
What specific benefits should you expect to see from using a database system such as MySQL? It depends on your particular needs and requirements - and as illustrated by the preceding examples, those can vary quite a bit. Let's look at a type of situation that occurs frequently and so is fairly representative of database use. Database management systems are often employed to handle tasks such as those for which people use filing cabinets. Indeed, a database is like a big filing cabinet in some ways, but one with a built-in filing system. There are some important advantages of electronically maintained records over records maintained by hand. For example, if you work in an office setting in which client records are maintained, the following are some of the ways MySQL can help you in its filing system capacity:
• Reduced record filing time . You don't have to look through drawers in cabinets to figure out where to add a new record. You just hand it to the filing system and let it put the record in the right place for you.
This is an important feature if we think about ITTIG projects, because we have frequently to add or update some records.
• Reduced record retrieval time. When you're looking for records, you don't search through each one yourself to find the ones containing the information you want. Suppose you work in a dentist's office. If you want to send out reminders to all patients who haven't been in for their checkup in a while, you ask the filing system to find the appropriate records for you. Of course,
29
you do this differently than if you were talking to another person to whom you'd say, "Please determine which patients haven't visited within the last 6 months." With a database, you utter a strange incantation:
1. SELECT last_name, first_name, last_visit FROM patient
2. WHERE last_visit < DATE_SUB(CURDATE(),INTERVAL 6 MONTH).
As we said in the previous paragraph, this is another important strength for ITTIG applications, because we reduce the time to find the requested information.
• Flexible retrieval order. You needn't retrieve records according to the fixed order in which you store them (by patient's last name, for example). You can tell the filing system to pull out records sorted in any order you like - by last name, insurance company name, date of last visit, and so on.
• Flexible output format. After you've found the records in which you're interested, there's no need to copy the information manually. You can let the filing system generate a list for you. Sometimes you might just print the information. Other times you might want to use it in another program. Or you might be interested only in summary information, such as a count of the selected records. You don't have to count them yourself; the filing system can generate the summary for you. • Simultaneous multiple-user access to records. With paper records, if
two people want to look up a record at the same time, the second person must wait for the first one to put the record back. MySQL gives you multiple-user capability so that both can access the record simultaneously.
The last strength is important, when ITTIG team research group is checking and correcting the records of the databases. As well, it’s useful when some people (students, researchers, and all type of user interested) access to ITTIG databases. • Remote access to and electronic transmission of records. Paper
records require you to be where the records are located or for someone to make copies and send them to you. Electronic records open up the potential for remote access to the records or electronic transmission of them. If someone who needs records doesn't have the same kind of database software you do but does have electronic mail, you can select the desired records and send their contents electronically.
30
APPENDIX 1: DATABASE TERMINOLOGY
The field of databases has its own terminology. Terms such as database, table, attribute, row, primary key, and relational model have specific meanings. In this section, we present an example of a simple database to introduce the basic components of relational databases, and we list and define selected terms used in the report.
About relational databases
A simple example relational database is shown in Figure 7. This database stores data about wineries and the wine regions they are located in. A relational database is organized into tables, and there are two tables in this example: a winery table that stores information about wineries, and a region table that has information about wine regions. Tables collect together information that is about one object.
Figure 7. An example relational database containing two related tables
Databases are managed by a database management system (DBMS) or database server. A database server supports a database language to create and delete databases and to manage and search data. The database language used by almost all database servers is SQL, a set of statements that define and manipulate data. After creating a database, the most common SQL statements used are INSERT, UPDATE, DELETE, and SELECT, which add, change, remove, and search data in a database, respectively.
A database table may have multiple attributes, each of which has a name. For example, the winery table in Figure 7 has four attributes, winery ID, winery name, address, and region ID. A table contains the data as rows, and a row contains values
31
for each attribute that together represent one related object. (Attributes are also known as fields or columns, while rows are also known as records.
Consider an example. The winery table has five rows, one for each winery, and each row has a value for each attribute. For example, in the first winery row, the attribute winery ID has a value of 1, the winery name attribute has a value of Moss Brothers, the attribute address has a value of Smith Rd., and the region ID attribute has a value of 3. There is a row for region 3 in the region table and it corresponds to Margaret River in Western Australia. Together this data forms the information about an object, the Moss Brothers Winery in Western Australia.
In our example, the relationship between wineries and regions is maintained by assigning a region ID to each winery row. The region ID value for each region is unique, and this allows you to unambiguously discover which region each winery is located in. Managing relationships using unique values is fundamental to relational databases. Indeed, good database design requires that you can make the right choice of which objects are represented as tables and which relationships exist between the tables.
Attributes have data types. For example, in the winery table, the winery ID is an integer, the winery name and address are strings, and the region ID is an integer. Data types are assigned when a database is designed.
Tables usually have a primary key, which is formed by one or more values that uniquely identify each row in a table. The primary key of the winery table is the winery ID, and the primary key of the region table is the region ID. The values of these attributes aren't usually meaningful to the user, they're just unique ordinal numbers that are used to uniquely identify a row of data and to maintain relationships.
Figure 8 shows our example database modeled using entity-relationship (ER) modeling. An ER model is a standard method for visualizing a database and for understanding the relationships between the tables. It's particularly useful for more complex databases where relationships of different types exist and you need to understand how to keep these up-to-date and use them in querying15.
In the ER model in Figure 8, the winery and region tables or entities are shown as rectangles. An entity is often a real-world object and each one has attributes, where those that are part of the primary key are shown underlined. The relationship between the tables is shown as a diamond that connects the two tables, and in this example the relationship is annotated with an M at the winery-end of the relationship. The M indicates that there are potentially many winery rows associated
15 See footnote n. 7.
32
with each region. Because the relationship isn't annotated at the other end, this means that there is only one region associated with each winery.
Figure 8. An example relational model of the winery database
Database specific terms
Database
A repository to store data. For example, a database might store all of the data associated with finance in a large company, information about your CD and DVD collection, or the records of an online store.
Table
A part of a database that stores data related to an object, thing, or activity. For example, a table might store data about customers. A table has columns, fields, or attributes. The data is stored as rows or records.
Attributes
The columns in a table. All rows in a table have the same attributes. For example, a customer table might have the attributes name, address, and city. Each attribute has a data type such as string, integer, or date.
Rows
The data entries stored in a table. Rows contain values for each attribute. For example, a row in a customer table might contain the values "Matthew Richardson," "Punt Road," and "Richmond." Rows are also known as records.
Relational model
A formal model that uses database, tables, and attributes to store data and manages the relationship between tables.
(Relational) database management system (DBMS)
A software application that manages data in a database and is based on the relational model. Also known as a database server.
33
SQL (see the report, here you have only the definition)
A standard query language that interacts with a database server. SQL is a set of statements to manage databases, tables, and data.
Constraints
Restrictions or limitations on tables and attributes. A database typically has many constraints: for example, a wine can be produced only by one winery, an order can't exist if it isn't associated with a customer, and having a name attribute is mandatory for a customer.
Primary key
One or more attributes that contain values that uniquely identify each row. For example, a customer table might have the primary key named cust ID. The cust ID attribute is then assigned a unique value for each customer. A primary key is a constraint of most tables.
Index
A data structure used for fast access to rows in a table. An index is usually built for the primary key of each table and can then be used to quickly find a particular row. Indexes are also defined and built for other attributes when those attributes are frequently used in queries.
Entity-relationship (ER) model
A technique used to describe the real-word data in terms of entities, attributes, and relationships.
34
BIBLIOGRAPHICAL REFERENCES
• PHP and MySQL Web Development, L. Welling, L. Thomson, Sams Publishing, Indianapolis, 2004
• PHP and MySQL, H. E. Williams, D. Lane, USA, 2004
• Learning PHP 5, D. Sklar, O’Reilly, Sebastopol, USA, 2004
• Web Data Management, S. S. Bhowmick, S. K. Madria, W. K. Ng, Springer, 2004
• MySQL, P. Du Bois, Indianapolis, Riders Publishing, 2003
• MySQL Pocket Reference, G. Reese, O’Reilly, 2003
• Fundamentals of database systems, R. Elmasri, S. B. Navathe, Pearson, 2003
• Database design for mere mortals: a hands-on guide to relational database design, M. J. Hernandez, Addison Wesley, 2003
• Web Data Management, S. S. Bhowmick, S. M. Madria, W. K. Ng, Springer, 2003
• Managing and Using MySQL, G. Reese, R. J. Yarger, T. King, O’Reilly, USA, 2002
• Basi di dati. Modelli e linguaggi di interrogazione, P. Atzeni, S. Ceri, S. Paraboschi, R. Torlone, McGraw-Hill, Milano, 2002
• Linguistic databases, J. Nerbonne, Center for the Study of Language and Information, 1998
Dott.ssa Paola Mariani Dott.ssa Costanza Badii
(Proponente del Programma) (Fruitore del Programma)
35
Figure 1. Client and server communication without PHP
Figure 2. Client and server communication with PHP
1.3 What's So Great About PHP?
a) PHP Is Free (as in Money)
1.4 The Different Version of PHP
1.4.1 PHP4, PHP5
1.4.2 What Is New in PHP 5.0?
A new MySQL function library is available in PHP5 that supports MySQL 4. The library has the significant feature that it allows an SQL query to be prepared once, and executed many times, and this substantially improves speed if a query is often used5.
2.2 Relational Database Concepts
Figure 3. Book-O-Rama's customer details are stored in a table.
Figure 4. Each order in the Orders table refers to a customer from the Customers table.
2.3 What Is SQL?
2.4 What Is MySQL?
2.4.6 How MySQL can help you

querying models for historical-legal language databases

Documents