a web2.0 collaborative cultural heritage archive with ... · a solution is to use web2.0 technique...

A web2.0 collaborative cultural heritage archive with recommender system over trace based reasoning Editor(s): Reim Doumat, University Jean Monnet, France Solicited review(s): Open review(s):

Reim Doumat

Laboratoire Hubert Curien, UMR CNRS 5516, Bâtiment F, 18 Rue du Professeur Benoît Lauras, Université Jean Monnet, 42000 Saint-Etienne, France. [email protected]

Abstract. Cultural heritage presents a big quantity of information; they entice different kinds of persons. In last decades, com-puter technology and internet helped bringing history to present life. Ancient and historical documents were digitized and ex-posed online. Therefore, cultural heritage digital libraries and web sites were created, first to enhance document preservation, and second to facilitate research and study of ancient documents. The arriving of Web 2.0 technologies enabled online access to functionally complex and rich applications, it is necessary to provide cultural heritage digital library with collaborative and personalized features, enabling users to participate in the annotation process. Now the questions that interest us are: What after permitting users to annotate manuscript images? How can a digital library take advantage of users collaborative annotations? Is it interesting to track and to register the user interaction with cultural heritage library? In this article, we present our Web2.0 cultural heritage archive that traces user actions for recommendation purposes. Users are assisted especially throughout the annotation process to reduce annotation inaccuracy.

Keywords: Cultural heritage documents, Web 2.0 archive, annotation, collaboration, trace-based reasoning, recommender sys-tem

1. Introduction

Digitizing ancient texts, historical and cultural documents and artifacts became widespread because of the current digital world. Thus different organiza-tions and museums digitize these types of documents. The reason is firstly to conserve the information from being lost, and secondly to make these documents available for contemplation and exploitation to a wide audience. For example, libraries seek to facili-tate the access to their documents, monasteries have ancient religious documents that may contain valua-ble information, and museums aim to exhibit their objects. On the other side, different organizations and individuals are interested in studying ancient docu-ments, such as historians, linguists, researchers, etc.

and publishing the cultural heritage documents on the web helps them in their mission. Thanks to digitiza-tion, attracted users can finally consult cultural herit-age documents online anywhere, while earlier, origi-nal papers and artifacts were strictly conserved, and in reach of only some authorized persons. The ques-tions now are: What is the next step of documents’ digitization?, How to reach quickly and precisely the needed information?, How to understand the content of these old documents?, and How to extract impor-tant information from ancient and historical docu-ments?. In fact, users need to access the digitized documents easily to study their content, search them, and enrich them by annotations or classifications. For this reason, a management system is needed to store, visualize, organize, search and annotate these docu-

ments. That is why many projects and digital libraries have been developed for these purposes [1], and that is why we are interested in developing a helpful digi-tized manuscript’s management system.

Actually, after the digitization, documents are classified into collections. In cultural heritage reposi-tories, collections represent sets of digitized images where the original papers are mostly handwritten documents. Consequently, document search and re-trieval in these collections is annotation-based. Moreover, search tools use the descriptive metadata, transcriptions and other annotations to find the re-quired document. Annotations and metadata are as-sociated to documents and collections to make them easily locatable. However, the problem that faces digital libraries and museums is to find the suitable and expressive annotations. Annotation can be ex-tracted automatically from certain documents using a kind of content recognition technique like the OCR (Optical character recognition). However, when the OCR fails to extract words, such as the case of handwritten documents, manual annotations are needed. The problem here is that paying expert per-sons to extract accurate annotations and to translate such documents is expensive, besides these persons are few.

Annotations are associated to documents either depending on metadata standards such as Dublin Core [2], METS[3], MARC21[4], or depending on the application requirements in order to give informa-tion about document content. In fact, we did not find a universal standard to be used in all libraries or mu-seums. There are two categories of annotations used within digitized images, first the non-textual annota-tions: annotations are a type of local descriptors that mark some points in images [5], and then these de-scriptors are used to identify, to index, to describe, and to facilitate image retrieval. The second category is the textual annotations: words extracted from im-ages automatically or added by users manually. Au-tomatic annotation is achieved by applying Optical Character Recognition such as in the MEMORIAL project [6], or by applying handwriting segmentation algorithms [7] [8] to extract the content of scanned documents. For manual annotation, the image man-agement system must allow adding annotations as metadata, to facilitate the document/image retrieval. On the other hand, this work is laborious because reading a manuscript page takes sometimes hours. Above and beyond, experts in manuscript translation and transcription are rare and expensive. Thus one of the solutions to facilitate the annotation of these doc-uments could be by publishing them on the web with

appropriate annotation tools, enabling any interested user to add annotations. The problem that may appear is that the created annotations are not verified, and users can make many mistakes. Therefore, permitting users to correct or to comment on the annotations of other persons is useful. To achieve good-quality re-sults in a difficult task like annotation, a collaborative environment work is needed. In a collaborative ap-plication, we need an intelligent system to capture user’s skill. That means it registers user actions first, and then depending on these actions, it proposes some assistance during the annotation process. In other words, observing users’ actions can enhance simplicity and efficiency of the application use, im-prove interaction quality, and save users time and effort.

In this paper we present our web 2.0 digital arc-hive; it manages image collections of ancient histori-cal handwritten manuscripts. Furthermore, the arc-hive application has annotation tools, enabling users to annotate manually and remotely the exposed con-tent, in a collaborative and assisted environment. Thus, besides the presentation of rare collections on the web, the web 2.0 archive enables users to add their own annotations, facilitates the collaborative work, and offers users an assistant for the annotation and the search processes. The originality of our ap-plication is that: firstly, it traces user interaction with the application. Secondly, it exploits the registered actions (traces), in the assistant system using case-base reasoning, to help and to recommend users ac-cording to what they are doing at that moment.

This article is organized as follows: section 2 gives an overview about some cultural heritage projects. In section 3, we present the structure of our web archive. In Section 4 we talk about our recommendation sys-tem based on user traces. Section 5 concerns our pro-totype, and then in section 6 we show our evaluation results. Finally, in section 7 we conclude and present our perspectives.

2. Related works: Online cultural heritage projects

In this part we show examples of cultural heritage projects, because cultural heritage documents present a big quantity of information; as well as they are of interest of different kinds of user. Cultural heritage digital libraries and web sites are used to enhance preservation and research of ancient documents. Computer technology and internet help bringing his-tory to present life.

In fact, manuscripts were stored for ages in closed boxes till the technological advances enabled to pub-lish them in an elegant version on the internet. Web projects concerning cultural heritage and ancient documents can be classified into web1.0 and web2.0 projects. Web1.0 heritage archive sites, look for ex-posing and sharing their multimedia documents on the internet, such as: Digital Image Archive of Me-dieval Music (DIAMM)[9], Avestan Digital Archive (ADA)[10], The Muhyiddin Ibn 'Arabi Society Arc-hive Project[11], Columbia Archives and Manuscript Collections[12], Gallica[13] the French digital arc-hive, National Audiovisual Institute website (INA)[14], Scraps (documents from the World War I) [15], and many others. The collections in these projects are previously transcribed and have metadata to reach them. All these projects do not enable users to annotate or to comment on the exposed images. Digital archives with collections of handwritten doc-uments, which are common in cultural heritage arc-hive, may face real problem in annotating automati-cally their collections, they lack annotations, and therefore users find difficulty to reach a particular document from a large collection.

A solution is to use Web2.0 technique and appeal internet users to annotate images. We cite here some examples of web2.0 cultural heritage projects.

The UVic IMT project [16] offers image mark-up tools and permits to load and display a wide variety of different image formats. Uvic IMT enables the user to specify arbitrary rectangle shaped fragments on the image, to insert resizable and movable annota-tion areas on the image and to associate them with annotations. Uvic IMT marks the annotated parts in the image and stores the resulting data according to the TEI P5 XML [17] syntax in local files on the us-er’s machine. The disadvantage of this application is that stored annotations cannot be seen by other inter-net users. Another example is the University of Michigan [18], it has a project to annotate manually, by internet users since experts are expensive, collec-tions of sultan Abdul-Hamid. MOSAICA [19] project about Jewish heritage is a platform for the presenta-tion and discovery of cultural content. This project has developed an online semantic annotator. Users can associate a free text annotation, comment and recommend individual cultural objects, or semanti-cally annotate them using the MOSAICA Ontology. The inconvenience of using a special ontology ap-pears when historical document collections concern different subjects and in different languages. Another project uses a combination of OCR and manual anno-tations, the IMPACT project aims to significantly

improve the accessibility of historical printed text produced before 1900 [20], the project uses OCR to convert images into text, and it provides a collabora-tive web-based workspace with options for correcting results coming from OCR engines, to improve OCR performance and accessibility.

Thus, with web2.0 techniques, social annotations on images would allow alternative views of digital content, and create a sense of collaborative effort. This simple annotating feature proves to be a power-ful for information management and content sharing. The annotations themselves can be interpreted as explicit metadata added by each user.

In library domain, enabling users to add annota-tions in order to organize the content has proved it efficiency. In [21] the authors measured the quality of tags generated by experts (such as metadata) ver-sus others generated by taggers, in a library. The re-sults show that the added annotations are almost sim-ilar. Thus, enabling users to add annotation, onto a manuscript archive, to any type of document could be very useful in interpreting their content. Especially, when the quantity of documents is enormous and the effort of experts to add detailed metadata is expen-sive.

As a result, we found out that in spite of the large number of existing cultural heritage digital libraries, the majority of them do not offer annotation tools and do not contain collaborative space. With the arriving of Web 2.0 technologies that enable online access to functionally complex and rich applications, it is ne-cessary to provide theoretical digital library models that have collaborative and personalized features. The interface should- for example- enable the selec-tion of image fragments to annotate. The questions that may come out now are: What after permitting users to annotate images? How can a digital library take advantage of users collaborative annotations? Is it interesting to track and to register the user interac-tion with cultural heritage library?

In the next sections, we answer these questions; first we present the structure of our web 2.0 manu-script archive. Then, for recommendation purpose, we show how we register user interaction with the system in form of structured traces; traces represent user experience. Subsequently, traces are used in a CBR (Case-Based Reasoning) recommendation sys-tem.

3. Web 2.0 manuscript digital archive

In this section we define what an archive is for us and how to structure its content of scanned documents to facilitate their classification and to add annotations [22]. In general, an archive is a place to store up no more used documents and historical objects. For us, an archive is every specialized library in cultural and historical documents. Reserved documents are most-ly unique, priceless and access restricted. The pro-posed model of web archive consists of:

3.1. Collections

A collection symbolizes digital handwritten collec-tion; it contains more than one image. Images in the collection are ordered, they reflect an initial order of the pages in the real paper collection or the scanning order. In our archive, we distinguish between two collection types: Original collection that is added by the archive administrator and Personal collection created by other users. This later enables users to gather several images from different collections into a new one with a given image order. Personal collec-tions enable users to express their point of view about the manuscript content depending on their experience in the domain. Additionally, users can modify or de-lete their collections as well as defining the access rules on their collections.

3.2. Images

Each image represents a scanned page of the manu-script document. Images belong to one original col-lection, in order to be introduced into the archive for the first time. Then images can be gathered or sepa-rated in new personal collections.

3.3. Image fragments

Image fragments are concave closed shapes on an image. Each fragment is defined by a set of points. Point coordinates (x,y) are expressed relatively to the image dimensions where fragments are created. Re-gistering the coordinates in this manner allows the fragments to appear always at the same position in the image even when image is resized. In general, image fragments will be visible to all users unless the annotator (the user) defines his annotation as private.

3.4. Annotations

We studied different annotation standards DC, METS, TEI, MARC, we found that neither of them can be used alone to represent our archive requirements to annotate similarly any type of handwritten manu-script. Thus, there is a need to combine the most suit-able features of these standards to: allow users to work in collaboration, annotate a defined area of an image, permit to evolve the annotation system in support of new users’ requirements, and finally ena-ble users to exploit metadata in a protected manner. We search to make an annotation system that is easy to be described in XML based syntax in order to easi-ly convert it into other types of metadata. The desired annotation system should enable describing handwritten archive and its content as a hierarchical structure like in TEI P5, adding semantic annotations by users, and defining the location and the dimen-sions of an image fragment in the archive as in AL-TO (Analyzed Layout and Text Object). Annotations are personal information added by any user on any type of document (collection, image or fragment). They are rather uncontrolled vocabularies that express a personal point of view, a script tran-scription, or some relative information about the text content. In our archive model, annotations are com-posed of annotation type and annotation value. Anno-tation type is already defined by the archive adminis-trator; for the moment it can be either Keywords representing any free word or text added by the user; or Transcriptions that must represent the same letters in the defined fragment. The values are vocabularies added by any user. Archive users may define the lan-guage of their annotation values. This might be help-ful to filter the search results depending on a given language, because the collections in the archive are of different languages.

3.5. Document units

We consider collections, images, and image frag-ments as document units. Thus, a document unit is an abstract element that represents any type of document or document part. A document unit may contain another document unit. For example, a collection contains images, and an image may include frag-ments. The main objective of introducing the docu-ment unit is to simplify annotation anchoring, be-cause annotations can be added only on document units. In Fig.1 we present an example about the re-presentation of a document unit. The collection “Bo-lyai notes 01” is a document unit, as well as the im-age (BJ-563-4.jpg), and the fragment (Frg_26).

Fig.1 An example of document units

3.6. Users

Users have to authenticate themselves to use the archive annotation tools. Additionally, they must belong to user groups to have rights of access and annotation. Allocate new user to groups defines his profile. For example, a user who belongs to “Histo-rians” and “Linguists” groups will have the profile that he is historian and linguist.

Although annotating ancient manuscripts is not a recent subject, exploiting any user knowledge to an-notate and to transcribe manuscript content is novel. For the reason that manual annotation is the only technique to make this content accessible and ex-ploitable; we incorporate user effort to illuminate information about manuscript contents. In view of the fact that expert user annotations are expensive and tedious to obtain, we enable internet users to participate in annotating these manuscripts. Further-more, we offer a type of assistance to guide and to help users in adding and correcting annotations de-pending on the experience of other users.

The next part is about registering user’s traces as information of the user experience.

4. User assistance based on reasoning over traces

Digital traces of information system use are regis-tered elements representing the interaction between the user and the system. Traces give information about how the user exploited the system.

In fact, traces are used in e-learning to study and to extract the user’s behavior and his activity’s objec-tive [23]. Therefore, the progress in user modeling over recent years has shown that models learning from observing users’ actions can enhance simplicity

and efficiency, and also improve interaction quality and save users time and effort. However, relatively few applications use this technique hindering user assistance development. Efficient trace exploitation is considerably better if the traces are conveniently modeled. It is possible to extract frequent items and correspondences using data mining methods, but if the tracing system has specific tools to cut the traces into comparable and reusable episodes, the capita-lized experience will be retrieved faster and more precisely. Such tools are based on trace models that are often based on the tasks a user performs.

Since our web archive includes a collaborative en-vironment, the use of a tracing system to capture us-er’s actions and experiences is essential to propose a kind of assistance in form of recommendations. Above and beyond capturing and registering user actions, the important objective of the assistance sys-tem is to know what the actual user is doing now. If the user is searching or adding annotations, the sys-tem will suggest him to do some actions relative to his work. If the user is making some corrections on registered annotations, the assistant will capture these actions for further use. Tracing user actions increases both of the annotation quality and the recommender system performance. The assistant in our web archive is of type case-based reasoning and it bases on the traces of user activities, as shown in Fig. 2 A case-based reasoning (CBR) [24][25]system is an assistant for situations that are hard to formalize, it contains a number of different knowledge containers (case base) of problem-solving process that will be used to solve a new problem. The assistant will use CBR and plays the role to answer the question: what to do now, ac-cording to what other users have done in similar situ-ations?

Fig. 2 A global view of our web archive components and their relationships

P1(X1,Y1)

P2(X2,Y2)

P1= (0.8, 0.2)

P2=(0.9, 0.25)

Document unit

col_id=02

col_name=« Bolyai notes 01 »

img_id=14

img_filename=« bolyai notes01/BJ-536-4.jpg »

img_title=« BJ-536-4.jpg »

img_order=5

frg_id=Frg_26

Recommendations Users

Traces Annotations Intelligent systemManuscript

images

CBR

CBR: Case-Based Reasoning

Browse Add Leave

In the web archive, we consider users’ activities about browsing and annotating images as case base; and the last action of the current user is the problem to be solved and the system recommendations as so-lutions. In our system, users do not offer a feedback to the intelligent system, however the system can guess from user traces if he took the recommenda-tions in consideration or not.

In the following sections we present the three main parts used by the case-based reasoning system in our web archive [26].

4.1. Representing user knowledge as traces

Following and registering user’s interaction activi-ties in web-based environments is called tracing. As our intention is to observe user’s activities only on the server side (searching a keyword, annotating a document, etc.); the user’s interaction on the client side (using scroll bars, forward, backward, mouse click…) is not completely traced. Before ongoing in explaining the traces and their use as cases, we begin with some definitions.

4.1.1. Trace (T) In our perception, traces are records concerning

user actions when exploiting the manuscript archive during a work session. A work session begins with a connection with the system and ends with the log out. In other words, we define traces as a sequence of user actions and their parameters as mentioned in the equ-ation below.

� � ��

where: �� is an action K represents the order of the action (a) in the trace An example of a user trace is illustrated in Fig. 3.

4.1.2. Trace components Action (a): in our concept, an action is an event

created by the user and affects a document unit dur-ing a session. Each action is registered with its para-meters. In the archive, not all user actions are traced, we are only interested in registering actions about search, browse and annotation process.

Each action has a type: AType and a set of parame-ters Pi � � �� , ��,��

where: m is the number of the action parameters Action type (AType): Action type is a pre-defined

value that the action has. In our application the traced action types are: Login, Logout, Select, Add, Create, Delete, and Edit (Modify).

Action Parameter (p): action parameters are the objects affected by an action. Each parameter has a type (PType) representing the name of the parameter, and a value (Pvalue) holding the data in the parame-ter. �� , ��

Thus, if we took the example of the first action lo-gin in Fig. 3, the three parameters of this action are: (User_id, date, time).

4.1.3. Trace Episodes The way user actions are traced implies the possi-

ble recommendations that can be given. Since our objective is to exploit traces in a recommender sys-tem; we decided to structure the traces about the arc-hive exploitation in form of episodes.

Traces are hard to be exploited when they only represent sequence of actions, thus decomposing traces into reusable units (that we call episodes) will be very supportive for the recommendation system.

Fig. 3 An example of a user trace

Image ID

« Img_02»

Fragment ID

« Frg_05»

P type Transcription

« nem»

Keyword

« name»

Collection name

« Bolyai»P value

User ID, date,

time

« U3,

19/02/09,

17:05»

Image ID

« Img_01»

Fragment ID

« Frg_06»

Keyword

« 53610»

Trace of session1

LoginSelect

collectionSelect image

Select image

Create fragment

Create annotation

Select fragment

Create annotation

Create annotation

…

a1 a2 a3 a4 a5 a6 a7 a8 a9

Hence, we use the word episode to describe a chronologically consecutive group of user actions concerning the work done on one document unit in a trace. �� , � � 1. . ��

where: n is the number of actions in the episode Episode actions ��are ordered chronologically

An example of trace episodes is illustrated in Fig. 4.

4.1.4. Sub-episode If an episode ep2 is composed of a subset of the ac-

tions of another episode ep1 we say that ep2 is a sub-episode of ep1, and ep1 is a parent episode of ep2. This happens when the document unit (du2) con-cerned by the actions of the sub-episode is contained in the document unit (du1) concerned by the parent episode. Thus, we define: �� ⊂ �� ⇔ �� ⊂ ��

Furthermore, we define two types of episodes in our archive:

Simple episode, is an episode that does not have any sub episodes, for example the episodes (ep1.1, ep1.2.1, and ep1.2.2) in Fig. 4.

Complex episode contains at least one sub-episode. An example of a complex episode is (ep1) representing the work on a collection and its images in Fig. 4.

4.1.5. Episode structure Episodes represent actions carried out by a user on

the same document unit, thus good organization of these episodes at the system storage space leads to retrieve common features between users work easily.

As we illustrate in Fig. 4, when the user changes the document unit another episode is being created.

For the reason that a document unit may represent one of the following types of documents (collection, image or fragment), registering the episode in the database requires identifying if the episode is simple or complex.

Complex episodes have a hierarchical structure (levels) according to the document unit type. For example in Fig. 4, the first level (collection level) concerns the actions done on collections (ep_1), second level (image level) represents actions made on images (ep_1.2) and the third level (fragment lev-el) corresponds to the actions manipulating image fragments.

The hierarchal structure allows episodes to be as-sembled into meaningful compositions, and provide the basic structure needed for quick and relevant re-trieval of these episodes. The hierarchical structure will be useful for the assistance to extract episodes of the same level to the user actions. Thus, makes more precise recommendations to user, depending on his current work.

After registering and structuring traces in the data-base, the next step in the system is to compare trace episodes in order to find similar ones.

4.2. Trace similarity

According to the idea of Case Based Reasoning, a case similar to the current user’s work should be able to help the user to solve his problem (the user next action). This raises the question of detecting cases similarity. Now the question is: which characteristics should be taken into account to determine similarity between cases? The answer to this question cannot be general because each algorithm has to be adapted to the system requirements.

Image ID

« Img_02»

Fragment

ID

« Frg_05»

P type Transcription

« nem»

Keyword

« name»

Collection

name

« Bolyai»P value

User ID, date,

time

« U3,

19/02/09,

17:05»

Image ID

« Img_01»

Fragment ID

« Frg_06»

Keyword

« 53610»

Trace of session1

Open session

Select collection

Select image

Select image

Create fragment

Add annotation

Select fragment

Add annotation

Add annotation

…

ep_1.2.1

fragment (frg_06 )

ep_1.2.2

fragment (frg_05 )ep_1.1

image (img_01) ep1.2

image (img_02)

ep_1

collection (Bolyai)

a1 a2 a3 a4 a5 a6 a7 a8 a9

Fig. 4 The episodes and their hierarchical structure in a trace

In this section we present an algorithm to extract similar cases and to calculate the similarity between them. The main objective is to assist the user during his session; the comparison is always executed be-tween the last unfinished episode of the highest level of the current user and episodes of registered traces. For example, if we supposed that the trace in Fig. 4 is of the current user, then the last unfinished episode of the highest level will be the episode (ep_1.2.2).

As a result of the proposed structure of user traces, and their division into episodes depending on the document units, the system is able to compare epi-sodes that have the same type of document unit (same level). Then it uses the following method to find similar episodes in order to assist the user in a succeeding step.

4.2.1. Extracting similar episodes Here we present our search algorithm to find simi-

lar episodes to the current user. Inputs: user’s last unfinished and highest level epi-

sode��!�, and the trace database �"#� Method: 1. From the traces database, get all the epi-sodes �ep&� that have the same level as the ��!� in inverse chronological order 2. For all returned episodes�ep& ), start the comparison between ��!� and �ep&� from the most recent till the oldest ones. 3. Compare ��!�with �ep&�to calculate their similarity degree using: '�()*��, �� |��|f�-./ � 1�'�(012�3��1�, �25��|67|��

Where: |��|isthelength (number of ac-tions) oftheepisodee�,

|��|isthelengthoftheepisodee� a. For all actions (�!) in ��!�

i. For all actions (��)in �ep&�, compare the similarity between these actions and ac-tion ��!� 1. This function calls two functions: ABCDEFG�HIEFGJ, HIEFGK�=

L J:BNOPDBQRSOTGDUGSOCGQTVUGRDUGDVQOPDBQRSOTG�SGWGPDORXPTGODG�Y:VUGRDUGOPDBQRSOTGXBNNGTGRD Z

If '�(#[\6 � 0 then '�(012�3will be zero also, we do not continue to calculate the similarity be-tween their parameters. Else we compare the parame-ters of the two actions. ABCFOT�^JB, ^KB�='�(\2[\6�� 1, �� 2� _'�(\`-a!6��b��1, �b��2�

to calculate the similarity between action types and action parameters we use the next two equations: '�(\2[\6�� 1, �� 2�

� c0�d�� 1 e �� 21�d�� 1 � �� 2 f '�(\`-a!6��b��1, �b��2� � 1g ��hi��j�

The distance is measured depending on the parameter type, for parameters that hold annotations (strings of characters); we measure the distance using Levensh-tein algorithm (edit distance).

For parameters of id type, the distance is calcu-lated with the formula:

��hi��j� � c1�d�b�� e �b��20�d�b��1 � �b��2 f

2. Results from the two functions will be in the range [0,1]

3. Get the max of similarity: SBC�Ok, OB� = max.sim��!, �� for i=1 to |��| (the number of the ac-tions in the episode epi )

ii. If the similarity SBC�Ok, OB�is >= Thre-shold (the similarity threshold is chosen according to our experiment as shown in section 1) we register the similar actions with their similarity degree and mark the action ( �� as solved, else set SBC�Ok, OB� to zero.

b. Calculate the similarity between the epi-sodes ��!�and�ep&� by making the sum of SBC�Ok, OB�values found in the step (ii). c. Call a previous episode �ep&m��and return to step (3)

Output: list of similar episodes �Epo� The resulted similar episodes will be the input in

the recommender system (assistant); this system de-cides the most suitable next action to the current user depending on the similarity between the episodes.

4.3. Recommendations

The use of a recommendation system is successive to the similarity algorithm; it depends mainly on the

user’s current episode and the similar registered cases in the trace database. The objective of retrieving sim-ilar episodes to an unfinished one is to retrieve what could be the next user action. The suggested action or its parameters, may be adapted to the current context, and will give the concrete recommendation. The re-commender system suggests zero or several actions (ordered in priority). The suggestion list is empty if no match is found due to two reasons, either the trace database is empty or and there are not sufficiently similar episodes in the trace database.

4.3.1. Recommendation algorithm Inputs: current user profile��d!�, current user epi-

sode��!�, list of similar episodes �p�q� 1. ��!�=last action of ��!� 2. Extract the following actions �� : for all

similar episodes ��q ∈ p�q� 3. For all ��in s�hi�� , s�hi��q��

a. For all ��q� in s�hi��q� i. Measure profiles similarity h�(\t h�(\t � � ∩ "� ∪ "

where: A and B represents sets of user groups

ii. Get episode similarity h�(6\w iii. Transform the episode time into percen-

tage score (t) iv. Calculate the recommendation rate

x�� yz� ∗ h�(\t _ z� ∗ h�(6\w _ i| where z�and z�are weighted coefficients used to

define the importance of h�(\tand h�(6\w b. Calculate the �� recommendation rate }�� ∑ x��

4. Decide the actions �� to be recommend from ��that have greatest recommendation rate R

Output: recommended actions (R) In the next section we represent an example about

reusing traced user actions to assist another person.

4.4. Example of harnessing user traces as recommendations

In Fig. 5 we show three user traces with details concerning cut episodes, their actions, the associated parameters of each action and their values. The cur-rent user User3 is annotating a fragment “Frg_116” in the image “Img_01” of the collection “Bolyai”. The system will assist the user according to his last action in the last unfinished episode epX of the fragment “Frg_116”. First, the assistant extracts, from the trace database, the episodes of the same level to the user current episode depending of the document unit type, is in this example it is image fragment. The results of the extraction function are here two episodes epY, epZ respectively from the traces of User2 and User1.

Fig. 5 Comparison between traces to make recommendations

Select image

Create fragment

Create annotation

…

User1 trace

Select collection

Select image

Select fragment

Select annotation

Modify annotation … …

User2 trace

P type

P type

P value

P value

Select collection

Search collection

Select collection

Select image

Select fragment

Create annotation

…

User3 trace

Keywords

« Romaine»

P type

« ??!!»

P value

Recommendations

Collection

name

« Bolyai»

Image ID

« Img_01»

Transcription

« nen»

Fragment ID

« Frg_99»

Collection

name

« Bolyai»

Image ID

« Img_01»

Fragment ID

« Frg_99»

Transcription

« nen»

Transcription

« nen» -> « nem »

Collection

name

« Bolyai»

Image ID

« Img_01»

Fragment ID

« Frg_116»

Transcription

« nen»

Correction

Trace

DB

Date

5/12/2009

Date

1/2/2010

Date

3/3/2010

Comparison

A type

A type

A type Create

annotation …

Keyword

« repeated word»

epx

epY

epZ

The recommender system compares the User3 epi-sode epX with the extracted ones epY and epZ. Due to the user last action (create the annotation “nen”) and the comparison algorithm. The recommender decides which next action to suggest to the user, it takes into consideration the similarity between the compared episodes, the resemblance between user profiles as well as the date when registered episodes were created. If User2 belongs to same groups as User3, thus his profile will be more recommended than Us-er1.

Consequently, the assistant will recommend the user to do the next action of the User2 episode to modify annotation “nen” into “nem”. When all three users have the same profile, the trace date will play a role in the recommendations. From Fig. 5we can see that the User2 episode epY is more recent than the episode epZ of User1, hence it will have higher rec-ommendation rate. Furthermore, when the assistant uses corrections as recommendations and the user accepts the suggestions, the system will be able to avoid in some way the user mistakes. Alternatively, the user can refuse the assistant’s suggestions.

5. ARMARIUS: a prototype of Web2.0 archive

We have developed a web2.0 manuscript archive called ARMARIUS. The aim is to expose images of historical and ancient handwritten documents on the

web, as well as to assist users in manuscript annota-tion and research process. The prototype contains different tools to visualize various collections, in-spect their content, and construct personal collections. The main tools in ARMARIUS involve: enabling users to add annotations onto images and collections, creating fragments on images to add more specific annotations or transcriptions, viewing the current user interactivity trace with the system inside a wid-get on the top of the working space, and finally as-sisting the user by presenting several suggestions according to his actions.

5.1. Annotation tools

Annotation tools are easy to use enabling users to annotate any type of document unit. The image frag-ment annotation tool is a little bit different from the ones for image and collection annotation. In order to annotate a fragment, the system proposes tools to create a new fragment: draw a closed shape defining the region on the image to be annotated, to modify the position of the fragment in the image, to edit the shape of the fragment enabling users to expand or to reduce it, and to delete an image fragment. Further-more, annotation tool maintains and shows the name of the user who added the annotations. An example of the annotation process is shown in Fig. 6.

Fig. 6 Annotation in ARMARIUS

5.2. Tracing tools

User role is important in the tracing and the rec-ommendation process, thus users have to be aware that their actions are traced. Tracing tools must pro-vide methods to show the trace in construction to the user, enabling him to stop the tracing when he wants. The system administrator must be capable to acti-vate/deactivate the tracing process of certain user(s), and to delete user traces.

5.3. Recommendation tools

Recommendation tools (the assistant) have to sug-gest numerous actions to users according to their current action context and profile. The application generates suggestions using case-based reasoning. Recommendations are based on the registered user traces, and they have to be relative to the current user work, as well as appropriate for his profile in order to be considered useful. An example of recommenda-tions is shown in Fig. 7.

6. Evaluation

Evaluating the performance of a digital archive application has different faces; some of them concern the evaluation of the interfaces ergonomy, others the archive content, the supplied tools, etc. In our work, we are more interested in evaluating the quality of generated recommendations, to measure the perfor-mance of the user assistance.

The challenge in evaluating a recommender sys-tem is to select the appropriate metric. There is large diversity of metrics that are used to evaluate the ac-curacy of recommender systems. The most exploited metrics to evaluate the system for its recommenda-tion quality are precision and recall measures. In the quality evaluation, a recommender is evaluated based on whether its recommendations are relative to user’s needs. Thus, we measure the suggested recommenda-tions according to the current user need (what to do as next episode). Precision and recall are computed from Table 1.

Fig. 7 ARMARIUS recommendation window

Table 1

Action categorizations in the system with respect to a given user situation

Suggested

Actions

Not Suggested

Actions Total

Relevant to

user episode ��0 ��0 ��0

Irrelevant to

user episode ��0 ��0 ��0

Total ��0 ��0

Precision is defined as the ratio of relevant sug-

gested actions to number of selected actions by our recommendation system, as shown in equation Px�j�h�� N��0N�0 � N�0 g N��0N�0

It represents the probability that a suggested action is relevant. The other measure Recall, shown in next equation, is defined as the ratio of relevant suggested actions to total number of relevant actions available. Thus, it represents the probability that a relevant ac-tion is suggested.

}�j�� 0��0 � ��0 g��0��0 g��0 _��0 One of the primary challenges of using precision

and recall is that they must be considered together to evaluate completely the performance of an algorithm because they are inversely related. When more ac-tions are returned, then the recall increases and preci-sion decreases.

In order to accomplish the experiment, we asked 25 users (of different groups) to realize a given sce-nario. Users had to note down: the results of the sys-tem recommendations for every action they do, and the number of recommendations that are related to their actions. Users’ results are used to measure sug-gestion recall and precision. All users had the same dataset, they have been asked to repeat the same ex-perience three times according to different percentag-es of action similarity. The used thresholds are: 60%, 70%, and 90%. The objective is to see how the simi-larity threshold affects the recommendations’ quality and quantity. During the test, we set the database to the initial dataset every time we changed the action similarity degree. The results of our experimentation to measure the precision and the recall of the system recommendations are illustrated in Fig. 8.

We notice that the recall is decreasing almost li-nearly when increasing the action similarity percen-tage. This is normal because increasing the similarity degree, between current user action and registered actions, leads to fewer recommendations, thus to a smaller amount of recall. We notice also that equili-brium results between precision and recall are ob-tained for an action similarity percentage of 70%. Therefore we fixed this threshold in our algorithm.

7. Conclusion and future works

In this article, we presented an archive model and prototype enabling user to annotate easily and re-motely historical manuscript documents, because until now, huge quantities of digitized handwritten manuscripts are still puzzles for the archivists, and they need user help in annotating them.

Fig. 8 Precision and Recall measurements for different action similarities thresholds

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

50% 60% 70% 80% 90% 100%

Pre

cisi

on

Action similarity percentage

Recommendation precision

precision R

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

50% 60% 70% 80% 90% 100%

Re

call

Action similarity percentage

Recommendation recall

recall R

We conducted an adequate research work to ad-dress the issues of annotating handwritten manuscript distantly on the web, as well, assisting this difficult task by a recommender system. We stressed the im-portance of integrating the effort of different users in the annotation process. Our developed web2.0 arc-hive permits users to work in collaboration; their ac-tions are registered and constitute user traces. Traces are used in trace based recommendation system to assist users.

As a result of user efforts, collections will be anno-tated by users of different experiences. Users are able to correct inaccurate annotations, and so increase the experience of the recommender system. The original-ity of our work is that the recommender system is based on registered traces of the user interaction with the archive documents. Furthermore, we organize traces in hierarchical episodes, the main goal of this structure is to make a quick, precise and relative rec-ommendations based on very similar registered epi-sodes.

In our current work, we define a user profile as the set of his groups. In future works we want to create a detailed profile. The interest is to refine the recom-mendation according to the user interest. Furthermore, we hope to create relationships between images or collections according to the similarity between their contents or annotations. We propose to make it poss-ible to link archive contents (collections, images…) of the same domain. This could be done either ma-nually by the user, or using a semantic analysis over the annotations. The created relations can be used to facilitate user search, as well as, to support the re-commender system using these links when searching similar episodes.

References

[1] R. Doumat, E. Egyed-Zsigmond, et J. Pinon, “Online ancient documents in european national libraries, a survey,” in Col-loque International sur le Document Electronique, p. 151-162, 2007.

[2] “Dublin Core Element Set.” [Online]. Available: http://www-rocq.inria.fr/~vercoust/METADATA/DC-fr.1.1.html. [Ac-cessed: 26-Mar-2008].

[3] “Metadata Encoding and Transmission Standard (METS) Official Web Site.” [Online]. Available: http://www.loc.gov/standards/mets/. [Accessed: 13-Mai-2008].

[4] “MARC 21 (Library of Congress).” [Online]. Available: http://www.loc.gov/marc. [Accessed: 26-Mar-2008].

[5] E. Valle, M. Cord, et S. Philipp-Foliguet, “Fast identification of visual documents using local descriptors,” in ACM Sympo-sium on Document Engineering, p. 173-176, 2008.

[6] A. Antonacopoulos et D. Karatzas, “Document Image Analy-sis for World War II Personal Records,” in First International

Workshop on Document Image Analysis for Libraries (DI-AL'04), vol. 00, p. 336-341, 2004.

[7] B. Gatos, A. Antonacopoulos, et N. Stamatopoulos, “Handwri-ting Segmentation Contest,” in ICDAR, p. 1284-1288, 2007.

[8] M. Coustaty, J. M. Ogier, R. Pareti, et N. Vincent, “Extraction d’informations d’images de documents anciens Information Extraction from Old Documents Images.”

[9] “Digital Image Archive of Medieval Music: DIAMM.” [On-line]. Available: http://www.diamm.ac.uk/index.html. [Ac-cessed: 11-Nov-2009].

[10] “Avestan Digital Archive (ADA).” [Online]. Available: http://ada.usal.es/. [Accessed: 11-Nov-2009].

[11] “Project of the Ibn 'Arabi Society.” [Online]. Available: http://www.ibnarabisociety.org/archive.html. [Accessed: 11-Nov-2009].

[12] “Rare Book and Manuscript Library of Columbia.” [Online]. Available: http://www.columbia.edu/cu/lweb/indiv/rbml/. [Accessed: 11-Nov-2009].

[13] “Gallica, bibliothèque numérique de la Bibliothèque nationale de France.” [Online]. Available: http://gallica.bnf.fr/. [Ac-cessed: 25-Jan-2008].

[14] “INA, institut national de l'audiovisuel.” [Online]. Available: http://www.ina.fr/. [Accessed: 25-Jan-2008].

[15] C. Petter, “scraps_access_presentation.pdf (Objet applica-tion/pdf),” 2007. [Online]. Available: http://access2007.uvic.ca/wp-content/uploads/2007/11/scraps_access_presentation.pdf. [Ac-cessed: 05-Mai-2008].

[16] “The UVic Image Markup Tool Project.” [Online]. Available: http://www.tapor.uvic.ca/%7Emholmes/image_markup/index.php. [Accessed: 05-Mai-2008].

[17] “TEI: Text Encoding Initiative.” [Online]. Available: http://www.tei-c.org/index.xml. [Accessed: 10-Juin-2008].

[18] “Islamic Manuscripts at Michigan.” [Online]. Available: http://www.lib.umich.edu/islamic/. [Accessed: 11-Mai-2010].

[19] “MOSAICA Project.” [Online]. Available: http://www.mosaica-project.eu/. [Accessed: 23-Jan-2011].

[20] “Impact | Improving access to text : Home.” [Online]. Availa-ble: http://www.impact-project.eu/home/. [Accessed: 19-Jan-2011].

[21] P. Heymann, A. Paepcke, et H. Garcia-Molina, “Tagging human knowledge,” in Web Search and Web Data Mining, p. 51-60, 2010.

[22] R. Doumat, E. Egyed-Zsigmond, J. Pinon, et E. Csiszar, “On-line ancient documents: Armarius,” in Proceeding of the eighth ACM symposium on Document engineering, p. 127-130, 2008.

[23] D. Cram, D. Jouvin, et A. Mille, “Visualizing Interaction Traces to improve Reflexivity in Synchronous Collaborative e-Learning Activities,” in 6th European Conference on e-Learning, p. 147-158, 2007.

[24] A. Aamodt et E. Plaza, “Case-based reasoning: foundational issues, methodological variations, and system approaches,” IOS Press, vol. 7, n°. 1, p. 39-59, 1994.

[25] A. Mille, “Associer expertise et expérience pour assister les tâches de l'utilisateur,” Habilitation à diriger des recherches, Université Claude Bernard, Lyon1, 1998.

[26] R. Doumat, E. Egyed-Zsigmond, et J. Pinon, “User Trace-Based Recommendation System for a Digital Archive,” in Case-Based Reasoning. Research and Development, vol. 6176, p. 360-374, 2010.