ferdowsi university of mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · web viewtitle: using data...

15
Document Information: Title: Using data island method for creating metadata records with indexability and visibility of tag names in web search engines Author(s): Sayyed Mahdi Taheri , (Department of Knowledge and Information Science, Science and Research Branch, Islamic Azad University, Tehran, Iran), Nadjla Hariri , (Department of Knowledge and Information Science, Science and Research Branch, Islamic Azad University, Tehran, Iran),Sayyed Rahmatollah Fattahi , (Department of Knowledge and Information Sciences, Ferdowsi University of Mashhad, Mashhad, Iran) Citation: Sayyed Mahdi Taheri, Nadjla Hariri, Sayyed Rahmatollah Fattahi, (2014) "Using data island method for creating metadata records with indexability and visibility of tag names in web search engines", Library Hi Tech, Vol. 32 Iss: 1, pp.83 - 97 Keywords: Element tag names , Indexability , Metadata standards , Visibility , Web search engines , XML data island method Article type: Research paper DOI: 10.1108/LHT-06-2013-0065 (Permanent URL) Publisher: Emerald Group Publishing Limited Abstract: Purpose – The aim of this research was to examine the use of the data island method for creating metadata records based on DCXML, MARCXML, and MODS with indexability and visibility of element tag names in web search engines. Design/methodology/approach – A total of 600 metadata records were developed in two groups (300 HTML-based records in an experimental group with special structure embedded in the <?pre> tag of HTML based on the data island method, and 300 XML-based records as the control group with the normal structure). These records were analyzed through an experimental approach. The records of these two groups were published on two independent websites, and were submitted to Google and Bing search engines. Findings – Findings show that all the tag names of the metadata records created based on the data island method relating to the experimental group indexed by Google and Bing were visible in the search results. But the tag names in the control group's metadata records were not indexed by the search engines. Accordingly it is possible to index and retrieve the metadata records by their tag name in the search engines. But the records of the control group are accessible by the element values only. The research suggests some patterns to the metadata creators and the end users for better indexing and retrieval. Originality/value – The research used the data island method for creating the metadata records, and deals with the indexability and visibility of the metadata element tag names for the first time. Introduction Metadata is a tool for organizing content objects in the new information environment, particularly worldwide web. Metadata standards are a set of elements with a special semantic structure developed in order to describe, identify, discover, preserve, and manage content objects in the information systems. Metadata standards that focus on functions of identification and discovery of the content objects are called “descriptive metadata”. Organizing and describing the content objects of the web is done based on this kind of metadata. The MARC metadata format, Dublin Core metadata initiative (DCMI), and Metadata

Upload: others

Post on 16-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

Document Information:

Title: Using data island method for creating metadata records with indexability and visibility of tag names in web search engines

Author(s): Sayyed Mahdi Taheri, (Department of Knowledge and Information Science, Science and Research Branch, Islamic Azad University, Tehran, Iran), Nadjla Hariri, (Department of Knowledge and Information Science, Science and Research Branch, Islamic Azad University, Tehran, Iran),Sayyed Rahmatollah Fattahi, (Department of Knowledge and Information Sciences, Ferdowsi University of Mashhad, Mashhad, Iran)

Citation: Sayyed Mahdi Taheri, Nadjla Hariri, Sayyed Rahmatollah Fattahi, (2014) "Using data island method for creating metadata records with indexability and visibility of tag names in web search engines", Library Hi Tech, Vol. 32 Iss: 1, pp.83 - 97

Keywords: Element tag names, Indexability, Metadata standards, Visibility, Web search engines, XML data island method

Article type: Research paper

DOI: 10.1108/LHT-06-2013-0065 (Permanent URL)

Publisher: Emerald Group Publishing Limited

Abstract: Purpose – The aim of this research was to examine the use of the data island method for creating metadata records based on DCXML, MARCXML, and MODS with indexability and visibility of element tag names in web search engines.

Design/methodology/approach – A total of 600 metadata records were developed in two groups (300 HTML-based records in an experimental group with special structure embedded in the <?pre> tag of HTML based on the data island method, and 300 XML-based records as the control group with the normal structure). These records were analyzed through an experimental approach. The records of these two groups were published on two independent websites, and were submitted to Google and Bing search engines.

Findings – Findings show that all the tag names of the metadata records created based on the data island method relating to the experimental group indexed by Google and Bing were visible in the search results. But the tag names in the control group's metadata records were not indexed by the search engines. Accordingly it is possible to index and retrieve the metadata records by their tag name in the search engines. But the records of the control group are accessible by the element values only. The research suggests some patterns to the metadata creators and the end users for better indexing and retrieval.

Originality/value – The research used the data island method for creating the metadata records, and deals with the indexability and visibility of the metadata element tag names for the first time.

Introduction

Metadata is a tool for organizing content objects in the new information environment, particularly worldwide web. Metadata standards are a set of elements with a special semantic structure developed in order to describe, identify, discover, preserve, and manage content objects in the information systems. Metadata standards that focus on functions of identification and discovery of the content objects are called “descriptive metadata”. Organizing and describing the content objects of the web is done based on this kind of metadata. The MARC metadata format, Dublin Core metadata initiative (DCMI), and Metadata Object Description Schema (MODS) are considered the most important and common descriptive metadata standards. On the other hand, web search engines are among the most frequently used tools by end-users for searching and retrieving web content. Also, ease of search as well as various search capabilities that they have provided for different levels of users adds to their use. Accordingly, accessibility of content objects, especially metadata records, through web search engines has always been addressed by their designers and creators. Bibliographic data are stored in metadata records in a structured format. This characteristic has been designed to support the feasibility of searches based on elements and to manage the metadata efficiently. However, although contextual search in the retrieval system of web search engines is possible, these searches are limited to a few parts of the content object (title, body, URL, and hypertext).

As the results of previous studies indicate, when web search engines crawl metadata records that now use extensible markup language (XML) as syntax, they adopt the tag discarding approach (Luk   et al. , 2000 , 2002; Farajpahlou and Tabatabaie Amiri, 2011; Taheri and Hariri, 2012; Aqa Abedi, 2012).

For instance, web search engines extract the value “Tim Berners-Lee” of the element “creator” (like < creator>Tim Berners-Lee<creator>) as one of the Dublin Core metadata elements and make it searchable, but they do not provide the possibility of searching this value based on the element tag name (i.e. “creator”). Adoption of the tag discarding approach by web search engines is due to the extensibility of XML. On the other hand, some studies (such as study of the “cache” version of pages in the database of web search engines) indicate that the indexing software of web search engines extracts all the characters of a web content object (such as tag names), but they do not provide the possibility of retrieving them (Grehan, 2002; Search Engine Watch, 2007; Taheri and Hariri, 2012). The question that can now be raised is: how can we change the current

Page 2: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

approach of web search engines to index and make the metadata element tag names visible based on extensible markup language? Regarding the current approach of web search engines, as evidence shows, their indexing software indexes all the characters of content objects. Can we make the indexability and visibility of element tag names possible by using special methods in implementing the metadata records while preserving the XML capabilities? Do Web search engines make it possible to present metadata records that are relevant to the element tag name-based queries?

The present study aims to answer these questions. In order to answer this issue, the following hypothesis will be examined:

H1. Embedding XML-based metadata records in the < pre> tag of HTML based on the XML data island method provides the possibility of indexing and visibility of metadata element tag names by web search engines.

An XML data island method is a method for embedding an XML document in an HTML document. In the other words, an XML document can be stored inside a HTML document as a “data island”. There are different ways for embedding an XML document into a HTML document like using < xml>, < script>, < object>, and < pre> tags. But the < pre> tag is more appropriate than other tags, because it preserves the formatting of XML documents exactly (ExpertRating.com, 2006; Powell, 2010; Microsoft Developer Network, 2013; Mozilla Developer Network, 2013a, b; see also www.htmlquick.com/reference/tags/pre.html). XML documents that are embedded in HTML documents become more friendly for web search engine spiders and are easier for rendering. This research used the XML data island method for improving the interoperability of metadata systems and web search engines and preserving the structure and formatting of XML-based metadata records. It should be noted that there are some other methods for offering metadata to web search engines, such as schema.org, microdata, and RDFa.

Literature review

The aim of designing and developing metadata systems is to improve access to content objects. Achieving this aim, which is based on the metadata capabilities, is reinforced through better interaction with other systems like web search engines. The tendency of some research on the metadata to study the interoperability of metadata systems and web search engines represents the importance of the interaction. At a general glance, there are four groups of research studying the area of interoperability among metadata systems and web search engines. The first group of research in this area focused on the effect of the indexability of HTML metatags values on improving the retrievability of content objects in web search engines. As is stated in the help document of some web search engines (Google, 2013; Yahoo, 2013), the spider-indexer software of web search engines reacts positively to HTML metatags, and indexes their values. This positive reaction paves the way for the possibility of retrieving content objects based on the indexed metatag values through web search engines. The results of studies conducted by Turner and Brackbill (1998), Quevedo-Torrero (2004), and Zhang and Dimitroff (2005a) confirm this issue. In the second group of research, the efficiency of the elements created based on metadata standards in improving the retrievability and ranking of content objects among the results of web search engines was investigated. The reason for doing this research was the enrichment and better functionality of the standard metadata elements to the HTML tags and metatags. Sokvitne (2000) and Safari (2005) studied the reaction of web search engines to the Dublin Core metadata elements implemented in the HTML syntax. The results of both studies emphasized lacking or decreasing the indexability of values of the standard metadata elements through web search engines, subsequently reducing the retrievability of the content objects.

The third group of research in the interoperability issues was concerned with the effectiveness of the HTML metatags and the standard metadata elements in a compound and comparative approach. The positive reaction of web search engines to HTML metatags and their negative reaction to the indexability of standard metadata elements caused such an approach to emerge in studies of interoperability. The findings of studies conducted by Henshaw and Valauskas (2001), Zhang and Dimitroff (2004), Zhang and Dimitroff (2005b), Mohamed (2006), and Sharif (2007) showed a better effect of the HTML metatags on the accessibility of web content objects through web search engines than the standard metadata elements. A greater focus on the HTML format in previous studies caused the development of new studies. The most important feature of the last (fourth) group of studies was change in the syntax of metadata records to extensible markup language (XML). Studies in this group found the solution of increasing interoperability among web search engines and metadata systems in changing the syntax of metadata records to extensible markup language. Moreover, the findings of the studies mentioned indicate that the adoption of the XML syntax led to improvements in indexability as well as retrievability of the values of metadata elements through web search engines (Farajpahlou and Tabatabaie Amiri, 2011; Taheri and Hariri, 2012; Aqa Abedi, 2012 [1]). The studies in interoperability had special common points. All of them were done based on the experimental method. With the exception of the research by Aqa Abedi (2012), all the other studies adopted the embedding approach for using metadata. The findings of the studies conducted in this area suggest that the HTML metatags are known to web search engines and their values are totally indexed by spider-indexer software, and bring about the accessibility of content objects. Also, the HTML metatags improve the rank of content objects among the results of web search engines (Turner and Brackbill, 1998; Quevedo-Torrero, 2004; Zhang and Dimitroff, 2005a). However these engines did not react positively to the values of the standard metadata elements indexed in the syntax of HTML as well as the element tag names (Sokvitne, 2000; Safari, 2005). The survey of research in the area of interoperability indicates a lack of attention to the indexability and visibility of the element tag names as index terms by web search engines (Taheri   et al. , 2013 ). It is obvious that if such a capability is developed for the metadata elements that describe and discover content objects, end users can limit their search scope to the values of special elements (or tag names). Furthermore, just like full-text databases and computerized catalogs, which provide the possibility of field-based searches, element-based searches will be possible in web search engines.

In addition to research into interoperability among metadata systems and web search engines, there are some research studies that have used data island methods, as well as other related methods such as schema.org, Microdata, RDFa, and the < pre> tag of HTML in their methodologies (although most of studies of these methods are descriptive in nature and focus on the characteristics of the methods and how creators could use them). Dehinbo (2006), Xin (2006), and Grimnes (2008) used the data island method to carry out their research. To identify “XML support” as a relevant concept in order to develop a conceptual framework for determining a platform for teaching web application development, Dehinbo (2006) embedded some XML data

Page 3: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

into HTML based on the XML data island method. Xin (2006) implemented a GUI toolkit based on the XML specifications by using the data island method. Grimnes (2008) analyzed the number of data islands to better understand the topology of his data sets that were used to evaluate individual learning methods from the semantic web. Kim   et al.   (2013)  analyzed an HTML source code based on the < pre> tag to determine the number of code examples generated by JavaDocs. Isaksen (2011) introduced RDFa as a key development in the semantic web era in building a framework to which content partners could contribute their data. Also, Mixter (2013) used schema.org (based on Microdata) to conceptualize the VRA core data model and to convert XML elements and attributes into RDF classes and properties. The findings of these research studies show the usability of some methods for developing the semantic web. These methods can be used to creating rich metadata that improves the retrievability of web content in web search engines.

Methodology

A total of 600 metadata records, created based on DCXML, MARCXML, and MODS, in two groups (300 HTML-based metadata records as the experimental group with special structure: the XML-based metadata records embedded in the < pre> tag of HTML based on the XML data island method, and 300 XML-based records as the control group with the normal structure) were analyzed through an experimental approach. The records of the experimental group and the records of the control group were published on two separate websites[2]. The websites were directly submitted to the webmaster tools of Google and Bing based on XML sitemaps. The metadata records of DCMI and MARC21 were selected from the subject class “Authors” of the California Digital Library collection and were downloaded from the website www.archive.org, and their equivalent records in MODS were provided from the website http://lccn.loc.gov. In this research, data gathering was done through structured observation during September to November 2012. A checklist developed by the researcher was used as the data gathering tool. The reason for selecting the three metadata standards studied was their semantic and structure (tree structure and family tree structure), which might influence the indexing and visibility of metadata records by web search engines.

Each of the three selected metadata standards had an XML-based schema. Three hundred metadata records of the control group were created based on that schema exactly, and were approved and validated by a MARC bibliographic validator and the Styles Studio software. Designing these records was done based on the research findings of Luk   et al.   (2000) , Farajpahlou and Tabatabaie Amiri (2011), Taheri and Hariri (2012), and Aqa Abedi (2012) indicating web search engines for indexing of the XML-based metadata records adopt the tag discarding approach. The 300 metadata records of the experimental group were developed using the XML data island method. According to this method, the metadata records were embedded in the < pre> tag within the < body> tag of HTML-based content objects, and the less-than and greater-than signs of their tags were replaced by the entity-scape characters “&lt” and “&gt”. These records preserved the capabilities of their previous format (XML). The name of the < pre> tag is an acronym of the first letter of pre-formatted text, meaning any values (or text) are stored in that tag will preserve the formatting of the source object. Thus, in addition to using the capabilities of the previous formatting, the objects could use the capabilities of HTML (Kyrnin, 2012; Woychowsky, 2003; ExpertRating.com, 2006; Microsoft Developer Network, 2013; Mozilla Developer Network, 2013a; see also www.w3schools.com and www.htmlquick.com/reference/tags/pre.html). This method of creating metadata records was used creatively in the research.

The next step was to select some appropriate search engines to study their reaction to the metadata records. As stated in many authentic resources, the search engines Google and Bing are the most frequently used search engines on the web (ComScore, 2012; Alexa, 2013; Campex, 2013; Lewis, 2013). Therefore, Google and Bing were selected in this work as the target search engines.

Data analysis

To examine the hypothesis that states that embedding XML-based metadata records in the < pre> tag of HTML based on the XML data island method provides the possibility of indexing and visibility of metadata element tag names by web search engines, the research used a binomial test. The reasons for selecting the test were the experimental approach of the research, dichotomous variables with binomial distribution, and pre-specified probability (the possibility of the indexability and visibility of the element tag names or not).

The data shown in Table I confirm the hypotheses of the research. The two categories (experimental and control groups) are equally likely to occur, that is, the test proportion is 0.50. The observed proportion for the experimental group is 1.00, and for the control group it is 0.00. Finally, the asymptotic significance for the experimental group is less than 0.001. The findings indicate the significant difference between the indexability and visibility of the element tag names of the DCMI, MARC21, and DCMI metadata records in the experimental group and the control group. In other words, the search engines Google and Bing only indexed the element tag names of the experimental group's metadata records, and have presented them in their search results, but this is not true for the control group. Accordingly, embedding XML-based metadata records in the < pre> tag of HTML based on the XML data island method improves the indexing and visibility of metadata element tag names by web search engines.

The indexing and visibility of the metadata elements tag names by web search engines leads to improved retrieval and better access to the metadata records, and consequently to web content objects. Based on the result of the test conducted as regards the research hypothesis, using the < pre> tag to embed the XML-based metadata records based on the “XML data island method” results in better indexing and visibility of the element tag names of metadata records, and even the tag attributes (such as the field number and their indicators in MARC21 records). It greatly influences the improvement of retrieval and an increase of precision and relevance through web search engines. Also the most frequently used internet browsers (such as Internet Explorer, Opera, Safari, Google Chrome, Firefox, etc.) best support content objects containing the < pre> tag (Mozilla Developer Network, 2013a; Kyrnin, 2013; see also www.w3schools.com/tags/tag_pre.asp), and apply the previous format capability of the object meaningfully while displaying them. This approach to indexing the element tag names by web search

Page 4: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

engines is the best way for the indexability of XML-based content objects (Luk   et al. , 2000 ; Qin, 2000; Luk   et al. , 2002 ; Gigee, 2006; Gill, 2008; Taheri and Hariri, 2012).

On the other hand, the indexing and visibility of element tag names as index terms leads to an improvement in the accessibility of metadata records and content objects by web search engines. Also, it changes the role of these engines to that of an information gateway that guarantees the quality of the searchable content, since the indexed content will be among the most reliable web information systems related to the cultural heritage communities (library, archive, and museum) and within the metadata records. Moreover, the possibility of element-based searches improves precision in retrieving metadata records by web search engines. Thus, designing and creating metadata records according to the method developed in this study will create added value for metadata systems. Although only two web search engines were examined in this research, other studies show that most web search engines lack robot software, and they use the results of Google and Bing to provide the necessary information for their users (Wikipedia, 2012). Regarding the findings of this research, it should be noted that by studying three reliable metadata standards with different structures – i.e. the family tree structure and tree structure, as well as language-based (DCXML and MODS) and non-language-based (MARCXML) tag names – and publishing 100 records based on each standard and the methods developed in the research, and finally the results obtained from the hypothesis test, it can be concluded that it is possible to generalize the findings regarding interoperability among metadata systems and web search engines.

Conclusion

Comparing the findings of this research with previous research reveals the fact that the method developed in creating the metadata records of the experimental group is regarded as a step towards improving the accessibility of metadata records through web search engines. In those studies in which the HTML was adopted as a syntax for embedding the metadata records, a lower level of indexability and visibility of metadata elements through web search engines was experienced. Web search engines only indexed the value of the HTML metatags and they acted indifferently towards the values of standard metadata elements such as Dublin Core. This lack of reaction to the standard element tag names, and even HTML metatag names, was aggravating (Turner and Brackbill, 1998; Sokvitne, 2000; Henshaw and Valauskas, 2001; Quevedo-Torrero, 2004; Zhang and Dimitroff, 2004, 2005a, b; Safari, 2005; Mohamed, 2006; Sharif, 2007). According to the results of studies carried out by Farajpahlou and Tabatabaie Amiri (2011), Taheri and Hariri (2012), and Aqa Abedi (2012), indexability and visibility of element tag names of metadata records implemented in the XML format were not possible, and the spider-indexer software of web search engines indexed only the values of their elements, or they did not make the related tag names visible in the search results.

In this section of the article, considering the out-and-out effect of the method developed in this research for creating metadata records with indexable and visible element tag names, and in order to benefit from its advantages by the addresses of this research – that is, the creators of the metadata records and the end users – four patterns are presented in the metadata records format. The first two patterns are for the creators of metadata records, and the other two are for end users.

Patterns for the creators of the metadata records

Pattern 1: A DCMI simple record embedded in the < pre> tag and with the file extension “.html”[3]

The display of the records will be normal on the Internet Explorer, Opera, Netscape, Safari, Google Chrome and Firefox Browsers (Mozilla Developer Network, 2013a;Kyrnin, 2013; see also www.w3schools.com/tags/tag_pre.asp). This is the case in all the other patterns designed for the creators of the metadata records.

As mentioned earlier, all web browsers (including Internet Explorer, Opera, Safari, Netscape, Google Chrome, Firefox, etc.) are able to display the entity-scape characters normally, and the created records do not differ from other records in their display. The patterns that are presented based on the XML data island method are mainly the XML-based metadata record, which is embedded within the < pre> tag, and its file extension has changed to “.html”. Due to the capabilities of the < pre> tag which was mentioned earlier, the embedded records preserved the structure and characteristics of XML format and they also benefit from the capabilities of the new format (HTML). However, the most important characteristic in the sample record of this pattern is the use of “&lt” and “&gt” signs, respectively, instead of the “less than” (“<”) and “greater than” signs (“>”) for the element tag names while designing the records. These signs cause web search engines not to regard them as tags while crawling and indexing them, and to index them as index terms, and to make them visible among their results. It is interesting to note that web browsers, unlike web search engines, easily identify these signs, and replace them with the “less than” (“<”) and “greater than” (“>”) signs. These characters, which are known as entity-scape characters, are recognized by all the parsers of XML-based content objects, and are considered “<” and “>” signs. The sample of Dublin Core element tag names, &lt;dc:title&gt;&lt;/dc:title&gt; are displayed in browsers as follows: < dc:title></dc:title>.

The indexability of the element tag names by web search engines makes it possible to do element-based searches by them, and improves the precision of the retrieval process; however, due to the limited use of semantic systems or controlled vocabularies by web search engines, for example the lack of use of the name and subject authority file (or list), and the preferred forms of names and terms in the bibliographic metadata records, the capability of retrieving the records desired by users is not possible in other forms of the names and terms.

Therefore, Web search engines do not benefit from the recall provided by the semantic systems. According to the method developed in this research, in order to create the records with the indexable tag names which can be presented to the search results, the XML-based authority metadata records can be embedded in the bibliographic metadata records. In this case, collection records will be created that include not only the tag names and values of the bibliographic elements, but also the tag

Page 5: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

names and values of the authority elements related to the bibliographic record. So they have the characteristic of being indexed by Web search engines and becoming visible in their search results. Accordingly, the metadata records will be retrievable based on other names and terms. This characteristic will also create added value for the interoperability among metadata systems and Web search engines.

An example of Pattern 1 is shown in Figure 1.

Pattern 2: A DCMI sample collection record, a combination of a bibliographic record and his related authority record in Metadata Authority Description Schema (MADS), embedded in the < pre> tag and with the file extension “.html”

It can be observed in the presented pattern that not only the bibliographic records created based on the Dublin Core Metadata Initiative but also its related subject authority record extracted from the subject authority of LCSH that is based on the Metadata Authority Description Schema (MADS) has been embedded in the collection record. The method used to create the collection record is technically reliable, and is indexable and visible to web search engines. Obviously, this is possible for the name authority records and the bibliographic records based on other metadata standards. It should be noted that due to the special characteristics of HTML format, the authority record in a < pre> tag, different from the bibliographic record < pre> tag, is embedded within the collection record, and it has all the capabilities of the bibliographic record. In regard to adding the subject authority records to the collection records, it is essential to mention that on top of that the non-preferred terms, the top, broader, narrower, and related terms are also registered in the subject authority records, while embedding them within the collection records. So the elements that include those related terms (except the non-preferred terms) must be omitted from the record. Otherwise, searching the related terms will retrieve irrelevant records, and will lead to irrelevant retrieval. Generally, some of the most significant practical aspects of the proposed patterns for the creators of the metadata records include:

it is possible to implement the metadata records based on other metadata standards of the cultural heritage community that have adopted the XML as main syntax or one of their syntaxes, according to the patterns presented for designing and creating the metadata records in this research; and

organizations and centers for information services around the world can create their metadata records merely based on the proposed patterns, or they can prepare a version based on this method and make it be indexed by web search engines;

In addition to the XML data island method, there are some other useful methods for offering metadata to web search engines, such as schema.org, Microdata, and RDFa. These methods focus on to add a set of attributes to the tags of HTML, XHTML, and XML documents for embedding metadata within the web content. That is, the methods use HTML attributes instead of elements for providing metadata (Wikipedia, 2013; see also schema.org). But this research used the XML data island method for creating metadata based on several reasons. First, the attributes based on schema.org, Microdata, and RDFa are pre-defined (as itemscope, itemtype, datatype, etc.) and are different from the elements of the metadata standards like MARC, DCMI, and MODS, which are used more for organizing the content objects of the cultural heritage communities (library, archive, and museum). So it is impossible to use the elements of the metadata standards as attributes of schema.org, Microdata or RDFa semantically and technically, or to implement some metadata standards like MARC, with a rich number of elements and semantic structure (hierarchical), in HTML format. However, schema.org and its initial format, i.e. Microdata, are based on HTML5.

Second, the metadata standards have adopted XML as the main syntax to benefit the capabilities of XML (extensibility, hierarchical structure, and self-description), and thus the metadata records studied in this research were based on XML. On the other hand, web search engines like HTML more than XML. Based on the XML data island method, it is possible to embed XML-based metadata records in HTML. Furthermore, using the < pre> tag causes XML-based metadata records to preserve their XML capabilities. The most important reason is that this research emphasizes the indexability and visibility of metadata element tag names by web search engines. Only the use of the data island method and the < pre> tag, in addition to the “&lt” and “&gt” signs, provide the possibility of creating indexable and visible metadata element tag names by web search engines. Web search engines do not index the attributes of schema.org, Microdata, and RDFa as index terms. Accordingly, it seems that the data island method is more compatible with the knowledge organization approaches of cultural heritage communities.

An example of Pattern 2 is shown in Figure 2.

Patterns for designing the search strategies are suggested to end users in the following.

Pattern 3: Sample search strategy for performing an element-based search in the Google and Bing search engines

The first part of the third pattern (Figure 3) includes the tag name of the target element. The (“ ”) sign is used when the tag name is a phrase, or when the search is specific to the attributes. Of course, in order to improve the accuracy of the search, the tag names can be used in the (“ ”) sign, both before the key word or key phrase related to the searchable value and after it, on the condition that the user knows the exact form of the phrase, or he/she uses the operator (*) instead of the key words he/she does not know, but he/she is aware that they exist in the key phrase. The next section includes the key word related to the element value. The third section is used when the user is interested in limiting the search results to a special website or a special domain. Finally, the fourth section relates to limiting the retrieval results to the file format of the metadata record. The sign “[ ]” means that using these operators is optional.

Pattern 4: Sample search strategy for doing the element-based search using AND operator in Google and Bing search engines

Page 6: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

Regarding the possibility of combining some elements or their values with each other, a fourth pattern is proposed (Figure 4). Some parts of this pattern are similar to the third pattern, the only difference is that the sign “( )” is embedded in order to group the keywords with their operators. The operators “AND/OR/NOT” perform this combining act.

Using the patterns presented in this article will lead to the indexability and visibility of the metadata element tag names by web search engines, and the retrieval of the records based on the capability of element-based search by the end users. Accordingly, improvement of the interoperability among metadata systems and web search engines will increase satisfaction of the users with the web retrieval and storage systems. Using the first and the second patterns for creating metadata records by the creators of the metadata records as the main basis for creating, or preparing other versions from these records based on these patterns are easily possible and non-time consuming. The end users can also obtain more relevant records by designing the search strategies based on the third and the fourth patterns. Finally, further studies on the methods of creating the metadata records with the indexable and visible tag names by web search engines will be desirable in order to compare the performance and ease of performing each one, as well as offering different options to the creators of the metadata records.

Figure 1 A DCMI simple record embedded in the < pre> tag and with the file extension “.html”

Page 7: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

Figure 2 A DCMI sample collection record, a combination of a bibliographic record and his related authority record in MADS, embedded in the < pre> tag and with the file extension “.html”

Figure 3 Sample search strategy for performing an element-based search in the Google and Bing search engines

Figure 4 Sample search strategy for doing the element-based search using the “AND” operator in the Google and Bing search engines

Table I Binomial comparison of the indexability and visibility of the metadata element tag names of DCMI, MARC21, and MODS of the experimental and control groups by Google and Bing

Page 8: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

Notes

1. The research of Aqa Abedi (2012) and Farajpahlou and Tabatabaie Amiri (2011) was based on the suggestions of the MA thesis of Taheri (2010).

2. See www.tagnamemeta-ex1.ir and www.tagnamemeta-cont.ir

3. It should be noted that the records of the patterns based on MARC21 metadata format and Metadata Object Description Schema presented in the framework of the research also look like the first pattern.

References

Alexa (2013), "Global Top 500", Alexa: The Web Information Company, available at: www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none (accessed June 14, 2013), .

[Manual request] [Infotrieve]

Aqa Abedi, E. (2012), "The effect of syntax on the indexing & ranking of metadata records by the web search engine: a comparative study on MARCXML and DCXML metadata records", Science and Research Branch, Islamic Azad University, Tehran, unpublished Master's thesis, .

[Manual request] [Infotrieve]

Campex (2013), "Top search engines", available at: http://capmex.biz/resources/top-search-engines (accessed June 14, 2013), .

[Manual request] [Infotrieve]

ComScore (2012), "ComScore releases May 2012 US search engine rankings", available at: www.comscore.com/Press_Events/Press_Releases/2012/6/comScore_Releases_May_2012_U.S._Search_Engine_Rankings (accessed June 14, 2013), .

[Manual request] [Infotrieve]

Dehinbo, J.O. (2006), "Towards a framework for determining a platform for teaching web application development in tertiary institutions in South Africa", available at: http://uir.unisa.ac.za/bitstream/handle/10500/760/dissertation.pdf?sequence=1 (accessed July 14, 2013), .

[Manual request] [Infotrieve]

ExpertRating.com (2013), "XML tutorial: embedding XML in HTML", available at: www.expertrating.com/courseware/XMLCourse/XML-Embedding-HTML-8.asp (accessed February 7, 2013), .

[Manual request] [Infotrieve]

Farajpahlou, A.H., Tabatabaie Amiri, F. (2011), "How are XML-based MARC 21 and Dublin Core records indexed and ranked by general search engines in dynamic online environments?", Aslib Proceedings, Vol. 63 No.6, pp.586-592.

[Manual request] [Infotrieve]

Gigee, G. (2006), "MARC and MARCXML", available at: http://threegee.files.wordpress.com/2006/05/marcxml.pdf (accessed November 5, 2012), .

[Manual request] [Infotrieve]

Gill, T. (2008), "Metadata and the web: introduction to metadata", available at: www.getty.edu/research/publications/electronic_publications/intrometadata/metadata.pdf (accessed November 5, 2012), .

[Manual request] [Infotrieve]

Google (2013), "Web master tools: meta tag", available at: www.google.com/support/webmasters/bin/answer.py?answer=79812 (accessed November 7, 2012), .

[Manual request] [Infotrieve]

Page 9: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

Grehan, M. (2002), "How search engines work", available at: http://cms.searchenginewatch.com/digital_assets/3859/how-search-engines-work-mike-grehan.pdf (accessed June 5, 2012), .

[Manual request] [Infotrieve]

Grimnes, G.A. (2008), "A goal directed learning agent for the semantic web", available at: www.dfki.uni-kl.de/∼grimnes/papers/grimnes_thesis_final.pdf (accessed July 13, 2003), .

[Manual request] [Infotrieve]

Henshaw, R., Valauskas, E.J. (2001), "Metadata as a catalyst: experiments with metadata and search engines in the internet journal", First Monday, available at: www.librijournal.org/pdf/1999-3pp125-131.pdf (accessed December 19, 2012), .

[Manual request] [Infotrieve]

Isaksen, L. (2011), "Archaeology and the semantic web", available at: http://eprints.soton.ac.uk/196571/1.hasCoversheetVersion/y_gao_PhD_thesis_0111.pdf (accessed July 14, 2013), .

[Manual request] [Infotrieve]

Kim, J., Lee, S., Hwang, S.-W. (2013), "Enriching documents with examples: a corpus mining approach", available at: www.cse.ust.hk/∼hunkim/papers/kim-tois2013.pdf (accessed July 14, 2013), .

[Manual request] [Infotrieve]

Kyrnin, J. (2012), "What is pre-formatted text?", available at: http://webdesign.about.com/od/htmltags/f/blfaqpre.htm (accessed November 14, 2012), .

[Manual request] [Infotrieve]

Kyrnin, J. (2013), " < pre></pre>", available at: http://webdesign.about.com/od/htmltags/p/bltags_pre.htm (accessed August 29, 2013), .

[Manual request] [Infotrieve]

Lewis, E. (2013), "Top ten search engines", available at: www.seoconsultants.com/search-engines/ (accessed June 14, 2013), .

[Manual request] [Infotrieve]

Luk, R., Chan, A., Dillon, T., Leong, H.V. (2000), "A survey of search engines for XML documents", available at: http://www.haifa.il.ibm.com/sigir00-xml/final-papers/Luk/XMLSUR.htm (accessed December 14, 2012), .

[Manual request] [Infotrieve]

Luk, R., Leong, H.V., Dillon, T.S., Chan, A.T.S., Croft, W., Bruce, A.J. (2002), "A survey in indexing and searching XML documents", available at: http://onlinelibrary.wiley.com/doi/10.1002/asi.10056/full (accessed December 14, 2012), .

[Manual request] [Infotrieve]

Microsoft Developer Network (2013), "XML data islands", available at: http://msdn.microsoft.com/en-us/library/windows/desktop/ms766512(v=vs.85).aspx (accessed February 14, 2013), .

[Manual request] [Infotrieve]

Mixter, J. (2013), "Linked data in VRA Core 4.0: converting VRA XML records into RDF/XML", available at: http://jmixter.s3-website-us-east-1.amazonaws.com/thesis/LinkedDataInVRACore4.pdf (accessed July 14, 2013), .

[Manual request] [Infotrieve]

Mohamed, K.A.F. (2006), "The impact of metadata in web resources discovering", Online Information Review, Vol. 30 No.2, pp.155-167.

Page 10: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

[Manual request] [Infotrieve]

Mozilla Developer Network (2013a), "< pre>", available at: https://developer.mozilla.org/ en-US/docs/Web/HTML/Element/pre (accessed August 28, 2013), .

[Manual request] [Infotrieve]

Mozilla Developer Network (2013b), "Using data islands in Mozilla", available at: https://developer.mozilla.org/en/docs/Using_XML_Data_Islands_in_Mozilla (accessed August 28, 2013), .

[Manual request] [Infotrieve]

Powell, T.A. (2010), HTML and CSS: The Complete Reference, McGraw-Hill Osborne, New York, NY, Complete Reference Series, .

[Manual request] [Infotrieve]

Qin, J. (2000), "Representation and organization of information in the web space: from MARC to XML", available at: http://inform.nu/Articles/Vol3/v3n2p83-88.pdf (accessed December 14, 2012), .

[Manual request] [Infotrieve]

Quevedo-Torrero, J.U. (2004), "Improving web retrieval by mining the HTML tags for keywords and exploring the hyperlink structures of web pages", Department of Computer Science, University of Houston, Houston, TX, PhD dissertation, .

[Manual request] [Infotrieve]

Safari, M. (2005), "Search engine and resource discovery on the web: is Dublin Core an impact factor", available at: www.webology.ir/2005/v2n2/a13.html (accessed December 5, 2012), .

[Manual request] [Infotrieve]

Search Engine Watch (2007), "How search engines work", available at: http://searchenginewatch.com/article/2065173/How-Search-Engines-Work (accessed September 5, 2012), .

[Manual request] [Infotrieve]

Sharif, A. (2007), "Study the effectiveness of metadata elements on web page visibility in public search engines", available at: http://eprints.rclis.org/handle/10760/9171#.UHPNcVG94hA (accessed December 7, 2012), .

[Manual request] [Infotrieve]

Sokvitne, L. (2000), "An evaluation of the effectiveness of current Dublin Core metadata for retrieval", available at: www.vala.org.au/vala2000/2000pdf/Sokvitne.PDF (accessed September 14, 2012), .

[Manual request] [Infotrieve]

Taheri, S.M., Hariri, N. (2012), "A comparative study on the indexing and ranking of the content objects including the MARCXML and Dublin Core's metadata elements by general search engines", The Electronic Library, Vol. 30 No.4, pp.480-491.

[Manual request] [Infotrieve]

Taheri, S.M., Hariri, N., Fattahi, S.R. (2013), "Interoperability between metadata systems and web search engines: a review article", submitted for publication, .

[Manual request] [Infotrieve]

Turner, T.P., Brackbill, L. (1998), "Rising to the top: evaluating the use of the HTML META tag to improve retrieval of world wide web documents through internet search engines", Library Resources & Technical Services, Vol. 42 No.4, pp.258-271.

[Manual request] [Infotrieve]

Page 11: Ferdowsi University of Mashhadprofdoc.um.ac.ir/articles/a/1040823.doc · Web viewTitle: Using data island method for creating metadata records with indexability and visibility of

Wikipedia (2012), "Web search engine", available at: http://en.wikipedia.org/wiki/Search_engines (accessed December 7, 2012), .

[Manual request] [Infotrieve]

Wikipedia (2013), "RDFa", available at: http://en.wikipedia.org/wiki/RDFa (accessed July 7, 2013), .

[Manual request] [Infotrieve]

Woychowsky, E. (2003), "XML data islands offer a useful mechanism to display web data", available at: http://www.techrepublic.com/article/xml-data-islands-offer-a-useful-mechanism-to-display-web-form-data/1058668 (accessed December 19, 2012), .

[Manual request] [Infotrieve]

Xin, W. (2006), "XML specification of GUI", available at: http://etd.dtu.dk/thesis/191676/imm4498.pdf (accessed July 14, 2013), .

[Manual request] [Infotrieve]

Yahoo (2013), "What are meta tags?", available at: http://help.yahoo.com/l/us/yahoo/smallbusiness/promotion/meta/meta-01.html (accessed September 5, 2012), .

[Manual request] [Infotrieve]

Zhang, J., Dimitroff, A. (2004), "Internet search engine's response to metadata Dublin Core implementation", Journal of Information Science, Vol. 30 No.4, pp.310-320.

[Manual request] [Infotrieve]

Zhang, J., Dimitroff, A. (2005a), "The impact of metadata implementation on webpage visibility in search engine result (Part II)", Information Processing & Management, Vol. 41 No.3, pp.691-715.

[Manual request] [Infotrieve]

Zhang, J., Dimitroff, A. (2005b), "The impact of webpage content characteristics on webpage visibility in search engine result (Part I)", Information Processing & Management, Vol. 41 No.3, pp.665-690.

[Manual request] [Infotrieve]

About the authors

Sayyed Mahdi Taheri holds a PhD in Library and Information Sciences and is an Assistant Professor. His research interests are mainly in the areas of networked knowledge organization systems (NKOS) and metadata, and he has published books and articles in these areas. He has been a member of the Iranian Library and Information Science Association's Administrative Board from 2009. Sayyed Mahdi Taheri is the corresponding author and can be contacted at: [email protected]

Nadjla Hariri holds a PhD in Library and Information Sciences and is Associate Professor at the Department of Library and Information Science, Science and Research Branch, Islamic Azad University, Tehran, Iran. Her major research experiences and interests include information science, research methods, information retrieval systems, and performance evaluation of libraries, whether traditional or modern, such as digital libraries.

Sayyed Rahmatollah Fattahi is a Full Professor at the Department of Library and Information Sciences, Ferdowsi University of Mashhad, Iran. He has a BA degree in English Language and Literature, and a MS in Library and Information Science. He was awarded his PhD by the University of New South Wales, Sydney, Australia (1997). Dr Fattahi's research interests cover information organization, knowledge organization, information retrieval and human-computer interaction. He has published papers in international journals and has attended a number of international conferences. Dr Fattahi was the President of the Iranian Library and Information Science Association during 2000-2003 and 2006-2009.