abstract - techniquecallyblog.files.wordpress.com€¦ · web viewnavigating multi-level big...

Navigating Multi-Level Big Metadata for Scientific Collaboration Network Analysis

Sarah BrattSyracuse University, United States. [email protected]

Jian QinSyracuse University, United States. [email protected]

Jeff HemsleySyracuse University, United States. [email protected]

ABSTRACTMetadata in publication and data repositories contain multiple levels, reflecting the contents and the information re-trieval needs of the community it serves. The multi-level metadata creates opportunities for scientometric researchers and data-driven science policymaking by providing a rich, many-layered data source for macroscopic and mesoscopic views of scientific knowledge diffusion and research impact. In recent studies big metadata analytics (BMA) researchers leverage on novel big metadata sources such as open research data repositories to conduct longitudinal statistical analyses of the growth and decline of science communities, scientific knowledge diffusion, and research impact. However, using the multi-level metadata gleaned from research data repositories requires analytic domain knowledge not only of the scientific subject matter but also of the technical standards and repository-specific implementations of the metadata to provide an accurate interpreta -tion of longitudinal scientific collaboration network analyses. In this paper, we propose a step-by-step framework for navigat-ing the multi-level metadata in the context of open research data repository metadata. We illustrate the framework with a use-case from a big metadata analytics project on the evolution of the mus musculus collaboration network from 1979- 2013. The framework and use-case make an important methodological and theoretical contribution to computational workflows in big metadata analytics as multi-level big metadata are increasingly used for academic, industry, and government research.

KEYWORDSBig metadata analytics, multi-level metadata, methodology, scientometrics.

INTRODUCTIONMetadata contain multiple levels and different kinds of information. They enable discovery, facilitate user-searches, contex-tualize documents, and can be used as a data source for data-intensive, domain-specific research and applications. Multi-level metadata refers to the types of metadata defined by functionality and accommodate diverse user needs (Houssos, Jörg, & Matthews, 2012). Three tiers of metadata are Level 1 - discovery metadata; Level 2 - usage metadata; Level 3 - domain meta-data, as defined in the model developed by Houssos et. al (2012). Their multi-level metadata framework was developed to meet the needs for managing the vast array of datasets produced by governmental organizations, valuable assets for economic and social science research.

Likewise, metadata in large-scale cyberinfrastructure (CI)-enabled open research repositories and related services perform similar functions at multiple levels. The metadata existing in scientific data repositories such as GenBank at the National Center for Biotechnology Information (NCBI) and the DNA Data Bank of Japan (DDBJ) have grown rapidly since their in -ception. As the nucleic sequence databases grow exponentially in use so too does the metadata accompanying the datasets submissions. As of June 2018 there have been over 209 million unique sequence submissions to GenBank (Benson et al., 2013; NLM, 2015). NCBI reports that “on a typical day, researchers download about 30 terabytes of data from NCBI in an effort to make discoveries” to 3.3 million users (NLM, 2015).

The massive quantity of metadata is known as big metadata, referring to “the structured or semi-structured descriptions of scientific data stored in data repositories” (Bratt et al., 2017). They conform to standards created and recognized by a re-search community. For example, the KNB repository contains ecosystems data with metadata descriptions based on the Eco -logical Metadata Language (EML), a standard on which the data-submission tool “Morpho” was developed (Fegraus, Andel-man, Jones, & Schildhauer, 2005). The user-contributed data through specialized tools is an approach adopted by many re-search data repositories. For scientometric researchers and science policy-makers, big metadata offer a novel source for map -ping, visualizing, quantifying, evaluating, and studying the scientific enterprise. Moreover, metadata from research reposito-ries is at a scale and breadth that enables large-scale computational analysis (Hood & Wilson, 2001). Researchers in informa-tion science have recognized the value of quantitative studies of science using classical and Bayesian statistical techniques (Newman, 2001), data visualization and mapping (Boyack, Klavans, & Börner, 2005; Rosvall, Axelsson, & Bergstrom, 2009), artificial intelligence (Van Raan & Tijssen, 1993) and algorithmic and machine learning methods (Walker, Xie, Yan, & Maslov, 2007) to glean insight into the production and function of scientific documents, data products, knowledge diffu-sion, and research impact.

1

mailto:[email protected]



However, this approach can also be the reason for “messy and complex” big metadata that require well-trained data scientists and costly technical and human resources before they can be used for research analysis (Bratt, Hemsley, Qin, & Costa, 2017). It has been found that records from large research repositories are often unstructured, ambiguous, and inconsistent, varying over time and between research disciplines (Qin, Costa, & Wang, 2015). In addition, while the domain metadata seems to be standardized and well-codified, over time there are changes. In legacy systems such as GenBank, the data fields change, tech -nical sophistication and data-entry practices have changed over time, complicating the researcher’s assumptions about the consistency of data fields and their contents. For example, a database’s data input requirements may have undergone signifi-cant change. While the journal title data-entry procedure may have begun with a manual approach to data entry, with the re-searcher typing in the journal title field into an annotation record, it is often the case that as technology improves, the data-en-try method could change to a form with greater standardization, such as the introduction of a drop-down menu. Here, the re-searcher selects the scientific journal title from a validated “drop-down” pre-certified list. The metadata themselves or re -search data repository documentation do not always report technical or structural changes. The only indication big metadata analytics researchers have as to these incremental changes are through inspecting the metadata itself; for example, by observ -ing longitudinal changes such as the existence of misspellings across fields. Likewise, the metadata standards for the fields may have changed to reflect changes in the domain-specific requirements. For example, the NCBI taxonomy currently covers only 10% of the known and described life species known and could undergo changes in coming years (“Home - Taxonomy - NCBI,” n.d.).

The complexity and messiness due in part to the incremental changes in standardization norms for scientific metadata create challenges for big metadata analysis. One of the primary challenges is developing reliable workflows based on consistent strategies for navigating multi-level metadata. Data scientists often need to make decisions about which level of domain metadata is in service of the research questions. For example, if taxonomic classes are to be used as the slicing criterion, one needs to determine which level of classes would be suitable or if they are interested in drilling down into further facets. To this end, they make decisions within the conceptual and technical workflow step about how to subset, slice, and filter the var-ious levels of metadata to scope the data and appropriately answer research questions. However, the analytic and interpretive decisions made during this iterative process are notoriously difficult to reproduce in retrospect. Data scientists often operate on the modus operandi of one-off, heuristic, and ‘in-the-moment,’ and ad hoc decision-making which is effective for the messy, exploratory, and experimental environment of messy metadata. Developing a reliable analytic framework to docu-ment, interpret, revisit, reproduce, and communicate findings based on the multi-level metadata can accelerate and streamline the process, promote transparency, and save researchers from ‘re-inventing the wheel.’ As reported in big metadata analytic studies of workflows, such a framework has not been codified or grounded in empirical metadata analysis.

Research QuestionsIn this paper, we address two research questions related to navigating the multi-level metadata to inform a framework for big metadata analytics (BMA):

RQ1: What are the optimal stages in leveraging with the level - 3 domain metadata to ascertain the validity of the domain-level metadata, given the challenges of evolving domain metadata standards and repository data-input systems/procedures?

RQ2: What are the optimal stages to engage with the level - 3 domain metadata to scope a research commu-nity for a collaboration network longitudinal analysis, given the challenges of multiple, often competing definitions of a community (*assuming RQ1 has been sufficiently resolved)?

While previous work such as (Houssos et al., 2012) and (Bratt et al., 2017) offer examples of sub-setting and filtering, they do not focus in detail on the decisions made in selecting the level of domain metadata in consideration of the details of this stage of the workflow. Further, previous work on workflows can be enriched with the detailed descriptions of the steps and decision-points involved in sub-setting, slicing, and filtering. In this paper, we contribute to this gap in the literature. Our first contribution is a conceptual framework for addressing the challenges associated with the issues in slicing domain metadata for scientometric research to inform social science research and science policy. legacy system’s metadata inconsistencies and ambiguities; and the RQ1 and RQ2: articulating the issues in validity of making equivalencies and how the slicing step con -front the well-known issue of defining a research community or discipline to guide the extraction of network data for longitu -dinal analysis.

The remainder of this paper is structured to address these research questions with a focus on the big metadata analytics work-flow step of negotiating slicing domain metadata for longitudinal collaboration network analysis. To do this, first we discuss previous work in bibliometrics and information systems which addresses issues with evolving domain metadata. We situate the study in relevant literature on the use of thesauri for determining equivalencies and then also in social network analysis on computationally defining and representing a scientific research community using big metadata.

2

Next, we describe a use-case from a BMA project in the context of scientific collaboration networks. The data we use for this project is the metadata from NCBI’s GenBank and the NCBI Taxonomy. As indicated above, the NCBI Taxonomy is an ex-emplar of the Level 3 - domain metadata defined by Houssos et. al (2012). Within this analysis we report the preliminary sta -tistical and visualization results from the collaboration network evolution analysis in this use-case to illustrate the technical and conceptual aspects of this slicing, sub-setting, and filtering step of the workflow. Namely, we analyze the mus musculus research community over a period of 34 years (1979 - 2013).

In the discussion section we describe and interpret how the BMA research team can optimally negotiate the two issues of va-lidity and equivalences and community-definition ambiguity from the initial high-level conceptualization of the research question (i.e., How do the collaboration networks of communities of scientists grow, change, and evolve over 34 years?) to the technical tasks of locating the datai in the database, extracting, filtering, and cleaning, to plotting and analysis, exploratory visualizations and testing the distribution of the underlying population statistical output (e.g. network density, centrality mea-sures, network diameter, transitivity). Finally, we discuss the implications of these issues and potential solutions. We con-clude with a summary of the primary contributions.

BACKGROUNDThe background situates the proposed framework in science of science literature. We present an overview of exemplary em-pirical and theoretical studies in three areas relevant to and necessary for situating the conceptual framework and use-case. This background section contains three parts:

1) Workflows and multi-level metadata, with a focus on the 3rd level of domain metadata and the workflow step of ‘sub-setting, slicing, and filtering’;

2) Scientific communication theory in relation to multi-level metadata used to analyze disciplinary divisions; 3) Community detection and longitudinal analysis methods and challenges.

Big Metadata Workflows

Previous research in industry and U.S. government research workflows was conducted by Smith et al. (2014) and extended in the context of big metadata analytic workflows for scientometrics by Bratt et al (2017). In both domains, the big metadata and “big data ecosystems” there is concern with a principled means of managing massive, unsystematic data and metadata. Smith et al. (2014) argue for the need for more systematic approaches to metadata management, the filtering, sub-setting, and slicing step in the big metadata analytic workflow entails conceptual and technical knowledge work of navigating multi-level metadata. That part of the work requires, like data science literature suggests, domain knowledge and reflective sense-making with a team through steps of the process. However, when dealing with big metadata the computations are exe-cuted over large numbers of rows. Thus, it is hard to “check your work.”. The lack of standardized, principled approaches to metadata impedes data-sharing and transparency of sharing computer code and workflow steps (Smith et al., 2014). Bratt et. a; (2017) develop a framework for the iterative process of conducting scientometric research using big metadata. The stages in the workflows are overlapping and involve two types of workflows: conceptual workflows and computational. The con-necting mechanisms between the conceptual and computational workflows which continue conterminously and continuously with all the workflow steps is what they term collaborative documentation, a “team-based effort to record goals, rationales, strategies, steps and activities for search, reuse and informational purposes. Corresponding to the conceptual workflows are the computational workflows, which define the machine execution that which has been defined at the conceptual level. Dur-ing the “data-wrangling” phases, the data is aggregated, filtered, and subset according to the research questions and analytic requirements.

Multi-level metadata enters the big metadata analytic workflow at the computational and conceptual stage pertinent to sub-setting, slicing, and filtering. As Smith et. al (2014) describe in their multi-level model, the third level is domain meta-data. Domain metadata consist of the specific metadata standards for particular types of datasets. Examples include the meta -data content within standardized descriptive systems such as Core Scientific Metadata (CSMD) developed for scientific datasets (Matthews et. al. 2009), the Data Documentation Initiative (DDI) designed for social science data, and the NCBI Taxonomy. These can be used for advanced domain-specific services and tools that can be provided for particular categories of data sets.

Using Metadata to Analyze the Development of Disciplinary Divisions

There is a well-established history of research inquiry in team science and scientific communication on how scientific com-munities collaborate and communicate. The fields within these lines of inquiry have measured the maturity and development of an area of scientific inquiry by analyzing the distinct subfields within each (Meadows, 1997) states in his classic book Communicating Research that the evidence for a subject or research discipline’s growth is the branching and expansion of fields underneath its umbrella of the original field. For example, he cites how all science was once called “natural philoso-

3

Microsoft Office User, 08/17/18, RESOLVED

At this stage the metadata in GenBank has been transformed into the data for the network study.

Microsoft Office User, 08/17/18, RESOLVED

I am quite puzzled by the paragraphs in this section and can’t figure out what it is exactly you are trying to get at. My questions for Sarah: Why suddenly bring up thesauri? What does metadata management have anything to do with thesauri or BMA? The paragraphs in this section do not seem to be coherent and each talks its own thing Is the whole background section used as background information about this paper or as a lit review? I can’t tell

phy” then branched off as time brought specialization and professionalization. Scientometrics researchers use large publica -tion-based metadata to measure the proliferation and decline of journals in an area of research to study the growth and death of fields. For example, the field of cold fusion had promising beginnings with many new conferences and journals emerging from new ideas in the field (Meadows, 1997: p. 22 – 23). However, history of science studies show that over time the confer -ences diminished and the subdisciplines that initially increased died out when cold fusion was found to be founded on prob -lematic and ill-informed assumptions (Meadows, 1997).

In the context of big metadata analytics on scientific collaboration networks, this branching of research disciplines takes the form of multi-level metadata. For example, the mus musculus community is created by sub-setting the taxonomic classes at the genus level. This level of granularity is sensible for this study because the research community in question identifies and collaborates using the house mouse because of its usefulness for research in human disease, given the similarity in DNA, known as the homologue. Moreover, the reliance of the community on each other for collaboration, communication, and data-sharing (evidenced by journals and conferences, and databases centered around mus musculus) further support for defin-ing all geneticists who submit data on the genus and species mus musculus as part of the laboratory mouse community is the network statistical properties of the co-authorship networks.

The Challenge of Community Detection and Longitudinal Analysis

Preliminary findings from the use-case study (described in detail below) show that the co-authorship networks for mus mus-culus are well connected. The average giant component is 98% connected (see Findings section). This tightly connected co-authorship network indicates that collaboration ties have been consistently strong within this group of scientists. As Mead-ows’ (1997) findings highlight, a strong indicator of the growth and maturation of a research discipline is the development of specialized branches, sub-categories, and “splinters.” Thus, an analysis and awareness of the multiple, dynamic ways in which metadata is structured is important for big metadata analytics researchers in the study research discipline maturation. Community detection and computation, for example, has multiple, differing definition for what makes a research community, One definition allows for multiple membership to communities according to institutional affiliation (Khan & Niazi, 2017). Other approaches define a community as the nodes which are linked by a relationship, called “link-density” community membership (Khan & Niazi, 2017). While the quantitative study of science has focused on computational methods for mea-suring and mapping scientific specialization and research discipline growth, there has not been systematic analyses of chal -lenges and opportunities in using different levels of metadata to measure research community development.

FINDINGS

In the use-case below, we describe how big metadata analytics confronts issues of ambiguity, equivalency, and semantic syn -onymy in the context of scientific collaboration networks community defined by the organism taxonomy. While there are multiple issues of ambiguity when drawing equivalencies such as in name-disambiguation, for this paper focus the scope on potential and actual ambiguities in classifying a research community. We describe how the discovery of the multiple-layers requires domain knowledge (as mentioned above), cognizance of the research questions and goals, and a relatively compre-hensive understanding of the possible configurations and re-configurations of the levels of metadata such that the co-author -ship research community is defined differently, affecting the outcome of the research. As literature in the social network anal-ysis has shown, the approaches for community detection and analysis can vary widely. Many studies address methods for computationally or qualitatively mitigating the challenge of conflicting definitions of a ‘research community’ as strictly-bounded, overlapping, or multi-membership. We argue that the conflicting definition of the community boundaries of re-search areas both implicates and conceals the metadata at the center of the analysis, given its role in classifying and ordering entities and relationships in co-authorship networks.

A Conceptual Framework for Negotiating Multi-Level Metadata in the Context of Big Metadata Analytics When scoping an analysis of community evolution, it is important to select the level of metadata to address occasionally ex-ploratory research question. The levels are integrated through Linked data (Figure 1). The integration of the levels in a multi-level model is valuable because it can connect and aggregate datasets from heterogeneous sources to build data services based on them (Houssos et al., 2012). Further, given the messiness and complexity of dataset metadata, a single-level meta-data representation approach is inadequate.

.

4

Figure 1: The simplified model of the multi-level metadata model (Houssos et al., 2012).

First, the researcher decides on the level of metadata or levels of metadata appropriate for the analysis: Level 1 - Discovery, Level 2 - Usage, and/or Level 3 - Domain. A combination of the levels of the metadata may be used. The process of navigat-ing multi-level metadata can occur early in the process when working with level one metadata. Figure 2 below describes the framework for navigating the level 3 domain metadata, beginning with the research question and iterating through three ques-tions which the research team determines within the analytic and collaborative documentation (Bratt et. al, 2017) process.

Figure 2: A conceptual framework for iterative research process in multi-level big metadata analysis. The framework begins and operates within the parameters of the research question (indicated by the outlined box in the figure. The steps result in the final circle: the interpreta-tion of findings. Note that the steps may be revisited iteratively, for example, during revision and peer- or research team- review. Further, there are three questions directing the steps. Note that the final two are constrained and informed by the available domain metadata (i.e.

“Question 1 (Q1): What sub-setting, slicing, and filtering steps should be taken?”; and “Question 2 (Q2): Which analysis methods should be employed?”). The available metadata may be incorporated iteratively into previous steps of the research process. The available metadata

can refine and reorient the research question.

The unit of the conceptual framework is the Research Question (RQ). The RQ is delineated by the black-outlined box which represents the environment of inquiry. The environment is flexible and informed by external factors (that is, the project changes which may occur at a higher level than at the level of this bounded research, e. g., research funding runs out, grant report deadlines, new research team members) and can be one of many RQs or overlap with other RQs.

5

The first step made in the framework is an assessment of the metadata and a determination of the level or levels of metadata necessary and appropriate to address the RQ. This determination of the level or levels of the metadata occurs through an ex-ploratory and investigative step through which the researcher(s) becomes familiar with the metadata source selected to ad-dress the phenomenon of interest in the overarching project or individual RQ. For example, the level 3 domain metadata may be selected because the RQ targets the content of the rather than the transactions information, which the level 1 - discovery and level 2 - use metadata frequently are better-suited to addressing. More detail on this step is discussed and illustrated in the use-case section in this paper.

Once the level or levels of metadata are selected, the researcher or team decides how to subset, slice, or filter the metadata. This process is captured in the question 2 (Q2) in the model: “What sub-setting, slicing, or filtering” should be executed? Is there a portion of the data which serves as a representative sample, a useful case of the phenomenon in question, or an infor-mative anomaly? This step is also an investigative process which is constrained and informed by the metadata available (see Figure x). Likewise, question 3 (Q3) follows from the available metadata and the decisions made in Q3 about partitioning the metadata. The analysis methods are constrained and informed by the subset metadata, e.g. the data-type such as numeric, cat-egorical, or ordinal data, the ‘cleanliness’ of the data, e.g., whether it is standardized in a consistent format or well-parsed. At the level of these two questions (i.e. Q2 and Q3), there may be new insights that lead to a reorientation to the research ques-tion, or a refinement of the research question. For example, the hypothesis on which the question was premised may have as-sumed a single factor to be tested, while Q2 and Q3 revealed there are multiple factors at play.

Finally, the iterative process culminates in the interpretation of findings. The findings are informed by all steps in the model, including the research question, level of metadata selected and surrounding knowledge about the data source and domain, the sub-setting, slicing, and filtering decisions made, and the analysis methods. The research is ideally integrated in this way, but we note that the process does not always result in clear and well-defined interpreted findings but may simply serve to alter or even discard the research question. Nevertheless, if the process is carried forward to and through Q3 (the analysis step) the interpretation of findings is the usual “resting point” where the process of iteration ceases and it is clear whether the RQ is answered. In the next section, we illustrate the framework at in the context of mus musculus collaboration network analysis. This use-case enables others in big metadata analysis to approach their domain level metadata with systematic selection and navigation of the metadata levels.

Use case: Mus Musculus Collaboration Network Analysis

For the discussion and use-case presented here, we focus on level 3 - the domain metadata, specifically, the taxonomic records imported and linked to GenBank annotation records. Within this level, there are additional levels, according to the well-known “tree of life” classification schema and the authorship location data, such as Region, Country, State, and City. In decision-making about how to define and subset a community for network analysis, the domain-level data is of chief interest. To illustrate the two issues (1. The ambiguity and inconsistency of metadata records in domain-level legacy repository meta -data 2. The challenge of defining and scoping a network dataset for community analysis), we use the former as it is a domain metadata exemplar and is used to define a research community in the context of scientific collaboration network analysis.

NCBI Taxonomy classes as content for metadata datasets

The NCBI Taxonomy Database is a curated classification and nomenclature for all the organisms in the public sequence data-bases (CITE http://www.ncbi.nlm.nih.gov/taxonomy). The Taxonomy classifies and provides names for known organisms that have been identified in life science studies. The mus musculus branch is at the genus-species level and resides below Ro-dentia and about the further branching to sub-species (Figure 3).

6

Figure 3: The NCBI taxonomy branch of Rodentia where mus musculus is defined in relation to its relatives in the organism hier-archy (source: (“tree.gif (678×372),” n.d.)).

Scoping and defining a research community for network analysis

The structure of the “tree of life” taxonomy is hierarchical. As such, is contains nested and sub-facted information at the node of each branch (Kwasnik, 1999). For example, mus musculus is the hyphenated genus-species name of the sub-genus of the phylum Rodentia.

Using bibliometric data to model a dynamic research community is not new. Co-authorship networks can be used to analyze co-authorship pattern changes in research communities over time. In this context, we define collaboration with Bozeman & Boardman (2014, p. 2) as the “social processes in which researchers pool their experience, knowledge, and social skills with the objective of producing new knowledge, including knowledge embedded in technology.” By modeling the co-authorships on publications and datasets, it is possible to compare the statistical properties of the collaborations by a defined period of time, e.g. by year.

Over the specified period, network metrics such as the number of papers published, the centrality measures for the re-searchers in the network, and the degree of clustering in the networks can be determined. These measures indicate collabora-tion properties and can lead toward theoretical development about collaboration capacity of a research community, such as example of such computational social science is the work by Barabasi et. al (2000), a now-famous body of work on social network analysis of scientific collaboration networks using digital metadata. For example, using a computer database of bib-liometric metadata for publications from scientific journals, they glean findings about the statistical properties of collabora-tion networks. The research question driving the use-case analysis was “How does the mus musculus collaboration network change over time?”

Addressing this question is valuable because it can potentially inform science policy by testing theories in Scientific and Technical (S&T) Human Capital. Methodologically, the research question can help researchers and policy-makers better un-derstand how to leverage on big metadata analytics to test theories of Scientific and Technical (S&T) Human Capital, such as collaboration capacity, a framework to operationalize S&T Human Capital (Qin, Hemsley, & Bratt, 2018).

Use-Case: The Multi-level Metadata Framework in the Case of the Mus Musculus Collaboration Network Analysis

In the mus musculus network analysis, the project began with research into how the mouse genome came to be used in re-search as a model organism. Then, the available metadata was inspected to see if the metadata, for example, author and co-author names, date of publication, taxonomic category, annotation record, journal an article is published in, patenting coun-try, or parent species, could answer questions about the network’s development or decline over time. Adding the taxonomy data to the initial FTP-downloaded GenBank metadata, we could query the data according to the level of the genus, species, family, etc. we determined most useful for analysis. We determined the genus-species binomial naming level (i.e. mus mus-culus) afforded a large enough research community and one working on similar research to be considered a research commu-nity, from an intellectual if not direct (that is, one-degree of separation) perspective, as discussed above.

The decision to use mus musculus as the level of metadata at which to define the community for network analysis was further supported by the increasing participation in GenBank by scientists submitted mus musculus research, as well as the interna-tional proliferation of submitters. Further, the maintenance and growth of the research field is evidenced by the increasingly sophistication of the web resources such as the Mouse Genome Informatics website (http://www.informatics.jax.org/) avail-able for the mouse community. Further, the increasing ease-of-use of submission tools (“wizards”) for these databases vastly

7

differ from the research on model organism databases from the mid-1990s (Star & Ruhleder, 1996) which found that the technical difficulties scientists had accessing the databases to make a submission in the first place was costly, inconvenient, and required a high-level of computer literacy such as downloading the correct programs to login to the server remotely (Star & Ruhleder, 1996: p. 10).

The steps taken were based on the research question “How does the mus musculus collaboration network change over time?” This RQ was a sub-question of the overall project goals. The overarching project was to analyze the diffusion and impact of knowledge using metadata from an international open research data repository. The network evolution question, then, was a smaller part of the investigation that aimed to uncover the factors that led to robust collaboration communities. We used ex-ploratory visualization, network metrics such as giant component and diameter measurements by decade, and descriptive sta-tistics to model and potentially predict trends in scientific collaboration. Within the context of this question, the researchers on our team confronted the first step in the framework: negotiating the multi-level metadata. This step was initiated with in-ternet research on the database documentation and exploratory visualizations (Figure 3).

The available metadata in our GenBank analysis included three types overlapping all three levels: on the discovery metadata level (level 1), the annotation records describing the sequencing datasets, or data submissions, are functional for web search results when researchers search for datasets. These submission metadata also serve on the use level, such as when the re-searcher checks the date of submission and the co-authors who produced the dataset. Then, the submission metadata links to the NCBI Taxonomy (in our project’s database, that is, not directly from the FTP-download from GenBank’s files – we, the research team, imported and linked the database according to the sequence’s organism-classification information that came with the GenBank metadata download).

Second is the publication and patent metadata. These are metadata which describe the scholarly and commercial documents that are associated with the dataset. The level 1 – discovery and level 2 – usage metadata were not of immediate functional interest for this research question. The level 3 – domain metadata tier was of most interest to the project because it functioned to provide numerous records for data mining, visualization, and community development analysis to the end of better under-standing how a research community develops and changes over time. These longitudinal, temporal aspects could best be an-swered from the perspective of domain metadata.

As alluded to in the literature section, he most critical and controversial aspect of longitudinal collaboration network analysis is the definition of a research community. Also referred to as a “discipline,” ‘community of practice’ (Wenger & Snyder, 2000), subject area, and/or research domain (Meadows, 1997), the collaboration network was defined by the model organism central to the genetics and genomics research on a range of topics. Defining a community a priori in this way requires ex-ploratory are confirmatory analysis (Tukey, 1977). With the RQ directing our selection of domain metadata, our research team moved toward answering the next question in the framework (Figure 2): “What sub-setting, slicing, and filtering steps should be taken?”

Mus Musculus Community Development Analysis

1979 1989 1999 2009

Figure 3: The mus musculus collaboration (co-authorship) networks of dataset submissions and publications: 1979, 1989, 1999, 2009. The number of nodes and links interacts with the layout algorithm (Fruchterman-Rheingold) to display clusters within the co-authorship net -work.

Given the wealth of available metadata and multiple, competing options for sub-setting the network, the decision began with an exploration of the overall trends for viable subsets of the network at the organism level, that is, the level 3 – domain meta -data level, using the NCBI Taxonomy class names (Figure 4). Other viable sub-sets and slices in consideration to address the RQ (i.e. for a community analysis) were, for example, to ascend the taxonomy and inspect the collaboration patterns of re -searchers who had submitted sequence data at the phylogenic position of Rodentia. If we took this tack in slicing, we would

8

measure the network of all rodent scientists, but would obscure the interactions of the mus musculus researchers who all are familiar with and specialized according to the model organism of mus musculus, otherwise known as ‘the laboratory mouse.’

Figure 4: Line plot subs and pubs frequency for mus musculus: 1979-2013. This analysis examined the mus musculus author-author undi-rected network within the mus musculus scientific community and found that the number of vertices (authors) in the network doubles every year.

Another option was to filter out just the records according to link density. That is, another approach to defining and sub-set-ting the network is to extract only the metadata records where there are completely connected networks, or a giant component (Table 1). This slicing approach assumed a definition of community according to collaboration. However, this method suffers two drawbacks: the first is that it overlooks the important cases in which researchers are related or can be said to belong to the same community if they study the same organism and research problem but are not connected by co-authorship links. This approach was relegated inferior to the model-organism definition of network for this analysis because it implied that some scientists in the same intellectual domain or community of practice may not exist in overlapping co-authorship net-works, and thus do not “belong” to the same communities.

year nodes edges giant component nodes giant component edges percent GC nodes1979 136 413 124 392 91.2%1989 1602 5375 1599 5373 99.8%1999 9253 279070 9248 45631 99.9%2009 1895 53650 1892 15626 99.8%

Table 1: The giant component metrics by year: 1979-2009, by decade. The giant component indicates a highlight connected collaborative community. The average giant component for this sample of the network was 97.7%. This high rate of connectivity indicates a highly col-laborative community where few researchers are disconnected from the primary mass of scientists. At the end of each decade from 1979 to

2009, the giant component maintained a well-connected research community.

Next after the challenging framework level of slicing, sub-setting and filtering, the analysis methods were selected. The choose network metrics that measured the key indicators of community growth and change. First, we measured the number of nodes (authors) and the number of edges (collaborations on a publication or submission) every year from 1979-1980 (Table 1 shows a sample of these statistics, over 30 years at the end of each decade from the earliest available data in GenBank). Other measures include the percent giant component, density, betweenness centrality, and diameter, among others.

Finally, after rounds of iteration and refining the research question according to initial findings, the last step in the conceptual framework in navigating multi-level metadata was taken. Preliminary findings suggest the growth of the collaboration net-work for the model organism mus musculus at the level of dataset submissions. The network metrics which indicate small-world properties because of the consistently large giant component over 30 years is also consistent with the scientific innova-tion events in mus musculus research at the turn of the 21st century. Our data analysis suggests a surge of research collabora-tion, culminating in 2002 with the first draft of the first-ever sequencing of the mouse genome. The community continues to grow into 2009 but declines slightly after the peaks in 2002 and 2005 (Figure 4). With 279,070 collaborative connections amongst 9,235 authors (non-cumulative) in 1999 alone, there is good indication that the network metrics and definition re-

9

flect the scientific collaboration events such as major breakthroughs and labor in the mus musculus community during the pe-riod of the late 1990s to the early 2000s.

DISCUSSIONAccording to the social shaping of technology (MacKenzie & Wajcman, 1999; Pinch & Bijker, 1987), the social and techni-cal resources we work with are co-constructed in an ongoing performance. Specific to big metadata analytics, the sources of data and the ecosystem of tools we use to store, inspect, analyze, and manage that metadata such as databases and statistical programming software are not so much boundary objects as boundary negotiating objects whose design, construction, and use continuously construct the analytic categories and relations we take for granted (Beitz and Lee, 2009: p. 1). Accordingly, as big metadata analytics matures, it will be important for BMA and data science researchers to simultaneously develop re-flexivity regarding the tools we use and invent, and the flexibility and inflexibility of our current and future configurations of our human- and cyberinfrastructure.

For example, the mus musculus data table was created and became a useful and convenient subset of the GenBank metadata database. Over time, the database table was queried and used for multiple analyses though it was a derivative table located in the “test” rather than primary database. The creation and continued use of a data table such as the mus musculus bibliometric data table moves the definition of a community in the direction of crystallization. The community of mus musculus re-searchers, thus defined through writing a semi-permanent, visible, and useful data object, is further instantiated. By becoming a more well-established table, the mus musculus ‘temporary’ table thus contributes to the definition of a research community in GenBank as a species level phenomenon, and a clearly-bounded one at that.

This idea of sharp delineation of a research community is thus tacitly reinforced and perpetuated, especially for newcomers onto the project such as graduate assistants and research members. While this is not a problem, per se, but does not immedi-ately reveal the porous, dynamic nature of research communities premised on co-authorship, and conceals the decisions made by the research team to define the community thus. Any level of the metadata taxonomic records could be used to define a community if it was clear from domain research and link-density that the community is reasonably defined in this manner. CONCLUSION

In this paper, we present a framework for navigating multi-level big metadata. We situate the framework in BMA workflow literature and articulate the challenges of interpreting, communicating, and reproducing data analysis at the appropriate levels without a consistent, reliable framework. In the context of research data repository metadata, we describe the relationship of multi-level metadata with domain metadata to underscore the ways in which a consistent and reliable framework can strengthen research by developing systematic strategies for navigating the multiple levels of metadata. We highlight the well-known difficulty arising in community detection and longitudinal network analysis on scientific collaboration networks: How do you defined a community? to situate our use case of a BMA. We present a conceptual framework for negotiating the ana -lytic process of addressing research questions concerning mutli-level metadata, and we illustrate the framework with a use-case focusing on the mus musculus co-authorship network evolution. This framework and illustrative use-case contribute to the development of big metadata analytics by elucidating the conceptual steps necessary for high-quality, thorough, and col -laborative research. In the quantitative study of science analyses that leverage on research repository metadata, the descrip-tive framework for navigating multi-level metadata contributes methodological and theoretical advances, as big metadata in-creasingly play a central role in advancing information science research and science policy decision-making.

ACKNOWLEDGMENTSThis work is supported in part by the NSF grant award #1561348.

REFERENCESBenson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2013). GenBank.

Nucleic Acids Research, 41(Database issue), D36-42. https://doi.org/10.1093/nar/gks1195

Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.

https://doi.org/10.1007/s11192-005-0255-6

Bratt, S., Hemsley, J., Qin, J., & Costa, M. (2017). Big data, big metadata and quantitative study of science: A workflow

model for big scientometrics. Proceedings of the Association for Information Science and Technology, 54(1), 36–45.

10

Fegraus, E. H., Andelman, S., Jones, M. B., & Schildhauer, M. (2005). Maximizing the value of ecological data with struc -

tured metadata: an introduction to Ecological Metadata Language (EML) and principles for metadata creation. The

Bulletin of the Ecological Society of America, 86(3), 158–168.

Home - Taxonomy - NCBI. (n.d.). Retrieved August 13, 2018, from https://www.ncbi.nlm.nih.gov/taxonomy

Hood, W., & Wilson, C. (2001). The Literature of Bibliometrics, Scientometrics, and Informetrics. Scientometrics, 52(2),

291–314. https://doi.org/10.1023/A:1017919924342

Houssos, N., Jörg, B., & Matthews, B. (2012). A multi-level metadata approach for a Public Sector Information data infra -

structure. In Proceedings of the 11th International Conference on Current Research Information Systems (pp. 19–

31).

Khan, B. S., & Niazi, M. A. (2017). Network community detection: a review and visual survey. ArXiv Preprint

ArXiv:1708.00977.

MacKenzie, D., & Wajcman, J. (1999). The social shaping of technology. Open university press.

Meadows, A. J. (1997). Communicating research. Emerald Group Publishing Limited.

Newman, M. E. (2001). Scientific collaboration networks. I. Network construction and fundamental results. Physical Review

E, 64(1), 016131.

NLM. (2015). Congressional Justification FY 2015 [Document]. Retrieved from https://www.nlm.nih.gov/about/2015CJ.html

Pinch, T. J., & Bijker, W. E. (1987). The social construction of facts and artifacts: Or how the sociology of. The Social Con-

structions of Technological Systems: New Directions in the Sociology and History of Technology, 17, 1–6.

Qin, J., Costa, M., & Wang, J. (2015). Methodological and Technical Challenges in Big Scientometric Data Analytics. ICon-

ference 2015 Proceedings. Retrieved from https://www.ideals.illinois.edu/handle/2142/73756

Rosvall, M., Axelsson, D., & Bergstrom, C. T. (2009). The map equation. The European Physical Journal Special Topics,

178(1), 13–23. https://doi.org/10.1140/epjst/e2010-01179-1

Smith, K., Seligman, L., Rosenthal, A., Kurcz, C., Greer, M., Macheret, C., … Eckstein, A. (2014). “Big Metadata”: The

Need for Principled Metadata Management in Big Data Ecosystems. In Proceedings of Workshop on Data Analytics

in the Cloud (pp. 13:1-4). Snowbird, UT, USA: ACM. https://doi.org/10.1145/2627770.2627776

Staff, N. (n.d.). open data | NCBI Insights. Retrieved August 13, 2018, from https://ncbiinsights.ncbi.nlm.nih.gov/tag/open-

data/

tree.gif (678×372). (n.d.). Retrieved August 13, 2018, from http://bioweb.uwlax.edu/bio203/s2009/smith_meg2/images/

tree.gif

Tukey, J. W. (1977). Exploratory Data Analysis. Pearson, Reading, Mass. Reading, Mass.: Pearson.

11

Van Raan, A., & Tijssen, R. (1993). The neural net of neural network research. Scientometrics, 26(1), 169–192. https://

doi.org/10.1007/BF02016799

Walker, D., Xie, H., Yan, K.-K., & Maslov, S. (2007). Ranking scientific publications using a model of network traffic. Jour-

nal of Statistical Mechanics: Theory and Experiment, 2007(06), P06010.

Wenger, E. C., & Snyder, W. M. (2000). Communities of practice: The organizational frontier. Harvard Business Review,

78(1), 139–146.

81st Annual Meeting of the Association for Information Science & Technology | Vancouver, Canada | Nov. 10 - 14, 2018

Author(s) Retain Copyright

12

i At this stage the metadata in GenBank has been transformed into the data for the network study.

abstract - techniquecallyblog.files.wordpress.com€¦ · web viewnavigating multi-level big...

Documents