rgi

5
Fourth Paradigm CONCEPT Scientific discovery has been done throughout the years. However, the approach and ways of doing it have changed dramatically. At first, scientific discovery was empirical, based on observation of natural phenomena. With the development of the society and human knowledge, a more theoretical approach was used, where men discovers and analyzes new information using models and generalizations Third, on the last few decades, thanks to technological advances, scientific discovery is done by recurring to the computational power available, to run simulations and simulate natural environments, done much faster and comfortably. A new approach is emerging, and this Is the one that encapsulates the Fourth Paradigm ideas: this new approach unifies theory, experiment and computer science. Here, information is extensively collected, analyzed and transported through a long processing pipeline. Data collection is not done manually by scientists, but gathered from tools, simulations and complex instruments, and then parsed and analyzed. Scientists only have access to it sometimes on the end of this process, where it is apt to be used and concluded on. The problem is that, as research work nowadays returns a massive collection of results, all this data is hard to manage and understand. It’s also not easy to share information through research teams, store it and make it available in an efficient manner.

Upload: pedro-soldado

Post on 18-Jul-2016

216 views

Category:

Documents


0 download

DESCRIPTION

df

TRANSCRIPT

Page 1: RGI

Fourth ParadigmCONCEPT

Scientific discovery has been done throughout the years. However, the approach and ways of doing it have changed dramatically. At first, scientific discovery was empirical, based on observation of natural phenomena. With the development of the society and human knowledge, a more theoretical approach was used, where men discovers and analyzes new information using models and generalizations Third, on the last few decades, thanks to technological advances, scientific discovery is done by recurring to the computational power available, to run simulations and simulate natural environments, done much faster and comfortably.

A new approach is emerging, and this Is the one that encapsulates the Fourth Paradigm ideas: this new approach unifies theory, experiment and computer science. Here, information is extensively collected, analyzed and transported through a long processing pipeline. Data collection is not done manually by scientists, but gathered from tools, simulations and complex instruments, and then parsed and analyzed. Scientists only have access to it sometimes on the end of this process, where it is apt to be used and concluded on.

The problem is that, as research work nowadays returns a massive collection of results, all this data is hard to manage and understand. It’s also not easy to share information through research teams, store it and make it available in an efficient manner.

To try to solve these problems, there is a growing demand for tools and technologies that are more generalized and applicable to a larger extent, not only to specific research areas (including the creation of Laboratory Information Management Systems).

Even when new developments are achieved (mostly for huge projects with funding for software developments) some of these developments cannot be reused, as they are specific to the project’s scope. This expands the gap between big projects and smaller projects, and makes the standardizing of tools and software much more difficult.

If all these contributions (both by bigger and smaller projects) could be collected and organized, all information could be shared and insights and conclusion could be given based on much more solid grounds (more information, more data reliability).

The main idea to have in mind, and the one that summarizes the Fourth Paradigm concept, is that we can leverage on the technological innovations and research available

Page 2: RGI

currently to create tools and mechanisms that can help us manage, store, share, access and consume data on a much more efficient and effective way.

OPEN DATA

The management of information, and in particular important data from varied sources, such as scientific discoveries, legal reports, and other contributions, is an increasingly recurring theme on current days.

The “Open Data” vision, an idea where certain data must be correctly stored and most importantly, freely available to public use, has gain terrain thanks to the increasing need and desire to access massive sets of data in order to leverage on the relevant information it can provide.

One of the most important fields that can take advantage of this idea is scientific discovery and research in general.

As said earlier, nowadays most information processed on scientific research comes from advanced tools and machines that generate massive amounts of data, based on simulations and models. It is then easy to understand the advantage of having all this information (or even a more processed and parsed version of it) available to similar researches. Scientists all around the world can benefit from research done all over the world, and share their conclusions, greatly speeding the scientific discovery process.

On a wider scope, this big vision (possible through the ideas coming from the Fourth Paradigm) can benefit other fields, such as legal fields, where past information (former cases and former decisions for example) can be used to great benefit. If all this information could be available to all people with possible interest in it, great changes could occur on how we use information, and how we can leverage on it.

On this big vision of Open Data, the Fourth Paradigm ideas will certainly come into play, as the generalized use of information cannot be possible without a correct and efficient management of all that information.

Page 3: RGI

TECHNIQUES

Efficient Index Construction

To ensure that all information available to be used on scientific discovery is effectively available (is possible to access in a fast and efficient manner), all this information must be indexed.

To leverage on all the possibilities, a strong and powerful structure for indexing must be built.

Inverted indexes have to be built, so it is easy to find data that is relevant to a particular field and research. If the process of accessing the data is not simple, the interest on doing it will surely be hurt and another approach must be sought.

The indexing must be done on a distributed way of course, because data, even if not globally shared at first (to access by all interested), will be generated at different places but should be available everywhere.

Specifically, the indexing work can be done locally, with the sharing facility being responsible for its processing (treated as a node on the network). This distributed approach is much more efficient and of course effective, because, as seen on the Fourth Paradigm book, data sets are massive and using only very limited computer clusters can be hard.

To achieve this, and to allow the ideas of the Fourth Paradigm to succeed, tools must be provided to make this process as simple and invisible to the scientist/user as possible.

Querying

Even if information is available and relatively organized on each of the data centers it is hosted on, no useful practical results can be taken from this if there is no way to retrieve information on a simple and efficient way.

It is impossible to simply navigate all this information until relevant information is found, so powerful querying techniques must be implemented to allow users to get information they find relevant. As querying massive sets of data can be unfeasible and extremely heavy, information needs to be structured (meta-data must be included on sets of data for quick identification for example).