resume: webometrics presentation

Upload: timkeajaiban

Post on 30-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 resume: Webometrics Presentation

    1/10

    Methodology

    PRESENTATION

    The Webometrics Ranking formally and explicitly adheres to the Berlin Principles of Higher

    Education Institutions. The ultimate aim is the continuous improvement and refinement of the

    methodologies according to a set of agreed principles of good practices.

    During the last year several of the signatories of the Code of Good Practices known as Berlin

    Principles on Ranking of Higher Education Institutions became private-for-profit companies

    and the biases of some of the Rankings are now more and more evident. Although Webometrics

    Ranking formally and explicitly still adheres to the Berlin Principles, we would make some points

    to add to these principles:

    A World Ranking is ONE ranking: Publishing a series of completely different classifications

    with exactly the same data is useless and confusing.

    A World Universities Ranking is a ranking of universities from all over the world, covering

    thousands of them, not only a few hundred institutions from the developed world.

    A Ranking backed by a for-profit company exploiting rank-related business should be

    checked with care.

    Unexpected presence of certain universities in top positions is a good indicator of the (lack

    of) quality of a Ranking, independently on how supposedly sound methodologies are used.

    Rankings favoring stability between editions and not publishing explicitly individual

    changes and reasons for them (correcting errors, adding or deleting entries, changingindicators) are violating the code of good practices.

    Research only (bibliometrics) based Ranking are biased against technologies, computer

    science, social sciences and humanities, disciplines that usually amounts for more than half

    of the scholars in a standard comprehensive university.

    Rankings should include indicators, even indirect ones, about teaching mission and the so-

    called third mission, considering not only the scientific impact of the university activities but

    also the economic, social, cultural and also the political ones.

    World-class universities are not small, very specialized institutions.

    Surveys are not a suitable tool for World Rankings as there is no even a single individual

    with a deep (several semesters per institution), multi-institutional (several dozen),

    multidisciplinary (hard sciences, biomedicine, social sciences, technologies) experience in arepresentative sample (different continents) of universities worldwide.

    Link analysis is a far more powerful tool for quality evaluation than citation analysis that only

    counts formal recognition between peers, while links not only includes bibliographic citations but

    third parties involvement with university activities.

    0) Background of the project.

    The World Universities' ranking on the Web is an initiative of the Cybermetrics Lab, a

    research group of the Centro de Ciencias Humanas y Sociales (CCHS), part of the NationalResearch Council (CSIC), the largest public research body in Spain.

    http://www.che.de/downloads/Berlin_Principles_IREG_534.pdfhttp://www.che.de/downloads/Berlin_Principles_IREG_534.pdfhttp://www.cchs.csic.es/http://www.csic.es/http://www.cchs.csic.es/http://www.che.de/downloads/Berlin_Principles_IREG_534.pdfhttp://www.che.de/downloads/Berlin_Principles_IREG_534.pdf
  • 8/14/2019 resume: Webometrics Presentation

    2/10

    Cybermetrics Lab is devoted to the quantitative analysis of the Internet and Web contents specially

    those related to the processes of generation and scholarly communication of scientific knowledge.

    This is a new emerging discipline that has been called Cybermetrics (our team developed and

    publishes the free electronic journal Cybermetrics since 1997) or Webometrics.

    With these rankings we intend to provide extra motivation to researchers worldwide for publishing

    more and better scientific content on the Web, making it available to colleagues and people

    wherever they are located.

    The "Webometrics Ranking of World Universities" was officially launched in 2004, and it is

    updated every 6 months (data collected in January and July and published one month later). The

    Web indicators used are based and correlated with traditional scientometric and bibliometric

    indicators and the goal of the project is to convince academic and political communities of the

    importance of the web publication not only for dissemination of the academic knowledge but formeasuring scientific activities, performance and impact too.

    A) Purposes and Goals of Rankings

    1. Assessment of higher education (processes, and outputs) in the Web. The Web indicators and we

    are already publishing comparative analysis with similar initiatives. But the current objective of the

    Webometrics Ranking is to promote Web publication by universities, evaluating the commitment to

    the electronic distribution of these organizations and to fight a very concerning academic digital

    divide which is evident even among world universities from developed countries. However, even

    when we do not intend to assess universities performance solely on the basis of their web output,

    Webometrics Ranking is measuring a wider range of activities than the current generation ofbibliometric indicators that focuses only in the activities of scientific elite

    2. Ranking purpose and target groups. Webometrics Ranking is measuring the volume, visibility

    and impact of the web pages published by universities, with special emphasis in the scientific output

    (referred papers, conference contributions, pre-prints, monographs, thesis, reports, ) but also

    taking into account other materials (courseware, seminars or workshops documentation, digital

    libraries, databases, multimedia, personal pages, ) and the general information on the institution,

    their departments, research groups or supporting services and people working or attending courses.

    There is a direct target group for the Ranking which are the university authorities. If the web

    performance of an institution is below the expected position according to their academic excellence,

    they should reconsider their web policy, promoting substantial increases in the volume and quality

    of their electronic publications.

    Faculty members are indirect target groups as we expect that in a near future the web information

    could be as important as other bibliometric and scientometric indicators for the evaluation of the

    scientific performance of scholars and their research groups. Finally, candidate students should not

    used this data as the sole guide for choosing university, although a Top position means that the

    institution has a policy that encourages new technologies and it has resources for their adoption.

    3. Diversity of institutions: Missions and goals of the institutions. Quality measures for research-

    oriented institutions, for example, are quite different from those that are appropriate for institutionsthat provide broad access to underserved communities. Institutions that are being ranked and the

    experts that inform the ranking process should be consulted often.

    http://www.cindoc.csic.es/cybermetrics/
  • 8/14/2019 resume: Webometrics Presentation

    3/10

    4. Information sources and interpretation of the data provided. Access to the Web information is

    done mainly through search engines. These intermediaries are free, universal, and very powerful

    even when considering their shortcomings (coverage limitations and biases, lack of transparency,

    commercial secrets and strategies, irregular behaviour). Search engines are key for measuring

    visibility and impact of universitys websites.

    There are a limited number of sources that can be useful for webometric purposes: 7 general search

    engines (Google*, Yahoo Search*, Live (MSN) Search*, Exalead*, Ask (Teoma), Gigablast and

    Alexa) and 2 specialised scientific databases (Google Scholar* and Live Academic). All of themhave very large (huge) independent databases, but due to the availability of their data collection

    procedures (Apis), only those marked with asterisk are used in compiling the Webometrics

    Ranking.

    5. Linguistic, cultural, economic, and historical contexts. The project intends to have true global

    coverage, not narrowing the analysis to a few hundreds of institutions (world-class universities) but

    including as many organizations as possible. The only requirement in our international rankings is

    having an autonomous web presence with an independent web domain. This approach allows a

    larger number of institutions to monitor their current ranking and the evolution of this position after

    adopting specific policies and initiatives. Universities in developing countries have the opportunity

    to know precisely the indicators' threshold that marks the limit of the elite.

    Current identified biases of the Webometrics Ranking includes the traditional linguistic one (more

    than half of the internet users are English-speaking people), and a new disciplinary one (technology

    instead of biomedicine is at the moment the hot topic) Since in most cases the infrastructure (web

    space) and the connectivity to the Internet already exits , the economic factor is not considered a

    major limitation (at least for the 3.000 Top universities).

    B) Design and Weighting of Indicators

    6. Methodology used to create the rankings. The unit for analysis is the institutional domain, soonly universities and research centres with an independent web domain are considered. If an

    institution has more than one main domain, two or more entries are used with the different

    addresses. About 5-10% of the institutions have no independent web presence, most of them located

    in developing countries. Our catalogue of institutions includes not only universities but also other

    Higher Education institutions following the recommendations of UNESCO. Names and addresses

    were collected from both national and international sources including among others:

    Universities Worldwide univ.cc

    All Universities around the World www.bulter.nl/universities/

    Braintrack University Index www.braintrack.com

    Canadian Universities www.uwaterloo.ca/canu

    UK Universities www.scit.wlv.ac.uk/ukinfo

    US Universities www.utexas.edu/world/univ/state

    University activity is multi-dimensional and this is reflected in its web presence. So the best way to

    build the ranking is combining a group of indicators that measures these different aspects. Almind

    & Ingwersen proposed the first Web indicator, Web Impact Factor (WIF), based on link analysis

    that combines the number of external inlinks and the number of pages of the website, a ratio of 1:1

    between visibility and size. This ratio is used for the ranking but adding two new indicators to the

    size component: Number of documents, measured from the number of rich files in a web domain,

    and number of publications being collected by Google Scholar database. As it has been alreadycommented, the four indicators were obtained from the quantitative results provided by the main

    search engines as follows:

    http://univ.cc/http://www.bulter.nl/universities/http://www.braintrack.com/http://www.uwaterloo.ca/canu/http://www.scit.wlv.ac.uk/ukinfo/http://www.utexas.edu/world/univ/state/http://www.utexas.edu/world/univ/state/http://www.scit.wlv.ac.uk/ukinfo/http://www.uwaterloo.ca/canu/http://www.braintrack.com/http://www.bulter.nl/universities/http://univ.cc/
  • 8/14/2019 resume: Webometrics Presentation

    4/10

    Size (S). Number of pages recovered from four engines: Google, Yahoo, Live Search and Exalead.

    For each engine, results are log-normalised to 1 for the highest value. Then for each domain,

    maximum and minimum results are excluded and every institution is assigned a rank according to

    the combined sum.

    Visibility (V). The total number of unique external links received (inlinks) by a site can be only

    confidently obtained from Yahoo Search. Results are log-normalised to 1 for the highest value and

    then combined to generate the rank.

    Rich Files (R). After evaluation of their relevance to academic and publication activities and

    considering the volume of the different file formats, the following were selected: Adobe Acrobat

    (.pdf), Adobe PostScript (.ps), Microsoft Word (.doc) and Microsoft Powerpoint (.ppt). These data

    were extracted using Google and merging the results for each filetype after log-normalising in the

    same way as described before.

    Scholar (Sc). Google Scholar provides the number of papers and citations for each academic

    domain. These results from the Scholar database represent papers, reports and other academic items.

    The four ranks were combined according to a formula where each one has a different weight:

    7. Relevance and validity of the indicators. The choice of the indicators was done according to

    several criteria (see note), some of them trying to catch quality and academic and institutional

    strengths but others intending to promote web publication and Open Access initiatives. Theinclusion of the total number of pages is based on the recognition of a new global market for

    academic information, so the web is the adequate platform for the internationalization of the

    institutions. A strong and detailed web presence providing exact descriptions of the structure and

    activities of the university can attract new students and scholars worldwide . The number of external

    inlinks received by a domain is a measure that represents visibility and impact of the published

    material, and although there is a great diversity of motivations for linking, a significant fraction

    works in a similar way as bibliographic citation. The success of self-archiving and other repositories

    related initiatives can be roughly represented from rich file and Scholar data. The huge numbers

    involved with the pdf and doc formats means that not only administrative reports and bureaucratic

    forms are involved. PostScript and Powerpoint files are clearly related to academic activities.

    8. Measure outcomes in preference to inputs whenever possible. Data on inputs are relevant as they

    reflect the general condition of a given establishment and are more frequently available. Measures

    of outcomes provide a more accurate assessment of the standing and/or quality of a given institution

    or program. We expect to offer a better balance in the future, but current edition intend to call the

    attention to incomplete strategies, inadequate policies and bad practices in web publication before

    attempting a more complete scenario.

    9. Weighting the different indicators: Current and future evolution. The current rules for ranking

    indicators including the described weighting model has been tested and published in scientific

    papers. More research is still done on this topic, but the final aim is to develop a model that includes

    additional quantitative data, especially bibliometric and scientometric indicators.

    http://www.webometrics.info/methodology.html#notahttp://www.webometrics.info/methodology.html#nota
  • 8/14/2019 resume: Webometrics Presentation

    5/10

    C) Collection and Processing of Data

    10. Ethical standards. We identified some relevant biases in the search engines data including

    under-representation of some countries and languages. As the behaviour is different for each

    engine, a good practice consists of combining results from several sources. Any other mistake or

    error is unintentional and it should not affect the credibility of the ranking. Please contact us if you

    think the ranking is not objective and impartial in any way.

    11. Audited and verifiable data. The only source for the data of the Webometrics Ranking is a smallset of globally available, free access search engines. All the results can be duplicated according to

    the describing methodologies taking into account the explosive growth of the web contents, their

    volatility and the irregular behaviour of the commercial engines.

    12. Data collection. Data are collected during the same week, in two consecutive rounds for each

    strategy, being selected the higher value. Every website under common institutional domain is

    explored, but no attempt has been done to combine contents or links from different domains.

    13. Quality of the ranking processes. After automatic collection of data, positions are checked

    manually and compared with previous editions. Some of the processes are duplicated and new

    expertise is added from a variety of sources. Pages that linked to the Webometrics Ranking are

    explored and comments from blogs and other fora are taken into account. Finally, our mailbox

    receives a lot of requests and suggestions that are acknowledged individually.

    14. Organizational measures to enhance credibility. The ranking results and methodologies are

    discussed in scientific journals, and presented in international conferences. We expect international

    advisory or even supervisory bodies to take part in future developments of the ranking.

    D) Presentation of Ranking Results

    15. Display of data and factors involved. The published tables show all the Web indicators used in avery synthetic and visual way. Rankings are provided not only from a central Top 4000

    classification but also considering several regional rankings for comparative purposes.

    16. Updating and error reducing. The listings are offered from asp dynamic pages build on several

    databases that can be corrected when errors or typos are detected.

  • 8/14/2019 resume: Webometrics Presentation

    6/10

    Glossary

    This section is really a hybrid between a real Glossary and a FAQ that intends to explain some of

    the terms and the meanings as used in the building of this ranking.

    Database size. The number of records in the search engine databases that it publicly accessible from

    external sources. Not all the robots crawl the Web at the same time or with identical procedures,

    besides post crawling processes and other commercial requirements finally result in really different

    databases. The current size, composition and evolution of the figures are a relevant point in

    webometric analysis.

    Delimited search. A key characteristic of the search engines that allow the cybermetric analysis. A

    delimiter operator has a specific syntax and meaning that can differ among engines. It provides thenumber of records (web pages) that satisfied a certain condition filtering the results according tostrings in the address (URL) or other characteristics (language, format) of the page. Special

    relevance has the link delimiter that can be used in combination with site or other similar to

    calculate inlinks.

  • 8/14/2019 resume: Webometrics Presentation

    7/10

    Discipline differences. The ranking does not provide any kind of thematic assignation to the units,

    so a formal thematic analysis is not possible at the moment. But there are important differences

    regarding academic focus on our universities database that should be taken into account. Research

    focused universities are mixed with learning institutions and a group of discipline oriented (mainly

    pedagogy, medicine and theology) organizations are also present.

    Formal characteristics. As there is neither universal document control nor formal guidelines for web

    page building, there is a huge diversity of formal aspects in the Webspace, including obvious

    malpractices. Some authors have focused on these to provide new indicators such as link density,link quality, expressed as ratios of non working links, missing tags, including those so relevant as

    title or metadata, or updating frequency. None of these characteristics are taken into account in our

    rankings, but they should be taken into consideration for micro-analysis.

    Geographical biases. The use of several search engines in our ranking is due to the geographical

    bias observed in some of them. We do not know if this is due to topological or traffic problems in

    the network (some eastern Asian countries are usually poorly covered) or to the crawlers behaviour

    or if the biases are equal long the time. Alexa biases preclude us to add the popularity data in our

    rankings.

    Institutional domains. The basic unit of our analysis refers to the common URL domain shared by

    all the web sites of an institution. Unfortunately some organizations maintain two or more

    equivalent domains, without a preferred marked one. Also for concern is the fact that some second

    level departments maintain completely different domains. Usually we maintain two entries for those

    institutions with two top level equivalent domains. We intend to merge results of smaller domains

    with those of the main one in the near future, but it is a difficult task.

    Invocation. The presence of the name of an institution or a researcher in a Web page. The global

    presence is the number of times the name appears in the Web and can be calculated easily using

    quotation marks around the name in the search engines. Sometimes this figure is referred as the

    number of times this name is cited in the Web. Some authors refer this as Web visibility, althoughwe prefer to reserve this word for link visibility. This indicator usually favours large, well-known,

    old institutions independently of their real effort for having a relevant Web presence.

    No invocation measure was used in our ranking, mainly because it is not possible to assign a

    unique, unambiguous universal name for every institution.

    Invisible Web. Traditionally refers to the information available through gateways or search

    interfaces that is not accessible by the search engines robots. It is a huge part of the Internet

    content, including library catalogues, bibliographic and alphanumeric databases or even some

    repositories of documents. During last years some engines, specially Google, has made a great

    effort to index these records and in fact several databases are more or less covered in their systems(i.e. PubMed is partially indexed by Google). Our ranking do not consider the Invisible or Deep

    Web and we encourage transforming it in crawler friendly information.

    Language. English is the lingua franca for scientific communication and it is also the language of

    a significant fraction of the internet users. Non-english institutions publishing only in their mother

    tongue alone achieved a lower visibility than those with multilingual websites.

    Link motivation. Major concern in link analysis is the motivations behind a link creation. Previous

    studies suggest that sitations, the hypertextual equivalent to bibliographic citations, are still rare.

    We think this situation will improve when more papers became available on the Web, but we

    consider other reasons to link very useful to describe scholarly communication. Informal linking isa powerful source of information about intellectual, economic and political connections of the

    academic and scientific activities.

  • 8/14/2019 resume: Webometrics Presentation

    8/10

    CATEGORY CASE COMMENTS

    Sitation Link to paper or documentGenerally in pdf/ps/doc

    format

    Teaching/learning Link to course materialsMainly html pages but

    also pdf, doc or ppt

    Resources index Portal type

    Software repository

    Research projects sitesConferences, seminars or

    meetings pages

    Research oriented

    Raw dataIncluding media files if

    applicable

    Self archivePre or post prints, but also

    unpublished material

    Team or colleagues pages

    Blog

    Personal

    Third parties (non-research)

    Parent institution And related onesInstitutionalFunding organization

    Link popularity. Another term to refer to link visibility that has been used extensively. We prefer to

    reserve popularity for the measure of number of visits. Although not yet implemented on the

    Ranking, we intend to consider number of visits or popularity as a relevant factor for our rankings

    in the future.

    Open access. The movement to distribute in an open way the scientific production of, at least, the

    public funded researchers is facing tougher opposition than expected. A strong bet for open access

    initiatives will be clearly reflected in our rankings.

    Personal pages. A frequently heard statement about web contents quality is related to the

    information provided by the personal pages of students or staff members. There is a lot of free space

    hosted by the university web servers that is used for personal purposes, and in general it is thought

    that it is used with low quality information or not academic related. Data suggest a large number of

    small websites are crowding the institutional domains, but most of them are interesting enough to

    merit consideration. Some personal pages are in fact the research group site, while others are

    institutional (scientific societies, electronic bulletins, conference sites). True personal pages cover

    both extremes of the contents range, with people offering only CVs to others providing very large

    arrangements of information of their academic or research topics with links to personal repositories

    of documents. A striking pattern is the absence of links to other colleagues websites or institutions.

    Quality. We advice against the use of the rankings as global or partial indicator of quality. Impact or

    visibility describes better our aims, but in the particular context of promotion of open and universal

    access to the scientific activities and results through the Web.

    Ranking. As their main objective is purely commercial, current search engines are not offering

    stable, reliable, or trustworthy results for webometric purposes. The situation has improved in the

    last years but there are still important bias and a worrisome instability. This is the reason we are

    using absolute values but relative positions for our analysis.

    Rich files. A general term comprising a rather heterogeneous group of file types, mainly thosedevoted to represent unitary enriched documents, such as MS Word doc, Adobe Acrobat pdf or

    PostScript ps. In our analysis we also included MS Powerpoint ppt and excluded xls or latex or tex.

    Rich files are relevant because they are use for scholarly communication as authors usually

    distribute their papers and presentations in these formats. Certainly some of these types are used

  • 8/14/2019 resume: Webometrics Presentation

    9/10

    extensively for bureaucratic purposes (forms, administrative documents, internal reports) but these

    can only explain a small percentage of large numbers observed in domains with extensive

    repositories.

    There are several other file types that can be considered as rich files, and even raw formats like txt

    are being used for distributing academic content. But their individual contribution is too low to be

    considered.

    Rounding. Google and Yahoo offer rounded results, ending in ,000, which means an error rate inthe order of 2 to 5%. Moreover the numbers provided by Yahoo in the first page is about another 4-

    5% higher that the one showed in the following pages that show a trend towards the correct

    number.

    Search Engine. The software that searches an index and returns matches. Search engine is often

    used synonymously with spider and index, although these are separate components that work with

    the engine. There are only four engines useful for quantitative analysis purposes as they have a

    large and independent self crawled database and their recovery system allow filtering of results

    according to url-related delimiters:

    Google www.google.com

    Yahoo Search search.yahoo.com

    Bing www.bing.com

    Exalead www.exalead.com/search

    Self archiving. Self-archiving involves depositing a free copy of a digital document on the World

    Wide Web in order to provide open access to it. The term usually refers to the self-archiving of peer

    reviewed research journal and conference articles as well as theses, deposited in the author's own

    institutional repository or open archive for the purpose of maximizing its accessibility, usage and

    citation impact. This practice is common among most prolific authors and in certain disciplines.

    However globally it is only a minority of authors who support this option. As much of these papersare published as rich files, pdf, ps or doc, this practice increases notably the performance of an

    institution in our rankings.

    Size. The size of an institutional domain is the combined number of pages of all the websites with

    that domain, including html and non html formats that can be assimilated. From a practical point of

    view, size refers to the number provided by a search engine when a search like site:domain is done.

    This indicator is central for our rankings and it is used also as denominator for Web Impact Factor

    calculations by other authors. However there is a wide range of pages according to different criteria,

    including content size measured in bytes. For example, a page containing a pdf document that can

    be a monograph consisting of several hundreds pages totalling several Mb of texts and images,

    while other page consists only of the phrase page under construction. Global size could be aninteresting indicator and we expect to provide it for selected websites.

    Stability. From the early times instability of the search results in general, and of the number that

    represents results in particular has been a subject of special concern. Certainly the Web is a highly

    dynamic system, growing at an incredible pace, but also the crawlers change their specifications and

    schedule unexpectedly. A world crawling round can last from 15 to 45 days and in this meantime.

    Visibility. In the context of this ranking, the term refers to link visibility: The number of external

    inlinks received by an institutional domain. The most used syntax for this request in search engines

    is:

    linkdomain:webometrics.info site:webometrics.info

    http://www.google.com/http://search.yahoo.com/http://www.bing.com/http://www.exalead.com/searchhttp://www.exalead.com/searchhttp://www.bing.com/http://search.yahoo.com/http://www.google.com/
  • 8/14/2019 resume: Webometrics Presentation

    10/10