mapping french open data actors on the web with common crawl
DESCRIPTION
TRANSCRIPT
Mapping french Open Data actors on the web with Common [email protected]@glebourg
Mining the Web at Data Publica
Different needs, different techniques● Scraping● Focused crawling● Prospective crawling
Mining the Web at Data Publica
Scraping● Identified resources● Configured extractors● Structured content● Not scalable
Mining the Web at Data Publica
Focused crawling● Identified entities● Fuzzy extraction● Structured content using text-mining● Scalable● Useful to get meta information on known
entities
Mining the Web at Data Publica
Prospective crawling● No starting point● Fuzzy extraction● Structured content using text-mining● Very hard to scale● Heavy resources needed : CPU, RAM,
HDD
It makes your life easier to use a third-party !
From a crawl to a map
Goal : build a map of the french open data actors on the web
● As a graph● Showing websites
From a crawl to a map
Using Common Crawl● Large web crawl archives fully accessible● Good coverage of french web● Easy access via AWS / MapReduce jobs
From a crawl to a map
Working on french web● Irrelevant to use tld .fr for detection● Detecting page language● Giving websites a "frenchness" score
○ Sw = amount of fr pages / total of pages○ Cutoff manually chosen via testing on french
websites
From a crawl to a map
Working on Open Data websites● Building an Open Data "vocabulary"● Detecting if page speaks about Open
Data● Giving websites an "opendataness" score
○ Sw = amount of Open Data pages / total of pages○ Cutoff manually chosen via testing on Open Data
websites
From a crawl to a map
Building graph● Inside our subset
○ Inlinks○ Outlinks
● Generating two files○ nodes.csv (list of websites with an id)○ edges.csv (directed links between websites)
Node AA inlink A outlink
A inlink
From a crawl to a map
Building graph● Links tell a lot about websites
○ Authorities○ Hubs
From a crawl to a map
Visualizing graph using Gephi● Load graph● Spatialize graph
○ links between websites create "attraction", to make them appear near each other
○ the more inlinks, bigger the node (= authority)○ categorizing web site for better understanding (a
color per category)■ Companies, Non profit/blogs, Governement
agencies○ communities can now appear !
From a crawl to a map
From a crawl to a map
Visualizing graph on the web● Sigma.js● Uses Gephi files● Gives better interactivity
Analyze
● The final graph is a good way to understand interactions between actors○ Open Data is definitely initiated by a Non Profit
movement○ Companies are beginning to work on the subject○ French state only had some sporadic initiatives for
now● This graph is to be generated again in near
futur, to see changes in this ecosystem
Results
● Large scale crawl made easy○ Easy to focus on mining the results instead of
finding/storing the data● Nice workflow from raw data to an
understandable visualisation● The final graph is a good way to understand
interactions between actors
Feedback
● Common Crawl○ Common crawl doesn't have an exhaustive crawl of
the french web for now○ Data is not fresh as it could be○ It is missing an index to access at least domains,
and maybe pages in O(1)● Methodology
○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
Resources
● http://webatlas.fr/tempshare/OpenDataActeursTypes.pdf○ poster by Franck Ghitalla
● http://french-opendata.data-publica.com/index.html○ dynamic visualisation of the results, by Data Publica
● http://fr.slideshare.net/willounet/a-sneak-peek-into-the-web-presentation,○ A sneak peek into the web, by GL
● http://french-opendata.data-publica.com/○ Project host page
Mapping french Open Data actors on the web with Common [email protected]@glebourg