scoda company networks2
DESCRIPTION
TRANSCRIPT
Some slide prompts to support a data framing inves3ga3on around corporate data – originally prepared for the OGP Fes3val, London, October 2013.
For more informa3on, contact: schoolOfData.org
1
When I buy something from a Shell petrol sta3on, who do I enter into a transac3on with?
When Shell builds a new petrol sta3on, who owns it?
When Shell enters into a new extrac3on contract, who actually enters the contract?
What, exactly, is this thing we think of as the company we refer to as “Shell”?
2
It’s a sprawl….
A complex network of interconnected companies with intertwined ownership structures and registered addresses in a wide range of countries spread across the globe.
This diagram, taken from OpenCorporates, shows companies for which Royal Dutch Shell is a beneficial owner, as well as beneficial ownership and shareholder rela3onships those subsidiary companies have with other companies.
3
This map shows companies in a corporate sprawl grown out from Royal Dutch Shell.
In this case, companies are connected if they share a common director, rather than a shareholding or ownership rela3on.
Star3ng with a seed company, we look for other companies that share two or more directors with the parent company. For each of these companies, we look up their directors and repeat, looking for further companies co-‐directed by two or more members of this extended set of directors.
The network is organised so that companies connected by several directors are posi3oned closely together. The result is a network map where different groups – or clusters of companies – that share common directors tend to neighbour each other.
The size of the company name is broadly related to how ’influen3al’ the company is in the network based on the extent to which it is connected to other influen3al companies.
Reading the map as if it were a gegraphical map, we see concentra3ons of different companies opera3ng in related sectors, presumably as a result of par3cular directors specialising in certain areas of the business.
Note the presence of BP in there, even though this network was grown out of the seed company Royal Dutch Shell Plc. – somehow, these two groupings are connected by shared directorships of some intermediate company.
4
This map shows a different view, concentra3ng on how directors are connected by virtue of being co-‐directors of the same company or companies.
5
These maps are all built from data -‐ company data – but where can we find this data? And how can we start to work with it /as data/ ourselves?
To explore that, let’s first think about what we mean by company data, before looking at strategies for discovering or
6
At their heart, complex network visualisa3ons can be constructed from quite simple stated data sets, such as this one.
Each row represents a connec3on -‐ or link – between two companies, with a “directed edge”, that is, an arrow, going from the Source element to the target element.
Loading a simple CSV (comma separated variable) text based data file such as this into a network visualisa3on tool provides enough informa3on to the tool to allow it to work out the connec3ons between all the elements and them plot the corresponding network diagram.
7
If you would like to learn more about genera3ng network visualisa3ons using Gephi, there are several tutorials available.
For example: hep://schoolofdata.org/2013/03/14/first-‐steps-‐in-‐iden3fying-‐climate-‐change-‐denial-‐networks-‐on-‐twieer/ hep://blog.ouseful.info/2012/11/09/drug-‐deal-‐network-‐analysis-‐with-‐gephi-‐tutorial/
8
Launch Gephi and create a new project. In the data laboratory, select “Import spreadsheet” and load in a CSV file containing a list of connected companies, one pair of companies (that is, one pair of conncected companies) per row,
Note that the data file must contain one column called Source and one called Target. Also make sure that you have iden3fied the file as being an Edge Table.
9
The controls are slightly too involved to go into here, but once you’ve loaded in the data into Gephi, you can, with prac3ce, very quickly (2-‐3 minutes in all) generate interac3ve visualisa3ons such as the one shown here.
10
Here’s a reminder of a couple of tutorials to get you started:
hep://schoolofdata.org/2013/03/14/first-‐steps-‐in-‐iden3fying-‐climate-‐change-‐denial-‐networks-‐on-‐twieer/ hep://blog.ouseful.info/2012/11/09/drug-‐deal-‐network-‐analysis-‐with-‐gephi-‐tutorial/
11
So given some data represented in quite a simple, two column text based data file, we can create quite rich and complex network visualisa3ons of our own.
But where can we get the data from in the first place?
12
OpenCorporates is a private company that has set itself the ambi3ous task of building a database of registered company informa3on for every legal corporate en3ty in the world.
13
One of the views OpenCorporates offers over at least some of the data in its database shows how companies are connected by beneficial ownership or shareholder rela3onships.
Although complex, this diagram is “human readable” – the data is presented in a way that is intended to make some sort of meaningful sense to us.
14
But as well as publishing data for us humans to read, OpenCorporates also makes data available in a way that machines can read -‐ machine readable data.
You may have heard of the term “API” in the context of data publishing websites. To all intents and purposes, an API is an interface that computers can use to get informa3on out of websites in a way that they, and the databases they work with, can understand.
15
If you aren’t a programmer, here’s way of gekng the data out of OpenCorporates and into a tabular form you may be more comfortable with, and which we can use to generate a network diagram to display in a tool such as Gephi…
16
Using the web address – or URL –to web page that reveals the data used to publish a corporate ownership network on OpenCorporates, we can load the data in to OpenRefine.
Note that you can import data into OpenRefine from several web addresses all in one go, though the data returned from each URL should have the same format or structure.
Using mul3ple URLs results in a combined data set, which can be quite handy.
17
Being machine readable, the data makes more sense to OpenRefine than it probably does to us!
Select a block of data in the preview view that is typical of a set of data that you want to map into a single row in a “tradi3onal” spreadsheet like view.
Data blocks are typically contained within braces (curly brackets); these things : { }
Note that in some machine readable data, some data blocks may be contained within other data blocks…
Each of the items in a single data block can be mapped into a separate cell – that is, a separate column – in a single row of data.
So each data block is a row, and each item in the block is a column…. OpenRefine will give you a preview of how the data will look if you click the right bueon!
18
Once we’re happy with the data preview, we can import the data into a more familiar looking layout.
The arrows at the top of each column pop up menus that allow us to run a wide variety of opera3ons on a column.
One of the opera3ons let’s us change the column name, so I’m going to rename the child company and parent company columns to Source and Target.
19
We can now export the data using the Custom Tabular Exporter.
Then from the Download tab, select the CSV output type and export your data.
You should have the two column data you can now load in to Gephi.
20
OpenRefine is a very powerful tool for working with data sets.
It can be used to help harvest informa3on from other websites by loading in data from every URL contained within a par3cular column (par3cularly machine readable data from web addresses that make call to website APIs).
21
OpenRefine also provides tools for cleaning data within a column (for example, changing everything to UPPERCASE or Title Case), removing unwanted punctua3on, or replacing one phrase with another (such as replacing Ltd with Limited).
If you have a column containing data elements at least some of which are supposed to match, or be consistent, but which aren’t, several clustering tools may be able to help.
For example, we may recognise Royal Dutch Shell PLC, ROYAL DUTCH SHELL P.L.C., and Royal Dutch Shell as represen3ng the same thing, but a computer will treat them all as different companies.
The clustering tools will aeempt to group together items that resemble each other in some way and provide with with the op3on of rewri3ng all the different flavours in the same way (for example, all the above examples as: Royal Dutch Shell plc)
22
That said, if you’re keen to learn more about OpenRefine, here are some tutorials to get you started:
hep://schoolofdata.org/2013/10/18/in-‐support-‐of-‐the-‐bangladeshi-‐garment-‐industries-‐data-‐expedi3on/ hep://schoolofdata.org/handbook/recipes/cleaning-‐data-‐with-‐refine/ hep://blog.ouseful.info/2013/03/14/first-‐dabblings-‐with-‐the-‐gateway-‐to-‐research-‐api-‐using-‐openrefine/ hep://blog.ouseful.info/2013/05/01/a-‐simple-‐openrefine-‐example-‐3dying-‐cutnpaste-‐data-‐from-‐a-‐web-‐page/ hep://blog.ouseful.info/2013/05/03/a-‐wrangling-‐example-‐with-‐openrefine-‐making-‐ready-‐data/ hep://blog.ouseful.info/2013/06/15/working-‐jobs-‐data-‐with-‐openrefine/ hep://schoolofdata.org/2013/07/26/using-‐openrefine-‐to-‐clean-‐mul3ple-‐documents-‐in-‐the-‐same-‐way/ hep://schoolofdata.org/2013/06/04/analysing-‐uk-‐lobbying-‐data-‐using-‐openrefine/ hep://blog.ouseful.info/2013/10/10/screenscraping-‐html-‐web-‐pages-‐with-‐openrefine-‐norwegian-‐oil-‐company-‐data/ (advanced)
23
We’ve started to see how we can get machine readable out of OpenCorporates and in to another set of tools that we can then use to start to analyse the data.
So what other data can we get out of OpenCorporates?
Note that while what follows mainly focusses on what machine readable data we can get from OpenCorporates, and where we can get it from, it’s worth also bearing in mind higher level journalis3c or inves3ga3ve ques3ons, such as: what sorts of structures or rela3onships might we be able to discover by analysing this data?
24
Looking at web pages on OpenCorporates provides us with a human readable view of the data. But if we want to start developing our own corporate maps or looking for connec3ons across hundreds of companies, it’s oqen easier to let a machine handle the task.
Let’s look again at a company page on OpenCorporates. We can see company informa3on rela3ng to a par3cular company is contained on the page in a human readable way.
Look at the web address – or URL -‐ of the page, and compare it to the company informa3on– do you recognise any pieces of the address in the company data?
The web address actually contains the jurisdic3on and the company iden3fier for the company shown. So if we know the jurisdic3on and the company number, we can get to this page (or the human readable version – just cut api. off the front of the web address). And if we can get to this page, we can find addi3onal informa3on about it, such as the registered name of the company, or it’s registered address.
To learn more about reading and wri3ng web addresses, or URLs as they are also known, see: hep://schoolofdata.org/2013/05/09/hun3ng-‐for-‐data-‐learning-‐how-‐to-‐read-‐and-‐write-‐web-‐addresses-‐aka-‐urls/
But what if we wanted to pull this informa3on in to OpenRefine? Is the data for a par3cular company available in a machine readable way?
25
If we look behind a company web page for a company listed on OpenCorporates, we can see the company data in a machine readable way. To get this view, simple add api. to the start of the web address of the company page.
If you pluck up courage to look at the data, you’ll see that you can start to make sense of it. What is the company name, for example? When was it incorporated? Where is its registered address? What is the company number, and which jurisdic3on does it apply to?
26
The OpenCorporates company data feed for a company (and the associated human readable web page) also contains a wealth of other informa3on presented in a form that computers can read – and manipulate (we’ll see an example of that shortly…).
In this case we can see a par3al lis3ng of the officers of the company (that is, the directors), along with their appointment date, their current status, and the termina3on date of the appointment, if appropriate.
The important thing to realise is that in the same way we can make sense of and read normal web pages, computers can make sense of and read structured, machine readable data such as this, as well as storing it in databases that allow us to search for paeerns and structures across large amounts of data in a rela3vely straighsorward way.
27
Here’s a recap of a couple of examples web addresses/URLs for machine readable data feeds rela3ng to companies on OpenCorporates.
Can you see how to hack (that is, edit, or modify) the URLs to give the human readable web page for the corresponding company?
Could you see how to change the URL to give a web page (even in machine readable form or human readable form) for another company in the same jurisdic3on? How about for a company in another jurisidic3on?
28
As well as looking up informa3on about a par3cular company on a web page whose address includes the jurisdic3on and company number of the company whose details we want to look up, we can also search for companies on OpenCorporates.
If you look at the web address, you should be able to see that the search term appears in it, although there also appears to be some gibberish characters (these characters aactually represent – or encode – the blank space characters in the search term that appears in the search box.
If I tell you that a machine readable version of the search results is available in a data form, could you guess at how to hack the URL in order to reveal this data?
29
As well as providing machine readable versions of company informa3on, via web addresses/URLs that contain key jurisdic3on and company iden3fier numbers that uniquely describe the corresponding company, OpenCorporates also provides a machine readable version of a search made on company name, reachable via a similar tweak to the we address that gave a machine readable, compared to human readable, version of a single company page.
Given the URL, do you think you would be able to pull this machine readable data into a tool such as OpenRefine?
30
As well as the “simple” search for companies by name, OpenCorporates offers another form of search that can be more 3ghtly integrated with OpenRefine
It’s known as a reconcilia3on API, and it’s used to try to find matches for company names in the OpenCorporates database, along with an es3mate about how confident the match is.
For example, if you have a list of company names in a single column, you can use the reconcilia3on API in an aeempt to get matches in OpenCorporates for all those companies.
This works similar to search, but we also get a confidence score back es3ma3ng how well the company name suggested by OpenCorporates matches the one you provided.
Tools such as OpenRefine can hook in to the reconcilia3on API and pitch a company name to it, and the API will give a set of best guess matches of company names OpenCorporates knows about along with a confidence es3mate. (You can also limit aeempts at reconcilia3on to just those companies within a par3cular jurisdic3on).
When OpenRefine gets the sugges3ons back, you can choose to accept the most confident matches, or matches above a certain confidence. The result is an automa3cally enriched data set that includes OpenCorporates iden3fiers and registered names for each of the companies in your original list.
For more on reconciling company names with OpenCorporates company lis3ngs, see: hep://schoolofdata.org/2013/10/18/in-‐support-‐of-‐the-‐bangladeshi-‐garment-‐industries-‐data-‐expedi3on/
31
Just so you know what it looks like, (because you should be gekng your eye in now…) here’s an example of the data returned by the OpenCorporates reconcilia3on API. Note the confidence score associated with each item in the response list.
You may no3ce that the web address for this API call has a slightly different form to the other API calling web addresses you have seen. You should also note that no human readable web page version corresponding to this data exists.
Of course, as well as calling the reconcilia3on API, if you can construct the web address/URL yourself, you can also load data into a new OpenRefine project from that address directly.
32
Here’s a summary of the search and reconcilia3on API web addresses.
Do you think you could hack the URLs to run searches for other company names?
33
As well as informa3on about companies, OpenCorporates also try to collect informa3on about company directors. This includes full name as well as the appointment and termina3on date (if appropriate) of the appointment.
If we are collec3ng data around a par3cular company, one thing we might look at is how directorial appointments changes over 3me in the various companies associated with a par3cular grouping, as well as the dynamics of how companies are formed and dissolved.
At the moment, a separate page exists for each person-‐who-‐is-‐a-‐director-‐of-‐a-‐par3cular-‐company. If the same human is a director of two companies, that human will be associated with two separate director numbers (one corresponding to the directorship of the first company, a second corresponding to their directorship of the other company).
Hopefully, in 3me we also get a unique iden3fier for a par3cular person, and a mapping to the director-‐company iden3fiers associated with pages such as the one shown here that describe a par3cular director role with a specific company.
34
As with company informa3on pages, we can also get a peek behind the scenes at the data associated with a par3cular director in the context of a par3cular company.
Given the URL of a human readable company director web page, do you think you could get hold of the data version?
35
As well as searching for companies, we can also search for directors. As before, if you look at the web address for the search, you may be able to recognise that the search term that appears in the search box on the web page also appears in the URL that acts as the web address for that human readable web page.
What happens if you filter the results by jurisdic3on? Click on one of the jurisdic3on links and then look at the URL of the page. Do you no3ce anything different about it?
(The same trick works for searches on companies…)
Knowing what you know about the OpenCorporates API, could you guess at a URL that might return a data version of this page?
36
A slight tweak the URL of the human readable search page, and we get the machine readable data.
Do you think you would be able to find a way of pulling data rela3ng to a search on a par3cular director in to OpenRefine?
Do you think you could hack the URL to pull data about a search for a par3cular director name limited to a search for appointments within a par3cular jurisdic3on into OpenRefine?
Recalling that OpenRefine can load in data from several URLs, do you think you could find a way of pulling data into OpenRefine from searches within a par3cular territory on a director’s name with two different spellings, or the same director’s name in two different jurisdic3ons?
37
Here’s a quick recap of how to pull data out of OpenCorporates that relates to a search for a par3cular director, and how to further limit the search for appointments made to companies registered within a par3cular jurisdic3on.
Note that there is no reconcilia3on API for director names.
38
If you want to know more, contact us…
39