finding stories in spreadsheets
DESCRIPTION
Presentation at Data Harvest 2014TRANSCRIPT
@PaulBradshawLeanpub.com/u/paulbradshawBirmingham City University, City University LondonOnline Journalism Blog, HelpMeInvestigate
Saturday, 10 May 14
Show of hands. Who has...- Calculated a proportion- Used a function like SUM- Used pivot tables- Used a function like VLOOKUP
Saturday, 10 May 14
PART ONE:
BASICS.Saturday, 10 May 14
Saturday, 10 May 14
https://pefonline.electoralcommission.org.uk/search/searchintro.aspxhttp://www.eib.org/projects/loans/list/
Download this data:Donations or EIB loans
Saturday, 10 May 14
- Make a copy, work on that- Use CTRL+arrow keys to skip to edges of data- Clean first few rows to create single heading row- Remove grand total row- Remove empty rows (Open Refine)
Speed: keyboard shortcuts for checking the data
Saturday, 10 May 14
Numbers Strings Calculations10 John Smith =10+20+30
20 Kate Brown =A2+A3+A430 Mike Moore =SUM(A2:A4)
N/A Kim Smith =COUNT(A:A)
50 =COUNTA(B:B)
Row 1
Column A Column B Column C
Row 3
Row 4
Row 5
Row 6
Row 2
Saturday, 10 May 14
Granular data has row for every payment, person, crime etc.Aggregate has rows for total crimes, payments, etc.Granular always better - can calculate your own aggregates
Two types of datasets:Aggregate and granular
Saturday, 10 May 14
Aggregate data: - put the focus in Rows- numbers (money, crimes) in Values
Granular: pivot tables
Saturday, 10 May 14
Saturday, 10 May 14
= indicates this is a formulaSUM is the function to be applied( contains the ingredients for that formulaD2:D300 this is a range (array) of cells*, separates each ingredient) ends the list of ingredients
Using functions - and arguments
Saturday, 10 May 14
=SUM(D:D) ignores any text/empty cells=MAX(D:D)=MIN(D:D)=AVERAGE(D:D)
More speed: use column ranges
Saturday, 10 May 14
=AVERAGE(D:D) =MEDIAN(D:D) =MODE(D:D) - for ‘most common’: useful for ordinal ratings which shouldn’t be averaged.
Sense-checking: misleading averages
Saturday, 10 May 14
=MAX(D:D)/SUM(D:D) - how much of the total is accounted for by the biggest value?=SUM(D35:D64)/SUM(D:D) - what proportion from one entity?=SUM(D:D)/365 - how much per day? (for annual data)
Combining functions to quickly make numbers meaningful
Saturday, 10 May 14
Org spending £X per dayCompany receives X% of spendingOrg spent £X on Y
Stories you can report quickly
Saturday, 10 May 14
Saturday, 10 May 14
Data health
warning!
Remember the context: e.g. spending over £500, inflationSaturday, 10 May 14
PART TWO:
CHECKINGSaturday, 10 May 14
Saturday, 10 May 14
=COUNT(D:D) =COUNTA(D:D) =COUNTBLANK(D2:D15000) - have to use specific range or blank cells underneath table are counted=COUNTIF(D:D, “Other”)
COUNT functions: Checking data coverage
Saturday, 10 May 14
=COUNTIF(D:D, “Individual”) =COUNTIFS(D:D, “Individual”, B:B,”<10000”)=SUMIF(D:D, “<10000”) =IF(This, then that, otherwise this)
IF functions: Drill down further
Saturday, 10 May 14
=COUNTIF(D:D, “*hire*”) =COUNTIF(D:D, “Scottish*”)=COUNTIF(D:D, “* hire*”)
COUNTIF:Use wildcards - and spaces
Saturday, 10 May 14
Saturday, 10 May 14
=COUNTIF(D2, “*adidas*”) =COUNTIF(D3, “*adidas*”)=COUNTIF(D4, “*adidas*”)...Then sort to bring the 1s to the top
COUNTIF: Test free text data
Saturday, 10 May 14
THE BLACK CROSS
DOUBLE
CLICKSaturday, 10 May 14
Saturday, 10 May 14
PART THREE:
CLEANINGSaturday, 10 May 14
Saturday, 10 May 14
=TRIM(D2)=SUBSTITUTE(D2,“ ”, “”)(Target cell, what you want to substitute, what you want to replace it with)=SEARCH(“Wales”,A2) Gives a position of the first match
Cleaning text:TRIM, SEARCH, SUBSTITUTE
Saturday, 10 May 14
mr SMITH=UPPER(D2) = MR SMITH=LOWER(D2) = mr smith=PROPER(D2) = Mr Smith
Cleaning text:UPPER, LOWER, PROPER
Saturday, 10 May 14
=LEFT(E2,3) = first 3 characters in E2=RIGHT(E2,3) = last 3 characters in E2=MID(E2,10,3) = the 3 characters in E2 starting from position 10
Cleaning text:LEFT, RIGHT, MID
Saturday, 10 May 14
=LEN(E2) = how many characters in E2=LEFT(E2,LEN(E2)-3) = Length of E2 - 3. Grab that many characters. i.e.- If E2 is 5 characters, it will grab the first 2 (5-3=2)- If E2 is 7 characters it will grab the first 4 (7-3=4)
Combine with LEN
Saturday, 10 May 14
=SEARCH(“ ”,E2) = which position is the first space=LEFT(E2,SEARCH(“ ”,E2)) = Grab all characters up to (and including) that space
Combine with SEARCH
Saturday, 10 May 14
=SEARCH(“ ”,E2) = which position is the first space=LEFT(E2,SEARCH(“ ”,E2)) = Grab all characters up to (and including) that space=TRIM(LEFT(E2,SEARCH(“ ”,E2)))
Combine with SEARCH
Saturday, 10 May 14
=ISERROR(D2) = TRUE or FALSESee also:ISNUMBER, ISTEXT, ISNONTEXT, ISLOGICAL, ISEVEN, ISODDISERR (all but N/A)
Finding errors:ISERROR, ISNA, ISBLANK
Saturday, 10 May 14
PART FOUR:
ADDINGSaturday, 10 May 14
Saturday, 10 May 14
Save time typing search URLs
Saturday, 10 May 14
"https://www.duedil.com/beta/search/companies?name="&B2
Generate URL
Saturday, 10 May 14
"https://www.duedil.com/beta/search/companies?name="&B2"https://www.duedil.com/beta/search/companies?name="&SUBSTITUTE(B2," ","%20")
Generate URL
Saturday, 10 May 14
=VLOOKUP(What you’re looking for, what range contains a match & what you want back, which column you want back, nearest match?)=VLOOKUP(D2,Sheet1!D:E,2,false)
Merging data:VLOOKUP
Saturday, 10 May 14
=TEXT(D2, “dddd”) =YEAR(D2)=MONTH(D2) = 1=TEXT(D2, “mmmm”) = ‘January’=TEXT(D2, “mmm”) = ‘Jan’If not formatted as date, use LEFT
Convert dates to years:TEXT functions
Saturday, 10 May 14
=IF(B2>2500,“High”,“Low”)
Convert amounts to categories: nested IF functions
Saturday, 10 May 14
=IF(B2>2500,“High”,“Low”)=IF(B2>2500,“High”,IF(B2<1000,“Low”,“Mid”))
Convert amounts to categories: nested IF functions
Saturday, 10 May 14
=IF(COUNTIF(B2, “*dropped*”), “Dropped”, “Not dropped”)
Can’t use wildcard. Combine with COUNTIF
Saturday, 10 May 14
1. Save time.2. Check your data.3. Clean your data.4. Add to your data.5. Feel clever. But don’t be too clever.
Saturday, 10 May 14
Thank youLeanpub.com/u/spreadsheetstories@paulbradshaw
Saturday, 10 May 14