![Page 1: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/1.jpg)
BUILDING A LIGHTWEIGHT DISCOVERY INTERFACE FOR CHINESE PATENTS
ERIC PUGH | [email protected] | @dep4b
![Page 2: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/2.jpg)
Who am I?
• Principal of OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary
• Member of Apache Software Foundation
• SOLR-284 UpdateRichDocuments (July 07)
• Fascinated by the art of software development
![Page 3: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/3.jpg)
Co-AuthorN
ext Edition May!
![Page 4: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/4.jpg)
Congrats to Trey and Tim!(Tim is here somewhere)
![Page 5: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/5.jpg)
Agilista
![Page 6: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/6.jpg)
Selected Customers
![Page 7: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/7.jpg)
Telling some stories
![Page 8: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/8.jpg)
Telling some storieswar ^
![Page 9: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/9.jpg)
![Page 10: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/10.jpg)
Risks
• Cloud new at USPTO
• Discovery is tenuous concept
• Conflicting User Goals
• Fixed Budget: trade scope for budget/quality
![Page 11: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/11.jpg)
• First USPTO application in “the cloud”
• Simple, and discoverable
• Expresses our philosophy of “Cloud meets Ocean”
!
• Check it out at http://gpsn.uspto.gov
![Page 12: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/12.jpg)
Telling some stories
➡How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
![Page 13: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/13.jpg)
Flow of understanding
Data UnderstandingInformation
![Page 14: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/14.jpg)
Building “Discovery”
UX DataTension
![Page 15: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/15.jpg)
Building “Discovery”
Engine
UX DataTension
![Page 16: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/16.jpg)
Grok data at gut level
Look for outliers
!
!
User Interviews
Surveys
Card Sorting
Scenarios/Personas
!
UX
Data
brainstormMockups
Proof of concept
!
!
![Page 17: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/17.jpg)
Where to spend time?
UX
Engine
Data
40%
!
20%
!
40%
!
![Page 18: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/18.jpg)
Where to spend time?
UX
Engine
Data
40%
!
20%
!
40%
!
40%
!
40%
!
20%
We spent
!
!
![Page 19: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/19.jpg)
Telling some stories
• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
![Page 20: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/20.jpg)
Boy meets Girl Story
![Page 21: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/21.jpg)
Boy meets Girl Story
![Page 22: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/22.jpg)
Boy meets Girl Story
![Page 23: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/23.jpg)
Boy meets Girl Story
Metadata
Ingest Pipeline
Discovery UX
Content Files
![Page 24: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/24.jpg)
How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
BrowsersMobile/
Tablet
Third Party Application
Servers
S3 BucketSolr
![Page 25: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/25.jpg)
Lessons Learned
![Page 26: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/26.jpg)
Don’t Move Files
• Copying 5 TB data up to S3 was very painful.
• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for our office
![Page 27: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/27.jpg)
Never underestimate
the bandwidth of a station wagon
full of tapes hurtling down the highway.
–Andrew Tanenbaum, 1981
![Page 28: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/28.jpg)
Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Count
277871
![Page 29: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/29.jpg)
Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Count
277871
![Page 30: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/30.jpg)
Think about Data Volume• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice, need more visibility into progress..
• Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!)
• 8 shards dropped time from 12 hours to 2 hours. Merging took 5!
• We had too many steps in our pipeline
![Page 31: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/31.jpg)
Building a Patents IndexM
achi
ne C
ount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
![Page 32: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/32.jpg)
Key scaling concept behind GPSN:
!
Cloud meets Ocean
![Page 33: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/33.jpg)
More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$
![Page 34: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/34.jpg)
More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$$
![Page 35: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/35.jpg)
More prosaically…
Database Server
Client
Client
Client
$
$
$
$$
![Page 36: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/36.jpg)
More prosaically…
Database Server
Client
ClientClient
Client
$
$
$
$$
Client
![Page 37: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/37.jpg)
More prosaically…
Database Server
Client
ClientClient
Client
$
$
$
$ $$
Client
$
![Page 38: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/38.jpg)
Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
![Page 39: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/39.jpg)
Why so many pipelines?Morphlines
![Page 40: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/40.jpg)
Tika as a pipeline?
![Page 41: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/41.jpg)
Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and EDI
• Purplebook, Yellowbook, Redbook,Greenbook, Questel, SIPO…
![Page 42: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/42.jpg)
Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the key/value needed for Solr
• Allows us to scale up with Behemoth project (and others!).
![Page 43: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/43.jpg)
Lots of files!HHHHHT APS1 ISSUE - 760106!PATN!WKU 039302717!SRC 5!APN 5328756!APT 1!ART 353!APD 19741216!TTL Golf glove!ISD 19760106!NCL 4!ECL 1
<PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>
![Page 44: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/44.jpg)
Detector to pick Filepublic class GreenbookDetector implements Detector { ! private static Pattern pattern = Pattern.compile("PATN"); @Override public MediaType detect(InputStream stream, Metadata metadata) throws IOException { ! MediaType type = MediaType.OCTET_STREAM; InputStream lookahead = new LookaheadInputStream(stream, 1024); String extract = org.apache.commons.io.IOUtils.toString(lookahead, "UTF-‐8"); ! Matcher matcher = pattern.matcher(extract); ! if (matcher.find()) { type = GreenbookParser.MEDIA_TYPE; } ! lookahead.close(); return type; } }
![Page 45: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/45.jpg)
Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
![Page 46: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/46.jpg)
Your BigData solution isn’t perfect
• Allow users to export data
• Most business users want to work in Excel! Accept it!
• Allow other applications to build on top of it.
![Page 47: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/47.jpg)
GPSN has• Lots of easy “Print to
PDF” options.
• Data stored in S3 as:
• individual patent files
• chunky downloads.
• Filtering to expand or select specific data sets.
• Permalinks: simple, very sharable URLs.
• Underlying Solr service is exposed to public via proxy. You can query Solr yourself.
• Need advance querying? Use Lucene syntax in search bar.
![Page 48: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/48.jpg)
One more thought...
![Page 49: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/49.jpg)
Measuring the impact of our algorithms
changes is just getting harder with Big Data.
![Page 51: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/51.jpg)
www.quepid.com
Quepid: Give your Queries some Love
Project SolrPanl
We
needbeta users!
![Page 52: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC](https://reader033.vdocuments.net/reader033/viewer/2022060108/5550984ab4c90590208b46f6/html5/thumbnails/52.jpg)
Thank you! !
Questions?
• @dep4b
• www.opensourceconnections.com
• slideshare.com/o19s
Nervous about speaking up? Ask
me later!