building a lightweight discovery interface for chinese patents

1.Building a Lightweight Discovery Interface for Chinese Patents Chinese Patents Strata 2014 Santa Clara Eric Pugh | [email protected] | @dep4b

2. Who am I? Principal of OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummaryMember of Apache Software FoundationSOLR-284 UpdateRichDocuments (July 07)Fascinated by the art of software development 3. ex Nn tio di tEM! arCo-Author 4. Agilista 5. Selected Customers 6. war ^Telling some stories 7. First USPTO application in the cloud Simple, and discoverable Expresses our philosophy of Cloud meets Ocean 8. Risks Cloud new at USPTO Discovery is tenuous concept Conflicting User Goals Fixed Budget: trade scope for budget/quality 9. Telling some stories How to inject Discovery into your app The Cloud to the Rescue (sorta!) Parsers and Parsers and Parsers Dont be Afraid to Share! 10. Flow of understandingData DataInformation InformationUnderstanding Understanding 11. Building DiscoveryUX UXTensio nData DataEngine Engine 12. UX UXUser Interviews Card Sorting Scenarios/Personas Data DataGrok data at gut level Look for outliersbrainstorm brainstorm brainstorm brainstormSurveysMockups Proof of concept 13. Where to spend time? UX UX Engine Engine Data Data40% 20% 40%40% 40% 20% We spent 14. Walk through results http://gpsn.uspto.gov 15. Telling some stories How to inject Discovery into your app The Cloud to the Rescue (sorta!) Parsers and Parsers and Parsers Dont be Afraid to Share! 16. Boy meets Girl Story 17. Boy meets Girl Story Content Files Ingest Pipeline MetadataDiscovery UX 18. How we built it 19. Lessons Learned 20. Dont Move Files Copying 5 TB data up to S3 was very painful. We used S3Funnel which is rsync like We bought more network bandwidth for our office 21. Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. Andrew Tanenbaum, 1981 22. Data Size 277871 23. Think about Data Volume Started with older dataset, and tasks like TIFF -> PNG conversion became progressively harder. Map/Reduce nice, need more visibility into progress..Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!) 8 shards dropped time from 12 hours to 2 hours. Merging took 5!We had too many steps in our pipeline 24. Building a Patents Index 25. Cloud meet Ocean 26. More prosaically Server Server$Database DatabaseServer Server$$Client ClientClient ClientServer Server$Client Client 27. Telling some stories How to inject Discovery into your app The Cloud to the Rescue (sorta!) Parsers and Parsers and Parsers Dont be Afraid to Share! 28. MorphlinesWhy so many pipelines? 29. Tika as a pipeline? 30. Lots of File Types Sometimes in ZIP archives, sometimes not! multiple XML formats as well as CSV and EDI Purplebook,Yellowbook,Redbook,Greenbook, Questel, SIPO 31. Tika as a pipeline! Auto detects content type Metadata structure has all the key/value needed for Solr Allows us to scale up with Behemoth project (and others!). 32. Detector to pick File public class GreenbookDetector implements Detector { private static Pattern pattern = Pattern.compile("PATN"); @Override public MediaType detect(InputStream stream, Metadata metadata) throws IOException { MediaType type = MediaType.OCTET_STREAM; InputStream lookahead = new LookaheadInputStream(stream, 1024); String extract = org.apache.commons.io.IOUtils.toString(lookahead, "UTF-8"); Matcher matcher = pattern.matcher(extract); if (matcher.find()) { type = GreenbookParser.MEDIA_TYPE; } lookahead.close(); return type; } } 33. Telling some stories How to inject Discovery into your app The Cloud to the Rescue (sorta!) Parsers and Parsers and Parsers Dont be Afraid to Share! 34. Your solution isnt perfect Allow users to export data Most business users want to work in Excel! Accept it! Allow other applications to build on top of it. 35. GPSN has Lots of easy Print to PDF options. Data stored in S3 as: individual patent files chunky downloads.Filtering to expand or select specific data sets. Permalinks: simple, very sharable URLs. Underlying Solr service is exposed to public via firewall. You can query Solr yourself. 36. One more thought... 37. Measuring the impact of our algorithms changes is just getting harder with Big Data. 38. e WQuepid: Give your Queries some Love e ne d t be as! er uswww.quepid.io 39. Office Hours Thurs 10:50 AMWhats Up with the Lucene Community? Community? 40. Questions? Questions? Questions? Nervous about [email protected] speaking up? Ask me @dep4b later! www.opensourceconnections.com slideshare.com/o19s

building a lightweight discovery interface for chinese patents

Technology

parsers dont

tb data

data volume

big data

data size

cloud simple

philosophy of cloud

mediatype type