zemanta tech talk at audible

Post on 12-Jan-2015

670 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Tech talk about Zemanta's st

TRANSCRIPT

Audible Tech Talk23. April 2012

Andraz Toriandraz@zemanta.com

@andraz

Today's plan• Short story of Zemanta

• The Zemanta technology

Where am I right now?

Wonders of modern communication

Ljubljana

Strip mine

• A system for Slovenian National television in 2006

• Closed captioning web page for each episode of →

each show

• Natural Langauge Processing, Information

Retrieval...

Start-up? Why not?

v

Tour de Slovénie

Sales

Seedcamp

• First European program inspired by YC (2007)

• London based

• 3 months, 50.000 EUR / 10%

Roller coaster12. August Deadline20. August Shortlist23. August Phone interview24. August Results

3. September London week start7. September London week end16. September ==> London

3 months in London

Back to Ljubljana

Back to Ljubljana

• Figuring out US is our target market

• Figuring out where in US to be and who to have here

• Partnerships

• And naturally the business model

And then ...

Technology

• Zemanta – Personal Writing Assistant

- on your current platform

• While bloggers write we suggest:

- images

- related articles

- in-text links

- tags

What do we do?

• 80k bloggers monthly

• 1.3 million posts enhanced in 2011

Some stats

How does it work• Natural Language Processing

• Big database of “meanings” (entities, concepts, topics)

• Word Sense Disambiguation

• Linking out to Wikipedia, Freebase, …

• Categorization, Named Entity Recognition

• Information Retrieval

• Solr based, using features from NLP

• With some twists

Contentsuggestions

Plain text(article) Analysis

Semanticsearch

Backgroundknowledge

Indexed content

“Text Understanding”- Input is meaningful chunk of text (not a keyword or a phrase)- Input is (semi) English language- Has to work across all domains in the open world- music, celebrities, finance, entertainment, politics, gardening, parenting, …

Backgroundknowledge

Contentsuggestions

Plain text(article) Analysis

Semanticsearch

Indexed content

Background knowledge- Data from Wikipedia, MusicBrainz, Freebase… and the

world wild web

- Includes linguistical and semantical properties and unstructured data

- Present in two forms:

- in “original” custom built triple store on top of MySQL (150 GB)

- processed into 7 GB optimized “memory mapped dump”

Analysis pipelineNamed Entity

Extraction

Known phrasesextraction

(aho-corasick)

Triple storeSurface form features evaluation

Statistical comparison tobackground knowledge

Semantic coherenceand hand-tuned

heuristics

Disambiguated entities

etc.

Backgroundknowledge

Contentsuggestions

Plain text(article) Analysis

Semanticsearch

Indexed content

Connecting content

• Indexing blogosphere and mediasphere

• Solr based index

• Twist: complicated queries – 50 terms

• Filtering out spam is “fun”

• Probably best “related content” in terms of accuracy

• Coming soon: social signal

But why just for bloggers?

Let's open up the API!

Some API users

Back to reality.

Age of “smart”

Blog me up, Scotty!23. April 2012

Some takeaways

• Accelerators are good• World is getting flatter

But it will never be flat• Start monetizing soon – to learn, not to earn• Be where your market is• Many markets left to innovate in

Thank you!

top related