zemanta tech talk at audible

Audible Tech Talk23. April 2012

Andraz Toriandraz@zemanta.com

@andraz

Today's plan• Short story of Zemanta

• The Zemanta technology

Where am I right now?

Wonders of modern communication

Ljubljana

Strip mine

• A system for Slovenian National television in 2006

• Closed captioning web page for each episode of →

each show

• Natural Langauge Processing, Information

Retrieval...

Start-up? Why not?

Tour de Slovénie

Seedcamp

• First European program inspired by YC (2007)

• London based

• 3 months, 50.000 EUR / 10%

Roller coaster12. August Deadline20. August Shortlist23. August Phone interview24. August Results

3. September London week start7. September London week end16. September ==> London

3 months in London

Back to Ljubljana

• Figuring out US is our target market

• Figuring out where in US to be and who to have here

• Partnerships

• And naturally the business model

And then ...

Technology

• Zemanta – Personal Writing Assistant

- on your current platform

• While bloggers write we suggest:

- images

- related articles

- in-text links

- tags

What do we do?

• 80k bloggers monthly

• 1.3 million posts enhanced in 2011

Some stats

How does it work• Natural Language Processing

• Big database of “meanings” (entities, concepts, topics)

• Word Sense Disambiguation

• Linking out to Wikipedia, Freebase, …

• Categorization, Named Entity Recognition

• Information Retrieval

• Solr based, using features from NLP

• With some twists

Contentsuggestions

Plain text(article) Analysis

Semanticsearch

Backgroundknowledge

Indexed content

“Text Understanding”- Input is meaningful chunk of text (not a keyword or a phrase)- Input is (semi) English language- Has to work across all domains in the open world- music, celebrities, finance, entertainment, politics, gardening, parenting, …

Backgroundknowledge

Contentsuggestions

Semanticsearch

Indexed content

Background knowledge- Data from Wikipedia, MusicBrainz, Freebase… and the

world wild web

- Includes linguistical and semantical properties and unstructured data

- Present in two forms:

- in “original” custom built triple store on top of MySQL (150 GB)

- processed into 7 GB optimized “memory mapped dump”

Analysis pipelineNamed Entity

Extraction

Known phrasesextraction

(aho-corasick)

Triple storeSurface form features evaluation

Statistical comparison tobackground knowledge

Semantic coherenceand hand-tuned

heuristics

Disambiguated entities

Backgroundknowledge

Contentsuggestions

Semanticsearch

Indexed content

Connecting content

• Indexing blogosphere and mediasphere

• Solr based index

• Twist: complicated queries – 50 terms

• Filtering out spam is “fun”

• Probably best “related content” in terms of accuracy

• Coming soon: social signal

But why just for bloggers?

Let's open up the API!

Some API users

Back to reality.

Age of “smart”

Blog me up, Scotty!23. April 2012

Some takeaways

• Accelerators are good• World is getting flatter

But it will never be flat• Start monetizing soon – to learn, not to earn• Be where your market is• Many markets left to innovate in

Thank you!

zemanta tech talk at audible

background knowledge

text understanding input

meaningful chunk of

text links tags

zemanta technology

natural langauge processing

best related content

short story of zemanta

Education

electromagnetically excited audible no

tad williams - audible

audible voice to skull

punta de prueba lógica audible

audible range and noise pollution

audible noise - assessment for planning application ·...

reversing & audible safety - ionnic

mv-audible catalog

lod2 webinar series: zemanta / open refine

zemanta: automating the promotion and distribution of...

zemanta - socialcrush columbia - better business blogging -...

audible talk

lod2 plenary meeting 2011: zemanta – partner introduction

zemanta - ljubljana, london, the world

audible innovation

el silencio audible - um

guia audible-visible notificacion

audible/visible appliance reference guide · 1...

tin dizdarevic - blogging like a rock star: powered by...

political theory - audible corse handbook