copyright © 2018 omniscien technologies. all rights reserved. · copyright © 2018 omniscien...

36
Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Upload: others

Post on 26-Sep-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Page 2: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.AI, MT and Language Processing Symposium

Dion Wiggins is a highly experienced ICT industry visionary, entrepreneur, analyst and consultant. Hehas an impressive knowledge in the fields of software development, architecture and management, aswell as an in-depth understanding of Asian ICT markets. He is an accomplished speaker and has a highmedia profile for his perceptive analysis of ICT in Asia/Pacific.

Previously Dion was Vice President and Research Director for Gartner based in Hong Kong, where hewas the most senior and highly-respected analyst based in all of Asia. Dion's research reports on ICT inChina helped change the way the world views this market.

Dion is also a well-known pioneer of the Asian Internet Industry, being the founder of one of Asia'sfirst ever ISPs (Asia Online in Hong Kong). In his role at Gartner and in various other consultingpositions prior to that, Dion advised literally hundreds of enterprises on their ICT strategy.

Dion was a founder of The ActiveX Factory, where he was recipient of the Chairman's CommendationAward presented by Microsoft's Bill Gates for the best showcase of software developed in thePhilippines. The US Government has recognized Dion as being in the top 5% of his field worldwide andhe is a former holder of a US O1 Extraordinary Ability Visa.

Speaker Overview

Dion WigginsChief Technology OfficerCo-FounderOmniscien Technologies

Page 3: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.AI, MT and Language Processing Symposium

Dion WigginsChief Technology OfficerCo-FounderOmniscien Technologies

Big Data and Domain Adaptation of Machine Translation

Page 4: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.2 April 2018

Big Data and Domain Adaptation of Machine Translation

Dion WigginsChief Technology [email protected]

Page 5: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Introduction

• This presentation is born out of Omniscien’s own challenges with developing high-quality in-domain MT engines for customers.

• Machine translation technologies have changed rapidly in recent years –enabling a range of new processing capabilities• Static or periodically trained MT alone is not enough • A live engine training platform was needed to keep MT engines current

• New approaches to teaching an engine to translate were needed

• Enhancements that find/manufacture MT corpora via Machine Learning/AI in conjunction with other NLP tools can be applied to assist in keeping MT engines up to date

• MT that adapts delivers a higher quality translation output and includes rich metadata created via cross language data enrichment

Page 6: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

In a good wine, all the taste components "work" together. None of the flavors compete against each other, and you aren't overwhelmed by one aspect of the wine. The wine

has a depth of flavors that evolve in "layers" the longer the wine is in your mouth, and after you've swallowed it.

The nuances are what make wines unique.

Page 7: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Near-Human Quality Translation is Possible - But Requires the Right Data

The coagulation time was determined as described above.The setting time was determined as described above.

The lighting device also typically includes a light source disposed at the end of the light conductor.The light device typically also includes a light source arranged at an end of the light guide.

Such communication between components is but one example of a unidirectional communication system.Such communication between components is only one example of a one-way communication system.

The use of a hearing aid by a healthcare provider is routine.The use of a stethoscope by health care providers is routine.

This can further enhance the electrical and long-term performance of the backsheet.This may further increase the electrical properties and long-term performance of the backsheets.

Initial Binding measurements were performed as described above for Plaque Initial Binding measurements.Initial bonding measurements were carried out as described above for Plaque Initial Bonding Measurements.

The subtractive color mixture selected may depend on the metalized surface area and the resistance material used.The subtractive process selected can depend upon the metallized structured surface region and the resist material utilized.

Japanese -> English Patent Translations MACHINE HUMANKey:

Page 8: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Omniscien’sOld Approach Worked,

But Had Expired / Run It’s Course. A New Approach Was Needed

OLD APPROACH

• Start with an average, often better than Google, Industry Engine

• Customer:

• Translate with Industry engine as is

• Use Industry as a base to build a custom engine

• Problem:

• NMT engines now beat SMT Industry

• Enough data for SMT

• Not enough data for NMT

• Competitors are producing high qualityengines in NMT, we must raise our bar.

Page 9: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

We Need Data

A Lot of Data

Page 10: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

New Approach

• Begin from high-quality Deep NMT/SMT Hybrid Engines

• A typical SMT engine is built on 1 to 5 million clean, focused, in domain bilingual sentences.

• A large SMT engine is built on ~30M sentences

• New SMT/NMT Hybrid engine are built on between 50 million and 1 billion+ clean, focused, in domain bilingual sentences.

Page 11: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Building Out Domains of Data

Where/How do you get

50M+ In-Domain Bilingual

Sentences for each Domain?

General purpose. A generic engine that is not specialized to any domain, instead providing a broad coverage of topics.

Subtitles, dialog, closed captioning and e-learning materials.

SUBTITLES and DIALOG

Page 12: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Where Will The Data Come From?• Crawling and Aligning Documents

• Takes time

• Can focus on domains, but must find matching sites

• RSS Feeds and Aligning Documents

• Previous Crawls (10+ Years of Data)

• Data Repositories

• Reprocessing Existing Data

• Client Data

Page 13: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Reprocessing our Existing Data

• We leveraged many of our existing tools

• We built new tools to validate and classify our existing data and improve the quality.

• We were able to domain identify huge bodies of ambiguous bilingual and monolingual data.

• We were able to analyze quality, domain and other meta data at a sentence by sentence level.

Big Bucket of Mixed Data

Domain ID

News

Finance

Life Sciences

Page 14: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

NMT Research Paying Dividends

Engine Training TimeJuly 2016 - 4 weeks. January 2017 - 1 weekMay 2017 - 1 dayJanuary 2018 – Minutes/Hours

Rapid Domain Adaptation: March 2017 – Manufacturing on synthetic data for domain adaptation to enhance human translated training data.

Translation Speed (Single GPU)July 2016 - 3,000 Words Per MinuteJanuary 2017 - 20,000 Words Per MinuteJune 2017 – 40,000 Words Per MinuteJanuary 2018 – 50,000 Words Per Minute

XML Based Directives: March 2017 – Support for Glossary/Do Not Translate/Forced Translations

Hybrid NMT/SMT: February 2017 – Both technologies have shortcomings. Many are mitigated when used together in a hybrid mix of quality metrics and translation technologies.

Deep Learning NMT: July 2017 – The latest advance in NMT, allows for more complex processing of input signals and notably higher quality translations.

Accurate Word Alignment: February 2017 - Accurate word alignment enables many other functions such as tag handling, formatting and cross referencing data between language pairs.

Page 15: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Automation and Workflows

Humans are slow (comparatively), make mistakes and get tired.

Machines only make the mistakes humans programmed into them and can work 24 hours a day, 7 days a week, 365 days a year.

Page 16: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Tools at Our Disposal

Rules Syntax

NeuralStatistical

Data Repository

Process & Workflow

Machine Translation

Data Manufacturing

Complex Linguistic Processing

Engine Customization

Language Processing

Integration & Automation

Automated Management

& Training

Expertise

APIs & Third Party Products

Machine Learning

Page 17: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

New Tools

LSScript• A JavaScript host for custom workflow and automation on Windows

and Linux

LSTools• A helper object written in Java for file / text manipulation and language

processing

REST API • Provides standards-based access to server processing such as

translation and complex language processing

Page 18: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Understanding the Basics

LSTools

JSJob

LSTools

Translate

Queue

RealTime

• Translate• Tokenize• LanguageID• JSJob• …

Job Load Management

• Chunking• Load Balancing• Sub-Jobs

• Workflow

LSTools• FileSystem• Text• NLP• Job• Account• ….

LSScript

Page 19: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

What Data Do We Need?

• HTML from Websites• Average quality text in most cases.

• RSS Feeds• Similar to HTML

• PDF files • Higher quality, but with many

technical challenges due to formatting issues

• Existing Data Repositories

• Newly Acquired Data

• Customer data

Page 20: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Many Bilingual Formats – Normalize to Tab Pair – Standardize Processing

Bilingual Data from Data Gathering

TMX/XLIFFFormat Extract

to Tab Pair

SRT/TTMLAlignment to

Tab

Plain Text x 2 to Tab Pair

XLS/XLSX/CSV Extract to Tab

Pair

XML Extract to Tab Pair

Text Align to Tab Pair

Other?? Extract to Tab

Pair

Tab Pair Processing Preparation

Tab Pair Quality Analysis& Classification

Tab Pair Data Extraction

Page 21: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Meta Data Enrichment and Classification – Document and Page Level

Page 22: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Cross Language Data Enrichment

• LSScript + LSTools for Orchestration• LSScript manages the workflow – EMCA compliant

enhanced JavaScript engine. • LSTools Helper Object for advanced functionality,

communications and language processing.

• Data Enrichment• Named Entity Recognition (NER)

• Different results in different languages. It is next to impossible to have the same data in all languages.

• This can be enriched via real-time word alignment mapping• New entities can be detected from either the source or the

target and used to enrich each other.

head

العراق كردستان

• Reverse Analysis Data Enrichment• Word Movement Tracking between

languages – Reverse mapping of information between source and target

• Additional Meta Data• Geolocation server – returns GPS

coordinates for all locations• Third Party services - Alternative Names,

Relationship Analysis, NER, Entity Linking, Sentiment, Synonyms

• Consolidated Meta Data Output• Enhances digital forensics

Page 23: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Data Manufacturing

Creating new data

Page 24: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Gather New Data – Web Crawler

• Start with a list of URLS of bilingual websites

• We have identified 20 million+ bilingual sites by language combinations• List continues to grow

• We have been processing them for several years progressively• Crawling• Document matching• Extracting• Cleaning• …

HTML/PDF File

Page 25: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

The Basics of Processing a Multilingual Website

Web Crawl

Document Align

Output is Bilingual Tab Pair Data

Sentence Align

Page 26: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Data Synthesis

I watered my flowersI watered my rosesI watered my plantsI watered my bicycle

He watered my flowersShe watered my flowersI watered your flowersHe watered their flowersHe watered Janes rosesI did not water my flowers…He did not water your roses

Multiply client in-domain data x 10-100?

Intelligently synthesize new and relevant bilingual sentencesderived directly from high-quality in-domin client data

Page 27: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Rapid Domain Adaptation

Learning in near-real time

Page 28: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Topics Change, Events Occur… Constantly and Quickly

Earthquake in Mexico

Protesters clash with police over missing activist in Argentina

School girls kidnapped in Nigeria

North Korea tensions

Shrine bombed in Thailand

Russian troop movements near Ukraine

Bomb on train in UK

Attempted coup in Turkey

Attack on publication in France

Iran uranium enrichment

Anti-government protests in Venezuela

Keeping up with terminology, names, locations and slang across languages for all these events

and topics is a challenge for technology

Foiled plot to blow up plane in Australia

Page 29: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

The Challenges of Language Change

• People and organizations of interest change rapidly. New names are a constant.

• Terms change with eventsHurricane Irma, Sluice Gate, Buk Missile

• LocationsErawan Shrine, Chernobyl Nuclear Plant, Dhankutā

Machines and data models are trained on the data available at the time of training.

Page 30: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Gather New Data – RSS Feeds

• Focus on News

• Most newspapers support RSS

• Newspapers are in every language

Page 31: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Keeping up by Automating Language Learning

• RSS feeds from major news sites in each language of interest.

• Data is stored in buckets with associated metadata attached.

• Many articles are translations but not directly associated or linked to the source document.

US English Russian …Korean

Page 32: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Unsupervised Document Matching

• Documents are first categorized to reduce comparison scope.

• Once matched, bilingual sentences are aligned and extracted.

Analyze and break down by classification

US English KoreanAnalyze matching

documents

Analyze and break down by classification

KO-ENKO EN

Aligned BilingualSentences

Automated Unsupervised

Engine Training

Updated MT Engine with new

terminology

Analyze document pairs for matching bilingual sentences

Page 33: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Automated Unsupervised Training of Hybrid NMT/SMT

• Data Manufacturing

• SMT is able to be updated in near real-time• Kept current always• Language model excluded

• Full SMT and NMT updates daily

• Managed progressive switch over

• switch time is transparent

KO-EN

KO EN

Aligned BilingualSentences

Daily NMT Training Live Engine Switching(10 seconds)

EN

Target Language Monolingual Data

Bilingual Data Manufacturing

Monolingual Data Manufacturing

Adaptive SMT Training(near real-time)

KO-EN

Always Current

KO-EN

Daily SMT Training Live Engine Switching(2-5 Minutes)

Page 34: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.

Summary / Conclusions

• NLP and NMT research has advanced notably

• Data is out there, but processing is expensive and complex

• Data volumes needed far exceed anything in the past• Billions of bilingual sentences

• Automated Rapid Adaptation of MT scope is a necessity• Content and topics change, and always will

• MT engines and NLP tools can keep up with change using rapid adaptation approaches that learn continuously in near-real time

• As content changes, meta data and the tools that leverage them are immediately out of date• Cross language data enrichment ensures the latest data can be fed back into tools for

progressive learning in near-real time

Page 35: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.2 April 2018

Big Data and Domain Adaptation of Machine Translation

Dion WigginsChief Technology [email protected]

Page 36: Copyright © 2018 Omniscien Technologies. All Rights Reserved. · Copyright © 2018 Omniscien Technologies. All Rights Reserved. Near-Human Quality Translation is Possible - But Requires

Copyright © 2018 Omniscien Technologies. All Rights Reserved.