phonexia portfolio brochure

11
Phonexia Product Portfolio Turning Voice into Knowledge

Upload: others

Post on 29-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Phonexia Portfolio Brochure

Phonexia Product Portfolio

Turning Voice into Knowledge

Page 2: Phonexia Portfolio Brochure

TABLE OF CONTENTS

About Phonexia 2

Phonexia Speaker Identification 3

Phonexia Language Identification 4

Phonexia Gender Identification 5

Phonexia Keyword Spotting 6

Phonexia Speech Transcription 7

Phonexia Speaker Diarization 8

Phonexia Voice Activity Detection 9

Phonexia Speech Quality Estimation 10

Phonexia Age Estimation 11

Phonexia Voice Inspector 12

Phonexia Denoiser 14

Integration Possibilities and Licensing 15

Page 3: Phonexia Portfolio Brochure

2 3

About Phonexia

Customers and Partners

Phonexia Products

Phonexia transforms voice into knowledge with its

innovative speech analytics and voice biometrics

technologies. Its Phonexia Speech Platform is

the first on the market using exclusively deep

neural networks to allow speaker identification

with extremely accurate and fast results. The

Phonexia Speech Platform packs a wide range of

speech technologies into a single, highly modular

platform that is easy to integrate with other

solutions. Phonexia innovation is available through

its network of integration partners. A university

spin-off, Phonexia has been delivering its

technologies to call centers, financial institutions

and security agencies in more than 60 countries

since 2006.

Phonexia Voice Biometrics helps the

identification of or search for a speaker based on

the comparison to a previously created voiceprint.

Similar to a fingerprint, Phonexia voice biometrics

can be used for voice authentication, fraud

prevention or speaker search.

Phonexia Speech Analytics provides ready to

analyze data on speech content using either full

Speech Transcription, Keyword Spotting

(a phonetically based keyword search)

or Language Identification.

Phonexia Voice Inspector is an out-of-

the-box solution providing police forces and

forensic experts with a highly accurate Speaker

Identification tool that supports criminal

investigations.

Phonexia Denoiser software cleans the audio

signals of reverberation and other noises to make

them more audible to the human ear.

14YEARS

ON THE MARKETBASED IN CZ,

THE EUROPEAN UNIONPROJECTS IN

60 COUNTRIES

Phonexia Speaker Identification

Output

XML/JSON format with all results or results

files with a log likelihood ratio (-∞;∞) and/or

percentage metric scoring <0–100%>

Accuracy and speed

Phonexia provides several technology models

optimized for different use cases.

The most precise model (XL4) is optimized for best

performance on a short speech signal. In this way,

the speaker can be verified through 3 seconds

of net speech with more than 92% accuracy.

With a longer speech signal, the accuracy can

reach 97% without calibration. This result can

be improved through calibration within the

customer’s environment. The speed of the model

is approximately 64 times faster than real-time for a

typical call center call (one channel).

The accuracy was measured on the NIST SRE16

dataset.

Technology

• A calibration tool for even higher accuracy

• 1:1 (verification), 1:n and n:m (identification)

comparison possible

• The technology is language-, accent-, text-,

and channel- independent

• Uses deep neural networks to generate highly

representative voiceprints

• Applies state-of-the-art channel compensation

techniques, verified by NIST evaluation

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Minimum speech signal for enrolment:

recommended 20+ secs

Minimum speech signal for identification:

recommended 3+ secs

In specific use cases the time required for

the speaker enrolment and identification can

be much shorter.

Phonexia Speaker Identification uses the power of voice biometrics to recognize a speaker automatically by their voice. Its latest generation, called Deep EmbeddingsTM, uses deep neural networks for even greater performance.

Voice Biometrics

Page 4: Phonexia Portfolio Brochure

4 5

Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics

Phonexia Language Identification

Technology

• The technology is text and channel independent

• Applies state-of-the-art channel compensation

techniques, verified by NIST evaluation

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Supported languages

Oromo, Albanian, Amharic, Arabic_Egypt,

Arabic_Gulf, Arabic_Iraqi, Arabic_Levantine, Arabic_

Maghrebi, Arabic_MSA, Assamese, Azerbaijani,

Bangla_Bengali, Basque, Belarusian, Bulgarian,

Burmese, Catalan, Cebuano, Chinese_Cantonese,

Chinese_Mandarin, Chinese_Min_Nan, Chinese_

Wu, Chuvash, Czech, Dari, Dutch, English_American,

English_British, English_Indian, Estonian, Farsi,

French, Georgian, German, German_Switzerland,

Greek, Guarani, Haitian_Creole, Hausa, Hindi,

Hungarian, Indonesian, Italian, Japanese, Kazakh,

Khmer, Kirundi_Kinyarwanda, Korean, Kurdish, Lao,

Lithuanian, Luxembourgish, Macedonian, Ndebele,

Pashto, Polish, Portuguese, Punjabi, Romanian,

Russian, Serbo-Croat-Bosnian, Shona, Slovak,

Slovenian, Somali, Spanish_American, Spanish_

European, Swahili, Swedish, Tagalog, Tamil, Telugu,

Thai, Tibetan, Tigrignya, Tok_Pisin, Turkish, Ukrainian,

Urdu, Uzbek, Vietnamese, Welsh, Zulu

The Phonexia Language Identification (LID) system allows the automatic detection of spoken language or dialect.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Minimum speech signal for identification:

recommended 7+ secs

Output

XML/JSON format with all results or results files

with a logarithm of probabilities scoring (-∞;0>

and/or percentage metric scoring <0-100%>

Processing speed

Approx. 20× faster than real-time processing

on 1 CPU core with the most precise model, i.e.,

a standard 1 CPU core server processes 480 hours

of audio in one day of computing time.

A user can add new languages to the system,

no assistance from Phonexia is necessary.

Approx. 20 hours of audio recordings

recommended for new language training.

Technology

• Uses the acoustic characteristics of speech

• Speech is converted to frequency spectra

and modeled with advanced statistical

methods

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Minimum speech signal for identification:

recommended 7+ secs

Output

XML/JSON format with all results or results files

with processed information (scores for male

and female)

Phonexia Gender IdentificationPhonexia Gender Identification (GID) automatically recognizes the gender of a speaker.

Processing speed

Approx. 200× faster than real-time processing

on 1 CPU core with the most precise

model, i.e., a standard 1 CPU core server

processes 4,800 hours of audio in one day

of computing time.

Page 5: Phonexia Portfolio Brochure

6 7

Speech AnalyticsSpeech Analytics

Technology

• Robust acoustic-based technology, even with

noisy recordings

• Keywords are automatically converted into

phonemes and searched for

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted). List of keywords or key

phrases to be searched for.

Output

XML/JSON format with all results or results files

generated with detected keywords (containing the

keyword, start/end time, path, probability, etc.)

Processing speed

The 5th generation is approximately 30× faster

than real-time processing on 1 CPU core, i.e.,

Phonexia Keyword SpottingPhonexia Keyword Spotting (KWS) identifies the occurrences of keywords and/or keyphrases in audio recordings.

a standard 1 CPU core server processes 720 hours

of audio in one day of computing time.

The 4th generation is approximately 10× faster

than real-time processing on 1 CPU core.

Supported languages

Language Code Note

Arabic (Levantine) ar-XL 5th Gen.

Arabic (Gulf) ar-KW 4th Gen.

Chinese zh-CN 4th Gen. – Beta

Croatian hr-HR 5th Gen.

Czech cs-CZ 5th Gen.

Dutch nl-NL 5th Gen.

English UK en-UK 4th Gen.

English US en-US 5th Gen.

Farsi fa-IR 4th Gen. – Beta

French fr-FR 4th Gen.

German de-DE 4th Gen.

Hungarian hu-HU 4th Gen. – Beta

Italian it-IT 4th Gen.

Pashtu ps-AR 4th Gen.

Polish pl-PL 5th Gen.

Russian ru-RU 5th Gen.

Slovak sk-SK 5th Gen.

Spanish – Latin America es-LA 5th Gen.

Swedish sw-SE 5th Gen.

Turkish tr-TR 4th Gen. – Beta

A user can add an unlimited number of keywords

to the system, as well as an unlimited number of

pronunciation variants for each keyword.

Technology

• In the fifth generation a Language Model

Customization tool is available for the optional

addition of desired words to the model

• Trained with an emphasis on spontaneous

telephone conversation

• Based on state-of-the-art techniques for

acoustic modeling, including discriminative

training and neural network-based features

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Processing speed

The 5th generation is approximately 7× faster than

real-time processing on 1 CPU core, i.e., a standard

1 CPU core server processes 168 hours of audio in

one day of computing time. The 4th generation is

approximately 1.2× faster than real-time processing.

Phonexia Speech TranscriptionPhonexia Speech Transcription (STT) converts speech signals into plain text.

Output

XML/JSON format with all results or results

files with:

• One-best transcription

i.e., a file with a time-aligned speech transcript

(the time of the words’ start and end)

• n-best transcription

i.e., a confusion network with hypotheses for

words at each moment

Supported languages

Language Code Note

Arabic (Levantine) ar-XL 5th Gen.

Arabic (Gulf) ar-KW 4th Gen. – Beta

Chinese zh-CN 4th Gen. – Beta

Croatian hr-HR 5th Gen.

Czech cs-CZ 5th Gen.

Dutch nl-NL 5th Gen.

English UK en-UK 4th Gen.

English US en-US 5th Gen.

Farsi fa-IR 4th Gen. – Beta

French fr-FR 4th Gen.

German de-DE 4th Gen.

Italian it-IT 4th Gen.

Polish pl-PL 5th Gen.

Russian ru-RU 5th Gen.

Slovak sk-SK 5th Gen.

Spanish – Latin America es-LA 5th Gen.

Swedish sw-SE 5th Gen.

Page 6: Phonexia Portfolio Brochure

8 9

Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics

Technology

• Trained with an emphasis on spontaneous

telephone conversation

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results

files with segmentation of speech, silence, and

technical signals (i.e., elimination of phone lines

beeps, DTMF tones, music, etc.)

Audio file extracted for each speaker

Phonexia Speaker DiarizationPhonexia Speaker Diarization (DIAR) enables segmentation of voices in one monochannel audio record.

Processing speed

Approx. 50× faster than real-time processing

on 1 CPU core with the most precise model,

i.e., a standard 1 CPU core server processes

1,200 hours of audio in one day of computing time.

Technology

• Trained with an emphasis on spontaneous

telephone conversation

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results

files with segmentation of speech, silence, and

technical signals (i.e., elimination of phone lines

beeps, DTMF tones, music, etc.)

Audio file extracted for each speaker

Phonexia Voice Activity DetectionPhonexia Voice Activity Detection (VAD) identifies parts of audio recordings with speech content vs. non-speech content.

Processing speed

Approx. 50× faster than real-time processing

on 1 CPU core with the most precise

model, i.e., a standard 1 CPU core server

processes 1,200 hours of audio in one day of

computing time.

Page 7: Phonexia Portfolio Brochure

10 11

Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics

Technology

• Trained with an emphasis on spontaneous

telephone conversation

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, satphones, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results files

with labels (speech vs. non-speech segments)

Phonexia Speech Quality EstimationPhonexia Speech Quality Estimator (SQE) measures the quality parameters of the speech in an audio recording.

Processing speed

Approx. 150× faster than real-time processing

on 1 CPU core with the most precise

model, i.e., a standard 1 CPU core server

processes 3,600 hours of audio in one day

of computing time.

Technology

• The technology is language-, accent-, text-,

and channel- independent

• Compatible with the widest range of

audio sources possible (applies channel

compensation techniques): GSM/CDMA, 3G,

VoIP, landlines, etc.

Input

Input format for processing:

WAV or RAW (PCM unsigned 8 or 16 bits, IEEE

float 32-bit, A-law or Mu-law, ADPCM), FLAC,

OPUS; 8 kHz+ sampling (other audio formats

automatically converted)

Output

XML/JSON format with all results or results

files with:

• Global score

i.e., a percentage expression of audio quality

(range <0; 100>), by default, the global score

is calculated based on waveform_n_bits

and waveform_snr variables

• Detailed outputs

i.e., clipped signal, amplitude, sample values,

sampling frequency, SNR, technical signal,

encoding, etc.

Phonexia Age EstimationPhonexia Age Estimation (AGE) estimates the age of a speaker from an audio recording.

Processing speed

Approx. 2,000× faster than real-time

processing on 1 CPU core with the most precise

model, i.e., a standard 1 CPU core server

processes 48,000 hours of audio in one day

of computing time.

Page 8: Phonexia Portfolio Brochure

12 13

Voice BiometricsVoice Biometrics

Phonexia Voice InspectorPhonexia Voice Inspector is an out-of-the-box solution providing police forces and forensic experts with highly accurate, AI-powered automatic speaker recognition to support criminal investigations.

Technology

• Deep Embeddings™ - uses deep neural

networks to generate highly representative

voiceprints

• Applies state-of-the-art channel compensation

techniques, verified by NIST evaluation

• Compatibility with the widest range of audio

sources possible: GSM/CDMA, 3G, VoIP,

landlines, etc.

• Independent of language, accent, text and

channel

Input

• WAV (8 or 16 bits linear coding), A-law and Mu-

law, PCM, 8 kHz+ sampling

• 7 seconds recommended minimum speech

signal duration for a questioned recording

• 20 seconds recommended minimum speech

signal duration for a suspected speaker

Features and Benefits

• 1:1 speaker comparison in accordance with

ENFSI guidelines

• 1:N speaker identification for more

complex cases

• Automatic Forensic Voice Comparison

• A diarization tool to make working with audio

recordings containing multiple speakers easier

• A phoneme recognizer for the searching and

visualization of the same phoneme sequences

across multiple audio files

• An evaluation tool for the measurement of

accuracy in a user’s data sets

• A waveform editor with tools such as

a spectrum panel, voice activity detection

and more

• Easy management of investigation cases

Output

• Scoring to a likelihood ratio (LR), log-likelihood

ratio (LLR) and verbal presentation of results

• Graphic presentation of the likelihood ratio (LR)

• Detailed report output (expert opinion

template automatically generated) for

presentation of results (to a court or an

investigation team)

Phonexia Voice Inspector User Interface

A visualization of scores from a sample case

Page 9: Phonexia Portfolio Brochure

14 15

Phonexia Denoiser Phonexia Denoiser software cleans the audio signals of reverberation and other noises to make them more audible to the human ear.

Technology

• Denoiser is distributed as a part of Phonexia

Speech Engine and is accessible via REST

API. Its algorithms use deep neural networks

to achieve the automatic cleaning and

reconstruction of the processed audio signals.

Removing noises and enhancing the speech

signal provide better audibility and the ability

to understand the speech content. For each

denoised file, information is provided about the

difference of the signal-to-noise ratio to indicate

the improvement in the signal achieved by the

process of denoising.

Input

• A WAVE (*.wav) container including any of the

following:

• signed 8-bit PCM (s8)

• signed 16-bit PCM (s16le)

• IEEE float 32-bit (f32le)

• IEEE float 64-bit (f64le)

• A-law (alaw)

• µ-law (mulaw)

• ADPCM

• FLAC codec inside a FLAC (*.flac) container

• OPUS codec inside an OGG (*.opus) container

Output

• A RAW or WAV audio file (8 or 16 bits)

The processed audio is to be listened to and

examined by an expert and is not to be used as an

input for other automatic processing.

Interfaces

• REST API interface

• Command line interface

• Graphical user interface (GUI) for evaluation

Supported OS

• Windows 64 bit (x86_64)

• Linux 64 bit (x86_64)

Licensing options

• USB dongle licensing key (offline license,

on-premise installment)

• HW profile licensing key (offline license,

on-premise installment)

• Licensing server (offline license, on-premise

installment, used for HA)

• NET-based license (for demo purposes)

Integration Possibilities and LicensingPhonexia offers multiple integration and licensing possibilities, as well as custom development.

Recommended hardware

For the production system, a 64-bit server

processor is recommended with a higher L3

cache (the higher, the better)—for example,

the Intel® Xeon® processors E5/E7/Gold/

Platinum or Intel® Core™ processors i5/i7/i9.

Phonexia technologies also work in a virtualized

environment.

An advanced consultation on hardware

configuration will be provided upon a specific

deployment request.

Customization

Phonexia provides research and development

services such as speech technology optimization

for target channels, development of new language

versions, etc. Phonexia also offers multiple

engines balancing speed and accuracy according

to the specific use case. Contact our team for

more details.

More information

Should you like to know more information about

Phonexia technologies, please do not hesitate to

contact us at [email protected]

Page 10: Phonexia Portfolio Brochure

Voice Biometrics Speech Analytics

V-20

20-1

0

Page 11: Phonexia Portfolio Brochure

Phonexia s.r.o.

+420 511 205 265 [email protected] Chaloupkova 3002/1a, 612 00 Brno, Czech Republic, European Union

phonexia.com