chinese academy of sciences, beijing, china speech and language processing techniques report...

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Report

Docum

ent

Overview of MPEG-7Overview of MPEG-7

Dr Zhang Sen

Speech Group, INRIA-LORIAVillers les Nancy, France

Chinese Academy of SciencesBeijing, China

04/18/23



Report

Docum

ent

2

Outline of contents

• Introduction• Basic Components• Content Description• Audiovisual (AV) Descriptions• Multimedia Description Schemes• XM and Applications• More Information



Report

Docum

ent

3

Ozone WP2 architecture

Ozone application

Software Environment layer

Oz

on

e

Servic

es

Situation Sensitivity

User Context

OzoneContext

Multi-modal widgets

Dialog management

smartagent User

Interfacemana-

gement Percep-

tion QoS

Security

speechrecognition

videobrowser

...

animated agent

Authen-tication

User-interaction module

gesturerecognition



Report

Docum

ent

4

90 92 94 98 99 01 ?

v1 v2

mpeg1 mpeg2 mpeg4 mpeg7 mpeg21

• MPEG-3, ever defined, but abandoned

• MPEG-5 and -6, not defined

From MPEG-1 to MPEG-7



Report

Docum

ent

5

MPEG-1 – Coding of moving pictures and audio for digital

storage media (CD-ROM, MP3), 11/92

MPEG-2 – Generic Coding of moving pictures and audio

information (DVD, Digital TV), 11/94

MPEG-4 – Coding of Audiovisual Objects for MM appls

Ver1 09/98, Ver2 11/99

MPEG-7 – Multimedia content description for AV material 08/01

MPEG-21 – Digital AV framework: Integration of

multimedia technologies, 11/01

MPEG Family



Report

Docum

ent

6

Why is MPEG-7 needed

• Digital audiovisual information increasing– more and more available contents– all kinds of sources of information

• Use of the digital audiovisual information– description of the contents– fast search of the contents



Report

Docum

ent

7

Objective of MPEG-7

• Standardize content-based description for various types of audiovisual information – Enable fast and efficient content searching, filtering and

identification

– Describe several aspects of the content (low-level features, structure, semantic, models, collections, creation, etc.)

– Address a large range of applications

• Types of audiovisual information: – Audio, speech

– Moving video, still pictures, graphics, 3D models

– Information on how objects are combined in scenes



Report

Docum

ent

8

Scope of MPEG-7

• The description generation (feature extraction, indexing process, annotation & authoring tools,...) and consumption (search engine, filtering tool, retrieval process, browsing device, ...) are non normative parts of MPEG-7.

• The goal is to define the minimum that enables interoperability.

DescriptionDescriptiongeneration

Description consumption

Scope of MPEG-7Research and

future competitionResearch and

future competition



Report

Docum

ent

9

Scope of MPEG-7

Feature SearchExtraction Engine

MPEG-7Description

standardization

Search Engine:Searching & filteringClassificationManipulationSummarization Indexing

MPEG-7 Scope:Description Schemes (DSs)Descriptors (Ds)Language (DDL)Ref: MPEG-7 Concepts

Feature Extraction:Content analysis (D, DS)Feature extraction (D, DS)Annotation tools (DS)Authoring (DS)



Report

Docum

ent

10

Audio in MPEG-7

• Audio content description (yes)

• Sound retrieval and classifier (yes)

• Speech synthesis (no)

• Speech recognition (no)

• Probability Models (yes)



Report

Docum

ent

11

Parts of the MPEG-7 Standard

• ISO / IEC 15938 - 1: Systems • ISO / IEC 15938 - 2: Description Definition Language • ISO / IEC 15938 - 3: Visual • ISO / IEC 15938 - 4: Audio • ISO / IEC 15938 - 5: Multimedia Description Schemes • ISO / IEC 15938 - 6: Reference Software



Report

Docum

ent

12

Outline of contents




Report

Docum

ent

13

Main elements of MPEG-7

• Descriptors (D): representations of features, that define the syntax and the semantics of each feature representation (low-level).

• Description Schemes (DS): that specify the structure and semantics of the relationships between their components, which may be both Ds and DSs (high-level).

• A Description Definition Language (DDL): based on XML Schema, to allow the creation of new DSs and Ds, and to allow the extension and modification of existing DSs

• System tools: to support multiplexing of descriptions, synchronization issues, transmission mechanisms, coded representations, management and protection of intellectual property



Report

Docum

ent

14

Relations of main elements

DS

DDL

DSDS

DSDS

D

DDD

D DSDS

DS

DD

D



Report

Docum

ent

15

Description Definition Language

• Description Definition Language (DDL) is a language

that define what description is valid, and allows the

creation of new Description Schemes and Descriptors.

It also allows the extension and modification of existing

Description Schemes• DDL is used to define a set of formal rules

• ordering of the elements

• occurrences of elements

……...

• XML + MPEG-7 extensions



Report

Docum

ent

16

• Why choose XML as the base for the DDL? • The popularity of XML• The interoperability with other standards in the future

• Why XML should be extended for MPEG-7?• SGML > XML• Structural extensions• Datatype extensions

XML: Base for DDL



Report

Docum

ent

17

DDL parser

DDL parser is a software to check if

a description is valid

Description Parser

Schema

YesorNo



Report

Docum

ent

18

Outline of contents




Report

Docum

ent

19

Type of descriptions

• Low level description (features, etc)• Generic and flexible • Intelligent / efficient search engine

• High level description (structures, concepts,etc)• Efficient and powerful • Lack of flexibility



Report

Docum

ent

20

Low-level Description

• Information in the creation and production processes• director, title, short feature movie

• Information related to the usage of the content • copyright pointers, usage history, broadcast schedule

• Information on the storage features of the content • storage format, encoding

• Information about low-level features in the content • colors, textures, sound timbres, melody



Report

Docum

ent

21

High-level Description

• Structural description – video segments, frames, still and moving regions,

audio segments– Segment DS (representing the spatial, temporal or

spatio-temporal structure)• Conceptual (semantic) description

– objects, events, and notions – links of the two descriptions



Report

Docum

ent

22

Illustration of descriptions



Report

Docum

ent

23

Basic description

• Elements– Information containers– containing data and other elements– <city> …… </city>

• Attributes– Attribute-value pairs used to characterize elements– <city population=“10000”> …… </city>



Report

Docum

ent

24

Structured descriptions

• Structured descriptions are trees• Trees are suitable for retrieval and search

DS

DS DS D

D D DD



Report

Docum

ent

25

Description trees<letter>

<header><name> Mr Sen </name><address>

<street> 16 rue Laplace </street><city> Nancy </city>

</address></header><text> Dear Mr White, …</text>

</letter>

text

name

letter

header

address

street city



Report

Docum

ent

26

Example: Audio description

<Mpeg7Main><DescriptionMetadata>

<Version>1.0</Version></DescriptionMetadata><ContentDescription>

<AudioContent xs1:type=“AudioType”><Audio>

<CreationInformation><Creation>

<Title> The daily news </Title></Creation>

</CreationInformation></Audio>

</AudioContent></ContentDescription>

</Mpeg7Main>



Report

Docum

ent

27

Outline of contents




Report

Docum

ent

28

Audio description

• Low-level Description – spectrum, parametric, and temporal features

• High-level Description– Audio signature Description Scheme – Instrument timbre Description Schemes – The melody Description Tools – Sound recognition and indexing Description To

ols– Spoken Content Description Tools



Report

Docum

ent

29

Audio low-level descriptors

• Waveform• Loudness• Spectral basis• Spectral envelope• Spectral centroid• Spectral spread• Fundamental frequency• Harmonicity• Attack time



Report

Docum

ent

30

Audio descriptor: Basic

• Two basic audio Descriptors– AudioWaveform Descriptor

• describes the audio waveform envelope (minimum and maximum)

– AudioPower Descriptor • describes the temporally-smoothed instantaneous po

wer



Report

Docum

ent

31

Audio descriptor: Basic Spectral

• AudioSpectrumEnvelope Descriptor– describes the short-term power spectrum

• AudioSpectrumCentroid Descriptor – describes the center of gravity of the log-frequency po

wer spectrum

• AudioSpectrumSpread Descriptor – describing the second moment of the log-frequency po

wer spectrum

• AudioSpectrumFlatness Descriptor – describes the flatness properties of the spectrum



Report

Docum

ent

32

Audio Signature Description

• AudioSignature Description Scheme provides a unique content identifier for the purpose of robust automatic identification of audio signals

• Applications include – audio fingerprinting– identification of audio– locating metadata for legacy audio content



Report

Docum

ent

33

Instrument Timbre Description

• Timbre is defined as the perceptual features that make two sounds having the same pitch and loudness sound different.

• Timbre Description describes the perceptual features with a reduced set of Descriptors– HarmonicInstrumentTimbre Descriptor – LogAttackTime Descriptor– PercussiveIinstrumentTimbre Descriptor – Combination with Basic Spectral Descriptors



Report

Docum

ent

34

Melody Description Tools

The melody Description Tools is to facilitate efficient, robust, and expressive melodic similarity matching

• MelodyContour Description Scheme– 5-step contour representation– basic rhythmic information representation

• MelodySequence Description Scheme – supporting an expanded descriptor set and high p

recision of interval encoding



Report

Docum

ent

35

General Sound Recognition and Indexing Description Tools

• SoundModel (SM) DS– statistical model, such as HMM or GMM– SoundModelStatePath Descriptor

• consists of a state sequence generated by a SM– SoundModelStateHistogram Descriptor

• consists of a normalized histogram of the state sequence generated by a SM given an audio segment

• SoundClassificationModel DS – a trainable multi-way classifier based on SMs

• speech vs music, male vs female, trumpet vs violin• genre classification, voice recognition



Report

Docum

ent

36

Spoken content retrieval

• Output of ASR– phone lattice or word lattice– spoken content DS stores these

lattices instead of plain text– lattices are good for retrieval



Report

Docum

ent

37

Spoken Content Description Tools

• SpokenContentLattice– representing the actual decoding produced by a

n ASR engine

• SpokenContentHeader– contains information about the speakers being r

ecognized and the recognizer itself– WordLexicon Descriptor – PhoneLexicon Descriptor– SpeakerInfo Descriptor– ConfusionInfo Descriptor



Report

Docum

ent

38

Gaussian DS

<Gaussian>

<Mean>

4087.18 7173.73 1.36364 94.2727 1834.36 2359.55 2645.27 2577.09

………………………………

</Mean>

<Variance>

1.6982e+007 5.21621e+007 14.3636 9749.09 3.65743e+006

………………………………

</Variance>

</Gaussian>



Report

Docum

ent

39

State-transition model DS<StateTransitionModel>

<Transitions size1="20" size2="20">

0 0 0.210526 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

……………………………………

</Transitions>

<Initial size="20">

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

</Initial>

<State label="0 players" confidence="1">

……………………………………

<State label="19 players" confidence="0.223607">

</StateTransitionModel>



Report

Docum

ent

40

ProbabilityModelClassier DS<ProbabilityModelClassifier confidence="0.9" length="2">

<ProbabilityModelClass SemanticLabel="fish" Confidence="0.5"

DescriptorName="ColorHistogram">

<Gaussian>

<Mean>

4087.18 7173.73 1.36364 94.2727 1834.36 2359.55

………………………….

</Mean>

<Variance>

1.6982e+007 5.21621e+007 14.3636 9749.09

………………………….

</Variance>

</Gaussian>

</ProbabilityModelClass>



Report

Docum

ent

41

SpokenContentLattice DS

A lattice structure for an hypothetical (combined phone and word) decoding of the expression “Taj Mahal drawing …”.



Report

Docum

ent

42

MPEG-7SOUND

DATABASE

SoundModelStatePath

SoundRecognitionClassifier

HMM 2

HMM 1

HMM N-1

HMM N

MODEL REF+STATE PATH

HMMAND

BASES

SELECTAUDIOQUERY

SPECTRUMPROJECTION

N

SoundRecognitionModel

Segmented AudioDescription

AudioSpectrumBasis

Extraction of sound indexes using a sound-recognition classifier. The model reference and state

path is stored.



Report

Docum

ent

43

MATCHING

MPEG-7SOUND

DATABASE

RESULT LIST

SoundModelStatePath

SoundRecognitionClassifier

HMM 2

HMM 1

HMM N-1

HMM N

MODEL REF+STATE PATH

HMMAND

BASIS

SELECTAUDIOQUERY

SPECTRUMPROJECTION

N


AudioSpectrumBasisContinuousMarkovModel

Indexed Audio

Query-by-example application with a query in media source form. Features must be

extracted and projected into the classification space for each model

in order to match against the database.



Report

Docum

ent

44

MATCHING

MPEG-7SOUND

DATABASE

RESULT LIST

MODEL REF +STATE PATH

DDLQUERY

An example search application utilizing a query in DDL format



Report

Docum

ent

45

Extraction of hidden Markov model and basis functions

and storage in a DDL representation

HMMAND

BASISAUDIOWAV FILES

BASISEXTRACT

HMM


FEATUREEXTRACT

AudioSpectrumBasis

SoundRecognitionFeatures ContinuousMarkovModel



Report

Docum

ent

46

Scenario for for the spoken content Description Tools

• Recall of AV data by memorable spoken events– A film or video recording where a character or person spoke a particular

word or sequence of words. The source media would be known, and the query would return a position in the media.

• Spoken Document Retrieval– There is a database consisting of separate spoken documents. The result

of the query is the relevant documents, and optionally the position in those documents of the matched speech

• Annotated Media Retrieval– Similar to spoken document retrieval. The result of the query is the

media which is annotated with speech, and not the speech itself. An example is a photograph retrieved using a spoken annotation.



Report

Docum

ent

47

Outline of contents




Report

Docum

ent

48

Multimedia DSs

• Basic Elements• Content Management• Content Description• Content Organization• Navigation and Access• User Interaction

Multimedia Description Schemes are metadata structures for describing and annotating audio-visual (AV) content



Report

Docum

ent

49

Organization of Multimedia DSs



Report

Docum

ent

50

Content Management• Creation and production information

– Creation information • title, textual annotation, creators, and dates

– Classification information• genre, subject, purpose, language

• Media coding, storage and file formats– format, compression, and coding

• Content usage– usage rights, usage record



Report

Docum

ent

51

Navigation and Access

• Summaries– hierarchical summaries– sequential summaries

• Partitions and Decompositions– decompositions in space, time and frequency– used in multi-resolution access and progressive retrieval

• Variations– selection of the most suitable of an AV program– adapt to the different capabilities of terminal devices,

network conditions or user preferences



Report

Docum

ent

52

Hierarchical summary



Report

Docum

ent

53

Illustration of variations



Report

Docum

ent

54

Content Organization

• Collections– group the contents into clusters

– describes statistics and models of the attribute values – describe relationships among collection clusters

• Models– model the attributes and features of AV content– Probability Model

• specify statistical functions and structures – Analytic Model

• specify semantic labels • specify the confidence• build classifiers



Report

Docum

ent

55

Collection Structure



Report

Docum

ent

56

User Interaction

• User Preference– context dependency in terms of time and place– relative importance of different preferences– privacy characteristics of the preferences – preferences update by agent or user

• Usage History – history of actions – used to determine the user's preferences



Report

Docum

ent

57

Outline of contents




Report

Docum

ent

58

eXperimentation Model(XM)

• Simulation platform for:• Ds, DSs, CSs, DDL

• XM applications: • the server (extraction) applications • the client (search, filtering and/or transcoding) applications

CS: Coding Schemes



Report

Docum

ent

59

The XM applications

• Extraction from Media• all low-level Ds or DSs should have an application class of this type

• Search & Retrieval Application• either client application

• Media Transcoding Application• either client application

• Description Filtering Application• either client application



Report

Docum

ent

60

Extraction from Media



Report

Docum

ent

61

Search and retrieval application



Report

Docum

ent

62

Media transcoding application



Report

Docum

ent

63

Description Filtering Application



Report

Docum

ent

64

Interface model for XM app



Report

Docum

ent

65

Real world application

MDB = media database, DDB = description database. First, from a media database two features are extracted. Then, basing on the first feature,

relevant media files are selected from the media database. The relevant media files are transcoded basing on the second extracted feature.



Report

Docum

ent

66

• Storage and retrieval of audiovisual databases (image, film, radio archives)

• Broadcast media selection (radio, TV programs)

• Surveillance (traffic control, surface transportation, production chains)

• E-commerce and Tele-shopping (searching for clothes / patterns)

• Remote sensing (cartography, ecology, natural resources management)

• Entertainment (searching for a game, for a karaoke)

• Cultural services (museums, art galleries)

• Journalism (searching for events, persons)

• Personalized news service on Internet (push media filtering)

• Intelligent multimedia presentations

• Educational applications nBio-medical applications

MPEG-7 application areas



Report

Docum

ent

67

Illustration of applications

Users



Report

Docum

ent

68

Information Flow

Feature extraction

Transmission

Storage

AV Description

Search/query

Browse

Filter

UsersUsers

PullPull

PushPush

Manual/automatic

DecodingEncoding



Report

Docum

ent

69

Push and Pull applications

• Push applications– Example: Search engines for internet and DBs – Advantage: Many search engines work on stand

ardized descriptions

• Pull applications– Example: Broadcast of video, Interactive TV – Advantage: Intelligent agents filter standardize

d descriptions



Report

Docum

ent

70

Example: Pull application

MPEG-7MPEG-7DatabaseDatabase



Report

Docum

ent

71

Example: Push application



Report

Docum

ent

72

Example: queries

• Text (keywords): – Find AV material with subject corresponding to some k

eywords • Semantic description:

– Find AV material corresponding to a specified semantic • Image as an example:

– Find an image with similar characteristics (global or local)

• A few notes of music: – Find corresponding musical pieces or movies

• Low level features (example: motion): – Find video with specific object motion trajectories



Report

Docum

ent

73

Integration of MPEG-7 into XML

<seq begin=20s dur=10s> <img id="Image1" dur=5s> <MP7: annotation> <Who>Fernado Morientes</Who> < WhatAction >Spain vs. Sweden soccer match </ WhatAction> </MP7: annotation> </img> <img id="Image2" dur=2s /> </seq>



Report

Docum

ent

74

Outline of contents




Report

Docum

ent

75

MPEG-7 and other Standards

• MPEG-1, -2, and -4 are designed to represent the information itself, while MPEG-7 is meant to represent information about the information.

• MPEG-1, -2, and -4 make content available, while MPEG-7 allows you to find the content you need.



Report

Docum

ent

76

Ultimate ambition of MPEG-7

• To make the web as searchable for multimedia content as it is searchable for text today

• To improve the use of computer systems as easy as possible



Report

Docum

ent

77

MPEG-7 beyond

• To mould computers around human requirements and not humans around computer requirements

• To enable content disclosure based on facts, rather than on human annotations

• To find information by rich spoken queries, hand-drawn images and address what most people expect computers to be able to do



Report

Docum

ent

78

More Information on WWW

• Major MPEG-7 documents

http://www.cselt.it/mpeg/, semi-official website

http://www.mpeg-7.com, official website

• Others

http://www.elsevier.com/locate/image



Report

Docum

ent

79

Conclusion

AV contents

Structures

Features

Ds

DSs

DDL Ds, DSs

User



Report

Docum

ent

80

ThankThankss



Report

Docum

ent

81



Report

Docum

ent

82

Low level AV descriptors

Video segments•Color •Camera motion •Motion activity •Mosaic

Moving regions•Color •Motion trajectory•Parametric motion•Spatio-temporal shape

Still regions

•Color •Shape •Position •Texture

Audio segments

•Spoken content •Spectral feature•Timbre



Report

Docum

ent

83

Face Recognition Descriptor

• Projection of a face vector onto a set of basis vectors (face patterns)

• Feature set is extracted from a normalized face image

• Normalized face image– 56 lines with 46 intensity values in each line– The centers of the two eyes are located on the

24th row and the 16th and 31st column for the right and left eye respectively



Report

Docum

ent

84

Segment Decomposition



Report

Docum

ent

85

MPEG-7 Normative Interfaces



Report

Docum

ent

86

Example: Content description

MPEG-7MPEG-7DatabaseDatabase

IndexingIndexingFea extracFea extrac

SearchSearchretrievalretrieval

High levelHigh levelprocessprocess

Low levelLow levelprocessprocess



Report

Docum

ent

87

Segment DS Segment DS describes the result of a spatial, temporal, or spatio-temporal partitioning of the AV content. It has nine major subclasses:

• Multimedia Segment DS• AudioVisual Region DS• AudioVisual Segment DS• Audio Segment DS• Still Region DS• Still Region 3D DS• Moving Region DS• Video Segment DS • Ink Segment DS



Report

Docum

ent

88

Examples: T/S segments



Report

Docum

ent

89

Example: Segment trees



Report

Docum

ent

90

Illus of conceptual description

Object DS

Event DS

Concept DS

Semantic state DS

Semantic place DS

Semantic time DSAV content

Semantic DS

Semantic container DS

Semantic base DS



Report

Docum

ent

91

Visual description

• Basic structures– Grid layout, Time series, Multiple view,

Spatial 2D coordinates, Temporal interpolation

• Descriptors– Color, Texture, Shape, Motion, Localization



Report

Docum

ent

92

Example: Color Descriptors

• Color space

• Color Quantization

• Dominant Colors

• Scalable Color

• Color Layout

• Color-Structure

• GoF/GoP Color



Report

Docum

ent

93

Example: Color space

• R,G,B

• Y,Cr,Cb

• H,S,V

• HMMD

• Linear transformation matrix with reference to R, G, B

• Monochrome



Report

Docum

ent

94

Audio Framework



Report

Docum

ent

95

Descriptor

• Definition A Descriptor (D) is a representation of a Feature. A Descriptor defines the syntax and the semantics of the Feature representation. • Notes A descriptor allows an evaluation of the corresponding feature via the descriptor value. It is possible to have several descriptors representing a single feature. • Examples For example for the color feature, possible descriptors are: the color histogram, the average of the frequency components, the motion field, the text of the title, etc.



Report

Docum

ent

96

Descriptor Value

• Definition A Descriptor Value is an instantiation of a Descriptor for a given data set (or subset thereof).

• Notes Descriptor Values are combined via the mechanism of a Description Scheme to form a Description.



Report

Docum

ent

97

Description Scheme

• Definition A Description Scheme (DS) specifies the structure and semantics of the relationships between its components, which may be both Descriptors and Description Schemes.• Examples A movie, structured as scenes and shots, including some textual descriptors at the scene level, and color, motion and some audio descriptors at the shot level. • Note Ds contain only basic data types, and does not refer to others D or DSs.



Report

Docum

ent

98

DS: XML Scheme & Extensions

• XML Scheme• Data types • Simple and Complex types • Elements • Inheritance, Abstract types

• MPEG-7 extensions• Array and Matrix datatype • Enumerated datatypes for MimeType, CountryCode, RegionCode, CurrencyCode and CharacterSetCode • Typed references



Report

Docum

ent

99

Basic elements of DS

• Constructs for linking media files

• Localizing pieces of content

• Describing – time, places, persons, individuals, groups,

organizations, and textual annotation, etc– Who? What object? What action? Where?

When? Why? and How?



Report

Docum

ent

100

Content recognition tools

• No speech or face or gesture recognition engines included in MPEG-7

• Content recognition tools is a task for industries, not a standard– coding tools in MPEG-1, -2, -4 were for

research purposes, not part of the standard– no tools were part of the MPEG standard



Report

Docum

ent

101

chinese academy of sciences, beijing, china speech and language processing techniques report...

Documents