bit 3193 multimedia database

BIT 3193 MULTIMEDIA DATABASE

CHAPTER 3 : MULTIMEDIA INFORMATION MODELING

• Media objects such as audio, image and video are binary, unstructured, and uninterpreted.

• Multimedia databases have to derive and store interpretation based on the content of these media objects.

• These interpretations are termed metadata.

• Metadata is the description of the media object’s content which is:

• subjective in nature

• dependent on the media type as well as the role of an application

• Has to be generated automatically or semi-automatically from the media information.

Deals with the interpretation of information within the media

Deals with the interpretation of information on multiple media

i1 ini2 i1 ini2

i1 i3i2 in

.. --------- ...

F

(f1) (f2) (fn)

Inter-mediaMetadata

Extracting Function

MediaMetadata

Extracting Function

Intra-mediaMetadata

Figure 3.1 : Intra-media and Inter-media Metadata

• Figure 3.1 describes the ways of generation of metadata from various media objects.

• Here the extracting functions f1, f2, …, fn work on the individual media and generate the intra-media data i1, i2, …, in.

• The extracting function F generates inter-media metadata, I1, I2, …, In, by working on all the composed media.

• The functions applied to extract metadata can be dependent or independent of the contents of media objects.

• Depends only on the content of media objects.

• Example:

• derivation of facial features of a person’s photographic image (ex: type of nose)

• derivation of camera operations in a video clip

• Does not depend on the contents of the media information.

• Example:

• name of the photographer who took the picture

• budget of the movie

• author who created a multimedia document

Still be useful for documentretrieval and

information checking

• It is based directly on the content of a document.

• Example:

• Full-text indices based on text of the document such as inverted tree and document vectors.

• This type of metadata describes the contents of a document with direct utilization of those contents.

• This type of metadata always involves use of knowledge of human perception.

• Example:

• fragrance of an image containing a flower

• Capture information present in the document independent of the application or subject domain of the information.

• Example:

• C++ parse tree and HTML/SGML document type definition.

• Described in a manner specific to the application or subject domain of information.

•Issues of vocabulary become very important as the terms have to be chosen in a domain specific manner.

• Example:

• relief, land-cover from the GIS

• The generation of metadata is done by applying feature extracting function on media objects.

• These feature extracting function employ content dependent, content independent and content descriptive techniques.

• These techniques use a set of terminologies that represents an application’s view as well as the contents of media objects.

• These set of terminologies refer to ontologies, create a semantic space onto which metadata is mapped.

• The ontologies used for derivation of metadata are:

• media dependent ontologies

• media independent ontologies

• metacorrelation

Media DependentOntologies

• It refers to concepts and relationships that are specific to a particular media type, such as text, image or video.

• Example:• features such as color or texture can can be applied to image data• features such as silence periods can be applied only to audio data

Media IndependentOntologies

• These ontologies describe the characteristics that are independent of the content of media objects.

• Example:• the ontologies corresponding to the time creation, location, owner of a media object are media independent.

Meta-correlation

• Metadata associated with different media objects have to be correlated to provide a unified picture of a multimedia database.

• This unified picture called query metadata is modeled for the needs of database users based on application- dependent ontologies.

Meta-correlation

• Example:• Consider the following query on GIS database on the Himalayan mountain ranges:

“ Get me images of the peaks that are at least 20000 feet in height, showing at least 5 people climbing the peaks”

• Here a correlation has to be established between the height of the peaks in the Himalayans and their available images.

• Figure 3.2 shows the various stages in the process of

generation of metadata.

• The media pre-processor helps in identifying the contents

interest in the media information.

• For each media, a separate pre-processor is required to identify the contents of interest.

TextMetadata

Image Metadata

AudioMetadata

VideoMetadata

MediaPre-processor

MediaPre-processor

MediaDependent

Media Independent

Inter-mediaMetadata

ONTOLOGIES

Physical StorageView

Figure 3.2 : Ontologies and Metadata Generation

Text Audio Image Video

MediaDependent

MediaDependent

MediaDependent

Metacorrelation

QUERY METADATA

• Text is often represented as a string of characters, stored in formats such as ASCII.

• Text can also be stored as digitized images, scanned from the original text document.

• Text has a logical structure based on the type of information it contains.

• Example:

• if the text is a book, the information can be logically structured into chapters, sections, etc.

• Metadata for text describes this logical structure, and the generation of metadata for text involves identifying its logical structure.

• Text that is keyed into a computer system normally uses a language to represent the logical structure.

• In the case of scanned text images, mechanisms is needed for identifying columns, paragraphs, semantic information and for locating the keywords.

Metadata for text images

Types of Text Metadata

Document ContentDescription

Representation of Text Data

Document History

DocumentLocation

DocumentContent

Description

• This metadata provides additional information about the content of the document.

• Example:• a document on global warming can be described as a document in environmental sciences.

RepresentationOf Text Data

• This metadata describes the format, coding and compression techniques that is used to store the data.

• Example:• the language using which the text has been formatted is a metadata.

• This metadata is content descriptive

DocumentHistory

• This metadata describes the history of creation of the document.

• This metadata is also content- descriptive.

• The components described as part of the history could include:

• Status of the document• Date of update of the document• Components of the older document which have been modified.

DocumentLocation

• This content independent metadata describes the location in a computer network that holds the documents.

• when creating the raw media data without further analyzing it.

• For instance:

• digital camera can implicitly deliver time and date for pictures of video taken.

• SGML editor generates metadata according to the document type definition

• depends on the analysis of the media object.

• this is done very often off-line

• after having recorded and stored the medium & prior to any processing which makes use of data.

• reason : tremendous cost for extracting the image’s features.

• SGML describes typographical annotations as well as the structure of the document.

• Markup is a notion from the publishing world.

• Manuscript are annotated with a pencil or a marker to indicate the layout.

• The markups or the structure provided by the author alone may not sufficient in many instances.

• Hence, automatic or semi-automatic mechanisms might be required to identify metadata such as topic and subtopic boundaries.

• Such mechanisms are especially needed when text is present in the form of scanned images.

• Here, we discuss the following text metadata generation methodologies:

a) Text Formatting Language : SGML (Implicit Metadata Generation)

b) Automatic / Semi-automatic Mechanisms (Explicit Metadata Generation)

i. Subtopic Boundary Location

ii. Word Image Spotting

Subtopic BoundaryLocation

• Texttiling algorithms are used to partition text information into tiles that reflects the underlying topic structure.

• Texttiling is a technique for automatically subdividing texts into multi-paragraph units that represent passages or subtopics.

• The algorithm uses quantitative lexical analysis to determine the segmentation of the documents.

The algorithm identifies subtopic boundaries by :

a) Dividing the text into 20 words adjacent token sequences.

• In Texttiling, a block of k sentences is treated as a logical unit.

b) Comparing adjacent blocks of token sequences for overall lexical similarity.

• The frequency of occurrence of a term within each block is compared to its frequency in the entire domain.• This helps in identifying the usage of the term within a discussed topic or in the entire text.• If the term occurs frequently over the entire text, then it cannot be used to identify topics.

The algorithm identifies subtopic boundaries by :

a) Computing similarity values for adjacent blocks.

• Determining boundary changes in the sequence of similarity scores.

Word ImageSpotting

• In the case of digitized text images, keywords have to be located in the document.

• The set of keywords that are to be located can be specified as part of the application.

Typical word spotting systems need to do the following:

a) Identify a text line by using a bounding box of a standard height and width.

• The concept of multi-resolution morphology is used to identify text lines using the specified bounding boxes.

b) Identify specific words within the determined text line.• A technique termed Hidden Markov Model (HMM) is used to identify the specific words in the text line.

* A separate sheet will be given for Word Image Spotting using HMM.

• The speech media refers to spoken language.

• It is considered as part of audio.

• The importance of speech processing arises due to its ease use as input/output mechanism for multimedia application.

• The metadata needs to be generated can be content dependent or content descriptive.

The metadata generated for speech can be as follows:

a) Identification of spoken words, it is called speech recognition.• Helps in deciding whether or not a particular speaker produced the utterance.• It is also termed as verification.

b) Identification of speaker.• Here a person’s identity is chosen from a set of known speaker.• It is called speaker identification or speaker recognition.

b) Identification of prosodic information.• Used for drawing attention to a phrase or sentence or to alter the

word meaning.

• Metadata generated as part of speech recognition is content dependent.

• This metadata can consist of the start and the end of the speech

• along with a confidence level of spoken word identification.

• Metadata generated as part of speaker recognition can be considered as content-descriptive

• though this metadata is generated by analyzing the content of the speech

• this metadata can consist of the name of the speaker, the start and the end time of the speech

• Metadata describing the prosodic information can be considered as content dependent.

• it can consist of :

• the implied meaning in case the speaker altered the word meaning, and

• a confidence score of the recognition of the prosodic information.

• Content independent metadata can also be associated with speech data.

• time of the speech location where the speech was given can be considered as content independent metadata for speech.

Has two (2) main components

Signal Processing Module

Pattern MatchingModule

SignalProcessing

Module

• gets the speech analog signal (via microphone or a recorder), and digitizes it

• the digitized signal is processed to do the following actions:

• detection of silence periods• separation of speech from non- speech components• conversion of the raw waveform into a frequency domain representation and data compression

SignalProcessing

Module

• the stream of such sample speech data values is grouped into frames of usually 10-30 ms duration

• the aim of this conversion is to retain only those components that are useful for recognition purpose

• this processed speech signal is used for identification of the spoken words or the speaker or prosodic information

Pattern MatchingModule

• is done by matching the processed speech with stored patterns.

• The pattern matching module has a repository of reference patterns that consists of the following:

• Difference utterances of the same set of words• Difference utterances by the same speaker• Difference ways of modifying the meaning of a word

• Metadata for images depend on :

• the type of images that are to be analyzed

• the applications that will be using the analyzed data

• Metadata that can be used for a few type of images such as:

• satellite images

• facial images

• architectural design images

• The satellite images are treated as 3-D grids (2889 rows, 4578 columns, 10 layers deep) – computer scientist

• The perception of earth scientists is to focus on the process that created the images.

• from this point of view, the images has 10 bands or layers, each created by a different process

Categories ofSatellite

metadata

Raster metadata

Object Descriptionmetadata

Lineage metadata

Data setmetadata

Raster Metadata

• describes the grid structure, spatial, and temporal information

• the spatial information describes the location and overlay of images

• the temporal information describes the time at which the images were taken

Lineage Metadata

• includes the processing history : algorithm and parameters used to produce the image

Data Set Metadata

• Describes the sets of data available at a particular site as well as the detailed information about each data set

ObjectDescriptionMetadata

• Includes the structures of a database table or the specific properties of an attribute

• Algorithms used for generating the required metadata are better off when they know the type of images that are being analyzed.

• algorithms – unique depending on the type of images

Steps involved in extracting the features from images :

Object locater design (Image Segmentation)

Feature Selection

Classifier Design

Classifier Training

Image Segmentation

• Basic requirement : to locate the objects that occur in image

• this requires the image to be segmented into regions or objects

• Designing object locater is to select an image segmentation algorithm that can isolate individual objects.

Feature Selection

• The specific properties or feature of objects are to be determined.

• These features should help in distinguish different types of objects in the set of images.

ClassifierDesign

• This steps helps in establishing the mathematical basis

• for determining how objects can be distinguished based on their features

ClassifierTraining

• The various adjustable parameters such as threshold values in the object classifier must be fixed

• help in classifying object

• Design and training of the classifier module are specific to the type of images.

• Helps in isolating objects in digitized image.

• There are two (2) approaches to isolate objects in an image:

a)Boundary Detection Approach

• Attempts to locate the boundaries that exist among the objects

b)Region Approach

• Determine whether pixels fall inside or outside an object

• Partitioning image into sets of interior and exterior points

• Techniques used in image segmentation :

a)Thresholding

b)Region Growing

Thresholding

• The principles behind this technique:• All pixels with gray level at or above a threshold are assigned to object• below the threshold fall outside the object

• This technique falls under the region approach

• helps identification of objects in a contrasting background

Thresholding

• The identification of threshold value has to be done carefully

• since it influences the boundary position as well as the overall size of objects

RegionGrowing

• This technique proceeds as though the interior of the object grows until their orders correspond with the edges of the objects.

• Here an image is divided into a set of tiny regions which may be single pixel or a set of pixels.

RegionGrowing

• Properties that distinguish the objects (such as gray levels, color, texture) are identified

• a value of the properties is assigned for each region

• Then the value between adjacent regions is examined

• by comparing the assigned values for each of the properties

RegionGrowing

• If the distinguish is below a certain value, then boundary of the two regions is dissolved.

• This region merging process is continued till no boundaries can be dissolved.

• After segmentation of the image, the objects in the image have to be classified according to desired features.

• This involves the steps:

a)Feature selection

b)Classifier Design

c) Classifier Training

• For instance : facial recognition

• The object to be identified include left eye, right eye, etc

• The area of search for a particular object can be reduced by applying the relationship between objects known apriori.

• Example : facial image

• We know that the mouth is below the nose

• Right eye should be at a distance y from the left eye, and so on

• When eyeball is located, the other can be located with a distance.

• Example : the nose

• can be identified with the constraint that the bottom of the nose should be between the horizontal center of the eye balls

• Approximately half the vertical distance from eye to the chin

• A score of certainty is also specified with each extracted feature.

• In case the certainty is low, alternate mechanisms can be used

• These mechanism includes using a relaxed facial template, reexamining a previously located feature.

• Video is stored as sequence of frames

• By applying compression mechanism to reduce the storage space requirements

• The stored data is raw or uninterpreted in nature

• Hence interpretation has to be drawn from this raw data

• The metadata of video can be

a)Sequence of video frames

b) a single video frame

The following video metadata can be identified:

Content dependent

Contentindependent

Content descriptive

Content dependent

• describes the raw features of video information

• metadata for a sequence of a video frames

• camera motion• camera light• track of object

• metadata at individual frame level• frame characteristics such as color histogram

Content descriptive

• metadata for a sequence of a video frames

• camera shot distance• shot angle• shot motion

• metadata at individual frame level• frame brightness• color

Content independent

• features that applicable perhaps to a whole video

• production date• producer’s name• budget, etc

• The easiest form of generating video metadata is to provide textual description.

• Can be done by two (2) approaches :

a)Manually

b)Automatic / semi-automatic

• By applying algorithm

• Content descriptive metadata

• Uses application dependent ontologies – to describe the content of video objects

• Content independent metadata

• Generated based on the inputs given about a video objects by a user / application

• To help in the process of generating the video metadata, the tools should have the following functions:

a) Identify logical information units in the video

b)Identify a different types of video camera operation

c) Identify the low level image properties of the video

d)Identify the semantic properties of the parsed logical unit

e) Identify objects and their properties (such as object motion) in the video frames.

• The logical unit of information that is to be parsed automatically is termed as camera shot or clip.

• A shot is assumed to be a sequence of frame representing a contiguous action in time and space

• The basic idea behind the identification of shot is that the frames on either side of a camera break shows a significant change in the information content.

• The algorithm:

• Needs a quantitative metric that can capture the information content of frame

• Based on threshold

• This idea for identifying between two shots get complex when fancy video presentation techniques (such as dissolve, wipe, fade-in) are used.

• In such cases, boundary between two shots no longer lies between two consecutive frames, instead is spread over a sequence of frames.

• Two types of metrics are used to quantify and compare the information content of a video frame:

a)Comparison of corresponding pixels or block in the frame

b)Comparison of histograms based on color or gray level intensities.

• The available video information may be compressed or uncompressed.

• Hence, the video parsing algorithm might have to work on compressed or uncompressed informations.

• The objects composing a multimedia database have an associated temporal characteristic.

• These characteristics specify the following parameters:

• Time instant of an object presentation.

• Duration of presentation.

• Synchronization of an object presentation with those of others.

• It can be specified in hard or flexible manner.

• In the case of hard temporal specification, the parameters such as time instants and durations of presentation of objects are fixed.

• In the case of flexible specification, these parameters are allowed to vary as long as they preserve certain specified relationships.

• For instance, consider the following temporal specifications:

(a)Show the video of the movie Toy Story at 11 am for 10 minutes.

(b)Show the video of the movie Toy Story SOMETIME BETWEEN 10.58am and 11.03am till the audio is played out.

• The first one (a) is a hard temporal specification with the instant and duration of presentation fixed at 11 am and for 10 minutes respectively.

• Whereas the specification (b) is a flexible one in that it allows the presentation start time to vary within a range of 5 minutes and the duration of video presentation till the corresponding audio is played out.

• The temporal specification, apart from describing the parameters for an individual object presentation, also need to describe the synchronization among the composing objects.

• This synchronization description brings out temporal dependencies among the individual object presentations.

• For example : refer to temporal specification (b)

• video has to be presented till the audio object is presented

• Hence a temporal specification needs to be described:

• individual object presentation characteristics (time instant and duration of presentation)

• the relationships among the composing objects

• Users viewing multimedia data presentation can interact by operations such as fast forwarding, rewinding and freezing

• temporal model needs to describe how they handle such user interactions.

• Given any two multimedia object presentations, the temporal requirements of one object can be related to that of another 13 possible ways, as shown below:

a before b a before -1b

a meets b a meets -1b

a b b a

a b b a

a overlaps b a overlaps -1b

b finishes a b finishes -1a

a starts b a starts -1b

b during a b during -1a

a

b

b

a

a

b

b

a

a

b

a

b

a

b

a

b

a

b

a overlaps b

a

b

• These 13 relationships describe how the time instants and presentation durations of two multimedia objects are related.

• These relationships, however, do not quantify the temporal parameters, time instants and duration of presentations.

• Many models have been proposed to describe the temporal relationships among the multimedia objects.

• describes the temporal relationships in a precise manner by specifying exact values for :

• time instants

• duration of presentations

• simplest model : timeline model

• media objects are placed on a timeline describing the values for the :

• time instants

• presentation durations

• Figure 3.3 shows the timeline model of VoD database example.

• For instance, the values for time instant and duration of presentation of text object W are t1 and t7-t1.

w

x1 x2 x3 x4

y1 y2

z1 z2

time

TEXT

IMAGE

VIDEO

AUDIO

t1 t2 t3 t4 t5 t6 t7

Figure 3.3 : Timeline Model

• timeline model

• extensively used in describing the temporal relationships in multimedia database

• however this model describes only the parameters for individual objects and not the presentation dependencies among the objects

• example : Figure 3.3

• video object y1 and audio object z1 have to be presented simultaneously

• this dependency is not explicitly shows in the timeline model

Textwindow

Image window

Videowindow

1

23

4

5

6

y(1)

y(2)y(3)

y(4)

y(5)

y(4)

x(1),X(3)

x(2),X(4)

x(5) x(6)

Figure 3.4 : Spatial Characteristic Representation

• Most multimedia objects have to be delivered through windows on a monitor.

• Multimedia database might have to be done in such a way that presentation of different objects do not overlap.

• A window of the monitor can be specified by using :

• the position ( x and y coordinates) of the lower left corners and the top right corners.

• relative to the position of another window

• For example : Figure 3.4, VoD database

• the lower left and top right coners of each window are numbered

• the corresponding x as well as y coordinates are shown

• in the case of the temporal model

• the values of x and y coordinates of the window corners can be specified in an absolute manner (hard spatial specification) or in a flexible manner

• For example : Figure 3.4, VoD database

• hard spatial specification would assign values (corresponding to the pixel positions) for the x and y coordinates

• The spatial specifications are simple

• they help in artistic presentation of multimedia information

• describes temporal and spatial specifications of object composing a database.

• skillful specification of temporal and spatial specifications brings out the aesthetic sense of a multimedia database presentation.

imagewindow

dY

Y

dX X

Figure 3.5 : Authoring Spatial Requirement : Example

Launch of the missile : …..……………………………..………………………………………………………………

• Spatial Specification :

• Figure 3.5 shows an example where a multimedia presentation on a missile launch is authored.

• Launch of missile can be animated by shifting the spatial locations of the window (by dX to the right and by dY towards the top)

• Positions of the windows can be described by specifying the coordinates of the window corners.

• Temporal Specification :

• Figure 3.6 shows the corresponding temporal specifications.

• The objects represented by rectangles can be placed on a timeline.

Animate Animate Animate Animate

12

34 5

t1 t2 t3 t4 t5 t6

Figure 3.6 : Authoring Temporal Requirements : Example


• Hard temporal specification

• the length of the rectangle specifies the presentation duration of the object.

• for instance:

• the presentation of text object is t6-t1


• Flexible temporal specification

• arc () can be used to represent the relation between the object presentations

• for instance:

• relation between the start of text and image presentation :

t2-t1 <

• relation between the start of text and image presentation :

t2-t1 ≤

• Graphical User Interface (GUI) based tools are required to facilitate multimedia authoring.

• Some existing commercial tools are:

a) Multimedia Toolbook

- Microsoft Windows platform

- Support OpenScript language for authoring

b) Icon Author

- Microsoft Windows and Mac platforms

- Flowchart is specified using icons

- The flowchart describes the sequence of actions to be performed

- Non-programmer oriented software

c) Director

- Microsoft Windows and Mac platform

- Object oriented language environment : Lingo

bit 3193 multimedia database

Documents

media objects content

content of media objects

contents of media objects

individual media

composed media

various media objects

particular media type

intermedia metadata