high throuput multi standard transform core realisation using csda

Upload: sughi

Post on 23-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    1/58

    1

    CHAPTER 1

    INTRODUCTION

    1.1 VLSI INTRODUCTION

    Most of the students of Electronics Engineering are exposed to

    Integrated Circuits (IC's) at a very basic level, involving SSI (small scale

    integration) circuits like logic gates or MSI (medium scale integration)

    circuits like multiplexers, parity encoders etc. But there is a lot bigger

    world out there involving miniaturization at levels so great, that a

    micrometer and a microsecond are literally considered huge, This is the

    world of VLSI - Very Large Scale Integration. The project aims at trying

    to introduce Electronics Engineering students to the possibilities and the

    work involved in this field.

    VLSI stands for "Very Large Scale Integration". This is the field

    which involves packing more and more logic devices into smaller and

    smaller areas. Thanks to VLSI, circuits that would have taken board full

    of space can now be put into a small space few millimeters across. This

    has opened up a big opportunity to do things that were not possible

    before. VLSI circuits are everywhere for example., your computer, your

    car, your brand new state-of-the-art digital camera, the cell-phones, and

    what have you. All this involves a lot of expertise on many fronts within

    the same field to make the work completely simple.

    VLSI has been around for a long time, there is nothing new about

    it, but as a side effect of advances in the world of computers, there has

    been a dramatic proliferation of tools that can be used to design VLSI

    circuits. Alongside, obeying Moore's law, the capability of an IC has

    increased exponentially over the years, in terms of computation power,

    utilization of available area, yield. The combined effect of these two

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    2/58

    2

    advances is that people can now put diverse functionality into the IC's,

    opening up new frontiers. Examples are embedded systems, where

    intelligent devices are put inside everyday objects, and ubiquitous

    computing where small computing devices proliferate to such an extent

    that even the shoes you wear may actually do something useful like

    monitoring your heartbeats. These two fields are kind of related, and

    getting into their description can easily lead to another invention.

    In the two decades CMOS technology has claimed a preeminent

    position in modern electrical system design and enabled the widespread

    use of personal computers. Continued advances in the previous decade

    have led to the explosion of the internet and wireless communication. The

    transistor counts and clock frequencies of the state-of-the-art chips have

    grown by orders of the magnitude.

    The main concept involved in VLSI is Moores law, a corollary of

    Moores law is that transistor becomes faster, consumes less power, and

    are cheaper to manufacture each and every year. For example Intel

    microprocessor clock frequencies have doubled roughly through every 34

    months. Remarkably the improvements have accelerated in recent years.

    Computer performance has grown even more than raw clock speed. Even

    though an individual CMOS uses very little energy each time it switches,

    the enormous numbers of transistors switching at very high rates of speed

    have made power consumption a major design consideration again.

    1.2 DEALING WITH VLSI CIRCUITS

    Digital VLSI circuits are predominantly CMOS based. The way

    normal blocks like latches and gates are implemented is different from

    what students have seen so far, but the behavior remains the same. All the

    miniaturization involves new things to consider. A lot of thought has to

    go into actual implementations as well as design.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    3/58

    3

    Let us look at some of the factors involved .

    1.2.1 Circuit Delays

    Large complicated circuits running at very high frequencies have

    one big problem to tackle the problem of delays in propagation of signals

    through gates and wires, even for areas a few micrometers across! The

    operation speed is so large that as the delays add up, they can actually

    become comparable to the clock speeds.

    1.2.2 Power

    Another effect of high operation frequencies is increased

    consumption of power. This has two-fold effect devices consume

    batteries faster, and heat dissipation increases. Coupled with the fact that

    surface areas have decreased, heat poses a major threat to the stability of

    the circuit itself can be generated at a very considerable rate.

    1.2.3 Layout

    Laying out the circuit components is task common to all

    branches of electronics. Whats so special in our case is that there are

    many possible ways to do this; there can be multiple layers of different

    materials on the same silicon, there can be different arrangements of the

    smaller parts for the same component and so on.

    The power dissipation and speed in a circuit present a trade-off; if we try

    to optimize on one, the other is affected. The choice between the two is

    determined by the way we chose the layout the circuit components.

    Layout can also affect the fabrication of VLSI chips, making it either

    easy or difficult to implement the components on the silicon.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    4/58

    4

    1.3 VIDEO COMPRESSION

    Video compression uses modern coding techniques to reduce

    redundancy in video data. Video is nothing but the continuous motion of

    the frames or images obtained from an moving object. Most video

    compression algorithms and codecs combine spatial image compression

    and temporal motion compensation technique. Video compression is a

    practical implementation of source coding in information theory. In

    practice, most video codecs also use audio compression techniques in

    parallel to compress the separate, but combined data streams as one

    package. The majority of video compression algorithms use lossy

    compression which is best one to reduce the delay. Uncompressed

    video requires a very high data rate. Although lossless video

    compression codecs perform an average compression of over factor 3, a

    typical MPEG-4 lossy compression video has a compression factor

    between 20 and 200. As in all lossy compression, there is a trade-

    off between video qualities, cost of processing the compression and

    decompression, and system requirements. Highly compressed video may

    present visible or distracting artifacts.

    Some video compression schemes typically operate on square-shaped

    groups of neighboring pixels, often called macroblocks. These pixel

    groups or blocks of pixels are compared from one frame to the next, and

    the video compression codec sends only the differences within those

    blocks. In areas of video with more motion, the compression must encode

    more data to keep up with the larger number of pixels that are changing.

    Commonly during explosions, flames, flocks of animals, and in some

    panning shots, the high-frequency detail leads to quality decreases or to

    increases in the variable bitrate.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    5/58

    5

    1.4 VIDEO CODEC DESIGN

    A video codec is a device or software that enables compression or

    decompression of digital video; the format of the compressed data

    adheres to a video compression specification. The compression is usually

    lossy. Historically, video was stored as an analog signal on magnetic tape.

    Around the time when the compact disc entered the market as a digital-

    format replacement for analog audio, it became feasible to also begin

    storing and using video in digital form, and a variety of such technologies

    began to emerge. Audio and video call for customized methods of

    compression which may leads to new trend in telecommunication

    wireless systems. Engineers and mathematicians have tried a number of

    solutions for tackling this problem.

    There is a complex relationship between the video quality, the

    quantity of the data needed to represent it (also known as the bit rate), the

    complexity of the encoding and decoding algorithms, robustness to data

    losses and errors, ease of editing, random access, and end-to-end delay.

    Video codecs seek to represent a fundamentally analog data set in a

    digital format. Because of the design of analog video signals, which

    represent luma and color information separately, a common first step in

    image compression in codec design is to represent and store the image in

    a Y,Cb,Cr color space. The conversion to Y,Cb,Cr provides two benefits:

    first, it improves compressibility by providing decorrelation of the color

    signals; and second, it separates the luma signal, which is perceptually

    much more important, from the chroma signal, which is less perceptually

    important and which can be represented at lower resolution to achieve

    more efficient data compression. It is common to represent the ratios of

    information stored in these different channels in the following way

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    6/58

    6

    Y:Cb:Cr. Refer to the following article for more information about

    Chroma subsampling.

    Different codecs will use different chroma subsampling ratios as

    appropriate to their compression needs. Video compression schemes for

    Web and DVD make use of a 4:2:0 color sampling pattern, and the DV

    standard uses 4:1:1 sampling ratios. Professional video codecs designed

    to function at much higher bitrates and to record a greater amount of

    color information for post-production manipulation sample in 3:1:1

    (uncommon), 4:2:2 and 4:4:4 ratios. Examples of these codecs include

    Panasonic's DVCPRO50 and DVCPROHD codecs (4:2:2), and then

    Sony's HDCAM-SR (4:4:4) or Panasonic's HDD5 (4:2:2). Apple's new

    Progress HQ 422 codec also samples in 4:2:2 color space. More codecs

    that sample in 4:4:4 patterns exist as well, but are less common, and tend

    to be used internally in post-production houses. It is also worth noting

    that video codecs can operate in RGB space as well. These codecs tend

    not to sample the red, green, and blue channels in different ratios, since

    there is less perceptual motivation for doing so just the blue channel

    could be under sampled.

    Some amount of spatial and temporal down sampling may also be

    used to reduce the raw data rate before the basic encoding process. The

    most popular such transform is the 8x8 discrete cosine transform (DCT).

    Codecs which make use of a wavelet transform are also entering the

    market, especially in camera workflows which involve dealing with

    RAW image formatting in motion sequences. The output of the transform

    is first quantized, then entropy encoding is applied to the quantized

    values. When a DCT has been used, the coefficients are typically scanned

    using a zig-zag scan order, and the entropy coding typically combines a

    number of consecutive zero-valued quantized coefficients with the value

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    7/58

    7

    of the next non-zero quantized coefficient into a single symbol, and also

    has special ways of indicating when all of the remaining quantized

    coefficient values are equal to zero. The entropy coding method typically

    uses variable-length coding tables. Some encoders can compress the

    video in a multiple step process called n-pass encoding (e.g. 2-pass),

    which performs a slower but potentially better quality compression.

    The decoding process consists of performing, to the extent

    possible, an inversion of each stage of the encoding process. The one

    stage that cannot be exactly inverted is the quantization stage. There, a

    best-effort approximation of inversion is performed. This part of the

    process is often called "inverse quantization" or "dequantization",

    although quantization is an inherently non-invertible process.

    This process involves representing the video image as a set of

    macroblocks. For more information about this critical facet of video

    codec design.

    Video codec designs are often standardized or will be in the future-

    i.e., specified precisely in a published document. However, only the

    decoding process needs to be standardized to enable interoperability. The

    encoding process is typically not specified at all in a standard, and

    implementers are free to design their encoder however they want, as long

    as the video can be decoded in the specified manner. For this reason, the

    quality of the video produced by decoding the results of different

    encoders that use the same video codec standard can vary dramatically

    from one encoder implementation to another.

    1.5 VERILOG INTRODUCTION

    There are many text on Verilog that provides a more in-depth

    treatment. The IEEE standard itself is quite readable as well as

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    8/58

    8

    authoritative [IEEE 1364-01].Verilog is better understood as a short-hand

    for describing digital hardware. It is best to begin your design process by

    planning, on paper or in mind, the hardware we want. Then write Verilog

    that implies the hardware to a synthesis tool.

    The Verilog language was developed by Gateway Design

    Automation as a proprietary language for a logic simulation in 1984.

    Gateway was acquired by Cadence in 1989 and Verilog was made an

    open standard in 1990 under the control of Open Verilog International. It

    became an IEEE standard in1995.

    There are two general styles of description:

    Behavioral

    Structural

    Behavioral Verilog describes how the outputs are computed as

    function of inputs. Structural Verilog describes how a module is

    composed of a simpler modules or basic primitives such as gates or

    structures. There are two general types of statements used in Behavioral

    Verilog that is Continuous assignment and Always assignment.

    Continuous assignment statement imply the combinational logic

    because the output on the left side is a function of the input on the right

    side. Always block can imply combinational logic or sequential logic,

    depending on how they are use. It is good to partition the design in to

    combinational and sequential component and then write Verilog.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    9/58

    9

    CHAPTER 2

    DESIGN CONSIDERATION

    2.1 THE VLSI DESIGN PROCESS

    Fig.2.1 VLSI design flow

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    10/58

    10

    2.1.1 Digital Design Flow

    Specification

    Architecture

    RTL Coding

    RTL Verification

    Synthesis

    Backend

    Tape Out to Foundry to get end product, a wafer with repeated number of

    identical Ics.

    All modern digital designs start with a designer writing a hardware

    description of the IC (using HDL or Hardware Description Language) in

    Verilog/VHDL. A Verilog or VHDL program essentially describes the

    hardware (logic gates, Flip-Flops, counters etc.) and the interconnect of

    the circuit blocks and the functionality. Various CAD tools are available

    to synthesize a circuit based on the HDL. The most widely used synthesistools come from two CAD companies. Synopsys and Cadence.

    Without going into details, we can say that the VHDL can be called

    as the "C" of the VLSI industry. VHDL stands for "VHSIC Hardware

    Definition Language", where VHSIC stands for "Very High Speed

    Integrated Circuit". This language is used to design the circuits at a high-

    level, in two ways. It can either be a behavioral description, which

    describes what the circuit is supposed to do, or a structural description,

    which describes what the circuit is made of. There are other languages for

    describing circuits, such as Verilog, which work in a similar fashion.

    Both forms of description are then used to generate a very low-

    level description that actually spells out how all this is to be fabricated on

    the silicon chips. This will result in the manufacture of the intended IC.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    11/58

    11

    2.1.2 Analog Design Flow

    In case of analog design, the flow changes somewhat.

    Specifications

    Architecture

    Circuit Design

    SPICE Simulation

    Layout

    Parametric Extraction / Back Annotation

    Final Design

    Tape Out to foundry.

    While digital design is highly automated now, very small portion

    of analog design can be automated. There is a hardware description

    language called AHDL but is not widely used as it does not accurately

    give us the behavioral model of the circuit because of the complexity of

    the effects of parasitic on the analog behavior of the circuit. Many analog

    chips are what are termed as flat or non-hierarchical designs. This is

    true for small transistor count chips such as an operational amplifier, or a

    filter or a power management chip. For more complex analog chips such

    as data converters, the design is done at a transistor level, building up to a

    cell level, then a block level and then integrated at a chip level. Not many

    CAD tools are available for analog design even today and thus analog

    design remains a difficult art. SPICE remains the most useful simulation

    tool for analog as well as digital design.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    12/58

    12

    2.2 CATEGORIES IN VLSI DESIGN

    2.2.1 Analog

    Small transistor count precision circuits such as Amplifiers, Data

    converters, filters, Phase Locked Loops, Sensors etc.

    2.2.2 ASICS or Application Specific Integrated Circuits

    Progress in the fabrication of IC's has enabled us to create fast and

    powerful circuits in smaller and smaller devices. This also means that we

    can pack a lot more of functionality into the same area. The biggest

    application of this ability is found in the design of ASIC's. These are IC's

    that are created for specific purposes - each device is created to do a

    particular job, and do it well. The most common application area for this

    is DSP - signal filters, image compression, etc. To go to extremes,

    consider the fact that the digital wristwatch normally consists of a single

    IC doing all the time-keeping jobs as well as extra features like games,

    calendar, etc.

    2.2.3 SoC or Systems on a chip

    These are highly complex mixed signal circuits (digital and analog

    all on the same chip). A network processor chip or a wireless radio chip is

    an example of a SoC.

    2.3 Developments in the field Of VLSI

    There are a number of directions a person can take in VLSI, and

    they are all closely related to each other. Together, these developments

    are going to make possible the visions of embedded systems and

    ubiquitous computing.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    13/58

    13

    2.3.1 Reconfigurable computing

    Reconfigurable computing is a very interesting and pretty recent

    development in microelectronics. It involves fabricating circuits that can

    be reprogrammed on the fly! And no, we are not talking about

    microcontrollers running with EEPROM inside. Reconfigurable

    computing involves specially fabricated devices called FPGA's, that when

    programmed act just like normal electronic circuits. They are so designed

    that by changing or "reprogramming" the connections between numerous

    sub modules, the FPGA's can be made to behave like any circuit we wish.

    This fantastic ability to create modifiable circuits again opens up

    new possibilities in microelectronics. Consider for example,

    microprocessors which are partly reconfigurable. We know that running

    complex programs can benefit greatly if support was built into the

    hardware itself. We could have a microprocessor that could optimise

    itself for every task that it tackled! Or then consider a system that is too

    big to implement on hardware that may be limited by cost, or other

    constraints. If we use a reconfigurable platform, we could design the

    system so that parts of it are mapped onto the same hardware, at different

    times. One could think of many such applications, not the least of which

    is prototyping - using an FPGA to try out a new design before it is

    actually fabricated. This can drastically reduce development cycles, and

    also save some money that would have been spent in fabricating

    prototype IC's

    2.3.2 Takeover of Hardware design

    ASIC's provided the path to creating miniature devices that can do

    a lot of diverse functions. But with the impending boom in this kind of

    technology, what we need is a large number of people who can design

    these IC's. This is where we realise that we cross the threshold between a

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    14/58

    14

    chip designer and a systems designer at a higher level. Does a person

    designing a chip really need to know every minute detail of the IC

    manufacturing process? Can there be tools that allow a designer to simply

    create design specifications that get translated into hardware

    specifications?

    The solution to this is rather simple - hardware compilers or silicon

    compilers as they are called. We know by now, that there exist languages

    like VHDL which can be used to specify the design of a chip. What if we

    had a compiler that converts a high level language into a VHDL

    specification? The potential of this technology is tremendous - in simple

    manner, we can convert all the software programmers into hardware

    designers!

    2.3.3 The need for hardware compilers:

    Before we go further let us look at why we need this kind of

    technology that can convert high-level languages into hardware

    definitions. We see a set of needs which actually lead from one to the

    other in a series.

    A. Rapid development cycles

    The traditional method of designing hardware is a long and

    winding process, going through many stages with special effort spent in

    design verification at every stage. This means that the time from drawing

    board to market, is very long. This proves to be rather undesirable in case

    of large expanding market, with many competitors trying to grab a share.

    We need alternatives to cut down on this time so that new ideas reach the

    market faster, where the first person to get in normally gains a large

    advantage.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    15/58

    15

    B. Large number of designers

    With embedded systems becoming more and more popular, there is

    a need for a large number of chip designers, who can churn out chips

    designed for specific applications. Its impractical to think of training so

    many people in the intricacies of VLSI design.

    C. Specialized training

    Person who wishes to design ASIC's will require extensive training

    in the field of VLSI design. But we cannot possibly expect to find a large

    number of people who would wish to undergo such training. Also, theprocess of training these people will itself entail large investments in time

    and money. This means there has to be system which can abstract out all

    the details of VLSI, and which allows the user to think in simple system-

    level terms.

    There are quite a few tools available for using high-level languages

    in circuit design. But this area has started showing fruits only recently.For example, there is a language called Handel-C, that looks just like

    good old C. But it has some special extensions that make it usable for

    defining circuits. A program written in Handel-C, can be represented

    block-by-block by hardware equivalents. And in doing all this, the

    compiler takes care of all low-level issues like clock-frequency, layout,

    etc. The biggest selling point is that the user does not really have to learnanything new, except for the few extensions made to C, so that it may be

    conveniently used for circuit design.

    Another quite different language, that is still under development, is

    Lava. This is based on an esoteric branch of computer science, called

    "functional programming". FP itself is pretty old, and is radically

    different from the normal way we write programs. This is because it

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    16/58

    16

    assumes parallel execution as a part of its structure - its not based on the

    normal idea of "sequence of instructions". This parallel nature is

    something very suitable for hardware since the logic circuits are is

    inherently parallel in nature. Preliminary studies have shown that Lava

    can actually create better circuits than VHDL itself, since it affords a

    high-level view of the system, without losing sight of low-level features.

    2.4 DESIGN METHODOLOGY

    A good VLSI design system should provide for consistent in all

    three description domains (behavioral, structural, and physical) and at

    level of abstraction (e.g. architecture, RTL/block, logic, circuit).The

    means by which this is to be measured in various terms that differ in

    importance based on their application. These parameters can be

    summarized in terms of

    Performance-Speed, power, flexibility

    Size of die (Cost of die)

    Time to design (Cost of engineering and schedule)

    Ease of verification, Test generation and testability (Cost of

    engineering and schedule)

    Design is a continuous trade off to achieve the adequate results for

    all of the above parameters .So that the tools and methodologies used for

    the particular chip will be functioning based on the above parameters. Butother constraints depends on economics (i.e., size of die affecting yield)

    are even subjectivity.

    The process of designing a system on silicon is complicated, the

    role of good VLSI-design aids is to reduce this complexity, increase the

    productivity, and assure that designer of the working product. A good

    method of simplifying the approach to a design by the use of constraints

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    17/58

    17

    and abstraction. The design method in contrast to the design flow used to

    built a chip. The base design method are arranged in roughly in order of

    increase investment, which loosely relates to time and cost it takes to

    design and implement the system. It is important to understand the cost,

    capabilities and limitations of a given implementation technology to

    select the right solution. To design a custom chip when an off-the-shelf

    solution that meet the system criteria is available for same or lower cost.

    2.5 OBJECTIVE AND SCOPE

    This project deals with a MST core that supports, H.264 (8 8, 4

    4) MPEG-1/2/4 (8 8), and VC-1 (8 8, 8 4, 4 8, 4 4) transforms.

    The proposed MST core employs Distributed algorithm and Factor

    Sharing schemes as common sharing distributed arithmetic (CSDA) to

    reduce hardware cost.

    Our new design of multi standard transform video codec

    architecture will be in high throughput, low area and low delay.

    2.6 APPLICATIONS

    Digital video codecs are found in DVD systems (players,

    recorders), Video CD systems, in emerging satellite and digital terrestrial

    broadcast systems, various digital devices and software products with

    video recording or playing capability. Online video material is encoded

    by a variety of codecs, and this has led to the availability of codec packs a

    pre-assembled set of commonly used codecs combined with an installer

    available as a software package for PCs, such as K-Lite Codec Pack.

    Encoding media by the public has seen an upsurge with the availability of

    CD and DVD recorders.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    18/58

    18

    2.7 RECENT RESEARCH IN VIDEO COMPRESSION

    Although the imminent death of research into video compression

    has often been proclaimed, the growth in capacity of telecommunications

    networks is being outpaced by the rapidly increasing demand for services.

    The result is an ongoing need for better multimedia compression, and

    particularly video and image compression.

    At the same time, there is a need for these services to be carried on

    networks of greatly varying capacities and qualities of service, and to be

    decoded by devices ranging from small, low-power, handheld terminals

    to much more capable fixed systems. Hence, the ideal video compression

    algorithm have high compression efficiency, be scalable to accommodate

    variations in network performance including capacity and quality of

    service, and be scalable to accommodate variations in decoder capability.

    In this presentation, these issues will be examined, illustrated by recent

    research at UNSW@ADFA in compression efficiency, scalability and

    error resilience.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    19/58

    19

    CHAPTER 3

    SYSTEM ANALYSIS

    3.1 PROJECT INTRODUCTION

    Compression can mainly done by using several transforms such as

    Discrete Cosine Transform, Integer Transforms, Distributed Arithmetic,

    Factor sharing in video and image signals. These transforms are mainly

    used as matrix decomposition methods to reduce the Hardware cost as

    well as the implementation cost. Swartzland and Yu present an efficient

    method for reducing ROMs size by using Recursive DCT algorithms.For scaling purpose ROMs are better but some other circuits with

    shrinking technology nodes. Numerous ROM free DA architecture have

    been emerged recently. A new DA sharing system called NEDA which

    involves bit level sharing scheme to implement the butterfly matrix based

    on the adders. These are used to support anyone of the application

    standards (Table 3.1).

    Table 3.1 Corresponding Dimensions Of Different Video

    Codecs

    Video Codecs Dimensions Groups

    MPEG 1/2/4 88 ISO

    H.264 88,44 ITU-T

    VC-1 88,84,48,44 Microsoft

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    20/58

    20

    The DFT is the most important discrete transform, used to

    perform Fourier analysis in many practical applications. In digital signal

    processing, the function is any quantity or signal that varies over time,

    such as the pressure of a sound wave, a radio signal, or

    daily temperature readings, sampled over a finite time interval (often

    defined by a window function). In image processing, the samples can be

    the values of pixels along a row or column of a raster image. The DFT is

    also used to efficiently solve partial differential equations, and to perform

    other operations such as convolutions or multiplying large integers

    likewise DCT and all other transforms has the advantages to increase the

    throughput rate.

    3.2 EXISTING SYSTEM

    3.2.1 INTRODUCTION

    Numerous researchers have worked on transform core designs,

    including discrete cosine transform (DCT) and integer transform, using

    distributed arithmetic (DA) , factor sharing (FS) and matrix

    decomposition methods to reduce hardware cost. The inner product can

    be implemented using ROMs and accumulators instead of multipliers to

    reduce the area cost. To improve the throughput rate of the NEDA

    method, high-throughput adder trees are introduced. FS method derives

    matrices for multistandards as linear combinations from the same matrix

    and delta matrix, and show that the coefficients in the same matrix can

    share the same hardware resources. Matrices for VC-1 transformations

    can be decomposed into several small matrices. Recently, reconfigurable

    architectures have been presented as a solution to achieve a good

    flexibility of processors in field-programmable gate array (FPGA)

    platform or application-specific integrated circuit (ASIC). These all

    existed methods fully supported transform core for the H.264 standard,

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    21/58

    21

    including 8 8 and 4 4 transforms. The eight-point and four-point

    transform cores for MPEG-1/2/4 and H.264 and VC-1 cannot support the

    VC-1 compression standard. To overcome this limitation the proposed

    system exists.

    3.2.2 CSDA

    CSDA means Common Sharing Distributed Arithmetic, it is the

    technique that combines the Factor sharing and Distributed Arithmetic to

    generate the CSDA coefficients. Factor sharing means sharing the same

    factors from the existing input and Distributed Arithmetic means sharing

    the same input coefficients. In existing system pipeline register is used as

    a storage element

    3.2.3 LIMITATIONS OF EXISTING SYSTEM

    Low throughput

    High cost

    High delay

    More number of adders

    3.3 PROPOSED SYSTEM

    3.3.1 INTRODUCTION

    The proposed CSDA combines DA and FS methods. By expand

    the coefficients matrix at bit level The Factor sharing method first shares

    the same factor in each coefficient ,the distributed method is then applied

    to share the same combination of Input among each coefficient position.

    The proposed CSDA algorithm in matrix inner product can explain

    as follows

    (3.1)

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    22/58

    22

    Where the coefficients C11 C22 are all five-bit CSD numbers

    (4.2)

    The shared factor FS in four coefficients is [1 -1] and C1 ~ C2 can

    use instead of [1 -1] with the corresponding position under the FS

    method. The Distributed Arithmetic is applied to share the same position

    for the input, and the DA shared coefficient DA1=(X1+X2) FS . Finally,the matrix inner product in above equation can be implemented by

    shifting and adding every nonzero weight position.

    3.3.2 BUFFER AS A MEMORY

    In proposed system instead of pipeline register buffer is used as a

    memory element. Buffer is active only when the clock input is high. The

    usage of buffer here makes the bit stream without getting any halt in the

    memory. Hence the delay is considerably reduced. There is no storage in

    the register which makes the retrieval time must be very small.

    3.3.3 ADVANTAGES OF PROPOSED SYSTEM

    High throughput

    Low cost

    Supports three different types of video codecs

    Reduction in number of adders

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    23/58

    23

    CHAPTER 4

    SYSTEM DESIGN IMPLEMENTATION

    4.1 DERIVATION OF CSDA ALGORITHM

    The CSDA algorithm mainly combines Factor Sharing and

    Distributed Arithmetic techniques. The methods for computing the

    coefficients are given below.

    4.1.1 Factor sharing derivation

    In this technique the signals having the same factors that

    has to be shared. If the signals S1 and S2 can be written as

    1 = 2 + 1

    2 = 2 + 2 (4.1)

    Where Fs (shared factor) and Fd1(remainder coefficients) can be found in

    the coefficients C1 and C2,respectively

    4.1.2 Distributed Arithmetic format

    For matrix multiplication and accumulation the inner product can

    be written as

    = = (4.2)

    Where Ai is an Nbit CSD coefficients and Xi is an input data.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    24/58

    24

    2 2 2

    (4.3)

    The product Y can be obtained by shifting and adding every Yj which is

    the nonzero value. The inner product can be obtained by using shifters

    and adders instead of using multipliers to implementing in low cost.

    4.1.3 CSDA Algorithm

    The inner product can be a product of inputs and coefficient

    (4.4)

    1 1 1 0 0

    1 1 0 0 1

    1 1 1 0 0

    0 1 1 0 0 (4.5)

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    25/58

    25

    This section provides a discussion of the hardware resources and

    system accuracy for the proposed 2-D CSDA-MST core and also presents

    a comparison with previous works. Finally, the characteristics of the

    implementation into a chip are described.

    The coefficients can be generated as the above matrix, the values

    can be compared and the shame factors i.e,[1 -1] that has to be shared and

    finally calculate the value Fs. From the shared factor the distributed

    arithmetic values can be considered with the help of the inputs X i. The

    CSDA combines the factor sharing and DA methods. The FS method is

    implemented to first identify the factors that can achieve the greater

    hardware resource sharing capacity. The shared factor FS in four

    coefficients is [ 1 -1] and C1 ~ C2 can use instead of [1 -1] with the

    corresponding position under the FS method. The Distributed Arithmetic

    is applied to share the same position for the input, and the DA shared

    coefficient DA1=(X1+X2) FS . Finally, the matrix inner product in

    above equation can be implemented by shifting and adding every nonzero

    weight position.

    To adopt searching flow software code is the only way to iterative

    searching loops by setting a constraint with minimum number of nonzero

    elements. The choice of shared coefficients is obtained by some

    constraints ,the coefficients is not a global optimal solution which have

    the minimal non zero bits.

    However, the chosen coefficients of CSD expression can achieve

    high sharing capability for arithmetic resources by using the proposed

    CSDA algorithm. More information about CSDA coefficients for MPEG-

    1/2/4, H.264, and VC-1 transforms.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    26/58

    26

    In the odd part and even part the coefficients can be separately allocated

    by using the symmetry property in which the decomposition takes place

    and obtains the values of Zee and Zeo.

    = 0 + 3

    1 + 2, =

    0 3

    1 2(4.6)

    4.1.4 FLOW DIAGRAM

    Input coefficient

    matrix

    Iteration Searching Loop

    FS finds new shared factor in

    coefficient matrix.

    DA finds shared coefficient Based on

    FS results

    Calculate the numbers of the adders.

    Compare to previous data (adders),

    and Update the smallest one for FS

    and DA

    Find the CSDA

    shared coefficient

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    27/58

    27

    4.1.5 DESCRIPTION:

    To obtain better resource sharing for inner product operation, the

    proposed CSDA combines the FS and DA methods. The FS method is

    adopted first to identify the factors that can achieve higher capability in

    hardware resource sharing, where the hardware resource in this paper is

    defined as the number of adder usage. Next, the DA method is used to

    find the shared coefficient based on the results of the FS method. The

    adder-tree circuits will be followed by the proposed CSDA circuit. Thus,

    the CSDA method aims to reduce the nonzero elements to as few as

    possible. The CSDA shared coefficient is used for estimating and

    comparing thereafter the number of adders in the CSDA loop. Therefore,

    the iteration searching loop requires a large number of loops to determine

    the smallest hardware resource by these steps, and the CSDA shared

    coefficient can be established. Notice that the optimal factor or coefficient

    in only FS or DA methods is not conducted for the smallest resource in

    the proposed CSDA method. Thus, a number of iteration loops is needed

    for determining the better CSDA shared coefficient.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    28/58

    28

    CHAPTER 5

    SOFTWARE DESCRIPTION

    5.1 Xilinx ISE Overview

    The Integrated Software Environment (ISE) is the Xilinx

    design software suite that allows you to take your design from design

    entry through Xilinx device programming. The ISE Project Navigator

    manages and processes your design through the following steps in the

    ISE design flow.

    5.1.1 Design Entry

    Design entry is the first step in the ISE design flow. During design

    entry, you create your source files based on your design objectives. You

    can create your top-level design file using a Hardware Description

    Language (HDL), such as VHDL, Verilog, or ABEL, or using a

    schematic. You can use multiple formats for the lower-level source files

    in your design.

    5.1.2 Synthesis

    After design entry and optional simulation, you run synthesis.

    During this step, VHDL, Verilog, or mixed language designs become net

    list files that are accepted as input to the implementation step.

    5.1.3 Implementation

    After synthesis, you run design implementation, which converts the

    logical design into a physical file format that can be downloaded to the

    selected target device. From Project Navigator, you can run the

    implementation process in one step, or you can run each of the

    implementation processes separately. Implementation processes vary

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    29/58

    29

    depending on whether you are targeting a Field Programmable Gate

    Array (FPGA) or a Complex Programmable Logic Device (CPLD).

    5.1.4 Verification

    You can verify the functionality of your design at several points in

    the design flow. You can use simulator software to verify the

    functionality and timing of your design or a portion of your design. The

    simulator interprets VHDL or Verilog code into circuit functionality and

    displays logical results of the described HDL to determine correct circuit

    operation. Simulation allows you to create and verify complex functions

    in a relatively small amount of time. You can also run in-circuit

    verification after programming your device.

    5.1.5 Device Configuration

    After generating a programming file, you configure your device.

    During configuration, you generate configuration files and download the

    programming files from a host computer to a Xilinx device.

    5.2 Model Sim Overview

    ModelSim is a very powerful simulation environment, and as such

    can be difficult to master. Thankfully with the advent of Xilinx Project

    Navigator 6.2i, the

    Xilinx tools can take care of launching ModelSim to simulate most

    projects. However, a rather large flaw in Xilinx Project Navigator 6.2i is

    its inability to correctly handle test benches which instantiate multiple

    modules. To correctly simulate a test bench which instantiates multiple

    modules, you will need to create and use a ModelSim project manually.

    The steps are fairly simple:

    1. Create a directory for your project

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    30/58

    30

    2. Start ModelSim and create a new project

    3. Add all your verilog to the project

    4. Compile your verilog files

    5. Start the simulation

    6. Add signals to the wave window

    7. Recompile changed verilog files

    8. Restart/Run the simulation

    ModelSim is a simulation and debugging tool for VHDL, Verilog,and mixed-language designs.

    5.2.1 Basic simulation flow

    The following diagram shows the basic steps for simulating a

    design in ModelSim

    Fig.5.1 Simulating design flowchart

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    31/58

    31

    5.2.2 Creating the working library

    In ModelSim, all designs, be they VHDL, Verilog, or some

    combination thereof, are compiled into a library. You typically start a

    new simulation in ModelSim by creating a working library called "work".

    "Work" is the library name used by the compiler as the default destination

    for compiled design units.

    5.2.3 Compiling your design

    After creating the working library, you compile your design units

    into it. The ModelSim library format is compatible across all supportedplatforms. You can simulate your design on any platform without having

    to recompile your design.

    5.2.4 Running the simulation

    With the design compiled, you invoke the simulator on a top-level

    module (Verilog) or a configuration or entity/architecture pair (VHDL).

    Assuming the design loads successfully, the simulation time is set to zero,

    and you enter a run command to begin simulation.

    5.2.5 Debugging your results

    If you dont get the results you expect, you can use ModelSims

    robust debugging environment to track down the cause of the problem.

    5.3 Project flow

    A project is a collection mechanism for an HDL design under

    specification or test. Even though you dont have to use projects in

    ModelSim, they may ease interaction with the tool and are useful for

    organizing files and specifying simulation settings. The following

    diagram shows the basic steps for simulating a design within a ModelSim

    project

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    32/58

    32

    Fig.5.2 Project Flow

    As you can see, the flow is similar to the basic simulation flow. However,

    there are two important differences:

    You do not have to create a working library in the project flow; it is

    done for you automatically.

    Projects are persistent. In other words, they will open every time you

    invoke ModelSim unless you specifically close them.

    5.4 Multiple library flow

    ModelSim uses libraries in two ways:

    1) As a local working library that contains the compiled version of your

    design;

    2) As a resource library.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    33/58

    33

    The contents of your working library will change as you update

    your design and recompile. A resource library is typically static and

    serves as a parts source for your design. You can create your own

    resource libraries, or they may be supplied by another design team or a

    third party (e.g., a silicon vendor).

    You specify which resource libraries will be used when the design

    is compiled, and there are rules to specify in which order they are

    searched. A common example of using both a working library and a

    resource library is one where your gate-level design and test bench are

    compiled into the working library, and the design references gate-level

    models in a separate resource library.

    The diagram below shows the basic steps for simulating with

    multiple libraries.

    Fig.5.3 Library Flow

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    34/58

    34

    You can also link to resource libraries from within a project. If you

    are using a project, you would replace the first step above with these two

    steps: create the project and add the test bench to the project.

    5.5 Debugging tools

    ModelSim offers numerous tools for debugging and analyzing your

    design. Several of these tools are covered in subsequent lessons,

    including:

    Setting breakpoints and stepping through the source code

    Viewing waveforms and measuring time

    Viewing and initializing memories

    A project may also consist of,

    HDL source files or references to source files

    Other files such as READMEs or other project documentation

    Local libraries

    References to global libraries

    5.6 VERILOG

    Verilog, standardized as IEEE 1364, is a hardware description

    language (HDL) used to model electronic systems. It is most commonly

    used in the design and verification of digital circuits at the register-

    transfer level of abstraction. It is also used in the verification of analog

    circuits and mixed-signal circuits.

    Verilog HDL is one of the two most common Hardware

    Description Languages (HDL) used by integrated circuit (IC) designers.

    The other one is VHDL. HDLs allows the design to be simulated earlier

    in the design cycle in order to correct errors or experiment with differentarchitectures. Designs described in HDL are technology-independent,

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    35/58

    35

    easy to design and debug, and are usually more readable than schematics,

    particularly for large circuits.

    Verilog can be used to describe designs at four levels of

    abstraction:

    (i) Algorithmic level (much like c code with if, case and loop

    statements).

    (ii) Register transfer level (RTL uses registers connected by

    Boolean equations).

    (iii) Gate level (interconnected AND, NOR etc.).

    (iv) Switch level (the switches are MOS transistors inside gates).

    The language also defines constructs that can be used to control the

    input and output of simulation. More recently Verilog is used as an input

    for synthesis programs which will generate a gate-level description (a

    netlist) for the circuit. Some Verilog constructs are not synthesizable.

    Also the way the code is written will greatly effect the size and speed of

    the synthesized circuit. Most readers will want to synthesize their circuits,

    so no synthesizable constructs should be used only for test benches.

    These are program modules used to generate I/O needed to simulate the

    rest of the design. The words not synthesizable will be used for

    examples and constructs as needed that do not synthesize.

    There are two types of code in most HDLs:

    Structural, which is a verbal wiring diagram without storage.

    assign a=b & c | d; /* | is a OR */

    assign d = e & (~c);

    Here the order of the statements does not matter. Changing e will change

    a.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    36/58

    36

    Procedural which is used for circuits with storage, or as a

    convenient way to write conditional logic.

    always @(posedge clk) // Execute the next statement on every

    rising clock edge.

    count

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    37/58

    37

    CHAPTER 6

    MODULES

    6.1 1-D Common Sharing Distributed arithmetic-MST:

    Based on the proposed CSDA algorithm, the coefficients for

    MPEG-1/2/4, H.264, and VC-1 transforms are chosen to achieve high

    sharing capability for arithmetic resources. To adopt the searching flow,

    software code will help to do the iterative searching lop by setting a

    constraint with minimum nonzero elements. In this paper, the constraint

    of minimum nonzero elements is set to be five. After softwaresearching, the coefficients of the CSD expression, where 1 indicates 1.

    Note that this choice of shared coefficient is obtained by some

    constraints. Thus, the chosen CSDA coefficient is not a global optimal

    solution. It is just a local or suboptimal solution. Besides, the CSD codes

    are not optimal expression, which have minimal nonzero bits. However,

    the chosen coefficients of CSD expression can achieve high sharing

    capability for arithmetic resources by using the proposed CSDA

    Fig 6.1. Architecture of the proposed 1-D CSDA-MST.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    38/58

    38

    algorithm. More information about CSDA coefficients for MPEG-1/2/4,

    H.264, and VC-1 transforms.

    The selected butterfly technique is used to obtain the inputs for odd part

    and even part as well. For even part addition can be performed and for

    odd part subtraction can be performed is mainly for 4 point H.264 and

    VC-1 transformations. This reduces the complex of using the DCT and

    Integer Transform.

    =

    0 + 7

    1 + 6

    2 + 5

    3 + 4

    , =

    0 7

    1 6

    2 5

    3 4

    (6.1)

    Where = 0, 1, 2, 3 and = 0, 1, 2, 3 are the

    inputs of even part and odd part.

    6.1.1 Even part common sharing distributed arithmetic circuit:

    The SBF module executes for the eight-point transform andbypasses the input data for two four-point transforms. After the SBF

    module, the CSDA_E and CSDA_O execute and by feeding input data a

    and b, respectively. The CSDA_E calculates the even part of the eight-

    point transform, similar to the four-point Trans form for H.264 and VC-1

    standards. Within the architecture of CSDA_E, two pipeline stages exist

    (12-bit and 13-bit). The first stage executes as a four-input butterflymatrix circuit, and the second stage of CSDA_E then executes by using

    the proposed CSDA algorithm to share hardware resources in variable

    standards.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    39/58

    39

    Fig.6.2 Architecture of the even part CSDA circuit

    6.1.2 Odd part common sharing distributed arithmetic circuit:

    Similar to the CSDA_E, the CSDA_O also has two pipeline stages.

    Based on the proposed CSDA algorithm, the CSDA_O efficiently shares

    the hardware resources among the od part of the eight-point transform

    and four-point transform for variable standards. It contains selection

    signals of multiplexers (MUXs) for different standards. Eight adder trees

    with error compensation (ECATs) are followed by the CSDA_E and

    CSDA_O, which ad the nonzero CSDA coefficients with corresponding

    weight as the tree-like architectures. The ECATs circuits can alleviate

    1st

    stage memory 2n

    stage memory

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    40/58

    40

    truncation error efficiently in small area design when summing the

    nonzero data al together.

    Fig.6.3. Architecture of the odd part CSDA circuit.

    6.1.3 ECAT

    Eight adder trees with error compensation (ECATs) are followed

    by the CSDA_E and CSDA_O, which add the nonzero CSDA

    coefficients with corresponding weight as the tree-like architectures. The

    ECATs circuits can alleviate truncation error efficiently in small area

    design when summing the nonzero data all together.

    1st

    stage memory 2n

    stage memory

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    41/58

    41

    Fig.6.4. Architecture of ECAT

    6.1.4 Permutation

    In 8 output from ECAT directly given to permutation.

    permutation relates to the act of rearranging, or permuting, all the

    members of a set into some sequence or order (unlike combinations,

    which are selections of some members of the set where order is

    disregarded).It is used to for encode output matrix.

    Fig.6.5 Permutation concept

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    42/58

    42

    6.2 2D CSDA CORE DESIGN

    6.2.1 Mathematical Derivation of Eight-Point and Four-Point

    Transforms

    This section introduces the proposed 2-D CSDA-MST core

    implementation. Neglecting the scaling factor, the one- dimensional (1-D)

    eight-point transform can be defined as follows

    (6.1)

    (6.2)

    Because the eight-point coefficient structures in MPEG- 1/2/4,

    H.264, and VC-1 standards are the same, the eight-point transform for

    these standards can use the same mathematic derivation. According to the

    Fig.6.6 2D CSDA core with TMEM

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    43/58

    43

    symmetry property, the 1-D eight- point transform can be divided into

    even and odd two four-point transforms, Ze and Zo, as listed in and

    respectively

    (6.3)

    The even part of the operation in (10) is the same as that of the four-point

    H.264 and VC-1 transformations. Moreover, the even part Ze can be

    further decomposed into even and odd parts: Zee and Zeo

    (6.4)

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    44/58

    44

    6.2.2 TMEM

    The TMEM is implemented using 64-word 12-bit dual-port buffer

    and has a latency of 52 cycles. Based on the time scheduling strategy and

    result of the time scheduling strategy, the 1st-D and 2nd-D transforms are

    able to be computed simultaneously. The transposition memory is an 88

    buffer array with the data width of 16 bits and is shown in Fig

    Fig.6.7 TMEM

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    45/58

    45

    CHAPTER 7

    RESULT ANALYSIS

    7.1 COMPARISION WITH EXISTING SYSTEMS

    While comparing the proposed system with existing system the

    usage of buffer instead of pipeline register will considerably reduce the

    Table 7.1 Measured Results

    Measured results Huang

    et al.

    [5]

    Lai et

    al.

    [8]

    Lee et

    al.

    [15]

    Chang

    et al.

    [9]

    Lee et

    al.

    [10]

    Exsisting

    CSDA

    Proposed

    CSDA

    Gate counts(NAND2) 39.8K 55.6K 36.6K 39.1K 36.8K 30K 27K

    Supporting

    Standards

    MPEG 1/2/4

    88

    H.264

    88

    44

    (L)

    44(H)

    VC-1

    88

    84

    48

    44

    Power Consumption

    (mW)

    38.7mW 3.4mW N/A N/A N/A 46.3mW 26mW

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    46/58

    46

    - represents non supported standards

    -represents supported standards

    delay which increases the speed with respect to the number of reduction

    in the adders from the Table 7.1. The value of gate counts can be reduced

    to 27k but in the existing system the value is quite higher 30k. Since only

    few numbers of adders utilized which reduces the power consumption

    which is of 26mw, while the existing system involves in high power

    consumption of 46mw. The proposed system also supports the multiple

    standards.

    From the above table the measured results are compared with the

    existing results .In a proposed CSDA the gate count is reduced when

    compared with the existing CSDAThe power consumption measured in

    our proposed CSDA is 26mW which is reduced when compared with our

    existing CSDA.

    7.2 MUX SELECTION INPUTS

    Table 7.2 Selection Inputs For Different Standards

    Video

    codec

    standards

    Dimensions MUX MUX-

    1

    MUX-

    2

    MUX-

    3

    MUX-

    4

    MUX-

    5

    MUX-

    6

    MPEG 8 1 1 1 1 1 0 1

    H.264

    8 1 0 0 1 0 0 0

    4(H) 0 0 0 0 0 1 1

    4(I) 0 0 0 0 0 1 1

    VC-1 8 1 0 1 0 0 1 1

    4 0 0 1 0 1 1 1

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    47/58

    47

    These are the selection inputs which are given to the individual standards.

    The desired standard can be obtained using the MUX selection.

    7.3 MPEG SIMULATION RESULT

    By giving the selection inputs as binary inputs for seven mux as

    (1111101) for eight point transform we get the MPEG output simulation

    as shown in fig.7.1

    Fig.7.1 simulation result for MPEG

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    48/58

    48

    7.4 H.264 SIMULATION RESULT

    By giving the selection inputs as binary inputs for seven mux as

    (1001000) for eight point transform and (0000011) for four point

    transform we get the H.264 output simulation as shown in fig.7.2

    Fig.7.2 simulation result for H.264

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    49/58

    49

    7.5 VC-1 SIMULATION RESULT

    By giving the selection inputs as binary inputs for seven mux as

    (1001000) for eight point transform and (0000011) for four point

    transform we get the VC-1 output simulation as shown in fig.7.3

    Fig.7.3 simulation result for VC-1

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    50/58

    50

    7.6 RTL SCHEMATIC VIEW OF ENTIRE PROCESS

    In a synthesis results Run by Xilinx 13.2 software. The proposed

    MST core employs Distributed algorithm and Factor Sharing schemes as

    common sharing distributed arithmetic (CSDA) to reduce hardware cost

    and delay.

    Fig.7.4 RTL view of whole 2D-CSDA architecture

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    51/58

    51

    Fig 7.5 RTL inner view of 2D CSDA

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    52/58

    52

    Fig7.6 .RTL inner view of 1D CSDA

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    53/58

    53

    7.7 SYNTHESIS REPORT FOR OUTPUT

    Fig.7.7 Output for 2-D Common Sharing Distributed

    arithmetic-MST delay

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    54/58

    54

    7.8 POWER ANALYZER OUTPUT

    Fig.7.8 Output for 2-D Common Sharing Distributed

    arithmetic-MST Power

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    55/58

    55

    7.9 DEVICE UTILIZATION SUMMARY

    Fig.7.9 Output for 2-D Common Sharing Distributed

    arithmetic-MST Gate count

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    56/58

    56

    CHAPTER 8

    CONCLUSION

    The CSDA-MST core can achieve high performance, with a high

    throughput rate and low-cost VLSI design, supporting MPEG-1/2/4,

    H.264, and VC-1 MSTs. By using the proposed CSDA method, the

    number of adders and MUXs in the MST core can be saved efficiently.

    Measured results show the CSDA-MST core with a synthesis and

    simulation rate with 27k logic gates and with power consumption of

    26mW. Measured results show the CSDA-MST core with a throughput

    rate of 1.28 G-pixels/s, which can support (4928 2048@24 Hz) digital

    cinema format with only 27k logic gates. Because visual media

    technology has advanced rapidly, this approach will help meet the rising

    high-resolution specifications and future needs as well.

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    57/58

    57

    REFERENCES

    1. Chang.H, Kim.S, Lee.S, and Cho.K, , Nov[ 2009], Design of

    area-efficient unified transform circuit for multi-standard video

    decoder, in Proc. IEEE Int. SoC Design Confpp. 369372.

    2. Chen.Y.H, Chang.T.Y, and Li.C.Y, Apr[ 2011]. High

    throughput.

    DA-based DCT with high accuracy error-compensated adder

    tree, IEEE Trans. Very Large Scale Integration. (VLSI) Syst.,

    vol. 19, no. 4, pp. 709714.

    3. Hoang.D.T and Vitter.J.S,[ 2001]. Efficient Algorithms for

    MPEG Video Compression. New York, USA: Wiley.

    4. Huang.C.Y, Chen.L.F, and Lai.Y.K, May [2008]. A high-

    speed 2-D transform architecture with unique kernel for multi-

    standard video applications, in Proc. IEEE Int. Symp. Circuits

    Syst., pp. 2124.

    5. Lai.Y.K and Lai.Y.F Aug. [2010].A reconfigurable IDCT

    architecture for universal video decoders, IEEE Trans.

    Consum. Electron., vol. 56, no. 3, pp. 1872187.

    6. Lee.S and Cho.K , Feb. [2008]. Architecture of transform

    circuit for video decoder supporting multiple standards,

    Electron. Lett, vol. 44, no. 4, pp. 2742758.

    7. Uramoto.S, Inoue.Y, Takabatake.A, Takeda.J,Yamashita.Y,

  • 7/24/2019 high throuput multi standard transform core realisation using csda

    58/58

    Terane.T, and Yoshimoto.M, Apr [1992].A 100-MHz 2-D

    discrete cosine transform core processor, IEEE J. Solid-State

    Circuits, vol. 27, no. 4, pp. 492499.

    8. Hwangbo.W and Kyun.C.M, Apr.[ 2010]. A multitransform

    architecture for H.264/AVC high-profile coders, IEEE Trans.

    Multimedia, vol. 12, no. 3, pp. 157162.