high performance image processing solution with intel ... · high performance image processing...

High Performance Image

Processing Solution with Intel®

Platform Technology

Yang Lu

Intel Corporation

2015.

White Paper: High Performance Image Processing Solution with Intel® Platform Technology

2

Intel disclaims all express and implied warranties, including without limitation, the implied

warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any

warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All

information provided here is subject to change without notice. Contact your Intel representative to

obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause

deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be

obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel, the

Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

© 2015 Intel Corporation.

Software and workloads used in performance tests of this paper may have been optimized for

performance only on Intel microprocessors. Performance tests are measured using specific computer

systems, components, software, operations and functions. Any change to any of those factors may

cause the results to vary. You should consult other information and performance tests to assist you in

fully evaluating your contemplated purchases, including the performance of that product when

combined with other products. For more information go to http://www.intel.com/performance.

http://www.intel.com/design/literature.htm

http://www.intel.com/performance




3

Contents

Contents ............................................................................................................................................. 3

1. Abstract ..................................................................................................................................... 4

2. Image Processing Introduction ...................................................................................... 4

3. Performance Characters .................................................................................................... 7

3.1 Image Processing Performances Overview ............................................... 7

3.2 Simultaneous Multithreading and Turbo Boost ........................................ 9

3.3 Micro Architecture Characters ................................................................ 10

4. High Performance Solution Based on Intel® Xeon™ Platform ................ 11

4.1 Image Compression Tuning .................................................................... 11

4.2 Image Scaling Program Tuning............................................................... 12

4.2.1 Down-Sampling Algorithm ................................................................. 12

4.2.2 Intel® High Performance Tools .......................................................... 14

5. WebP Image Processing ................................................................................................ 18

6. Summary ............................................................................................................................... 22

Reference ....................................................................................................................................... 22

Author contacts Yang Lu, Senior Application Engineer <[email protected]>


4

1. Abstract

With the increasing popularity of internet and media cloud applications, huge volume of

image data has been generated, utilized and shared every day, that presents the big

computing and storage challenges to the media related industry and company. In this

paper, we study the techniques of most popular image processing, analyze the

performance challenge, explore the tuning methodology, and implement the most

effective solution based on IA platform. We aim to maximize IA platforms’ capabilities

for typical image processing workloads, achieve best performance and efficiency to

benefit the most popular internet and media industry.

2. Image Processing Introduction

One picture is worth a thousand words. People are more and more used to upload pictures

to describe their status, feeling, and some events at SNS. Sellers also upload many

distinct pictures to describe and advertise their products at B2B, B2C and C2C platforms.

We even can search the knowledge or information through images at popular search

engine. All kinds of images have filled every corner of our life. This flood of images

presents the computing and storage challenge: how to effectively process, compact, store,

manage and transmit those images? And what kind of platforms and technologies are

most efficient for image processing? In this paper, we analyze the typical image

processing framework and the performance characters, then explore the most effective IA

technologies to maximize the image processing applications performance, propose the

best solution in terms of the processing performance and efficiency.

Generally, most of the companies need to scale and edit the images that customers

uploaded or the media content they purchased from 3rd

party, such as scaling the original

images to the different dimensions that fit different terminal devices, compressing the

images to the target format that save storage size and network bandwidth further, and

editing (adding logo and watermark) the images to meet the business requirement. Figure

1 is the typical image processing flow that most companies adopted.


5

Figure 1: Typical Image Processing Flow for Cloud Applications

To conduct those processing, following software stacks are mainly adopted:

Software Type License Developed by language OS

ImageMagick Image manipulation Apache 2.0

License

ImageMagick

Studio LLC C Cross platform

GraphicsMagick

Fork from

ImageMagick version

5.5.2, emphasizing

stability and

performance

MIT License GraphicsMagick

Group C Cross platform

OpenCV Computer Vision

library and framework BSD License

Intel Corporation, Willow Garage, Itseez

C/C++ Cross platform

OpenGL

Graphics application

API, aim to achieve

hardware-accelerated

rendering.

SGI Open

source license

and Trademark

License.

formerly: OpenG

L Architecture Review Board (ARB)

now: Khronos Group

C Cross platform

Table 1: Common Image Processing Software Stack

http://en.wikipedia.org/wiki/ImageMagick

http://en.wikipedia.org/wiki/Intel_Corporation

http://en.wikipedia.org/wiki/Intel_Corporation

http://en.wikipedia.org/wiki/Willow_Garage

http://en.wikipedia.org/wiki/Willow_Garage

http://en.wikipedia.org/wiki/Hardware_acceleration

http://en.wikipedia.org/wiki/Rendering_(computer_graphics)


6

Currently, most of the media companies adopt the proper image formats to store and

distribute images to achieve the best compression ratio and flexible internet content

expression, such as JPEG, GIF and PNG. Table 2 illustrates these kinds of image formats

characters, usage models and respective advantage and disadvantage.

Image Format Name

File Extensi

ons Developed

by Licens

ed Lossless

Animation

Support

Transparency

Support Usage Pros Cons

Browser Support (without plugin)

BMP Windows

Bitmap .bmp .dib Microsoft No Yes No Yes

Large file

size No

GIF Graphics Interchange Format

.gif CompuServce No expired Yes Yes Yes Animatio

n

Animation Widely supported format Transparency support

Limited to 256 colors

ie firefox chrome safari opera

JPEG

Joint Photographic Experts Group

.jpg .jpeg .jpe .jif .jfif .jfi


No invalid No No No Photogra

phy

Small file size Widely supported format

Lossy compression


JPEG2000

Joint Photographic Experts Group 2000

.jp2 .j2c .j2k .jpx .jpf .j2c .j2k


Yes Yes Lossless and lossy

Yes ISO/IEC 15444-2

Yes

JPEG replacement, HD imaging

Small file size

Computing intensive

safari

MNG Multiple-image Network Graphics

.mng W3C (donated by PNG Development Group)

Yes Yes Yes

Animation Not widely supported

No

PNG Portable Network Graphics

.png W3C (donated by PNG Development Group)

No Yes No Yes Icons

Lossless Widely supported format Transparency support


PSD Photoshop Document

.psd .pdd Adobe Systems

Yes No Yes Image editing

Lossless layers support transparency support

no

RAW RAW Image file

.crw .cr2 .raw .rw2 .nef .nrw .orf ...

Camera manufacturer

Yes No No

HDR photography, Archiving

Lossless Large file size No

TIFF Tagged Image File Format

.tiff .tif Adobe

Yes No No

Images from scanner, HD imaging

Lossless Large file size No

WebP WebP .webp Google

No No No

Small file size

Lossy compression

chrome opera

Table 2: Common Image Standard and Format


7

Dealing with so many types of images with dedicated processing efficiently is a big

challenge for the backend clusters. End uses generate and upload all kinds of images

every day, new media contents are created and distributed frequently, that make the

image processing clusters always carrying the huge processing pressures. In the following

sections we will analyze what kinds of IA technologies are most important for the image

processing applications, and how to take the advantage of those technologies to improve

the image processing performance, to benefit the media related business finally.

3. Performance Characters

3.1 Image Processing Performances Overview

Image processing is the typical computing intensive workload, which consumes lots of

CPU and memory resource. The performance of the image processing applications highly

depends on the capabilities of the CPU cores, cache and the memory bandwidth. We take

the traditional image scaling, compression and rotate process for examples, running those

applications at different IA platforms, from the low-end processors to the high-end

processors as shown in the table 3, get the following performance results shown in the

figure 2.

Intel® Xeon™ Platform

Processor Number E5-2620V2 E5-2640V2 E5-2697V2 E5-2697V3 HSW

# of Cores 6 8 12 14

# of Threads 12 16 24 28

Clock Speed 2.1 GHz 2 GHz 2.7 GHz 2.3 GHz

Max Turbo Frequency 2.6 GHz 2.5 GHz 3.5 GHz 4.0 GHz

Intel® Smart Cache 15 MB 20 MB 30 MB 35 MB

Intel® QPI Speed 7.2 GT/s 7.2 GT/s 8 GT/s 9.6 GT/s

Table 3: Intel® Xeon™ Ivy Bridge and Haswell Platform Configuration

The performance metric here is the processing time, lower is better, from Figure 2, we

can see that the processing performance improves along with CPU frequency, cache size

and the number of the cores increasing, and IA architecture upgrade. Therefore, the

high-bin processor is more efficient for image processing applications.


8

Figure 2: Image Processing Applications Performance at Different Platforms

We also conduct the traditional image scaling application at the different Haswell

platforms, and used the IPP (Intel® Integrated Performance Primitives)[6] to optimize the

performance. Figure 3 demonstrates that image scaling performance increases with the

CPU frequency and cache capabilities growth, and the CPU frequency plays more

important role at this application.

Figure 3: Image Scaling Application Performance at Haswell Platforms

1.5

2

2.5

3

3.5

4

4.5

e5-2620 v2 e5-2640 v2 e5-2697 v2 e5-2695-v3 hsw

rotate (s)

scaling (s)

compress (s)

tim

e (

s)

image processing performance at different IA pltaforms

E5-2670 v3 (2.30GHz,30M Cache)

E5-2699 v3 (2.30GHz, 45M cache)

E5-2697 v3 (2.60GHz, 35M cache)

original 233 221 207

ipp tuning 192 158 152

100

120

140

160

180

200

220

240

scal

ing

tim

e (

ms)

image scaling performance at Haswell platform


9

3.2 Simultaneous Multithreading and Turbo Boost

Intel® Simultaneous Multithreading (SMT), also called Hyper-threading (HT), and

Intel® Xeon™ Turbo Boost are two kinds of key technologies that IA platform provided.

They are widely supported in the most of IA platforms and contribute lots of performance

speedup at many media related applications.

− SMT makes the operating system addresses two virtual or logical cores for each

physical core, and shares the resources between them when possible. The main

function of hyper-threading is to decrease the number of dependent instructions

on the pipeline. It offers performance benefits when CPU cores fully running in

the heavy level, but not in every application such as that have the cores stay idle,

in this case SMT technology will introduce the task/thread switching overhead.

− Intel® Turbo Boost increases performance by translating the temperature, power

and current head room into higher frequency. The actual Turbo Boost frequency is

determined by the processor active cores (in C0 state), the type of workload,

estimated power consumption, and the processor temperature. For most customers,

the behavior of Intel® Turbo Boost Technology will have positive impact to

application performance as it provides opportunistic frequency upside above the

rated frequency when conditions allow and no action is required. For customers

that need deterministic processor frequency, it is recommended to disable Intel®

Turbo Boost Technology.

Figure 4 is one example of SMT and Turbo Boost contribution for the image scaling

workload. We can see that, SMT provide around 30% performance speedup and the

Turbo Boost boosts the performance by about 5%. Those two kinds of IA technologies

can improve the performance distinctly, and have been widely adopted at customer image

processing applications.

Figure 4: SMT and Turbo Boost Contribution for Image Processing Application

http://en.wikipedia.org/wiki/Operating_system


10

3.3 Micro Architecture Characters

From the processor micro architecture perspective, a well-tuned computing intensive

workload should be almost fully running at the specific logical computing modules, less

cache miss, less mis-predicted branches ratio, and ideal CPI(cycles per instruction) ratio.

Generally, CPI is less than 1 is the ideal state. Figure 4 is one well-tuned example of the

micro-architecture profiling data for image scaling application. We can see from the

figure 5 that image scaling application also consumes lots of memory bandwidth and

resource, therefore memory capability is also very important for the performance of

image processing applications.

Figure 5: Processor Micro Architecture Characters for Image Processing

With these performance characters of the image processing application, we will analyze

the IA platform technologies, and investigate how to take the favors of those technologies

to benefit image processing applications at the following sections.


11

4. High Performance Solution Based on Intel®

Xeon™ Platform

As we illustrated in the section 3, image processing application is a standard CPU and

memory intensive workload, which requires high capabilities of the server platform, such

as core computing efficiency, reliability, and stability. In this section, we will introduce

the key IA technologies that can bring the significant performance boost for image

related processing.

4.1 Image Compression Tuning

Most of the images are stored and distributed in compressed format, like JPEG, GIF and

PNG, therefore image compression/decompression is one of the most critical modules in

the image related processing clusters. As we know that every image contains several

[wide*height] matrix, and the image processing is a set of matrix computing essentially.

Those kind of calculation can be optimized by the vectorization technology, such as IA

SIMD (single instruction multiple data) instructions, which operate multiple element data

that perform the same operation on multiple data points simultaneously, as shown in the

figure 6, that will greatly improve the data throughput and execution efficiency.

Figure 6: SIMD Methodology

SIMD instructions have been widely supported in x86 processors, evolving from MMX,

SSE, AVX, to the AVX2 at different x86 platform generations respectively. Lots of


12

projects have been developing to utilize the vectorization instructions optimizing the

image compression and decompression applications. libjpeg-turbo[3] is one of the open

source project that uses SIMD instructions to accelerate the baseline JPEG compression

and decompression on x86 Platform. Generally, libjpeg-turbo can provide 2-4x times

performance speedup for JPEG image compression and decompression. Figure 7 is one of

the examples that we used the libjpeg-turbo library to optimize the JPEG image

compression application via SIMD technology.

Figure 7: JPEG Image Compression Tuning by libjpeg-turbo with SIMD

4.2 Image Scaling Program Tuning

As we illustrated in the section 2, image scaling is another hot function in the image

procesing cluster, since most of the companies would scale original images to the target

resolution and format when they receive the image source. To optimize the image scaling

appliction, we can consider both of algorithm level and code level.

4.2.1 Down-Sampling Algorithm

As shown in the figure 8, traditional image scaling application starts from the decoding

original compressed image to the raw data, then scale in or scale out as the application’s

requirement, and finally save as the target format as the output.


13

Figure 8: Image Scaling Application

A. N. Skodras and C. A. Christopoulos proposed a new algorithm, "Down-Sampling of

Compressed Images in the DCT Domain"[8], that is during the decompression stage, only

sampling and decompress the part of most important data, not full size decoding, so as to

save lots of computing time and improve the performance. For JPEG image, it is the 8x8

DCT compression algorithm, and can be decompressed to n/8 (n=4,2,1) directly. By this

way, we can decompress the image to the reasonable n/8 resolution instead the original

full size.

With ImageMagick[10] framework, for example, scaling the source image from the

resolution 4352x3264 to the target 500x575:

o Step 1: decompress source image to: (4352/8) x (3264/8) = 544x408

o Step 2: zoom image to 500x375 These two steps can be implemented by ImageMagick with the following command:

# convert -size 500x375 -scale 500x375 source.jpg target.jpg

(-size 500x375: ImageMagick will decompress to 544x408 automatically)

The performance is shown in the table 4:

Scaling Method Time (s) comments

#convert -scale 500x375

source.jpg target.jpg

0.825s original full size decompression and

scaling

#convert -size 500x375 -scale

500x375 source.jpg target.jpg

0.230s -size option first decompress to the

544x408 automatically by IM

framework, and then scaling to

500x375

Performance speed up: 0.825/0.230=3.6x

Table 4: down-sampling algorithm performance

In some ImageMagick versions, “-size” parameter can’t be supported well. Therefore the

most stable solution is to modify the code in the file of “ImageMagick_DIR/coders/jpeg.c”

as the following (bold lines):


14

static Image *ReadJPEGImage(const ImageInfo *image_info,ExceptionInfo *exception)

{

…

if (units == 2)

image->units=PixelsPerCentimeterResolution;

number_pixels=(MagickSizeType) image->columns*image->rows;

//option=GetImageOption(image_info,"jpeg:size");

//change this line to the following line

option=image_info->size;

if (option != (const char *) NULL) {

…

}

…

}

4.2.2 Intel® High Performance Tools

To maximize IA platform’s capabilities and facilitate customers to utilize and deploy

advanced IA technologies, Intel developed full set of high performance libraries[4], for

all IA based client and server platforms, many OSs (Windows*, Linux*, OS X* and

Android), and various of domains, such as system profiling, compiler, math kernel libs,

cluster analysis, graphics SDK, and multithreading programming tools. Intel®

Compiler

and IPP (Intel® Integrated Performance Primitives) have been widely used to optimize the

image processing applications.

4.2.2.1 Intel® Compiler

ICC (Intel® Compiler)[5] generates the optimized code for all IA based platforms

automatically, including the auto vectoring and paralleling, memory and cache line

tuning, as well as serious of high level optimization, based on Intel architecture’s

advanced features. It explores the most possible way to complete the task within the

minimal CPU cycles, and compatible with Microsoft Visual C++ on Windows, GCC

(GNU Compiler Collection) on Linux.

Figures 9 and 10 are two examples of ICC contribution to the image scaling applications,

at single thread/single core and multi threads/multi cores scenarios respectively. Replace

GCC with ICC, using the same optimized switch, ICC provides distinct performance

improvement here.

http://software.intel.com/en-us/windows-tool-suites/

http://software.intel.com/en-us/linux-tool-suites/

http://software.intel.com/en-us/intel-software-development-products-for-mac-os-x/

http://software.intel.com/en-us/android#pid-1619-26


15

Figure 9: icc contribution for single thread image scaling application

Figure 10: icc contribution for multi thread image scaling application

Generally, for those applications running on Intel architectures, ICC could help

customers to have further performance improvement more easily and flexibly.

4.2.2.2 Intel® Integrated Performance Primitives

IPP (Intel® Integrated Performance Primitives)[6] exploits the best thread-level

parallelism and Intel architecture instruction set implementation for following

applications and algorithms:

1) Image, video and audio processing

2) Data communication

3) Data compression and encryption

4) Signal processing, etc.

IPP functions achieve the significant performance improvement via following

technologies:

thread-level parallelism

o Multi-Core


16

o Hyper-Threading

instruction set architecture

o SIMD vectorization Instructions, MMX, SSE, AVX

o processing data in larger chunks with each instruction

o any new instructions based on IA arch.

microarchitecture by

o pre-fetching data and avoiding cache blocking

o resolving data and trace cache misses

o avoiding branch mis-predictions

Here we take a common image scaling workload to demonstrate how to use the ipp

library to replace original implementation and achieve better performance on IA

platform: #include "lanczos.h"

#include <ipp.h>

#include <time.h>

#include <stdlib.h>

//----------Original Scaling Implementation --------------------------//

void original_scaling(IplImage *src, IplImage *dst, int width, int height)

{

double x_factor;

double y_factor;

LanczosResizeFilter *resize_filter;

int long span = 0;

IplImage * filter_image = NULL;

CvSize filter_size;

unsigned int status;

x_factor = (double)width/src->width;

y_factor = (double)height/src->height;

// set the value of filter structure //

resize_filter = AcquireLanczosResizeFilter();

filter_size.width = width;

filter_size.height = height;

//create temp matrix //

if ((x_factor*y_factor) > WorkLoadFactor)

{

filter_size.width = width;

filter_size.height = src->height;

filter_image = cvCreateImage(filter_size, src->depth, src->nChannels);

}

else

{

filter_size.width = src->width;

filter_size.height = height;

filter_image = cvCreateImage(filter_size, src->depth, src->nChannels);

}

// compute piexl of dest matrix//

if ((x_factor*y_factor) > WorkLoadFactor)

{

span = (int long)filter_image->width + height;

cv::Mat srcMat = cv::cvarrToMat(src);

cv::Mat filterMat = cv::cvarrToMat(filter_image);


17

cv::Mat dstMat = cv::cvarrToMat(dst);

status = lanczosHorizontalFilter(resize_filter, &srcMat, &filterMat,

x_factor, span);

status &= lanczosVerticalFilter(resize_filter, &filterMat, &dstMat,

y_factor, span);

}

else

{

span = (int long)filter_image->height + width;

cv::Mat srcMat = cv::cvarrToMat(src);

cv::Mat filterMat = cv::cvarrToMat(filter_image);

cv::Mat dstMat = cv::cvarrToMat(dst);

status = lanczosVerticalFilter(resize_filter, &srcMat, &filterMat,

y_factor, span);

status &= lanczosHorizontalFilter(resize_filter, &filterMat, &dstMat,

x_factor, span);

}

// memory free Matri//

cvReleaseImage(&filter_image);

DestoryLanczosResizeFilter(resize_filter);

}

//-------------------------IPP Code --------------------------------//

void ipp_scaling(IplImage *src, IplImage *dst, int width, int height)

{

LanczosResizeFilter *resize_filter;

/* set the value of filter structure */

resize_filter = AcquireLanczosResizeFilter();

double x_factor;

double y_factor;

x_factor = (double)width/src->width;

y_factor = (double)height/src->height;

ippSetNumThreads(1);

//define ipp parameters

IppiRect srcRoi = {0,0, src->width, src->height};

IppiRect dstRoi={0,0, width,height};

IppiSize srcSize = {src->width, src->height};

IppiSize dstSize = {width,height};

int interpolation = IPPI_INTER_LANCZOS;

int srcStep, dstStep;

int channel = src->nChannels;

//allocate memory

int BufferSize;

ippiResizeGetBufSize(srcRoi, dstRoi, channel, interpolation, &BufferSize);

Ipp8u* pBuffer=ippsMalloc_8u(BufferSize);

if(channel == 1) {

//ippiConvert_32f8u_C1R((Ipp32f*)Temdst->imageData,

Temdst->widthStep,(Ipp8u*)dst->imageData, dst->widthStep, dstSize, ippRndNear);

ippiResizeSqrPixel_8u_C1R((Ipp8u*)src->imageData, srcSize, src->widthStep,

srcRoi, (Ipp8u*)dst->imageData, dst->widthStep, dstRoi, x_factor, y_factor,0.0,

0.0, interpolation, pBuffer);

}

else

{

//ippiConvert_32f8u_C3R((Ipp32f*)Temdst->imageData,


18

Temdst->widthStep,(Ipp8u*)dst->imageData, dst->widthStep, dstSize, ippRndNear);

ippiResizeSqrPixel_8u_C3R((Ipp8u*)src->imageData, srcSize, src->widthStep,

srcRoi, (Ipp8u*)dst->imageData, dst->widthStep, dstRoi, x_factor,

y_factor,0.0 , 0.0, interpolation, pBuffer);

}

ippsFree(pBuffer);

}

Figure 11 is the result of using the Intel optimized solution to tuning the image scaling

application. Based on the Intel® Xeon™ CPU E5-2697 v3 @ 2.60GHz system, using

libjpeg-turbo library to optimize the image compression and decompression, using IPP

high performance library to tuning the scaling process, since the IA SIMD/AVX

instructions have been utilized and implemented well in the libjpeg-turbo and IPP, we

achieved 2-3x times performance speedup in this application.

Figure 11: Image Scaling Application Tuning

5. WebP Image Processing

WebP[11][12], a new image format, proposed and developed by Google, based on the

technology from On2 company. It aims to save around half or more times of the image

size(compare with the traditional jpeg, png, gif, etc.), which will help to reduce

significant storage volume and network bandwidth for the image processing platforms.

WebP adopts the block-based transformation and prediction scheme, with eight bits

of color depth and a luminance-chrominance model. Each block is predicted on the

values from the three blocks above it and from one block to the left of it (block decoding

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

base time(s) turbojpeg-SIMD IPP IPP+turbojpeg

tim

e (

s)

Image Scaling Application Tuning on IA platform

test1.jpg

test2.jpg

test3.jpg

test4.jpg

test5.jpg

test6.jpg


19

is done in raster-scan order: left to right and top to bottom). And it supports four basic

modes of block prediction: horizontal, vertical, DC (one color), and TrueMotion.

Mis-predicted data and non-predicted blocks are compressed in a 4×4 pixel sub-block

with a discrete cosine transform or a Walsh–Hadamard transform. Lossy WebP algorithm

only supports 8-bit YUV 4:2:0 format, which may cause color loss on. Furthermore the

WebP is the derivative of the VP8/VP9 video format.

In the table 5 and table 6 we compare the WebP with JPEG and other image formats from

both of color domain display and compression performance. WebP format demonstrates

the same image quality with the JPEG format image, but has some different behavior in

the color domain display.

JPEG Image WebP Image

Table 5: JPEG and WebP Image Compare at Color Domain

In those five images, WebP format can save 2-4x times storage size and bandwidth, but

also needs 3-4x times more computing time and resource.


20

Image Format time (s) Size (KB) quality (PSNR)

time webp/jpeg

file size jpeg/webp

1.tiff 1419 x 1001 1922kB

Webp 0.208 75 41.02

Jpeg 0.056 287 38.2934 3.714286 3.826667

PNG 0.911 1917

Gif 0.633 887

Jpeg2000 1.621 1905

2.tiff -800x600 253kB

Webp 0.078 30 39.9

Jpeg 0.026 113 34.2956 3 3.766667

PNG 0.294 252

Gif 0.227 198

Jpeg2000 0.593 647

3.tiff-5120 x 3840 15877kB

Webp 2.302 523 43.74

Jpeg 0.548 1967 43.2104 4.20073 3.760994

PNG 12.186 15813

Gif 5.203 9222

Jpeg2000 15.071 14309

4.tiff-2560 x 1600 6869kB

Webp 0.701 604 38.57

Jpeg 0.176 1442 37.2701 3.982955 2.387417

PNG 1.963 6824

Gif 1.363 3310

Jpeg2000 5.095 5367

5.tiff-3942 x 4684 17344kB

Webp 2.62 772 43.26

Jpeg 0.549 2533 46.8093 4.772313 3.281088

PNG 12.047 17122

Gif 5.591 11865

Jpeg2000 13.005 12983

Table 6: Image Compression Performance Compare

Currently, more and more media cloud customers have adopted WebP format to provide

the image related service to the supported client devices, which will save lots of cost from

the storage volume and network bandwidth. However, WebP image compression

consumes over 3 times more computing resource. They also are exploring the most

efficient WebP image processing solution to meet the computing requirement.


21

Similar with other image processing applications, the most time-consuming modules of

the WebP image processing are also block based data intensive functions, that can be

optimized by the IA SIMD vectorization technology also. Figure 12 is a workload that

converts the original JPEG image to the WebP format, and delivers to the end users who

are using the WebP format supported browser. For the jpeg decompress process,

jpeg-turbo can be adopted to leverage the SIMD optimization. For the webp compress

part we use the icc optimized SIMD switches to optimize, 17% performance speedup

obtained.

Figure 12: WebP Image Tuning with SIMD

From the profiling data that shown in the figure 13, Google has developed the IA

SIMD/SSE2 code to optimize the webp compression performance, but less AVX2

support. AVX2 instruction will theoretically double the performance of previous 128b

SSE code by 256b int computing, which has be supported in Intel® Xeon™ E5-2600 v3

platform already, we can expect further extremely performance improvement when

upgrade the SSE code to AVX2 at E5-2600 v3 platform.

0.0820.0830.0840.0850.0860.0870.0880.089

0.090.0910.092

WebP Image Workload

tim

e (s

)


22

Figure 13: Profiling Result of the WebP Processing

6. Summary

In this paper, we analyze the architecture and performance characters of the most popular

image processing applications, demonstrate the high capabilities of the Intel server

platforms, and the leadership in the media processing domain via architecture design,

rigorous manufacture procedure and the excellent performance. As the new image

technology and usage modules emerging constantly, and IA platform upgrading stably,

more and more applications and customers will get benefit from high performance and

high reliability of IA technologies definitely.

Reference [1] The JPEG http://www.jpeg.org/

[2] Wikipedia: http://es.wikipedia.org

[3] libjpeg-turbo: http://www.libjpeg-turbo.org/Main/HomePage

[4] https://software.intel.com/en-us/intel-sdp-home

[5] https://software.intel.com/en-us/c-compilers/

[6] https://software.intel.com/en-us/intel-ipp/

[7] http://software.intel.com/en-us/intel-isa-extensions

[8] https://www.academia.edu/3036328/Down-sampling_of_compressed_images_in_the_DCT_domain

[9] http://jpegclub.org/jidctred/

[10] http://www.imagemagick.org/

[11] https://code.google.com/p/webp/

[12] http://en.wikipedia.org/wiki/WebP

http://www.jpeg.org/

http://es.wikipedia.org/

http://www.libjpeg-turbo.org/Main/HomePage

https://software.intel.com/en-us/intel-sdp-home

https://software.intel.com/en-us/c-compilers/

https://software.intel.com/en-us/intel-ipp/

http://software.intel.com/en-us/intel-isa-extensions

https://www.academia.edu/3036328/Down-sampling_of_compressed_images_in_the_DCT_domain

http://jpegclub.org/jidctred/

http://www.imagemagick.org/

https://code.google.com/p/webp/

http://en.wikipedia.org/wiki/WebP

high performance image processing solution with intel ... · high performance image processing...

Documents