high performance image processing solution with intel ... · high performance image processing...
TRANSCRIPT
High Performance Image
Processing Solution with Intel®
Platform Technology
Yang Lu
Intel Corporation
2015.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
2
Intel disclaims all express and implied warranties, including without limitation, the implied
warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any
warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All
information provided here is subject to change without notice. Contact your Intel representative to
obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause
deviations from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be
obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel, the
Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
© 2015 Intel Corporation.
Software and workloads used in performance tests of this paper may have been optimized for
performance only on Intel microprocessors. Performance tests are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may
cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when
combined with other products. For more information go to http://www.intel.com/performance.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
3
Contents
Contents ............................................................................................................................................. 3
1. Abstract ..................................................................................................................................... 4
2. Image Processing Introduction ...................................................................................... 4
3. Performance Characters .................................................................................................... 7
3.1 Image Processing Performances Overview ............................................... 7
3.2 Simultaneous Multithreading and Turbo Boost ........................................ 9
3.3 Micro Architecture Characters ................................................................ 10
4. High Performance Solution Based on Intel® Xeon™ Platform ................ 11
4.1 Image Compression Tuning .................................................................... 11
4.2 Image Scaling Program Tuning............................................................... 12
4.2.1 Down-Sampling Algorithm ................................................................. 12
4.2.2 Intel® High Performance Tools .......................................................... 14
5. WebP Image Processing ................................................................................................ 18
6. Summary ............................................................................................................................... 22
Reference ....................................................................................................................................... 22
Author contacts Yang Lu, Senior Application Engineer <[email protected]>
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
4
1. Abstract
With the increasing popularity of internet and media cloud applications, huge volume of
image data has been generated, utilized and shared every day, that presents the big
computing and storage challenges to the media related industry and company. In this
paper, we study the techniques of most popular image processing, analyze the
performance challenge, explore the tuning methodology, and implement the most
effective solution based on IA platform. We aim to maximize IA platforms’ capabilities
for typical image processing workloads, achieve best performance and efficiency to
benefit the most popular internet and media industry.
2. Image Processing Introduction
One picture is worth a thousand words. People are more and more used to upload pictures
to describe their status, feeling, and some events at SNS. Sellers also upload many
distinct pictures to describe and advertise their products at B2B, B2C and C2C platforms.
We even can search the knowledge or information through images at popular search
engine. All kinds of images have filled every corner of our life. This flood of images
presents the computing and storage challenge: how to effectively process, compact, store,
manage and transmit those images? And what kind of platforms and technologies are
most efficient for image processing? In this paper, we analyze the typical image
processing framework and the performance characters, then explore the most effective IA
technologies to maximize the image processing applications performance, propose the
best solution in terms of the processing performance and efficiency.
Generally, most of the companies need to scale and edit the images that customers
uploaded or the media content they purchased from 3rd
party, such as scaling the original
images to the different dimensions that fit different terminal devices, compressing the
images to the target format that save storage size and network bandwidth further, and
editing (adding logo and watermark) the images to meet the business requirement. Figure
1 is the typical image processing flow that most companies adopted.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
5
Figure 1: Typical Image Processing Flow for Cloud Applications
To conduct those processing, following software stacks are mainly adopted:
Software Type License Developed by language OS
ImageMagick Image manipulation Apache 2.0
License
ImageMagick
Studio LLC C Cross platform
GraphicsMagick
Fork from
ImageMagick version
5.5.2, emphasizing
stability and
performance
MIT License GraphicsMagick
Group C Cross platform
OpenCV Computer Vision
library and framework BSD License
Intel Corporation, Willow Garage, Itseez
C/C++ Cross platform
OpenGL
Graphics application
API, aim to achieve
hardware-accelerated
rendering.
SGI Open
source license
and Trademark
License.
formerly: OpenG
L Architecture Review Board (ARB)
now: Khronos Group
C Cross platform
Table 1: Common Image Processing Software Stack
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
6
Currently, most of the media companies adopt the proper image formats to store and
distribute images to achieve the best compression ratio and flexible internet content
expression, such as JPEG, GIF and PNG. Table 2 illustrates these kinds of image formats
characters, usage models and respective advantage and disadvantage.
Image Format Name
File Extensi
ons Developed
by Licens
ed Lossless
Animation
Support
Transparency
Support Usage Pros Cons
Browser Support (without plugin)
BMP Windows
Bitmap .bmp .dib Microsoft No Yes No Yes
Large file
size No
GIF Graphics Interchange Format
.gif CompuServce No expired Yes Yes Yes Animatio
n
Animation Widely supported format Transparency support
Limited to 256 colors
ie firefox chrome safari opera
JPEG
Joint Photographic Experts Group
.jpg .jpeg .jpe .jif .jfif .jfi
Joint Photographic Experts Group
No invalid No No No Photogra
phy
Small file size Widely supported format
Lossy compression
ie firefox chrome safari opera
JPEG2000
Joint Photographic Experts Group 2000
.jp2 .j2c .j2k .jpx .jpf .j2c .j2k
Joint Photographic Experts Group
Yes Yes Lossless and lossy
Yes ISO/IEC 15444-2
Yes
JPEG replacement, HD imaging
Small file size
Computing intensive
safari
MNG Multiple-image Network Graphics
.mng W3C (donated by PNG Development Group)
Yes Yes Yes
Animation Not widely supported
No
PNG Portable Network Graphics
.png W3C (donated by PNG Development Group)
No Yes No Yes Icons
Lossless Widely supported format Transparency support
ie firefox chrome safari opera
PSD Photoshop Document
.psd .pdd Adobe Systems
Yes No Yes Image editing
Lossless layers support transparency support
no
RAW RAW Image file
.crw .cr2 .raw .rw2 .nef .nrw .orf ...
Camera manufacturer
Yes No No
HDR photography, Archiving
Lossless Large file size No
TIFF Tagged Image File Format
.tiff .tif Adobe
Yes No No
Images from scanner, HD imaging
Lossless Large file size No
WebP WebP .webp Google
No No No
Small file size
Lossy compression
chrome opera
Table 2: Common Image Standard and Format
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
7
Dealing with so many types of images with dedicated processing efficiently is a big
challenge for the backend clusters. End uses generate and upload all kinds of images
every day, new media contents are created and distributed frequently, that make the
image processing clusters always carrying the huge processing pressures. In the following
sections we will analyze what kinds of IA technologies are most important for the image
processing applications, and how to take the advantage of those technologies to improve
the image processing performance, to benefit the media related business finally.
3. Performance Characters
3.1 Image Processing Performances Overview
Image processing is the typical computing intensive workload, which consumes lots of
CPU and memory resource. The performance of the image processing applications highly
depends on the capabilities of the CPU cores, cache and the memory bandwidth. We take
the traditional image scaling, compression and rotate process for examples, running those
applications at different IA platforms, from the low-end processors to the high-end
processors as shown in the table 3, get the following performance results shown in the
figure 2.
Intel® Xeon™ Platform
Processor Number E5-2620V2 E5-2640V2 E5-2697V2 E5-2697V3 HSW
# of Cores 6 8 12 14
# of Threads 12 16 24 28
Clock Speed 2.1 GHz 2 GHz 2.7 GHz 2.3 GHz
Max Turbo Frequency 2.6 GHz 2.5 GHz 3.5 GHz 4.0 GHz
Intel® Smart Cache 15 MB 20 MB 30 MB 35 MB
Intel® QPI Speed 7.2 GT/s 7.2 GT/s 8 GT/s 9.6 GT/s
Table 3: Intel® Xeon™ Ivy Bridge and Haswell Platform Configuration
The performance metric here is the processing time, lower is better, from Figure 2, we
can see that the processing performance improves along with CPU frequency, cache size
and the number of the cores increasing, and IA architecture upgrade. Therefore, the
high-bin processor is more efficient for image processing applications.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
8
Figure 2: Image Processing Applications Performance at Different Platforms
We also conduct the traditional image scaling application at the different Haswell
platforms, and used the IPP (Intel® Integrated Performance Primitives)[6] to optimize the
performance. Figure 3 demonstrates that image scaling performance increases with the
CPU frequency and cache capabilities growth, and the CPU frequency plays more
important role at this application.
Figure 3: Image Scaling Application Performance at Haswell Platforms
1.5
2
2.5
3
3.5
4
4.5
e5-2620 v2 e5-2640 v2 e5-2697 v2 e5-2695-v3 hsw
rotate (s)
scaling (s)
compress (s)
tim
e (
s)
image processing performance at different IA pltaforms
E5-2670 v3 (2.30GHz,30M Cache)
E5-2699 v3 (2.30GHz, 45M cache)
E5-2697 v3 (2.60GHz, 35M cache)
original 233 221 207
ipp tuning 192 158 152
100
120
140
160
180
200
220
240
scal
ing
tim
e (
ms)
image scaling performance at Haswell platform
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
9
3.2 Simultaneous Multithreading and Turbo Boost
Intel® Simultaneous Multithreading (SMT), also called Hyper-threading (HT), and
Intel® Xeon™ Turbo Boost are two kinds of key technologies that IA platform provided.
They are widely supported in the most of IA platforms and contribute lots of performance
speedup at many media related applications.
− SMT makes the operating system addresses two virtual or logical cores for each
physical core, and shares the resources between them when possible. The main
function of hyper-threading is to decrease the number of dependent instructions
on the pipeline. It offers performance benefits when CPU cores fully running in
the heavy level, but not in every application such as that have the cores stay idle,
in this case SMT technology will introduce the task/thread switching overhead.
− Intel® Turbo Boost increases performance by translating the temperature, power
and current head room into higher frequency. The actual Turbo Boost frequency is
determined by the processor active cores (in C0 state), the type of workload,
estimated power consumption, and the processor temperature. For most customers,
the behavior of Intel® Turbo Boost Technology will have positive impact to
application performance as it provides opportunistic frequency upside above the
rated frequency when conditions allow and no action is required. For customers
that need deterministic processor frequency, it is recommended to disable Intel®
Turbo Boost Technology.
Figure 4 is one example of SMT and Turbo Boost contribution for the image scaling
workload. We can see that, SMT provide around 30% performance speedup and the
Turbo Boost boosts the performance by about 5%. Those two kinds of IA technologies
can improve the performance distinctly, and have been widely adopted at customer image
processing applications.
Figure 4: SMT and Turbo Boost Contribution for Image Processing Application
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
10
3.3 Micro Architecture Characters
From the processor micro architecture perspective, a well-tuned computing intensive
workload should be almost fully running at the specific logical computing modules, less
cache miss, less mis-predicted branches ratio, and ideal CPI(cycles per instruction) ratio.
Generally, CPI is less than 1 is the ideal state. Figure 4 is one well-tuned example of the
micro-architecture profiling data for image scaling application. We can see from the
figure 5 that image scaling application also consumes lots of memory bandwidth and
resource, therefore memory capability is also very important for the performance of
image processing applications.
Figure 5: Processor Micro Architecture Characters for Image Processing
With these performance characters of the image processing application, we will analyze
the IA platform technologies, and investigate how to take the favors of those technologies
to benefit image processing applications at the following sections.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
11
4. High Performance Solution Based on Intel®
Xeon™ Platform
As we illustrated in the section 3, image processing application is a standard CPU and
memory intensive workload, which requires high capabilities of the server platform, such
as core computing efficiency, reliability, and stability. In this section, we will introduce
the key IA technologies that can bring the significant performance boost for image
related processing.
4.1 Image Compression Tuning
Most of the images are stored and distributed in compressed format, like JPEG, GIF and
PNG, therefore image compression/decompression is one of the most critical modules in
the image related processing clusters. As we know that every image contains several
[wide*height] matrix, and the image processing is a set of matrix computing essentially.
Those kind of calculation can be optimized by the vectorization technology, such as IA
SIMD (single instruction multiple data) instructions, which operate multiple element data
that perform the same operation on multiple data points simultaneously, as shown in the
figure 6, that will greatly improve the data throughput and execution efficiency.
Figure 6: SIMD Methodology
SIMD instructions have been widely supported in x86 processors, evolving from MMX,
SSE, AVX, to the AVX2 at different x86 platform generations respectively. Lots of
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
12
projects have been developing to utilize the vectorization instructions optimizing the
image compression and decompression applications. libjpeg-turbo[3] is one of the open
source project that uses SIMD instructions to accelerate the baseline JPEG compression
and decompression on x86 Platform. Generally, libjpeg-turbo can provide 2-4x times
performance speedup for JPEG image compression and decompression. Figure 7 is one of
the examples that we used the libjpeg-turbo library to optimize the JPEG image
compression application via SIMD technology.
Figure 7: JPEG Image Compression Tuning by libjpeg-turbo with SIMD
4.2 Image Scaling Program Tuning
As we illustrated in the section 2, image scaling is another hot function in the image
procesing cluster, since most of the companies would scale original images to the target
resolution and format when they receive the image source. To optimize the image scaling
appliction, we can consider both of algorithm level and code level.
4.2.1 Down-Sampling Algorithm
As shown in the figure 8, traditional image scaling application starts from the decoding
original compressed image to the raw data, then scale in or scale out as the application’s
requirement, and finally save as the target format as the output.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
13
Figure 8: Image Scaling Application
A. N. Skodras and C. A. Christopoulos proposed a new algorithm, "Down-Sampling of
Compressed Images in the DCT Domain"[8], that is during the decompression stage, only
sampling and decompress the part of most important data, not full size decoding, so as to
save lots of computing time and improve the performance. For JPEG image, it is the 8x8
DCT compression algorithm, and can be decompressed to n/8 (n=4,2,1) directly. By this
way, we can decompress the image to the reasonable n/8 resolution instead the original
full size.
With ImageMagick[10] framework, for example, scaling the source image from the
resolution 4352x3264 to the target 500x575:
o Step 1: decompress source image to: (4352/8) x (3264/8) = 544x408
o Step 2: zoom image to 500x375 These two steps can be implemented by ImageMagick with the following command:
# convert -size 500x375 -scale 500x375 source.jpg target.jpg
(-size 500x375: ImageMagick will decompress to 544x408 automatically)
The performance is shown in the table 4:
Scaling Method Time (s) comments
#convert -scale 500x375
source.jpg target.jpg
0.825s original full size decompression and
scaling
#convert -size 500x375 -scale
500x375 source.jpg target.jpg
0.230s -size option first decompress to the
544x408 automatically by IM
framework, and then scaling to
500x375
Performance speed up: 0.825/0.230=3.6x
Table 4: down-sampling algorithm performance
In some ImageMagick versions, “-size” parameter can’t be supported well. Therefore the
most stable solution is to modify the code in the file of “ImageMagick_DIR/coders/jpeg.c”
as the following (bold lines):
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
14
static Image *ReadJPEGImage(const ImageInfo *image_info,ExceptionInfo *exception)
{
…
if (units == 2)
image->units=PixelsPerCentimeterResolution;
number_pixels=(MagickSizeType) image->columns*image->rows;
//option=GetImageOption(image_info,"jpeg:size");
//change this line to the following line
option=image_info->size;
if (option != (const char *) NULL) {
…
}
…
}
4.2.2 Intel® High Performance Tools
To maximize IA platform’s capabilities and facilitate customers to utilize and deploy
advanced IA technologies, Intel developed full set of high performance libraries[4], for
all IA based client and server platforms, many OSs (Windows*, Linux*, OS X* and
Android), and various of domains, such as system profiling, compiler, math kernel libs,
cluster analysis, graphics SDK, and multithreading programming tools. Intel®
Compiler
and IPP (Intel® Integrated Performance Primitives) have been widely used to optimize the
image processing applications.
4.2.2.1 Intel® Compiler
ICC (Intel® Compiler)[5] generates the optimized code for all IA based platforms
automatically, including the auto vectoring and paralleling, memory and cache line
tuning, as well as serious of high level optimization, based on Intel architecture’s
advanced features. It explores the most possible way to complete the task within the
minimal CPU cycles, and compatible with Microsoft Visual C++ on Windows, GCC
(GNU Compiler Collection) on Linux.
Figures 9 and 10 are two examples of ICC contribution to the image scaling applications,
at single thread/single core and multi threads/multi cores scenarios respectively. Replace
GCC with ICC, using the same optimized switch, ICC provides distinct performance
improvement here.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
15
Figure 9: icc contribution for single thread image scaling application
Figure 10: icc contribution for multi thread image scaling application
Generally, for those applications running on Intel architectures, ICC could help
customers to have further performance improvement more easily and flexibly.
4.2.2.2 Intel® Integrated Performance Primitives
IPP (Intel® Integrated Performance Primitives)[6] exploits the best thread-level
parallelism and Intel architecture instruction set implementation for following
applications and algorithms:
1) Image, video and audio processing
2) Data communication
3) Data compression and encryption
4) Signal processing, etc.
IPP functions achieve the significant performance improvement via following
technologies:
thread-level parallelism
o Multi-Core
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
16
o Hyper-Threading
instruction set architecture
o SIMD vectorization Instructions, MMX, SSE, AVX
o processing data in larger chunks with each instruction
o any new instructions based on IA arch.
microarchitecture by
o pre-fetching data and avoiding cache blocking
o resolving data and trace cache misses
o avoiding branch mis-predictions
Here we take a common image scaling workload to demonstrate how to use the ipp
library to replace original implementation and achieve better performance on IA
platform: #include "lanczos.h"
#include <ipp.h>
#include <time.h>
#include <stdlib.h>
//----------Original Scaling Implementation --------------------------//
void original_scaling(IplImage *src, IplImage *dst, int width, int height)
{
double x_factor;
double y_factor;
LanczosResizeFilter *resize_filter;
int long span = 0;
IplImage * filter_image = NULL;
CvSize filter_size;
unsigned int status;
x_factor = (double)width/src->width;
y_factor = (double)height/src->height;
// set the value of filter structure //
resize_filter = AcquireLanczosResizeFilter();
filter_size.width = width;
filter_size.height = height;
//create temp matrix //
if ((x_factor*y_factor) > WorkLoadFactor)
{
filter_size.width = width;
filter_size.height = src->height;
filter_image = cvCreateImage(filter_size, src->depth, src->nChannels);
}
else
{
filter_size.width = src->width;
filter_size.height = height;
filter_image = cvCreateImage(filter_size, src->depth, src->nChannels);
}
// compute piexl of dest matrix//
if ((x_factor*y_factor) > WorkLoadFactor)
{
span = (int long)filter_image->width + height;
cv::Mat srcMat = cv::cvarrToMat(src);
cv::Mat filterMat = cv::cvarrToMat(filter_image);
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
17
cv::Mat dstMat = cv::cvarrToMat(dst);
status = lanczosHorizontalFilter(resize_filter, &srcMat, &filterMat,
x_factor, span);
status &= lanczosVerticalFilter(resize_filter, &filterMat, &dstMat,
y_factor, span);
}
else
{
span = (int long)filter_image->height + width;
cv::Mat srcMat = cv::cvarrToMat(src);
cv::Mat filterMat = cv::cvarrToMat(filter_image);
cv::Mat dstMat = cv::cvarrToMat(dst);
status = lanczosVerticalFilter(resize_filter, &srcMat, &filterMat,
y_factor, span);
status &= lanczosHorizontalFilter(resize_filter, &filterMat, &dstMat,
x_factor, span);
}
// memory free Matri//
cvReleaseImage(&filter_image);
DestoryLanczosResizeFilter(resize_filter);
}
//-------------------------IPP Code --------------------------------//
void ipp_scaling(IplImage *src, IplImage *dst, int width, int height)
{
LanczosResizeFilter *resize_filter;
/* set the value of filter structure */
resize_filter = AcquireLanczosResizeFilter();
double x_factor;
double y_factor;
x_factor = (double)width/src->width;
y_factor = (double)height/src->height;
ippSetNumThreads(1);
//define ipp parameters
IppiRect srcRoi = {0,0, src->width, src->height};
IppiRect dstRoi={0,0, width,height};
IppiSize srcSize = {src->width, src->height};
IppiSize dstSize = {width,height};
int interpolation = IPPI_INTER_LANCZOS;
int srcStep, dstStep;
int channel = src->nChannels;
//allocate memory
int BufferSize;
ippiResizeGetBufSize(srcRoi, dstRoi, channel, interpolation, &BufferSize);
Ipp8u* pBuffer=ippsMalloc_8u(BufferSize);
if(channel == 1) {
//ippiConvert_32f8u_C1R((Ipp32f*)Temdst->imageData,
Temdst->widthStep,(Ipp8u*)dst->imageData, dst->widthStep, dstSize, ippRndNear);
ippiResizeSqrPixel_8u_C1R((Ipp8u*)src->imageData, srcSize, src->widthStep,
srcRoi, (Ipp8u*)dst->imageData, dst->widthStep, dstRoi, x_factor, y_factor,0.0,
0.0, interpolation, pBuffer);
}
else
{
//ippiConvert_32f8u_C3R((Ipp32f*)Temdst->imageData,
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
18
Temdst->widthStep,(Ipp8u*)dst->imageData, dst->widthStep, dstSize, ippRndNear);
ippiResizeSqrPixel_8u_C3R((Ipp8u*)src->imageData, srcSize, src->widthStep,
srcRoi, (Ipp8u*)dst->imageData, dst->widthStep, dstRoi, x_factor,
y_factor,0.0 , 0.0, interpolation, pBuffer);
}
ippsFree(pBuffer);
}
Figure 11 is the result of using the Intel optimized solution to tuning the image scaling
application. Based on the Intel® Xeon™ CPU E5-2697 v3 @ 2.60GHz system, using
libjpeg-turbo library to optimize the image compression and decompression, using IPP
high performance library to tuning the scaling process, since the IA SIMD/AVX
instructions have been utilized and implemented well in the libjpeg-turbo and IPP, we
achieved 2-3x times performance speedup in this application.
Figure 11: Image Scaling Application Tuning
5. WebP Image Processing
WebP[11][12], a new image format, proposed and developed by Google, based on the
technology from On2 company. It aims to save around half or more times of the image
size(compare with the traditional jpeg, png, gif, etc.), which will help to reduce
significant storage volume and network bandwidth for the image processing platforms.
WebP adopts the block-based transformation and prediction scheme, with eight bits
of color depth and a luminance-chrominance model. Each block is predicted on the
values from the three blocks above it and from one block to the left of it (block decoding
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
base time(s) turbojpeg-SIMD IPP IPP+turbojpeg
tim
e (
s)
Image Scaling Application Tuning on IA platform
test1.jpg
test2.jpg
test3.jpg
test4.jpg
test5.jpg
test6.jpg
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
19
is done in raster-scan order: left to right and top to bottom). And it supports four basic
modes of block prediction: horizontal, vertical, DC (one color), and TrueMotion.
Mis-predicted data and non-predicted blocks are compressed in a 4×4 pixel sub-block
with a discrete cosine transform or a Walsh–Hadamard transform. Lossy WebP algorithm
only supports 8-bit YUV 4:2:0 format, which may cause color loss on. Furthermore the
WebP is the derivative of the VP8/VP9 video format.
In the table 5 and table 6 we compare the WebP with JPEG and other image formats from
both of color domain display and compression performance. WebP format demonstrates
the same image quality with the JPEG format image, but has some different behavior in
the color domain display.
JPEG Image WebP Image
Table 5: JPEG and WebP Image Compare at Color Domain
In those five images, WebP format can save 2-4x times storage size and bandwidth, but
also needs 3-4x times more computing time and resource.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
20
Image Format time (s) Size (KB) quality (PSNR)
time webp/jpeg
file size jpeg/webp
1.tiff 1419 x 1001 1922kB
Webp 0.208 75 41.02
Jpeg 0.056 287 38.2934 3.714286 3.826667
PNG 0.911 1917
Gif 0.633 887
Jpeg2000 1.621 1905
2.tiff -800x600 253kB
Webp 0.078 30 39.9
Jpeg 0.026 113 34.2956 3 3.766667
PNG 0.294 252
Gif 0.227 198
Jpeg2000 0.593 647
3.tiff-5120 x 3840 15877kB
Webp 2.302 523 43.74
Jpeg 0.548 1967 43.2104 4.20073 3.760994
PNG 12.186 15813
Gif 5.203 9222
Jpeg2000 15.071 14309
4.tiff-2560 x 1600 6869kB
Webp 0.701 604 38.57
Jpeg 0.176 1442 37.2701 3.982955 2.387417
PNG 1.963 6824
Gif 1.363 3310
Jpeg2000 5.095 5367
5.tiff-3942 x 4684 17344kB
Webp 2.62 772 43.26
Jpeg 0.549 2533 46.8093 4.772313 3.281088
PNG 12.047 17122
Gif 5.591 11865
Jpeg2000 13.005 12983
Table 6: Image Compression Performance Compare
Currently, more and more media cloud customers have adopted WebP format to provide
the image related service to the supported client devices, which will save lots of cost from
the storage volume and network bandwidth. However, WebP image compression
consumes over 3 times more computing resource. They also are exploring the most
efficient WebP image processing solution to meet the computing requirement.
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
21
Similar with other image processing applications, the most time-consuming modules of
the WebP image processing are also block based data intensive functions, that can be
optimized by the IA SIMD vectorization technology also. Figure 12 is a workload that
converts the original JPEG image to the WebP format, and delivers to the end users who
are using the WebP format supported browser. For the jpeg decompress process,
jpeg-turbo can be adopted to leverage the SIMD optimization. For the webp compress
part we use the icc optimized SIMD switches to optimize, 17% performance speedup
obtained.
Figure 12: WebP Image Tuning with SIMD
From the profiling data that shown in the figure 13, Google has developed the IA
SIMD/SSE2 code to optimize the webp compression performance, but less AVX2
support. AVX2 instruction will theoretically double the performance of previous 128b
SSE code by 256b int computing, which has be supported in Intel® Xeon™ E5-2600 v3
platform already, we can expect further extremely performance improvement when
upgrade the SSE code to AVX2 at E5-2600 v3 platform.
0.0820.0830.0840.0850.0860.0870.0880.089
0.090.0910.092
WebP Image Workload
tim
e (s
)
White Paper: High Performance Image Processing Solution with Intel® Platform Technology
22
Figure 13: Profiling Result of the WebP Processing
6. Summary
In this paper, we analyze the architecture and performance characters of the most popular
image processing applications, demonstrate the high capabilities of the Intel server
platforms, and the leadership in the media processing domain via architecture design,
rigorous manufacture procedure and the excellent performance. As the new image
technology and usage modules emerging constantly, and IA platform upgrading stably,
more and more applications and customers will get benefit from high performance and
high reliability of IA technologies definitely.
Reference [1] The JPEG http://www.jpeg.org/
[2] Wikipedia: http://es.wikipedia.org
[3] libjpeg-turbo: http://www.libjpeg-turbo.org/Main/HomePage
[4] https://software.intel.com/en-us/intel-sdp-home
[5] https://software.intel.com/en-us/c-compilers/
[6] https://software.intel.com/en-us/intel-ipp/
[7] http://software.intel.com/en-us/intel-isa-extensions
[8] https://www.academia.edu/3036328/Down-sampling_of_compressed_images_in_the_DCT_domain
[9] http://jpegclub.org/jidctred/
[10] http://www.imagemagick.org/
[11] https://code.google.com/p/webp/
[12] http://en.wikipedia.org/wiki/WebP