kevin_park_osu_ master_project report

Image Feature Extraction for Plankton Classification

An exploration of image feature extraction and classification on

large oceanographic data

June, 3 2015

Author

KEVIN PARK

Oregon State University

Corvallis

KHP

2015

KIDDER HALL PRESS

Contents

1 Introduction 3

2 Edge detection 3

3 Feature Extraction 7

3.1 Shape Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Histogram Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Classification 14

4.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Results & Discussion 15

6 Future Work 16

7 References 16

1 Introduction

Plankton, perhaps surprisingly, form a critical link in the global ecosystem and are a fun-

damental source of food and energy for aquatic wildlife. As such, the population levels of

plankton are an ideal metric for determining the health and viability of oceans and aquatic

ecosystems. The challenge thus becomes determining the best way to classify and count the

multitude of phytoplankton and zooplankton species from a sample of ocean water. Modern

imaging systems can easily produce hundreds of thousands of images in a short time scale,

so using human based means is daunting and often of minimal utility.

An open competition called the National Data Science Bowl was hosted by Booz|Allen|

Hamilton and Kaggle. A training set of over 30,337 images of plankton were labeled (121

classes) and a test set of 130,400 images of plankton were provided by Hatfield Marine

Science Center at Oregon State University. The goal of this competition was to find an

algorithm that can properly classify the different species of plankton. In addition, there was

a substantial prize to three teams that built the best algorithm. The competition ended in

March, but it still allows individuals to submit their results from their algorithms to see how

they would have performed if they had participated in the competition.

The goal for my project is to properly construct a contour of the plankton to be able to

extract geometric properties and how it distinguishes the different species of plankton.

2 Edge detection

The images of the different species of the plankton vary in shape and intensity and as a

result creates a challenge in producing an edge that captures the shape of the plankton. We

found that by using a combination of popular edge detection methods, the Canny, Roberts,

and Sobel. It provides us more details than using only one method.

3

Roberts Edge Detection

The Roberts edge detection algorithm was developed in 1963[1] and at the time it was

difficult to implement and not widely used because the lack of computing availability. As

computation power increased as did it’s popularity in edge detection. The idea behind this

method involves only a few steps. The first is computing the gradient of the original image

at each pixel, by convolving with the following kernels.

Gx =

+1 0

0 −1

and Gy =

+1 0

0 −1

.The magnitude of the gradient is computed for each pixel,

∇I(x, y) = G(x, y) =√

G2x + G2

y.

and the gradient direction,

Θ(x, y) = arctan

(Gy

Gx

).

The two results are then combined to produce what is called the Roberts Edges. The

disadvantage to Roberts edge detection is the sensitivity to noise.

Sobel Edge Detection

The Sobel edge detection is similar to the Roberts method. The only difference is the kernel

used to produce the gradient,

Gx =

−1 0 1

−2 0 2

−1 0 1

and Gy =

−1 −2 −1

0 0 0

1 2 1

.

Similarly to Roberts, the Sobel is also sensitive to noise.

4

Canny Edge Detection

The most popular algorithm of edge detection is the Canny algorithm, it was developed in

1986[3]. The difference between the Roberts and Sobel, is the additional use of denoising

(smoothing) the image prior to edge formation and after eliminating any isolated edges. It

has been shown under most conditions that the Canny algorithm performed better compared

to Roberts, Sobel, and other methods. Currently, it is the benchmark standard in comparing

new edge detection[2].

The Canny algorithm method is completed in five steps.

1. First we denoise the image with a Gaussian filter with some fixed parameter (σ).

2. Compute the gradient of the image, with

Gx =

−1 +1

−1 +1

and Gy =

+1 +1

−1 −1

.

3. Determine the local maxima as edges.

4. Eliminate any false edges such as isolated pixels.

5. Finally, fill gaps between edges by thresholding.

We vary the smoothing parameter (σ) from 1.5 to 3 with increments of 0.5 (∆σ = 0.5) and

after compute the mean of the edges produced.

Combining edge detectors

The edges produced by Roberts and Sobel were combined with the edges from the Canny

algorithm. The combination of these methods provided accurate shapes for both simple and

complex planktons.

5

Figure 1: The above are the original images of the plankton species, Acantharia Protist(A. Protist), Decapods, Detritus Blob (D. Blob), and Trichodesmium Bowtie (T. Bowtie)represent with the edges produced by taking the mean from the Canny edge algorithm.

Figure 2: These edges are produced from combining Canny and Roberts.

6

Figure 3: These edges produced from combining Canny and Sobel.

Although difficult to see the edges produced by combination of Canny and Roberts and

Canny and Sobel are different from each other.

3 Feature Extraction

3.1 Shape Analysis

After forming an edge for each image plankton, we extracted several geometric properties.

For each geometric feature there is an accompanying figure of (Figures 3-9) four species (A.

Protist, Decapods, D. Blob, and T. Bowtie) to demonstrate the similarities and differences

of their distributions. The distributions on the left represent the features extracted from

combined edges of Canny and Sobel and the right are combined edges of Canny and Roberts.

3.1.1 Area

The area is computed by counting the number of pixels that make up the object.

7

Figure 4: The distribution of Area between Decapods and D. Blob are different. However,there is little difference between A. Protist and T. Bowtie.

This method in computing the Area of the object is not the scale or rotation invariant.

3.1.2 Perimeter

The perimeter is computed by calculating the number of pixels that surround the object.

Figure 5: The distribution perimeter does not seem to differ too much from each otherbesides D. Blob. This is most likely do the image size.

This method of computing the perimeter is not scale or rotation invariant.

8

3.1.3 Major and Minor Axes

The Major Axis is the longest straight line in the object and the Minor Axis is the longest

straight line perpendicular to the Major Axis.

Figure 6: The distribution for the Major Axis length seems to be different between A. Protistand T. Bowtie.

Figure 7: The distribution for the Minor Axis length, is different between A. Protist and D.Blob.

The Major and Minor Axis lengths are rotation invariant, but not scale invariant.

9

3.1.4 Convexity

The convexity is defined as the ratio of the perimeter of the convex hull to the perimeter of

the original image.

Convexity =Perimeter of Convex Hull

Perimeter of the Image.

Convex hull is a method of fitting the smallest convex region that surrounds the original

image. If the Convexity ≈ 1, then the object is convex.

Figure 8: Compared to the previous methods, the distribution of convexity between thesefour species appear to be more distinct.

3.1.5 Compactness

Compactness gives us information on how circular the object is and is computed from the

Area and Perimeter of the image.

Compactness =4πArea

Perimeter2

If the compactness is closer to 1, then the object is more circular. For example if the

Area= πr2 and Perimeter= 2πr, then compactness is 1.

10

Figure 9: Similar to the previous values, it appears the A. Protist and T. Bowtie are themost different.

3.1.6 Eccentricity

The eccentricity is computed by first fitting an ellipsoid around the object and taking the

ratio of the length of the major axis to length of the minor axis,

Eccentricity =Length of Major Axis

Length of Minor Axis.

Figure 10: The distribution of Eccentricity for Decapods seem to differ from the other threeplanktons.

11

3.1.7 Solidity

Solidity measures if the object is concave or convex. This is done by taking the ratio of the

original object over the area of the fitted convex hull.

Solidity =Area of the Shape

Area of the Convex Hull.

The solidity of a convex object is 1.

Figure 11: The distribution of Solidity is different between A. Protist and D. Blob, whilethe distribution for Decapods and T. Bowtie are similar.

3.1.8 Mean Curvature

We computed the curvature by segmenting the boundary for each object. On the set points

that make up each segment, a 2nd degree polynomial was fitted and the coefficient of quadratic

term was used to measure curvature. The mean curvature was computed by taking the mean

of all the coefficients.

12

Figure 12: The distribution of mean curvature between the four planktons appear to besimilar.

3.1.9 Skeleton

A skeleton is fitted inside the object and then the number of branching nodes are then

counted.

Figure 13: The distribution in the number of branch points are the similar between A. Protistand D. Blob. Furthermore, the Decapods and T. Bowties are similar. Interestingly, thesetwo groups are different.

13

3.2 Histogram Method

The first feature extraction method we developed, was measuring the distribution of grayscale

values that make up the shape and texture of each plankton. The grayscale scaled between

0 and 1, where 1 is white and 0 is black. A count is made for the grayscale values that fall

between the intervals of [0, 0.1), [0.1, 0.2), . . . , [0.9, 1).

Figure 14: The grayscale distributions on the right correspond to the species selected onthe left. The species of plankton are, Copepod Calanoid, Echinoderm Larva Pluteus Early,Ctenophore Cydippid no tentacles, and Jellies Tentacles.

We have that certain species of plankton the distribution of grayscale values are differ-

ent from each other. While there are a few species of plankton that have nearly identical

distributions.

4 Classification

4.1 Random Forest

After extracting the features from the Canny & Sobel and Canny & Roberts. A random

forest model was fitted on the training set and with a different number of trees from 500 to

6000. The models performance was then evaluated on the test set.

14

5 Results & Discussion

The score for plankton classification was measured with a multi-class logarithmic loss. For

each image in the test set there is a true class label. The formula used is,

logloss = − 1

N

N∑i=1

M∑j=1

yij log(pij).

Where N is the number of images in the test set and M is the number of class labels. The

score is ”conveniently” evaluated on Kaggle’s website.

If we were to randomly guess (with equal probability) the class of each image (1 of the

121) we would obtain a benchmark score of 4.795791. Our random forest model with the 21

features performed better than the benchmark score and as we increased the number of trees

the score improved and converged around 1.77. The features extracted using the Canny &

Sobel performed better than the Canny & Roberts edges. We conclude the intensity values

and geometric properties of plankton distinguish the different species of planktons.

15

6 Future Work

The proper fitting of a classification model was not fully explored and should be. There are

plenty of room for improving the current feature extraction and extract additional geometric

features.

From our results we found that the combinations of edge detecting methods produced the

desired results. A possibility is to develop an algorithm that creates the quality edges from

the combined Sobel and Canny.

We can focus on accurately representing some of the features extracted such as the mean

curvatures. The method we used just fitted a second degree polynomial and we could explore

other types of fitting such splines to measure the curvature. We can extract additional

geometric features that are described in A survey of Shape Feature Extraction Techniques

[3].

There was no tuning for the random forest model and the only parameter we varied was

the number of trees. We can further explore how the model does if other parameters are

changed, such as the number of nodes and pruning the trees.

Additionally, we can fit a random forest model by boosting instead of bagging (which we

fitted.)

7 References

1. R. Maini and H. Aggarwal. Study and Comparison of Various Image Edge Detection

Techniques. CSC journals. vol. 3, 1-60, 2009.

2. J. Canny, A computational approach to edge detection. IEEE Trans. Pattern Analysis

Machine Intell., vol. 8, 679-697, 1986

3. M. Yang, K. Kpalma, and J. Ronsin. A Survey of Shape Feature Extraction Techniques.

Peng-Yeng Yin. Pattern Recognition. IN-TECH, pp. 43-90,2008.

16

4. C. Cheng, W. Liu, and H. Zhang, Image retrieval based on region shape similarity, in

13th SPIE symposium on Electronic Imaging, Storage, and Retrieval for Image and

Video Databases, 2001.

5. M. Peura and J. Iivarinen, Efficiency of Simple shape descriptors. Proc. 3rd Interna-

tional Workshop on Visual Form (IWVF3), May 1997.

17

kevin_park_osu_ master_project report

Documents