sigir2015_0399_0e0543aa4.pdf

10
Efficient and Scalable MetaFeature-based Document Classification using Massively Parallel Computing ABSTRACT The unprecedented growth of available data nowadays has stimulated the development of new methods for organizing and extracting useful knowledge from this immense amount of data. Automatic Document Classification (ADC) is one of such methods, that uses machine learning techniques to build models capable of automatically associating documents to well-defined semantic classes. ADC is the basis of many important applications such as language identification, senti- ment analysis, recommender systems, spam filtering, among others. Recently, the use of meta-features has been shown to substantially improve the effectiveness of ADC algorithms. In particular, the use of meta-features that make a com- bined use of local information (through kNN-based features) and global information (through category centroids) has pro- duced promising results. However, the generation of these meta-features is very costly in terms of both, memory con- sumption and runtime since there is the need to constantly call the kNN algorithm. We take advantage of the current manycore GPU architecture and present a massively paral- lel version of the kNN algorithm for highly dimensional and sparse datasets (which is the case for ADC). Our experi- mental results show that we can obtain speedup gains of up to 15x while reducing memory consumption in more than 5000x when compared to a state-of-the-art parallel base- line. This opens up the possibility of applying meta-features based classification in large collections of documents, that would otherwise take too much time or require the use of an expensive computational platform. Keywords document classification, meta-features, parallelism 1. INTRODUCTION The processing of large amounts of data efficiently is critical for Information Retrieval (IR) and, as this amount of data grows, storing, indexing and searching costs rise up alto- gether with penalties in response time. Parallel computing may represent an efficient solution for enhancing modern IR systems but the requirements of designing new parallel al- gorithms for modern platforms such as manycore Graphical Processing Units (GPUs) have hampered the exploitation of this opportunity by the IR community. In particular, similarity search is at the heart of many IR systems. It scores queries against documents, presenting the highest scoring documents to the user in ranked order. The kNN algorithm is commonly used for this function, retriev- ing the most similar k documents for each query. Another very common application domain for kNN is Automatic Doc- ument Classification (ADC). In this case, the kNN algorithm is used to automatically map (classify) a new document d to a set of predefined classes given a set of labeled (training) documents, based on the similarities between d and each of the training docoments. kNN has been shown to produce competitive results in several datasets [15]. The speedup of this type of algorithm, using some efficient parallel strate- gies, mainly in applications in which it is repetitively applied as the core of other applications, opens huge opportunities. One recent application which uses kNN intensively is the generation of meta-level features (or simply meta-features ) for ADC [21]. Such meta-features capture local and global information about the likelihood of a document to belong to a class, which can then be exploited by a different clas- sifier (e.g. SVM). More specifically, meta-features capture: (i) the similarity value between a test example and the near- est neighbor in each considered class, and (2) the similarity value of the test example with the classes’ centroids 1 . As shown in [21] and in our experiments, the use of meta- features can substantially improve the effectiveness of ADC algorithms. However, the generation of meta-features is very costly in terms of both, memory consumption and runtime. In order to generate them, there is the need to constantly call the kNN algorithm. kNN, however, is known to produce poor performance (execution time) in classification tasks in com- parison to other (non-lazy) supervised methods, normally being not the best choice to deal with on-the-fly classifica- tions. Classification using kNN-based meta-features inherits this poor performance, since we need to generate meta-level features for both, all training, and each new test sample be- fore actual classification. For textual datasets (with large 1 Notice that this also has to be applied to all training doc- uments in an offline procedure.

Upload: rafael-quirino

Post on 28-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SIGIR2015_0399_0e0543aa4.pdf

Efficient and Scalable MetaFeature-based DocumentClassification using Massively Parallel Computing

ABSTRACTThe unprecedented growth of available data nowadays hasstimulated the development of new methods for organizingand extracting useful knowledge from this immense amountof data. Automatic Document Classification (ADC) is oneof such methods, that uses machine learning techniques tobuild models capable of automatically associating documentsto well-defined semantic classes. ADC is the basis of manyimportant applications such as language identification, senti-ment analysis, recommender systems, spam filtering, amongothers. Recently, the use of meta-features has been shown tosubstantially improve the effectiveness of ADC algorithms.In particular, the use of meta-features that make a com-bined use of local information (through kNN-based features)and global information (through category centroids) has pro-duced promising results. However, the generation of thesemeta-features is very costly in terms of both, memory con-sumption and runtime since there is the need to constantlycall the kNN algorithm. We take advantage of the currentmanycore GPU architecture and present a massively paral-lel version of the kNN algorithm for highly dimensional andsparse datasets (which is the case for ADC). Our experi-mental results show that we can obtain speedup gains of upto 15x while reducing memory consumption in more than5000x when compared to a state-of-the-art parallel base-line. This opens up the possibility of applying meta-featuresbased classification in large collections of documents, thatwould otherwise take too much time or require the use of anexpensive computational platform.

Keywordsdocument classification, meta-features, parallelism

1. INTRODUCTIONThe processing of large amounts of data efficiently is criticalfor Information Retrieval (IR) and, as this amount of datagrows, storing, indexing and searching costs rise up alto-gether with penalties in response time. Parallel computing

may represent an efficient solution for enhancing modern IRsystems but the requirements of designing new parallel al-gorithms for modern platforms such as manycore GraphicalProcessing Units (GPUs) have hampered the exploitation ofthis opportunity by the IR community.

In particular, similarity search is at the heart of many IRsystems. It scores queries against documents, presenting thehighest scoring documents to the user in ranked order. ThekNN algorithm is commonly used for this function, retriev-ing the most similar k documents for each query. Anothervery common application domain for kNN is Automatic Doc-ument Classification (ADC). In this case, the kNN algorithmis used to automatically map (classify) a new document d toa set of predefined classes given a set of labeled (training)documents, based on the similarities between d and each ofthe training docoments. kNN has been shown to producecompetitive results in several datasets [15]. The speedup ofthis type of algorithm, using some efficient parallel strate-gies, mainly in applications in which it is repetitively appliedas the core of other applications, opens huge opportunities.

One recent application which uses kNN intensively is thegeneration of meta-level features (or simply meta-features)for ADC [21]. Such meta-features capture local and globalinformation about the likelihood of a document to belongto a class, which can then be exploited by a different clas-sifier (e.g. SVM). More specifically, meta-features capture:(i) the similarity value between a test example and the near-est neighbor in each considered class, and (2) the similarityvalue of the test example with the classes’ centroids 1. Asshown in [21] and in our experiments, the use of meta-features can substantially improve the effectiveness of ADCalgorithms.

However, the generation of meta-features is very costly interms of both, memory consumption and runtime. In orderto generate them, there is the need to constantly call thekNN algorithm. kNN, however, is known to produce poorperformance (execution time) in classification tasks in com-parison to other (non-lazy) supervised methods, normallybeing not the best choice to deal with on-the-fly classifica-tions. Classification using kNN-based meta-features inheritsthis poor performance, since we need to generate meta-levelfeatures for both, all training, and each new test sample be-fore actual classification. For textual datasets (with large

1Notice that this also has to be applied to all training doc-uments in an offline procedure.

Page 2: SIGIR2015_0399_0e0543aa4.pdf

vocabularies), the performance problem is hardened, sincethe kNN algorithm will have to run on high dimensionaldata. In this scenario, kNN often requires large portionsof memory to represent all the training data and intensivecomputation to calculate the similarity between points. Infact, as we shall see the generation of meta-features is notfeasible using previous meta-feature generators [5, 21] forthe larger datasets we experimented with.

In this paper we present a new GPU-based implementationof kNN specially designed for high-dimensional and sparsedatasets (which is the case for ADC), allowing a very fastand much more scalable meta-feature generation which al-lows one to apply this technique in large collections muchfaster. Some of the most interesting characteristics of our ap-proach, compared to other state-of-the-art GPU-based pro-posals, include:

• most solutions use a brute-force approach (i.e., com-pare the query document to all training documents)which demands too much memory while we exploit in-verted indexes along with an efficient implementationto cope with GPU memory limitations;

• in order to scale, some solutions sacrifice effectivenessusing an approximated kNN solution (e.g. locality-sensitive hashing - LSH) while our proposal is an ex-act kNN solution which exploits state-of-the-art GPUsorting methods;

• most proposals achieve good results only when con-sidering many queries (i.e., multiple document classi-fication) in order to produce good speedups while oursolution obtains improvements even when dealing witha single query (document); we achieve this by an ef-fective load balacing approach within the GPU;

• some solutions require a multi-GPU approach to dealwith large datasets while ours deals with large datasetsusing a single GPU.

Our experimental results show that we can obtain speedupgains of up to 140x and 15x while reducing memory con-sumption in more than 8000x and 5000x when compared toa standard sequential implementation and to a state-of-the-art parallel baseline, respectively.

This paper is organized as follows. Section 2 covers re-lated work. Section 3 introduces the use of meta-featurein ADC. Section 4 provides a brief introduction to paral-lelism in GPU. Section 5 describes our GPU-based imple-mentation, specially designed for highly dimensional, sparsedata. Section 6 presents an analysis of the complexity ofthe proposed solution. Section 7 presents our experimentalevaluation while Section 8 concludes the paper.

2. RELATED WORKSeveral meta-features have been proposed to improve the ef-fectiveness of machine learning methods. They can be basedon ensemble of classifiers [2, 20], derived from clusteringmethods [9, 14] or from the instance-based kNN method [5,21].

Meta-features derived from ensembles exploit the probabil-ity distribution over all classes generated by each of the in-dividual classifiers composing the ensemble [20]. In [2] otherensemble-based meta-features were also used, including: theentropies of the class probability distributions and the max-imum probability returned by each classifier. This schemewas found to perform better than using only probability dis-tributions.

Clustering techniques may also be used to derive meta-features.In this case, the feature space is augmented using clustersderived from a previous clustering step considering both thelabeled and unlabeled data [14, 9]. The idea is that clustersrepresent higher level “concepts” in the feature space, andthe features derived from the clusters indicate the similar-ity of each example to these concepts. In [14] the largest nclusters are chosen as representatives of the major concepts.Each cluster c contributes with a set o meta-features like, forinstance, binary feature indicating if c is the closest of the nclusters to the example, the similarity of the example to thecluster’s centroid, among others. In [9] the number of clus-ters is chosen to be equal to the predefined number of classesand each cluster corresponds to an additional meta-feature.

Recently, [5] reported good results by designing meta-featuresthat make a combined use of local information (throughkNN-based features) and global information (through cat-egory centroids) in the training set. Despite the fact thatthese meta-features are not created based on an ensemble ofclassifiers, they differ from the previously presented meta-features derived from clusters because they explicitly cap-ture information from the labeled set.

Although the kNN algorithm can be applied broadly, it hassome shortcomings. For large datasets (n) and high dimen-sional space (d), its complexity O(nd) can easily becomeprohibitive. Moreover, if m successive queries are to be per-formed, the complexity further increases to O(mnd). Re-cently, some proposals have been presented to acceleratethe kNN algorithm via a highly mutithreaded GPU-basedapproach. The first, and most cited, GPU-based kNN im-plementation was proposed by Garcia et al. [3]. They usedthe brute force approach and reported speedups of up to twoorders of magnitude when compared to a brute force CPU-based implementation. Their implementation assumes thatmultiple queries are performed and computes and stores acomplete distance matrix, what makes it impracticable forlarge data (over 65,536 documents).

Following Garcia’s et al. work, Kuang and Zhao [8] imple-mented their own optimized matrix operations for calculat-ing the distances, and used radix sort to find the top-k ele-ments. Liang et al. [12] took advantage of CUDA Streamsto overlap computation and communication (CPU/GPU)when dealing with several queries, and thus decrease theGPU memory requirements. The distances were computedin blocks and later merged first locally and then globally tofind the top-k elements. However, such works can still beconsidered brute-force. Sismanis et al. [18] concentrated onthe sorting phase of the brute-force kNN and provided anextensive comparison among parallel truncated sorts. Theyconclude that the truncated biotonic sort (TBiS) producesthe best results.

Page 3: SIGIR2015_0399_0e0543aa4.pdf

Our proposal differs from the above mentioned work in manyaspects. First, it exploits a very efficient GPU implementa-tion of inverted indexes which supports an exact kNN so-lution without relying in brute-force. This also allows oursolution to save a lot of memory space since the invertedindex corresponds to a sparse representation of the data. Inthe distance calculation step, we resort to a smart load bal-ancing among threads to increase the parallelism. And inthe sorting step, we exploit a GPU-based sorting procedure,which was shown to be superior to other partial sorting al-gorithms[18], in combination with a CPU merge operationbased on a priority queue.

3. USE OF META-FEATURES FOR ADCIn here, we formally introduce the meta-features whose kNN-based calculation we intend to speed up and scale with ourproposed massively parallel approach.

Let X and C denote the input (feature) and output (class)spaces, respectively. Let Dtrain = {(xi, ci) ∈ X × C}|ni=1

be the training set. Recall that the main goal of supervisedclassification is to learn a mapping function h : X 7→ C whichis general enough to accurately classify examples x′ 6∈ Dtrain.

The kNN-based meta-level features proposed in [5], are de-signed to replace the original input space X with a newinformative and compact input space M. Therefore, eachvector of meta-features mf ∈M is expressed as the concate-nation of the sub-vectors below, which are defined for eachexample xf ∈ X and category cj ∈ C for j = 1, 2, . . . , |C| as:

• ~vcos~xf

= [cos(~xij , ~xf )]: A k-dimensional vector produced

by considering the k nearest neighbors of class cj tothe target vector xf , i.e., , ~xij is the ith (i ≤ k) nearestneighbor to ~xf , and cos(~xij , ~xf ) is the cosine similaritybetween them. Thus, k meta-features are generated torepresent xf .

• ~vL1~xf

= [d1(~xij , ~xf )]: A k-dimensional vector whose el-

ements d1(~xij , ~xf ) denote the L1 distance between ~xfand the ith nearest class cj neighbor of ~xf (i.e., d1(~xij , ~xf ) =||~xij − ~xf ||1).

• ~vL2~xf

= [d2(~xij , ~xf )]: A k-dimensional vector whose el-

ements d2(~xij , ~xf ) denote the L2 distance between ~xfand the ith nearest class cj neighbor of ~xf (i.e., d2(~xij , ~xf ) =||~xij − ~xf ||2).

• ~vcent~xf

= [d2(~xj , ~xf ), cos(~xj , ~xf )]: A 2-dimensional vec-

tor where ~xj is the cj centroid (i.e., vector average ofall training examples of the class cj).

Considering k neighbors, the number of features in vectorxf is (3k + 2) per category, and the total of (3k + 2)|C|for all categories. The size of this meta-level feature setis much smaller than that typically found in ADC tasks,while explicitly capturing class discriminative informationfrom the labeled set.

4. PARALLELISM AND THE GPU

In the last few years, the focus on processor architectures hasmoved from increasing clock rate to increasing parallelism.Rather than increasing the speed of its individual proces-sor cores, traditional CPUs are now virtually all multicoreprocessors. In a similar fashion, manycore architectures likeGPUs have concentrated on using simpler and slower cores,but in much larger counts, in the order of thousands of cores.The general perception is that processors are not gettingfaster, but instead are getting wider, with an ever increas-ing number of cores. This has forced a renewed interest inparallelism as the only way to increase performance.

The high computational power and affordability of GPUshas led to a growing number of researchers making use ofGPUs to handle massive amounts of data. While multicoreCPUs are optimized for single-threaded performance, GPUsare optimized for throughput and a massive multi-threadedparallelism. As a result, the GPUs deliver much better en-ergy efficiency and achieves higher peak performance forthroughput workloads. However, GPUs have a different ar-chitecture and memory organization and to fully exploit itscapabilities it is necessary considerable parallelism (tens ofthousands of threads) and an adequate use of its hardwareresources. This imposes some constraints in terms of de-signing appropriate algorithms, requiring the design of novelsolutions and new implementation approaches. However, afew research groups and companies have faced this challengewith promising results in Database Scalability, DocumentClustering, Learning to Rank, Big Data Analytics and In-teractive Visualization [1, 19, 17, 13, 6].

The GPU consists of a M-SIMD machine, that is, a MultipleSIMD (Single Instruction Multiple Data) processor. EachSIMD unit is known as a streaming multiprocessor (SM) andcontains streaming processor (SP) cores. At any given clockcycle, each SP executes the same instruction, but operateson different data. The GPU supports thousands of light-weight concurrent threads and, unlike the CPU threads, theoverhead of creation and switching is negligible. The threadson each SM are organized into thread groups that sharecomputation resources such as registers. A thread groupis divided into multiple schedule units, called warps, thatare dynamically scheduled on the SM. Because of the SIMDnature of the SP’s execution units, if threads in a sched-ule unit must perform different operations, such as goingthrough branches, these operations will be executed seriallyas opposed to in parallel. Additionally, if a thread stalls ona memory operation, the entire warp will be stalled until thememory access is done. In this case the SM selects anotherready warp and switches to that one. The GPU global mem-ory is typically measured in gigabytes of capacity. It is anoff-chip memory and has both a high bandwidth and a highaccess latency. To hide the high latency of this memory, it isimportant to have more threads than the number of SPs andto have threads in a warp accessing consecutive memory ad-dresses that can be easily coalesced. The GPU also providesa fast on-chip shared memory which is accessible by all SPsof a SM. The size of this memory is small but it has a lowlatency and it can be used as a software-controlled cache.Moving data from the CPU to the GPU and vice versa isdone through a PCIExpress connection.

The GPU programming model requires that part of the

Page 4: SIGIR2015_0399_0e0543aa4.pdf

application runs on the CPU while the computationally-intensive part is accelerated by the GPU. The programmerhas to modify his application to take the compute-intensivekernels and map them to the GPU. A GPU program exposesparallelism through a data-parallel SPMD (Single ProgramMultiple Data) kernel function. During implementation, theprogrammer can configure the number of threads to be used.Threads execute data parallel computations of the kerneland are organized in groups called thread blocks, which inturn are organized into a grid structure. When a kernel islaunched, the blocks within a grid are distributed on idleSMs. Threads of a block are divided into warps, the sched-ule unit used by the SMs, leaving for the GPU to decidein which order and when to execute each warp. Threadsthat belong to different blocks cannot communicate explic-itly and have to rely on the global memory to share theirresults. Threads within a thread block are executed by theSPs of a single SM and can communicate through the SMshared memory. Furthermore, each thread inside a blockhas its own registers and private local memory and uses aglobal thread block index, and a local thread index within athread block, to uniquely identify its data.

5. GPU-BASED GENERATION OF META-FEATURES

The proposed parallel implementation, called GPU-basedTextual kNN (GT-kNN), greatly improves the k nearestneighbors search in textual datasets. The solution efficientlyimplements an inverted index in the GPU, by using a paral-lel counting operation followed by a parallel prefix-sum cal-culation, taking advantage of Zipf’s law, which states thatin a textual corpus, few terms are common, while many ofthem are rare. This makes the inverted index a good choicefor saving space and avoiding unnecessary calculations. Atquery time, this inverted index is used to quickly find thedocuments sharing terms with the query document. Thisis made by constructing a query index which is used for aload balancing strategy to evenly distribute the distance cal-culations among the GPU’s threads. Finally, the k nearestneighbors are determined through the use of a truncatedbitonic sort to avoid sorting all computed distances. Nextwe present a detailed description of these steps.

5.1 Creating the Inverted IndexThe inverted index is created in the GPU memory, assum-ing the training dataset fits in memory and is static. Let Vbe the vocabulary of the training dataset, that is the set ofdistinct terms of the training set. The input data is the setE of distinct term-documents (t, d), pairs occurring in theoriginal training dataset, with t ∈ V and d ∈ Dtrain. Eachpair (t, d) ∈ E is initially associated with a term frequencytf , which is the number of times the term t occurs in thedocument d. An array of size |E| is used to store the invertedindex. Once the set E has been moved to the GPU memory,each pair in it is examined in parallel, so that each time aterm is visited the number of documents where it appears(document frequency - df) is incremented and stored in thearray df of size |V|. A parallel prefix-sum is executed, usingthe CUDPP library [16], on the df array by mapping each el-ement to the sum of all terms before it and storing the resultsin the index array. Thus, each element of the index arraypoints to the position of the corresponding first element in

the invertedIndex, where all (t, d) pairs will be stored or-dered by term. Finally, the pairs (t, d) are processed in paral-lel and the frequency-inverse document frequency tf -idf(t, d)for each pair is computed and included together with thedocuments identification in the invertedIndex array, usingthe pointers provided by the index array. Also during thisparallel processing, the value of the norm for each trainingdocument, which is used in the calculus of the cosine or Eu-clidean distance, is computed and stored in the norms array.Algorithm 1 depicts the inverted index creation process.

Algorithm 1: CreateInvetedIndex(E)

input : term-document pairs in E[ 0 . . |E| − 1 ].output: df , index, norms, invertedIndex.

1 array of integers df [ 0 . . |V| − 1 ] // document-frequency array,initialized with zeros.

2 array of integers index[ 0 . . |V| − 1 ].3 array of floats norms[ 0 . . |Dtrain − 1| ].4 invertedIndex[ 0 . . |E| − 1 ] // the inverted index

5 Count the occurrences of each term in parallel on the inputand accumulates in df .

6 Perform an exclusive parallel prefix sum on df and stores theresult in index.

7 Access in parallel the pairs in E, with each processorperforming the following tasks:

8 begin9 Compute the tf-idf value of each pair.

10 Accumulate the square of the tf-idf value of a pair (t, d)in norms[d].

11 Store in invertedIndex the entries corresponding topairs in E, according to index.

12 end13 Compute in parallel the square root of the values in array

norms.14 Return the arrays: count, index, norms and invertedIndex.

Figure 1 illustrates each step of the inverted index creationfor a five documents collection where only five terms areused. If we take t2 as an example, the index array indicatesthat its inverted document list (d2, d4) starts at position 3 ofthe invertedindex array and finishes at position 4 (5 minus1).

E (entries)

df

0 1 2 3 4 5 6 7 8 9

t5

d5 t1

d1 t3

d1 t4

d1 t2

d2 t5

d2 t1

d3 t2

d4 t1

d5 t3

d5

0 1 2 3 4

2

t5 3

t1 2

t2 2

t3 1

t4

index

0 1 2 3 4

8

t5 0

t1 3

t2 5

t3 7

t4

invertedIndex

0 1 2 3 4 5 6 7 8 9

d5

t5 d1

t1 d3

t1 d5

t1 d2

t2 d4

t2 d1

t3 d5

t3 d1

t4 d2

t5

d1 t1 t3 t4

d2 t2 t5

d3 t1

d4 t2

d5 t1 t3 t5

Document Collection

Count number of terms

Compute prefix sum

Point to 1st positions

Figure 1: Creating the inverted index

5.2 Calculating the distancesOnce the inverted index has been created, it is now possibleto calculate the distances between a given query documentq and the documents in Dtrain. The distances computationcan take advantage of the inverted index model, because

Page 5: SIGIR2015_0399_0e0543aa4.pdf

only the distances between query q and those documents inDtrain that have terms in common with q have to be com-puted. These documents correspond to the elements of theinvertedIndex pointed to by the entries of the index arraycorresponding to the terms occurring in the query q.

The obvious solution to compute the distances is to dis-tribute the terms of query q evenly among the processors andlet each processor p access the inverted lists correspondingto terms allocated to it. However, the distribution of termsin documents of text collections is known to follow approx-imately the Zipf Law. This means that few terms occur inlarge amount of documents and most of terms occur in onlyfew documents. Consequently, the sizes of the inverted listalso vary according to te Zipf Law, thus distributing thework load according to the terms of q could cause a greatimbalance of the work among the processors.

In this paper besides using an inverted index to boost thecomputation of the distances, we also propose a load balancemethod to distribute the documents evenly among the pro-cessors so that each processor computes approximately thesame number of distances. In order to facilitate the expla-nation of this method, suppose that we concatenate all theinverted lists corresponding to terms in q in a logical vectorEq = [ 0 . . |Eq| − 1 ], where |Eq| is the sum of the sizes of allinverted lists of terms in q. Considering the example in Fig.1 and supposing that q is composed by the terms t1, t3 andt4, the logical vector Eq would be formed by the followingpairs of the inverted index: Eq = [(t1, d1), (t1, d3), (t1, d5),(t3, d1), (t3, d5), (t4, d1)] and |Eq| equals to six.

Given a set of processors P = {p0, · · · p|P|−1}, the loadbalance method should allocate elements of Eq in inter-vals of approximately the same size, that is, each proces-sor pi ∈ P should process elements of Eq in the interval

[id |Eq||P| e,min((i + 1)d |Eq||P| e − 1, |Eq| − 1)]. Consider de ex-

ample stated above, and suppose that the set of processorsis P = {p0, p1, p2}. Thus elements of Eq with indices in theinterval [0, 1] would be assigned to p0, indices in [2, 3] wouldbe processed by p1 and indices in [4, 5] would be processedby p2.

Since each processor knows the interval of the indices of thelogical vector Eq it has to process, all that is necessary toexecute the load balancing is a mapping of the logical indicesof Eq to the appropriate indices in the inverted index (ar-ray invertedIndex). In the case of the example associatedto Fig. 1, the following mappings between logical indicesand indices of the invertedIndex array must be performed:0 → 0, 1 → 1, 2 → 2, 3 → 5, 4 → 6 and 5 → 7. Eachprocessor executes the mapping for the indices in the inter-val corresponding to it and finds the corresponding elementsin the invertedIndex array for which it has to compute thedistances to the query.

Let Vq ⊂ V be the vocabulary of the query document d. Themapping proposed in this paper uses three auxiliary arrays:dfq[ 0 . . |Vq| − 1 ], startq[ 0 . . |Vq| − 1 ]] and indexq[ 0 . . |Vq| −1 ]. The arrays dfq and startq are obtained together by copy-ing in parallel df [ti] to dfq[ti] and index[ti] to startq[ti], re-spectively, for each term ti in the query q. Once the dfq isobtained, an inclusive parallel prefix sum on dfq is performed

and the results are stored in indexq.

Algorithm 2:DistanceCalculation(invertedIndex, q)

input : invertedIndex, df , index, query q[ 0 . . |Vq| − 1 ].output: distance array dist[ 0 . . |Dtrain| − 1 ] initialized

according to the distance function used.

1 array of integers dfq [ 0 . . |Vq| − 1 ] initialized with zeros2 array of integers indexq [ 0 . . |Vq| − 1 ]3 array of integers startq [ 0 . . |Vq| − 1 ]

4 for each term ti ∈ q, in parallel do5 dfq [i] = df [ti];6 startq [i] = index[ti];

7 end8 Perform an inclusive parallel prefix sum on dfq and stores the

results in indexq9 foreach processor pi ∈ P do

10 for x ∈ [id |Eq||P| e,min((i + 1)d |Eq||P| e − 1, |Eq| − 1)] do

// Map position x to the correct positionindInvPos of the invertedIndex

11 pos = min(i : indexq [i] > x);12 if pos = 0 then13 p = 0; offset = x;14 else15 p = indexq [pos− 1]; offset = x− p;16 end17 indInvPos = startq [pos] + offset18 uses q[pos] and invertedIndex[indInvPos] in the

partial computation of the distance between q and thedocument associated to invertedIndex[indInvPos]

19 end

20 end

Algorithm 2 shows the pseudo-code for the parallel compu-tation of the distances between documents in the trainingset and the query document. In lines 4-7 the arrays dfq andstartq are obtained. In line 8 the array indexq is obtainedby applying a parallel prefix sum on array dfq. Next, eachprocessor executes a mapping of each position x in the in-terval of indices of Eq associated to it to the appropriateposition of the invertedIndex. This mapping is described inlines 10-17 of the algorithm. Then, the mapped entries ofthe inverted index are used to compute the distances be-tween each document associated with these entries and thequery.

Figure 2 illustrates each step of Algorithm 2 for a query con-taining three terms, t1, t3 and t4, using the same collectionpresented in the example of Figure 1. Initially, the arraysdfq and startq are obtained by copying in parallel entriesrespectively from arrays df and index, corresponding to thethree query terms. Next a parallel prefix sum is appliedto array dfq and the indexq array is obtained. Finally theFigure shows the mapping of each position of the logical ar-ray Eq into the corresponding positions of the invertedIndexarray.5.3 Finding the k Nearest NeighborsWith the distances computed, it is necessary to obtain the kclosest documents. This can be accomplished by making useof a partial sorting algorithm on the array containing the dis-tances, which is of size |Dtrain|. For this, we implemented aparallel version of the Truncated Bitonic Sort (TBiS), whichwas shown to be superior to other partial sorting algorithmsin this context [18]. One advantage of the parallel TBiS isdata independence. At each step, the algorithm distributeselements equally among the GPU’s threads avoiding syn-chronizations as well as memory access conflicts. Althoughthe partial bitonic sort is O(|Dtrain| log2 k), worse than the

Page 6: SIGIR2015_0399_0e0543aa4.pdf

q t1 t3 t4

query

df

0 1 2 3 4

2

t5 3

t1 2

t2 2

t3 1

t4

index

0 1 2 3 4

8

t5 0

t1 3

t2 5

t3 7

t4

Parallel copy

dfq

0 1 2

3

t1 2

t3 1

t4

startq

0 1 2

0

t1 5

t3 7

t4

indexq

0 1 2

3

t1 5

t3 6

t4

invertedIndex

0 1 2 3 4 5 6 7 8 9

d5

t5 d1

t1 d3

t1 d5

t1 d2

t2 d4

t2 d1

t3 d5

t3 d1

t4 d2

t5

Logical array

0 1 2 3 4 5

d1

t1 d3

t1 d5

t1 d1

t3 d5

t3 d1

t4 Eq

Figure 2: Example of the execution of Algorithm 2 for aquery with three terms.

best known algorithm which is O(|Dtrain| log k), for a smallk the ratio of log k becomes almost negligible. In the case ofADC using kNN, the value of k is usually not greater than50. Our parallel TBiS implementation also uses a reductionstrategy, allowing each GPU block to act independently fromeach other on a partition of array containing the computeddistances. Results are then merged in the CPU using a pri-ority queue.

6. ANALYSIS OF THE SOLUTIONIn this section we analyze the amount of time and memoryused to construct the index and to compute the k near-est neighbors for a given query document q. The first stepof the construction of the inverted index is to obtain thedf array (line 5 of the Algorithm 1). During this step,the set of input pairs E is read in parallel by all proces-sors in P and for each term the corresponding document

counter is incremented. This takes time O( |E||P| ). The par-

allel prefix sum algorithm applied to array df to obtain the

index takes time O( |V||P| log |V|) [16]. Next, the computa-

tion of tf -idf , the computation of accumulated square oftf -idf (to compose the norms of the documents), and theinsertion of pairs in the invertedIndex (lines 9-11) are doneby accessing elements of E in parallel, thus taking time

O( |E||P| ). Finally the square roots of the norms are computed

in time O( |Dtrain||P| ). The total time of the index construction

is O( |E||P| )+O( |V||P| log |V|)+O( |E||P| )+O( |Dtrain||P| ). Since in real

text collections |E| > |Dtrain| > |V|, we conclude that the

time complexity of the index construction is O( |E||P| ).

The computation of the distances between each training doc-ument and the query document q starts by obtaining thearrays dfq and startq (lines 4-7 of Algorithm 2). This stepis executed in parallel by all processors in P. Thus the two

arrays are computed in time O(|Vq||P| ). The computation of

array indexq is the result of the use of the parallel prefix

sum on array dfq, thus it is done in time O(|Vq||P| log |Vq|).

Each processor pi ∈ P executes the mapping of|Eq||P| po-

sitions of the logical array Eq. It is possible to estimatethe value of |Eq| in terms of the sizes of V and Vq. Re-member that the logical array Eq represents the concate-nation of all inverted lists of terms in the query documentq, that is, |Vq| inverted lists. Considering that the train-ing collection follows the Zipf Law, we have that the prob-ability of the occurrence of a term t with rank k is given

by k−s∑|V|i=1

1is

, where s is the value of the exponent charac-

terizing the distribution. The term with greatest probabil-ity is the term with rank 1. Thus the expected document

frequency of this term is given by |E|∑|V|i=1

1is

. If we use the

classic version of Zipf’s law, the exponent s is 1, then the

document frequency of this term is |E|∑|V|i=1

1i

≈ |E|ln |V| . This

value represents an upper bound for lengths of the docu-ment frequency of each term in q. Thus, in the worst case,

we have that |Eq| ≈ |Vq| |E|ln |V| . According to Heaps Law, the

size of vocabulary is |V| = k|W |β , where k is a constant,usually in the range 10-100, W is the set formed by all oc-currences of all terms in the collection, and β is anotherconstant in the range 0.4-0.6. The size of W can be takenas an upper bound for the size of the input pairs E. Thus

|Eq| = O(|Vq| |W |logβ(k|W |β)

) = O(|Vq| |W ||W |β logβ k

) = O(|Vq|).We conclude that and each processor pi executes the map-

ping of O(|Vq||P| ) positions.

Now we analyze the time to compute the mapping of a sin-gle position (lines 11-17 of Algorithm 2). The computing ofvariable pos in line 11 can be performed in time O(log |Vq|)because values in array dfq are disposed in ascending or-der and a binary search can be used to find the minimumindex required. All the remaining operations (lines 12-18)are computed in constant time (O(1)). The processing ofeach mapped pair of the inverted index, as part of the com-putation of the distance between q and the correspondingdocument in the pair, is also done in constant time. Thus,the execution time of one iteration of inner loop (lines 10-18)is O(log |Vq|) +O(1) = O(log |Vq|). Finally the partial sortof the distances is computed in time O(|Dtrain| log k). Con-sequently, the overall execution time of Algorithm 2 corre-

sponds to O(|Vq||P| ) + O(

|Vq||P| log |Vq|) + O(

|Vq||P| )(O(log |Vq|)+

O(|Dtrain| log k) = O(|Vq||P| )(O(log |Vq|) +O(|Dtrain| log k).

The work of Garcia et al.[3] processes many query docu-ments in parallel, however, each query q is compared to everydocument in the training set. Besides, the query and eachdocument are represented as arrays of sizes |V|. Thus, theprocessing time of query q is O(|V||Dtrain|)+O(|Dtrain| log k).

When Comparing the speedup of our solution over the Gar-cia’s algorithm we do not take into consideration the time tosort the array of distances, since this task ads the same com-puting time in both solution. As consequence the speedup

achieved is O(|V||Dtrain|)

O(|Vq||P| )(O(log |Vq|)

. If we take |V| as upper bound

for |Vq|, we have that the speedup obtained is:

speedup = O(|Dtrain||P|

log |V| )

If we consider that number of processors is constant (in one

Page 7: SIGIR2015_0399_0e0543aa4.pdf

GPU), and that, according to Heaps Law the number of newwords in vocabulary V does not grow much as the collectionsize increases, we have that, the speedup increases propor-tionally to the number of documents in the collection.

Considering memory space requirements, the proposed so-lution consumes 2|E| units of memory to store arrays E andinvertedIndex, consumes 2|V| units to store arrays df andindex, consumes O(|Vq|) space to store the related-to-queryarrays (dfq, startq, and indexq) and O(|Dtrain|) space tostore the array containing the norms of the documents andthe array containing the distances. Thus, the space complex-ity of the solution is O(|E|)+O(|V|)+O(|Vq|)+O(|Dtrain|) =O(|E|) +O(|Dtrain|).

The solution presented by Garcia et al. [3] uses a matrixwith dimensions |V||Dtrain| to store the training set and anarray of size |V| to store the query q. Thus the space com-plexity of their solution is O(|V||Dtrain|) + O(|Dtrain|). Theratio between the space used by Garcia’s solution and ours

solution is O( |V||Dtrain|(|E|) ). This corresponds to a mesure of the

sparsity of the matrix storing the training set in Garcia’s so-lution.

7. EXPERIMENTAL EVALUATION7.1 Experimental SetupIn order to evaluate the meta-feature strategies, we considersix real-world textual datasets, namely, 20 Newsgroups, FourUniversities, Reuters, ACM Digital Library, MEDLINE andRCV1 datasets. For all datasets, we performed a traditionalpreprocessing task: we removed stopwords, using the stan-dard SMART list, and applied a simple feature selectionby removing terms with low “document frequency (DF)”2.Regarding term weighting, we used TFIDF for both, SVMand kNN. All datasets are single-label. In particular, in thecase of RCV1, the original dataset is multi-label with themulti-label cases needing special treatment, such as scorethresholding, etc. (see [11] for details). As our current focusis on single-label tasks, to allow a fair comparison among theother datasets (which are also single-label) and all baselines(which also focus on single-label tasks), we decided to trans-form all multi-label cases into single-label ones. In order todo this fairly, we randomly selected, among all documentswith more than one label, a single label to be attached tothat document. This procedure was applied in about 20% ofthe documents of RCV1 which happened to be multi-label.More details about the datasets are shown in Table 1.

Dataset Classes # attrib # docs Density Size

4UNI 7 40,194 8,274 140.325 14MB

20NG 20 61,049 18,766 130.780 30MB

ACM 11 59,990 24,897 38.805 8.5MB

REUT90 90 19,589 13,327 78.164 13MB

MED 7 803,358 861,454 31.805 327MB

RCV1Uni 103 134,932 804,427 79.133 884MB

Table 1: General information on the datasets.

All experiments were run on a IntelR© i7-870, running at2.93GHz, with 16Gb RAM. The GPU experiments were run

2We removed all terms that occur in less than six documents(i.e., DF<6).

on a NVIDIA Tesla K40, with 12Gb RAM. In order to con-sider the cost of all data transfers in our GPU-based algo-rithms, we report the wall time of the process execution inall efficiency experiments. To compare the average resultson our cross-validation experiments, we assess the statisti-cal significance of our results with a paired t-test with 95%confidence and Bonferroni correction to account for multi-ple tests. This test assures that the best results, marked inbold, are statistically superior to others.

We compare the computation time to generate meta-featuresusing three different algorithms: (1) GTkNN, our GPU-based implementation of kNN; (2) BF-CUDA, a brute forcekNN implementation using CUDA proposed by Garcia etal. [3]; and (3) ANN, a C++ library that supports exactand approximate nearest neighbor searching3. We use theANN exact version, since it was used in the previous meta-feature works [5, 21]. We chose BF-CUDA because it is themain representative of the GPU-based brute force approach.However, the other implementations mentioned in Section 2(some not available for download) also work with a completedistance matrix and would produce similar results.

We also conducted controlled experiments to evaluate theeffectiveness of classifiers learned with three different setsof features. The names and the descriptions of each set offeatures are given as follows: (1) Bag of Words, a set con-taining only the original features, i.e, TF-IDF weights ofthe document’s terms; (2) Meta-features, a set of meta-features proposed recently in literature (state-of-art meta-features) and described in Section 3; and (3) Bag + Meta-features, the combination of the above sets. The effective-ness of the features were compared using two standard textcategorization measures: micro averaged F1 (MicroF1) andmacro averaged F1 (MacroF1), which capture distinct as-pects of the ADC task [11]. To evaluate the performanceof different groups of features, we adopted the LIBLINEARimplementation of the SVM classifier. The regularizationparameter was chosen by using 5-fold cross-validation in thetraining set. For the size of neighborhood used for generat-ing the kNN-based meta-features, we adopted k = 30 in allexperiments, since it was empirically demonstrated as thebest parameter for text classification [7].

We would like to point out that some of the results obtainedin some datasets with and without the meta-features maydiffer from the ones reported in other works for the samedatasets (e.g., [10, 4]). Such discrepancies may be due toseveral factors such as differences in dataset preparation4,the use of different splits of the datasets (e.g., some datasetshave “default splits” such as REUT and RCV15), the appli-cation of some score thresholding, such as SCUT, PCUT,etc., which, besides being an important step for multi-labelproblems, also affects classification performance by minimiz-ing class imbalance effects, among other factors. We wouldlike to stress that we ran all alternatives under the same

3http://www.cs.umd.edu/~mount/ANN/4For instance, some works do exploit complex featureweighting schemes or feature selection mechanisms that dofavor some algorithms in detriment to others.5We believe that running experiments only in the defaultsplits is not the best experimental procedure as it does notallow a proper statistical treatment of the results.

Page 8: SIGIR2015_0399_0e0543aa4.pdf

Dataset

Meta-features Bag of Words

SVM kNN

MacF1 MicF1 MacF1 MicF1 MacF1 MicF1

4UNI 62.50± 2.27 76.52± 1.44 54.55± 1.64 70.18± 0.77 52.34± 1.77 68.25± 1.61

20NG 89.26± 0.23 89.59± 0.33 87.08± 0.33 87.34± 0.45 83.89± 0.72 84.40± 0.96

ACM 63.83± 2.05 76.03± 0.27 53.62± 1.12 67.61± 0.53 57.11± 1.64 70.71± 0.42

REUT90 38.96± 1.04 77.13± 1.04 29.13± 2.03 65.92± 0.78 30.01± 1.03 65.64± 1.68

MED 74.33± 0.17 83.55± 0.07 75.15± 0.18 85.65± 0.07 58.66± 0.50 86.05± 0.30

RCV1Uni 55.77 ± 0.92 77.41 ± 0.21 55.32± 0.66 78.28± 0.12 46.32± 0.72 68.23± 0.16

Table 2: MicroF1 and MacroF1 of different sets of features.

conditions in all datasets, using the best traditional featureweighting scheme (TFIDF), using standardized and well-accepted cross-validation procedures that optimize parame-ters for each of alternatives, and applying the proper statis-tical tools for the analysis of the results. All our datasets areavailable for others to replicate our results and test differentconfigurations.

7.2 Experimental Results7.2.1 Effectiveness

We start by demonstrating the effectiveness of the meta-features. As shown in Table 2, in most datasets, the useof the traditional bag-of-words was statistically worse thanmeta-features, justifying meta-features as a replacement forthe original high dimensional feature space, as demonstratedin previous work. Most of the results of SVM on the originalspace (bag-of-words) are superior or tied with kNN.

The only datasets in which the effectiveness of meta-featureswas not better than that of bag-of-words were MED andRCV1Uni. We hypothesize that it is due to the fact thatthese datasets have a large training dataset, which allows theclassification method (SVM) to deal better with the highlydimensional data, since there is enough training examplesto learn discriminative patterns from more dimensions.

Although bag-of-words achieved good results in MED andRCV1Uni, the combination Bag + Meta-features proposedin this work achieved the best results all datasets but REUT90,as shown in Table 3. This demonstrates the complementar-ity between the Bag and Meta-features in these datasets.REUT90 was the only case in which bag + meta-featureswas worse than the meta-features alone. Since this datasethas only a few training examples per class, the inclusion ofmore noisy features (from the bag-of-words) only makes itmore difficult to find an effective SVM model.

7.2.2 Computational Time to Generate Meta-featuresTable 4 shows the average time to generate meta-featuresfor a batch of test examples using ours and the baseline’skNN implementations. Since we use a 5-fold cross validation,we measure the time to generate meta-features for all theexamples in the batch of examples in each test fold. Noticethat the time to generate meta-features is practically theaverage time to classify a fold, since the time to classify thetest examples with the SVM implementation is negligible.

As can be seen in Table 4, the generation of meta-featuresusing GTkNN shows significant speedups in relation to theother kNN implementations. In particular, the speedups forthe small datasets range from 4.8 to 141.3 in relation to

the ANN implementation, used in previous works to gen-erate meta-features. This high speedup was somewhat ex-pected, since the ANN do not explore parallelism to com-pute the distances. However, even when compared to theparallel BF-CUDA implementation[3], GTkNN was able toachieve speedups ranging from 3.6 to 15.7. This was possiblemainly due to the fact that BF-CUDA does not optimize thedistance calculations to deal with the low density of termspresent in textual documents nor tries to balance the loadamong threads.

GTkNN produced the best speedups in 4UNI, 20NG andACM, but it obtained a lower speedup in REUT90. Thismay be due to the fact that REUT90 has a large number ofclasses and only a few documents in each class. Since themeta-features are generated one class at a time, we couldnot explore the parallelism in its full potential. For example,if there are only 10 training documents in a class, we canonly perform at most 10 simultaneous distance calculations,leaving most CUDA cores idle.

0

0.5

1

1.5

2

2.5

200 400 600 800 1000 1200 1400 1600 1800 2000

Se

co

nd

s

Number of training samples

BF-CUDAANN

GTkNN

Figure 3: Time to generate meta-features for one example

with different sample sizes from the MED dataset. GTkNN

keeps very low execution time (up to 0.005 seconds). The

other two approaches slow down dramatically as the training

dataset grows in size.

GTkNN was the only implementation able to generate meta-features for the larger datasets: MED and RCV1Uni. Infact, ANN and BF-CUDA are extremely slow in this con-text. Figure 3 shows the time to generate meta-features fora single test example considering a sample size of 100 upto 2000 documents in MED. Since this dataset has morethan 800,000 features, even a small fraction of it (e.g. 2000

Page 9: SIGIR2015_0399_0e0543aa4.pdf

DatasetMeta-features Bag + Meta-features

MacF1 MicF1 MacF1 MicF1

4UNI 62.50± 2.27 76.52± 1.44 62.93 ± 2.03 75.38 ± 1.76

20NG 89.26± 0.23 89.59± 0.33 90.11 ± 0.30 90.36 ± 0.43

ACM 63.83± 2.05 76.03± 0.27 63.58 ± 1.15 76.77 ± 0.24

REUT90 38.96± 1.04 77.13± 1.04 37.36 ± 1.31 74.98 ± 0.71

MED 74.33± 0.17 83.55± 0.07 79.90 ± 0.20 87.43 ± 0.08

RCV1Uni 55.77 ± 0.92 77.41 ± 0.21 57.21 ± 0.32 78.92 ± 0.16

Table 3: MicroF1 and MacroF1 of the meta-features and the combination of meta-features and bag-of-words.

DatasetExecution Time Speedup

GTkNN BF-CUDA ANN BF-CUDA ANN

4UNI 40± 1 259 ± 46 1590 ± 29 6.4 39.6

20NG 187± 4 2004 ± 17 10947 ± 1323 10.7 68.7

ACM 112± 3 1760 ± 91 13589 ± 1539 15.7 141.3

REUT90 625± 12 2242 ± 5 3024 ± 303 3.6 4.8

MED 4637 ± 43 * * * *

RCV1Uni 33884 ± 111 * * * *

Table 4: Average time in seconds (and 95% confidence interval) to generate meta-features using different kNN strategies.

GTkNN is significantly better than others and makes the generation of meta-features possible to MED and RCV1Uni.

DatasetTime to Generate meta-features Classification Time (SVM)

GTkNN BF-CUDA ANN BF-CUDA ANN

4UNI 0.012 ± 0.003 0.316 ± 0.001 0.534 ± 0.233 26.3 45.3

20NG 0.033 ± 0.007 1.112 ± 0.002 1.974 ± 0.642 33.7 59.8

ACM 0.016 ± 0.002 1.012 ± 0.003 2.084 ± 0.769 63.2 130.2

REUT90 0.139 ± 0.028 1.369 ± 0.001 0.524 ± 0.156 9.8 3.8

MED 0.027 ± 0.002 * * * *

RCV1Uni 0.191 ± 0.023 * * * *

Table 5: Average execution time in seconds (and standard deviations) to classify each test example.

documents - 0.3%) requires a significant time (2.5 and 1.5seconds for ANN and BF-CUDA respectively) to generatemeta-features. On the contrary, GTkNN takes no more than0.005 seconds to generate the meta-features. This discrep-ancy is expected since the calculations involved to deal withsuch high dimensions (greater than 800,000) are huge. Thesequential nature of ANN and the lack of support for datasparsity make both ANN and BF-CUDA less competitive inthis situation.

The average time to generate meta-features for a single testexample is presented in Table 5. In this experiment, weuse a leave-one-out methodology, where only one exampleis chosen as a test example, leaving all others as trainingexamples. GTkNN never takes more than 0.2 seconds togenerate meta-features to a test example. It makes possibleto classify examples on-the-fly even using very big trainingdatasets. The most costly examples are in REUT90 andRCV1Uni due the fact that GTkNN cannot fully explorethe parallelism with multiple classes on the current imple-mentation, as described before.

Although the ANN speedups are similar in Tables 5 and4, we notice that the BF-CUDA speedups in Table 5 aremuch higher than the ones previously presented in Table4. This happens because the BF-CUDA implementationis optimized to perform multiple queries in parallel, whichmatches the needs for generating meta-features to multipleexamples. However, when a single test example has to beprocessed, the meta-feature generation using the BF-CUDAimplementation is considerably slowed down.

7.2.3 Memory ConsumptionFor textual datasets, the traditional data representation us-ing a D x V matrix with D training documents and Vfeatures (the vocabulary of the collection) is not a goodchoice as the density of words in each document is very low.The proposed GTkNN approach represents textual data ina very compact way by exploiting an efficient in-memoryGPU-based implementation of an inverted index, allowingus to store only the statistics of the words present in eachdocument.

Table 6 shows the memory consumption figures required togenerate meta-features using the three kNN implementa-tions. The numbers correspond to the peak memory con-sumption during the meta-feature generation process. Anestimate of memory consumption, based on the allocation ofdata structures, was made for the ANN and BF-CUDA im-plementations with MED and RCV1Uni, since these imple-mentations are not capable of processing these large datasets.

DatasetMemory Consumption

GTkNN BF-CUDA ANN

4UNI 92 1697 945

20NG 93 1257 395

ACM 90 2541 2487

REUT90 90 909 494

MED 339 1859104 2857048

RCV1Uni 120 43245 69328

Table 6: Memory consuption in Megabytes.

Our GTkNN implementation stands out way ahead of the

Page 10: SIGIR2015_0399_0e0543aa4.pdf

competition when it comes to memory usage. The mostimpressive memory reduction occurs in MED. Although be-ing the second largest dataset, MED has only a few classes.Since our meta-feature generation strategy is performed oneclass at a time, a great number of documents has to beprocessed for each class. Thus MED consumes the largestamount of memory (339 MB) for our implementation, butthe baselines ANN and BF-CUDA consume more than 5,000xand 8,000x this space respectively for the same dataset. Wecan also obtain a large memory demand reduction in RCV1,but not as expressive as in MED due to its large number ofclasses. In general, for all kNN implementations, the peakmemory usage depends on the class with most training ex-amples.

8. CONCLUSIONThe use of meta-features in automatic document classifica-tion has permitted important improvements in the efficiencyof classification algorithms. One of these meta-feature ap-proaches is based on intensive use of the kNN algorithm,in order to exploit local information regarding the neighbor-hood of training documents. However, intensive use of kNN,combined with the high dimensionality and sparsity of tex-tual data, make this a challenging computational task. Wehave presented a very fast and scalable GPU-based approachfor computing kNN-based meta-feature document classifi-cation. Different from other GPU-based kNN implemen-tations, we avoid comparing the query document with alltraining documents. Instead we build an inverted index inthe GPU that is used to quickly find the documents sharingterms with the query document. Although the index doesnot allow a regular and predictable access to the data, we usea load balancing strategy to evenly distributed the computa-tion among thousand threads in the GPU. After calculatingthe distances, we again use a massive number of threads toselect the k smallest distance by implementing a truncatedbitonic sort in the GPU followed by a merge operation in theCPU. We tested our approach in a very memory-demandingand time consuming task which requires intensive and re-current execution of kNN. Our results show very significantgains in speedup and memory consumption when comparedto our baselines. In fact, running our baselines in the largestdatasets demonstrated to be unfeasible, stressing the valueof our contribution. And even in scenarios where the datasetis too big for our implementation (a single GPU), our ap-proach can easily be extended by spiting the dataset, ex-ecuting the kNN in each part, and merging the partial re-sults. Thus, meta-feature based classification can be appliedin huge collections of documents, taking a reasonable timeand without requiring expensive machines. As future work,we intend to approach other tasks and exploit Multi-GPUplatforms to improve even further our solutions.

9. REFERENCES[1] Y.-S. Chang, R.-K. Sheu, S.-M. Yuan, and J.-J. Hsu.

Scaling database performance on gpus. InformationSystems Frontiers, 14(4):909–924, 2012.

[2] S. Dzeroski and B. Zenko. Is combining classifiers withstacking better than selecting the best one? MachineLearning, 54(3):255–273, 2004.

[3] V. Garcia, E. Debreuve, and M. Barlaud. Fast knearest neighbor search using gpu. In CVPRWorkshops, pages 1–6, 2008.

[4] S. Godbole and S. Sarawagi. Discriminative methodsfor multi-labeled classification. In Proc. PAKDD,pages 22–30, 2004.

[5] S. Gopal and Y. Yang. Multilabel classification withmeta-level features. In Proc. SIGIR, pages 315–322,2010.

[6] T. Graham. A gpu database for real-time big dataanalytics and interactive visualization - (map-d). InNVIDIA GTC-GPU Technology Conference-2014,2014.

[7] T. Joachims. Text categorization with suport vectormachines: Learning with many relevant features. InProc. ECML, pages 137–142, 1998.

[8] Q. Kuang and L. Zhao. L.: A practical gpu based knnalgorithm. In In Proc. ISCSCT), pages 151–155, 2009.

[9] A. Kyriakopoulou and T. Kalamboukis. Usingclustering to enhance text classification. In Proc.SIGIR, pages 805–806, 2007.

[10] M. Lan, C.-L. Tan, and H.-B. Low. Proposing a newterm weighting scheme for text categorization. InProc. AAAI, pages 763–768, 2006.

[11] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: Anew benchmark collection for text categorizationresearch. JMLR., 5:361–397, 2004.

[12] S. Liang, C. Wang, Y. Liu, and L. Jian. Cuknn: Aparallel implementation of k-nearest neighbor oncuda-enabled gpu. In IEEE YC-ICT’09., pages415–418, 2009.

[13] O. Netzer. Getting big data done on a gpu-baseddatabase - (sqream). In NVIDIA GTC-GPUTechnology Conference-2014, 2014.

[14] B. Raskutti, H. L. Ferra, and A. Kowalczyk. Usingunlabelled data for text classification through additionof cluster parameters. In Proc ICML, pages 514–521,2002.

[15] F. Sebastiani. Machine learning in automated textcategorization. ACM Comput. Surv., 34(1):1–47, 2002.

[16] S. Sengupta, M. Harris, M. Garland, and J. D. Owens.Efficient parallel scan algorithms for many-core gpus.Sci. Comp. with Multicore and Acc., pages 413–442,2011.

[17] A. Shchekalev. Using gpus to accelerate learning torank - (yandex). In NVIDIA GTC-GPU TechnologyConference-2014, 2014.

[18] N. Sismanis, N. Pitsianis, and X. Sun. Parallel searchof k-nearest neighbors with synchronous operations. InHPEC, pages 1–6. IEEE, 2012.

[19] B. E. Teitler, J. Sankaranarayanan, H. Samet, andM. D. Adelfio. Online document clustering using gpus.In New Trends in Databases and Information Systems,pages 245–254. Springer, 2014.

[20] K. M. Ting and I. H. Witten. Issues in stackedgeneralization. J. Artif. Int. Res., 10(1):271–289, 1999.

[21] Y. Yang and S. Gopal. Multilabel classification withmeta-level features in a learning-to-rank framework.Mach. Learn., 88:47–68, 2012.