hierarchical multi-label text classification: an attention-based … · 2020. 9. 18. · 4iflytek...

Hierarchical Multi-label Text Classification: An Attention-basedRecurrent Network Approach

Wei Huang1, Enhong Chen1,∗, Qi Liu1, Yuying Chen1,2, Zai Huang1, Yang Liu1,Zhou Zhao3, Dan Zhang4, Shijin Wang4

1School of Computer Science and Technology, University of Science and Technology of Chinacheneh,[email protected],ustc0411,cyy33222,huangzai,[email protected]

2Ant Financial Services Group, [email protected] of Computer Science and Technology, Zhejiang University, [email protected]

4iFLYTEK Research, danzhang, [email protected]

ABSTRACTHierarchical multi-label text classification (HMTC) is a fundamentalbut challenging task of numerous applications (e.g., patent annota-tion), where documents are assigned to multiple categories storedin a hierarchical structure. Categories at different levels of a docu-ment tend to have dependencies. However, the majority of priorstudies for the HMTC task employ classifiers to either deal withall categories simultaneously or decompose the original probleminto a set of flat multi-label classification subproblems, ignoring theassociations between texts and the hierarchical structure and thedependencies among different levels of the hierarchical structure.To that end, in this paper, we propose a novel framework calledH ierarchical Attention-based Recurrent Neural Network (HARNN)for classifying documents into the most relevant categories level bylevel via integrating texts and the hierarchical category structure.Specifically, we first apply a documentation representing layer forobtaining the representation of texts and the hierarchical structure.Then, we develop an hierarchical attention-based recurrent layer tomodel the dependencies among different levels of the hierarchicalstructure in a top-down fashion. Here, a hierarchical attention strat-egy is proposed to capture the associations between texts and thehierarchical structure. Finally, we design a hybrid method which iscapable of predicting the categories of each level while classifyingall categories in the entire hierarchical structure precisely. Exten-sive experimental results on two real-world datasets demonstratethe effectiveness and explanatory power of HARNN.

CCS CONCEPTS• Computing methodologies → Neural networks; Naturallanguage processing;

KEYWORDSHierarchical Multi-label Text Classification; Attention Mechanism;Hierarchical Attention Networks

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’19, November 3–7, 2019, Beijing, China© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6976-3/19/11. . . $15.00https://doi.org/10.1145/3357384.3357885

ACM Reference Format:Wei Huang1, Enhong Chen1,∗, Qi Liu1, Yuying Chen1,2, Zai Huang1, YangLiu1, Zhou Zhao3, Dan Zhang4, Shijin Wang4. 2019. Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach.In The 28th ACM International Conference on Information and KnowledgeManagement (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, NewYork, NY, USA, 10 pages. https://doi.org/10.1145/3357384.3357885

1 INTRODUCTIONMany real-world applications organize documents in a hierarchicalstructure, where classes are specialized into subclasses or groupedinto superclasses [2, 32]. For example, an electronic document (e.g.,web-page, patent and e-mail) is associated with multiple categoriesand all these categories are stored hierarchically in a tree or a DirectAcyclic Graph (DAG) [16]. Figure 1 shows a toy example of a patentdocument with a four-level hierarchical category structure, the toppart is a detailed text description of the patent, and its correspond-ing hierarchical categories with a tree graph representation areshown below. The root node is considered as level 0, and the parentnodes represent more general than the child nodes. It is an elegantway to show the characteristics of data and a multi-dimensionalperspective to tackle the classification problem via the hierarchicalstructure. Thus, this kind of problem, known as hierarchical multi-label text classification (HMTC), has aroused widespread attractionin both the industry and the academia [11, 14, 22, 25].

In the literature, there are many efforts for HMTC problem. Ini-tially, the flat-based methods (e.g., Naive Bayes) have been proposedto predict only the categories of the last level by reducing the HMTCproblem into a flat multi-label problem [12]. Unfortunately, thesesimple approaches ignore the hierarchical category structure infor-mation (e.g., nuclear physics is the subclass of physics in Figure 1). Totackle the problem above, some works have taken the hierarchicalstructure into consideration [2, 6, 10, 24, 27], which can be furthercompartmentalized into two approaches: (1) Local approaches gen-erate hierarchical classifiers and each classifier is responsible forpredicting either corresponding categories [2] or correspondingcategory levels [27]; (2) Global approaches gather all the levels ofcategories together and predict them with a single classifier [10].However, these studies only focus on either the local regions orthe overall structure of the category hierarchy, while ignoring thedependencies among different levels of the hierarchical structure.

In summary, there are still many unique challenges inherent indesigning an effective HMTC solution. First, in a text classificationtask, different parts of the document are associated with different

Session: Long - Natural Language Processing II CIKM ’19, November 3–7, 2019, Beijing, China

1051

https://doi.org/10.1145/3357384.3357885

https://doi.org/10.1145/3357384.3357885

Categories:

A gas injection system of a nuclear power plant including a reactor and a gaseous waste disposal system, comprises a hydrogen injection unit for injecting hydrogen into a reactor of a nuclear power plant, an oxygen injection unit for …

… … … … …

Chemistry | Physics

Inorganic Chemistry | Nuclear Physics

Non-metallic Elements | Radioactive Sources | Nuclear Reactors…… | …… | …… | …… | ……

D :<latexit sha1_base64="xVupmG9vhqRtJKPMDELGzmT1LFU=">AAAB6XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQfEU1IPHKOYBSQizk9lkyOzsMtMrhCV/4MWDIl79I2/+jZNkD5pY0FBUddPd5cdSGHTdbye3srq2vpHfLGxt7+zuFfcPGiZKNON1FslIt3xquBSK11Gg5K1Ycxr6kjf90c3Ubz5xbUSkHnEc825IB0oEglG00sPtVa9YcsvuDGSZeBkpQYZar/jV6UcsCblCJqkxbc+NsZtSjYJJPil0EsNjykZ0wNuWKhpy001nl07IiVX6JIi0LYVkpv6eSGlozDj0bWdIcWgWvan4n9dOMLjspkLFCXLF5ouCRBKMyPRt0heaM5RjSyjTwt5K2JBqytCGU7AheIsvL5NGpeydlSv356XqdRZHHo7gGE7Bgwuowh3UoA4MAniGV3hzRs6L8+58zFtzTjZzCH/gfP4AFdaNEA==</latexit>

C1<latexit sha1_base64="JFrd+a5VxULtIcHy5klTfK4as/I=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMdiLx4rmFpoY9lsp+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCopeNUMfRZLGLVDqlGwSX6hhuB7UQhjUKBD+G4MfMfnlBpHst7M0kwiOhQ8gFn1FjJbzxm3rRXrrhVdw6ySrycVCBHs1f+6vZjlkYoDRNU647nJibIqDKcCZyWuqnGhLIxHWLHUkkj1EE2P3ZKzqzSJ4NY2ZKGzNXfExmNtJ5Eoe2MqBnpZW8m/ud1UjO4DjIuk9SgZItFg1QQE5PZ56TPFTIjJpZQpri9lbARVZQZm0/JhuAtv7xKWrWqd1Gt3V1W6jd5HEU4gVM4Bw+uoA630AQfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/fuWOeg==</latexit>

C2<latexit sha1_base64="sWuWm3KHmLya6sXpb5ZZmgc+bUU=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMdiLx4rmFpoa9lsJ+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCopeNUMfRZLGLVDqhGwSX6hhuB7UQhjQKBD8G4MfMfnlBpHst7M0mwF9Gh5CFn1FjJbzxmtWm/XHGr7hxklXg5qUCOZr/81R3ELI1QGiao1h3PTUwvo8pwJnBa6qYaE8rGdIgdSyWNUPey+bFTcmaVAQljZUsaMld/T2Q00noSBbYzomakl72Z+J/XSU143cu4TFKDki0WhakgJiazz8mAK2RGTCyhTHF7K2EjqigzNp+SDcFbfnmVtGpV76Jau7us1G/yOIpwAqdwDh5cQR1uoQk+MODwDK/w5kjnxXl3PhatBSefOYY/cD5/AIBqjns=</latexit>

C3<latexit sha1_base64="JL+C5Fg6Ce/RWoNUciEsOt4rYAk=">AAAB7HicbVBNTwIxEJ3FL8Qv1KOXRmLiieyCiR6JXDxi4gIJrKRbutDQdjdt14Rs+A1ePGiMV3+QN/+NBfag4EsmeXlvJjPzwoQzbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TRahPYh6rbog15UxS3zDDaTdRFIuQ0044ac79zhNVmsXywUwTGgg8kixiBBsr+c3HrD4blCtu1V0ArRMvJxXI0RqUv/rDmKSCSkM41rrnuYkJMqwMI5zOSv1U0wSTCR7RnqUSC6qDbHHsDF1YZYiiWNmSBi3U3xMZFlpPRWg7BTZjverNxf+8XmqimyBjMkkNlWS5KEo5MjGaf46GTFFi+NQSTBSztyIyxgoTY/Mp2RC81ZfXSbtW9erV2v1VpXGbx1GEMziHS/DgGhpwBy3wgQCDZ3iFN0c6L86787FsLTj5zCn8gfP5A4Hvjnw=</latexit>

C4<latexit sha1_base64="DqIj3CioG5ISiWrlmyRCdSW9rTE=">AAAB7HicbVBNTwIxEJ3iF+IX6tFLIzHxRHaRRI9ELh4xcYEEVtItXWjodjdt14Rs+A1ePGiMV3+QN/+NBfag4EsmeXlvJjPzgkRwbRznGxU2Nre2d4q7pb39g8Oj8vFJW8eposyjsYhVNyCaCS6ZZ7gRrJsoRqJAsE4wac79zhNTmsfywUwT5kdkJHnIKTFW8pqPWX02KFecqrMAXiduTiqQozUof/WHMU0jJg0VROue6yTGz4gynAo2K/VTzRJCJ2TEepZKEjHtZ4tjZ/jCKkMcxsqWNHih/p7ISKT1NApsZ0TMWK96c/E/r5ea8MbPuExSwyRdLgpTgU2M55/jIVeMGjG1hFDF7a2Yjoki1Nh8SjYEd/XlddKuVd2rau2+Xmnc5nEU4QzO4RJcuIYG3EELPKDA4Rle4Q1J9ILe0ceytYDymVP4A/T5A4N0jn0=</latexit>

Figure 1: A toy example of a patent document in HTMC problem.

categories. For instance, in Figure 1, the "red" words of documentD focus more on physics while the "blue" words concentrate moreon chemistry in C1 (level-1 category), and the words on the "red"underline further describe nuclear physics in C2 (level-2 category).Therefore, when understanding the semantic of each document, it isnecessary to capture these associations between texts and the hier-archical structure. Second, there are dependencies among differentlevels in the hierarchical category structure (i.e., the determinationof a category not only is influenced by its parent category but alsowill affect its child categories). As shown in Figure 1, document Dconcentrates on the nuclear physics in C2 since its parent categoryphysics inC1 is focused. And nuclear reactors inC3 is the subclass ofnuclear physics, which should also be highly heeded. Thus, it is alsocritical to classify level by level via considering the dependenciesamong different levels of the hierarchical structure. Third, whenassigning the document into hierarchical categories, not only thelocal regions but also the overall structure of the category hierar-chy should be considered. Hence, how to predict the categories ofeach level while classifying all categories in the entire hierarchicalstructure precisely is a nontrivial problem.

To address the challenges mentioned above, in this paper, we pro-pose a novel framework named Hierarchical Attention-based Recur-rent Neural Network (HARNN), which can automatically annotatea document with the most relevant categories level by level. Specif-ically, given the documents and the whole hierarchical categorystructure, we first apply a Documentation Representing Layer (DRL)for obtaining the representation of texts and the hierarchical struc-ture. Then, we devise an Hierarchical Attention-based RecurrentLayer (HARL) to model the dependencies among different levels byleveraging the hierarchical structure gradually in a top-down fash-ion. To be specific, at each level, the attention mechanism qualifiesthe contribution of each text word to each category, and the text-category association information will affect the next category level.Throughout the recurrence, the Hierarchical Attention-based Mem-ory (HAM) unit we designed can model the dependencies amongdifferent levels of the hierarchical structure. After that, we utilizea Hybrid Predicting Layer (HPL) for combining the predictions ofeach level in the category hierarchy and the overall hierarchicalstructure. Thus, the proposed model is capable of predicting thecategories of each level while classifying all categories in the entirehierarchical structure precisely. Finally, extensive experiments ontwo large-scale real-world datasets demonstrate the effectivenessand explanatory power of our model. To the best of our knowl-edge, this is the first comprehensive attempt to employ hierarchicalattentive neural networks for HMTC problem.

2 RELATEDWORKGenerally, the related works can be grouped into two researchaspects, i.e., studies on hierarchical multi-label text classificationand attention mechanism.

2.1 Studies on HMTCThere have been several efforts for HMTC in the literature. Ini-tially, flat-based methods were widely used in HMTC task, suchas Decision Tree (DT) and Naive Bayes (NB) [12], which learneddiscriminant classifiers only for the categories of the last level in thecategory hierarchy. However, the hierarchical structure informa-tion was ignored in these approaches, which caused the inefficiency.Thus, some hierarchical approaches were proposed by taking thehierarchical structure information into consideration and could becompartmentalized into local and global approaches according tothe strategy adopted. Regarding local approaches, Cesa-Bianchi etal. [8] proposed a classification method using hierarchical SVM,where SVM learning was applied to a category only if its parent cat-egory was labeled as positive. Afterwards, a large margin methodwas proposed by Rousu et al. [26] to calculate the maximum struc-tural margin for the output categories. As for global approaches,Vens et al. [31] developed a tree-based approach called Clus-HMCto employ a single decision tree to deal with the entire hierarchicalcategory structure. Recently, neural networks have been utilizedfor HMTC problems and have shown its effectiveness. Cerri etal. [7] attempted to use HMC-LMLP for incrementally training aset of neural networks and each being responsible for predictingthe categories in a given level. Borges et al. [4] proposed a globalapproach based on a competitive artificial neural network to predictall categories in the hierarchical structure.

Nevertheless, the above works mainly focus on either the localregions or the overall structure of the category hierarchy. More-over, they neglect the dependencies among the different levels ofthe hierarchical structure, which leads to error propagation andwell-known class-membership inconsistency [28]. Along this line,methods combing the advantages of local and global approacheshave been applied [33] in many domains (e.g., text classification,protein function prediction), where a neural network HMCN wasproposed for integrating the predictions of each level in the cate-gory hierarchy and the overall hierarchical structure. Unfortunately,they failed to capture the associations between texts and the hi-erarchical structure, which are important for understanding thesemantics of each document.

2.2 Attention MechanismIn our framework, one of the most significant steps is to reveal theassociations between texts and each category in the hierarchicalstructure from top to down and level by level, which is relatedto the attention mechanism [37]. In text classification, attentionmechanism is a powerful approach to highlight different parts ofthe text semantic representation by assigning different weights. Forexample, Huang et al. [18] used attention mechanism on the top of aCNN model to introduce an extra source of information for guidingthe extraction of the sentence embedding. Tao et al. [30] utilized anattention method which created a weighted vector representationby using the encodings of either RNNs hidden states. Lin et al. [20]


1052

designed a self-attention mechanism for the sequential models (e.g.,RNN) to replace the max pooling or averaging step, which enabledtheir model pay attention to the different aspects of the sentence.

In the aforementioned works, the attention weights are usuallycalculated by the correspondence between texts and each categoryin particular levels, which treats the different levels of the hierar-chical structure independently thereby ignores the dependenciesamong the different levels. In this work, we propose a novel hier-archical attention recurrent structure to capture the associationsbetween texts and each category gradually from top to down, whichintegrates the dependencies among different levels. Specifically, inthe hierarchical attention mechanism, the attention weights of textsand each category in a level not only are influenced by its previouslevel but also will affect the next level.

3 PRELIMINARIES3.1 Problem DefinitionIn HMTC problem, there are a set of documents (e.g., patent) andeach document contains the text description and its expected cat-egories which are organized in a hierarchical structure. Beforedefining the HMTC problem, we start with a detailed descriptionof the hierarchical structure and documents.

Definition 3.1. (Hierarchical Structure γ ). Given the definedpossible categories in H hierarchical levels C = (C1,C2, . . . ,CH ),whereCi = c1, c2, . . . ∈ 0, 1 |C

i | is the set of possible categoriesin the i-th hierarchical level and |Ci | is the number of categories inthe i-th hierarchical level and K is the total number of categories.We define the hierarchical category structure γ over C as a partialorder set (C, ≺). ≺ is a partial order representing the PARENT-OFrelationship, which is asymmetric, anti-reflexive and transitive [34]:• ∀cx ∈ Ci , cy ∈ C j , cx ≺ cy then cy ⊀ cx• ∀cx ∈ Ci , cx ⊀ cx• ∀cx ∈ Ci , cy ∈ C j , cz ∈ Ck , if cx ≺ cy and cy ≺ cz thencx ≺ cz

A set of M documents with expected hierarchical categoriescan be denoted as X = (D1,L1), (D2,L2) . . . , (DM ,LM ), whereDi = w1,w2, . . . ,wN can usually be represented as a sequenceof N words, and Li = ℓ1, ℓ2, . . . , ℓH is the set of excepted hierar-chical categories assigned to the document Di , with ℓi ⊂ Ci in thecorresponding hierarchical structure γ . Then, the HMTC problemcan be formulated as:

Definition 3.2. (HMTC Problem). Given a set of documents andthe corresponding hierarchical category structure, our goal is tointegrate the document texts D and the corresponding hierarchicalcategory structure γ to learn a classification model Ω, which canbe used to predict the hierarchical categories L for documents:

Ω(D,γ ,Θ) → L, (1)where Θ is the parameters of Ω.

4 MODEL ARCHITECTUREIn this section, we will introduce the technical details of HARNNframework. As shown in Figure 2, HARNN mainly contains three

parts, i.e., Documentation Representing Layer (DRL), HierarchicalAttention-based Recurrent Layer (HARL) and Hybrid PredictingLayer (HPL). Specifically, we utilize DRL to obtain the unified rep-resentation of each document text and the hierarchical categorystructure. Then, we design HARL to model the dependencies amongdifferent levels via capturing the associations between texts andeach category of the hierarchical structure in a top-down fashion.Finally, HPL is applied to predict the hierarchical categories ofdocuments.

4.1 Documentation Representing LayerIn the first stage of HARNN, DRL aims to generate the unifiedrepresentation of the document text and the hierarchical categorystructure. To this end, we first apply an Embedding Layer to encodetext and the hierarchical category structure. Then a Bi-LSTM Layeris utilized to enhance the encodings of text semantic representation.

Embedding Layer is used to encode the text and the hierarchi-cal category structure. DRL receives the text tokens of a documentDand the hierarchical category structure γ as input, as shown in Fig-ure 2. Intuitively, the document text D is formalized as a sequenceof N words D = (w1,w2, . . . ,wN ), wherewi ∈ R

k is initialized bya k-dimensional pre-trained word embedding withWord2vec [23].Like Word2vec operation in text tokens, we embed the hierarchicalcategory structure γ into a matrix S = (S1, S2, . . . , SH ) for training,where Si ∈ R |C i |×da is a randomly initialized matrix which rep-resents the embedding of the i-th hierarchical category level withthe da -dimension. After the initialization from Embedding Layer,we apply the subsequent Bi-LSTM Layer to enhance the semanticrepresentations of the document text.

Bi-LSTM Layer targets at enhancing the encodings of the textsemantic representation. For the text tokens in D, we utilize aBi-LSTM architecture, which is a variant of the traditional LSTMarchitecture [15, 17]. Bi-LSTM can learn not only long-range de-pendencies across the input sequence but also context informa-tion from forward and backward simultaneously, which is benefi-cial for enhancing the encodings of the text semantic representa-tion. Specifically, the input of the Bi-LSTM network is a sequenceD = (w1,w2, . . . ,wN ), and the hidden vector of a Bi-LSTM is cal-culated as follows:

−→hn = LSTM(

−→h n−1,wn ),

←−hn = LSTM(

←−h n+1,wn ),

hn = [−→hn ,←−hn ],

(2)

where−→hn and

←−hn are the forward hidden vector and the backward

hidden vector respectively at the n-th word step in the Bi-LSTM.And hn ∈ R2u is the hidden output of Bi-LSTM at the n-th wordstep, which is the concatenation of

−→hn and

←−hn , whereu is the hidden

unit number of each unidirectional LSTM.For simplicity, we note all hn as V = h1,h2, . . . ,hN ∈ RN×2u .

After that, to obtain the whole semantic representation of the doc-ument D, we exploit the word-wise average pooling operation tomerge N words contextual representations into an average embed-ding V as V = avд(h1,h2, . . . ,hN ) ∈ R2u .


1053

Bi-lstmEmbedding Embedding

……

………

Level 1Level 2

Level H…

ROOT

HAM HAM HAM

……

……

Hierarchical Attention-based Recurrent Layer

Documentation Representating Layer

Hybrid Predicting Layer

gasinjectionsystem

waste

…

disposal

P 1L<latexit sha1_base64="3HWKj4ElnZBx1GGStaqIGW0PopM=">AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe6ioGXQxsIigvmQ5Ax7m71kye7esbsnhON+hY2FIrb+HDv/jZvkCk18MPB4b4aZeUHMmTau++0UVlbX1jeKm6Wt7Z3dvfL+QUtHiSK0SSIeqU6ANeVM0qZhhtNOrCgWAaftYHw99dtPVGkWyXsziakv8FCykBFsrPTQ6Ke32WPqZf1yxa26M6Bl4uWkAjka/fJXbxCRRFBpCMdadz03Nn6KlWGE06zUSzSNMRnjIe1aKrGg2k9nB2foxCoDFEbKljRopv6eSLHQeiIC2ymwGelFbyr+53UTE176KZNxYqgk80VhwpGJ0PR7NGCKEsMnlmCimL0VkRFWmBibUcmG4C2+vExatap3Vq3dnVfqV3kcRTiCYzgFDy6gDjfQgCYQEPAMr/DmKOfFeXc+5q0FJ585hD9wPn8Aq+2QUg==</latexit> P 2

L<latexit sha1_base64="FXTVMXfLRQJ2U4VrIjJAevyduBo=">AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe6ioGXQxsIigvmQ5Ax7m71kye7esbsnhON+hY2FIrb+HDv/jZvkCk18MPB4b4aZeUHMmTau++0UVlbX1jeKm6Wt7Z3dvfL+QUtHiSK0SSIeqU6ANeVM0qZhhtNOrCgWAaftYHw99dtPVGkWyXsziakv8FCykBFsrPTQ6Ke32WNay/rlilt1Z0DLxMtJBXI0+uWv3iAiiaDSEI617npubPwUK8MIp1mpl2gaYzLGQ9q1VGJBtZ/ODs7QiVUGKIyULWnQTP09kWKh9UQEtlNgM9KL3lT8z+smJrz0UybjxFBJ5ovChCMToen3aMAUJYZPLMFEMXsrIiOsMDE2o5INwVt8eZm0alXvrFq7O6/Ur/I4inAEx3AKHlxAHW6gAU0gIOAZXuHNUc6L8+58zFsLTj5zCH/gfP4ArXKQUw==</latexit> PH

L<latexit sha1_base64="EqRdwBpCdGOOWJa2y1kn9tu7XhA=">AAAB8HicbVA9SwNBEJ3zM8avqKXNYhCswl0UtAzapLCIYD4kOcPeZi9Zsrt37O4J4bhfYWOhiK0/x85/4ya5QhMfDDzem2FmXhBzpo3rfjsrq2vrG5uFreL2zu7efungsKWjRBHaJBGPVCfAmnImadMww2knVhSLgNN2ML6Z+u0nqjSL5L2ZxNQXeChZyAg2Vnpo9NPb7DGtZ/1S2a24M6Bl4uWkDDka/dJXbxCRRFBpCMdadz03Nn6KlWGE06zYSzSNMRnjIe1aKrGg2k9nB2fo1CoDFEbKljRopv6eSLHQeiIC2ymwGelFbyr+53UTE175KZNxYqgk80VhwpGJ0PR7NGCKEsMnlmCimL0VkRFWmBibUdGG4C2+vExa1Yp3XqneXZRr13kcBTiGEzgDDy6hBnVoQBMICHiGV3hzlPPivDsf89YVJ585gj9wPn8AzuCQaQ==</latexit>⊗(1 ↵)

<latexit sha1_base64="P0JsbzY03cFJ5THxhaKY1FQxb2k=">AAACAHicbVC7TsMwFHXKq5RXgIGBxWqFVISokjLAWMHCWCT6kJqochyntXCcyHaQoigL38AfsDCAECufwda/wWk7QMuRLB+dc6/uvceLGZXKsiZGaWV1bX2jvFnZ2t7Z3TP3D7oySgQmHRyxSPQ9JAmjnHQUVYz0Y0FQ6DHS8x5uCr/3SISkEb9XaUzcEI04DShGSktD88jxIubLNNRfVrfPHcTiMTrNh2bNalhTwGViz0mtVXXOniettD00vx0/wklIuMIMSTmwrVi5GRKKYkbyipNIEiP8gEZkoClHIZFuNj0ghyda8WEQCf24glP1d0eGQlnsqCtDpMZy0SvE/7xBooIrN6M8ThTheDYoSBhUESzSgD4VBCuWaoKwoHpXiMdIIKx0ZhUdgr148jLpNhv2RaN5p9O4BjOUwTGogjqwwSVogVvQBh2AQQ5ewBt4N56MV+PD+JyVlox5zyH4A+PrB2HkmTw=</latexit>

⊗↵<latexit sha1_base64="YM8+9f8gWlRfgqx26zLpmTaXF7c=">AAAB7XicbZC7SgNBFIZn4y2ut6ilzWIQrMJuLLQRgzaWEcwFkiWcncwmY2ZnhplZISx5BxsLRWwsfBR7G/FtnFwKTfxh4OP/z2HOOZFkVBvf/3ZyS8srq2v5dXdjc2t7p7C7V9ciVZjUsGBCNSPQhFFOaoYaRppSEUgiRhrR4GqcN+6J0lTwWzOUJEygx2lMMRhr1dvAZB86haJf8ifyFiGYQfHiwz2Xb19utVP4bHcFThPCDWagdSvwpQkzUIZiRkZuO9VEAh5Aj7QsckiIDrPJtCPvyDpdLxbKPm68ifu7I4NE62ES2coETF/PZ2Pzv6yVmvgszCiXqSEcTz+KU+YZ4Y1X97pUEWzY0AJgRe2sHu6DAmzsgVx7hGB+5UWol0vBSal84xcrl2iqPDpAh+gYBegUVdA1qqIawugOPaAn9OwI59F5cV6npTln1rOP/sh5/wHsVpJZ</latexit>PG<latexit sha1_base64="TITICDTdmSmcivOeyzoP1/mO2YM=">AAAB7HicbVC7SgNBFL0bXzG+ooKNzWAQrMJuLLQMsdAyATcJxCXMTmaTIbMzy8ysEJZ8g42FIrZ2/oVfYGfjtzh5FJp44MLhnHu5954w4Uwb1/1yciura+sb+c3C1vbO7l5x/6CpZaoI9YnkUrVDrClngvqGGU7biaI4DjlthcOrid+6p0ozKW7NKKFBjPuCRYxgYyW/3s2ux91iyS27U6Bl4s1JqXrU+GbvtY96t/h515MkjakwhGOtO56bmCDDyjDC6bhwl2qaYDLEfdqxVOCY6iCbHjtGp1bpoUgqW8Kgqfp7IsOx1qM4tJ0xNgO96E3E/7xOaqLLIGMiSQ0VZLYoSjkyEk0+Rz2mKDF8ZAkmitlbERlghYmx+RRsCN7iy8ukWSl75+VKw6ZRgxnycAwncAYeXEAVbqAOPhBg8ABP8OwI59F5cV5nrTlnPnMIf+C8/QCUapI7</latexit>

PF<latexit sha1_base64="lebCM5Qo0Lzdr8dTxD+GFnEOw7g=">AAAB7HicbVC7SgNBFL0bXzG+ooKNzWAQrMJuLLQMEcQyATcJxCXMTmaTIbMzy8ysEJZ8g42FIrZ2/oVfYGfjtzh5FJp44MLhnHu5954w4Uwb1/1yciura+sb+c3C1vbO7l5x/6CpZaoI9YnkUrVDrClngvqGGU7biaI4DjlthcOrid+6p0ozKW7NKKFBjPuCRYxgYyW/3s2ux91iyS27U6Bl4s1JqXrU+GbvtY96t/h515MkjakwhGOtO56bmCDDyjDC6bhwl2qaYDLEfdqxVOCY6iCbHjtGp1bpoUgqW8Kgqfp7IsOx1qM4tJ0xNgO96E3E/7xOaqLLIGMiSQ0VZLYoSjkyEk0+Rz2mKDF8ZAkmitlbERlghYmx+RRsCN7iy8ukWSl75+VKw6ZRgxnycAwncAYeXEAVbqAOPhBg8ABP8OwI59F5cV5nrTlnPnMIf+C8/QCS5ZI6</latexit>

!0<latexit sha1_base64="n6nf7DlmmgyjgxBZrKc5k/E8LDQ=">AAAB8XicbZDLSgMxFIYzXut4q7p0EyyCqzJTF7oRi25cVrAXbMeSSc+0oUlmSDJCGfoWblwooksfxL0b8W1MLwtt/SHw8f/nkHNOmHCmjed9OwuLS8srq7k1d31jc2s7v7Nb03GqKFRpzGPVCIkGziRUDTMcGokCIkIO9bB/Ocrr96A0i+WNGSQQCNKVLGKUGGvdtmIBXXKXecN2vuAVvbHwPPhTKJx/uGfJ25dbaec/W52YpgKkoZxo3fS9xAQZUYZRDkO3lWpICO2TLjQtSiJAB9l44iE+tE4HR7GyTxo8dn93ZERoPRChrRTE9PRsNjL/y5qpiU6DjMkkNSDp5KMo5djEeLQ+7jAF1PCBBUIVs7Ni2iOKUGOP5Noj+LMrz0OtVPSPi6Vrr1C+QBPl0D46QEfIRyeojK5QBVURRRI9oCf07Gjn0XlxXielC860Zw/9kfP+A9/jlAo=</latexit>

!1<latexit sha1_base64="WiK2UBRl0te1u2j+iz2ImcQF60Q=">AAAB8XicbZDLSgMxFIYzXut4q7p0EyyCqzJTF7oRi25cVrAXbMeSSc+0oUlmSDJCGfoWblwooksfxL0b8W1MLwtt/SHw8f/nkHNOmHCmjed9OwuLS8srq7k1d31jc2s7v7Nb03GqKFRpzGPVCIkGziRUDTMcGokCIkIO9bB/Ocrr96A0i+WNGSQQCNKVLGKUGGvdtmIBXXKX+cN2vuAVvbHwPPhTKJx/uGfJ25dbaec/W52YpgKkoZxo3fS9xAQZUYZRDkO3lWpICO2TLjQtSiJAB9l44iE+tE4HR7GyTxo8dn93ZERoPRChrRTE9PRsNjL/y5qpiU6DjMkkNSDp5KMo5djEeLQ+7jAF1PCBBUIVs7Ni2iOKUGOP5Noj+LMrz0OtVPSPi6Vrr1C+QBPl0D46QEfIRyeojK5QBVURRRI9oCf07Gjn0XlxXielC860Zw/9kfP+A+FolAs=</latexit>

!2<latexit sha1_base64="tmDYBahFlzrQYwftahfyzJFto1U=">AAAB8XicbZDLSgMxFIbPeK3jrerSTbAIrspMXehGLLpxWcFesB1LJs20oUlmSDJCGfoWblwooksfxL0b8W1MLwtt/SHw8f/nkHNOmHCmjed9OwuLS8srq7k1d31jc2s7v7Nb03GqCK2SmMeqEWJNOZO0apjhtJEoikXIaT3sX47y+j1VmsXyxgwSGgjclSxiBBtr3bZiQbv4LisN2/mCV/TGQvPgT6Fw/uGeJW9fbqWd/2x1YpIKKg3hWOum7yUmyLAyjHA6dFuppgkmfdylTYsSC6qDbDzxEB1ap4OiWNknDRq7vzsyLLQeiNBWCmx6ejYbmf9lzdREp0HGZJIaKsnkoyjlyMRotD7qMEWJ4QMLmChmZ0WkhxUmxh7JtUfwZ1eeh1qp6B8XS9deoXwBE+VgHw7gCHw4gTJcQQWqQEDCAzzBs6OdR+fFeZ2ULjjTnj34I+f9B+LtlAw=</latexit>

!H1<latexit sha1_base64="8V6UM7Tmgg4OaCuictmhasDRCcc=">AAAB83icbVC7SgNBFJ2Nr7i+opY2g0GwMezGQhsxaJMygnlAdg2zk9lkyDyWmVkhLPkNGwtFbf0Oexvxb5w8Ck08cOFwzr3ce0+UMKqN5307uaXlldW1/Lq7sbm1vVPY3WtomSpM6lgyqVoR0oRRQeqGGkZaiSKIR4w0o8H12G/eE6WpFLdmmJCQo56gMcXIWCkIJCc9dJdVT/xRp1D0St4EcJH4M1K8/HAvktcvt9YpfAZdiVNOhMEMad32vcSEGVKGYkZGbpBqkiA8QD3StlQgTnSYTW4ewSOrdGEslS1h4ET9PZEhrvWQR7aTI9PX895Y/M9rpyY+DzMqktQQgaeL4pRBI+E4ANilimDDhpYgrKi9FeI+UggbG5NrQ/DnX14kjXLJPy2Vb7xi5QpMkQcH4BAcAx+cgQqoghqoAwwS8ACewLOTOo/Oi/M2bc05s5l98AfO+w/kZ5SU</latexit>

<latexit sha1_base64="fJ+xIONHl3eqLGr8ewGgfe3/x4E=">AAAB7nicbVDLSgNBEJz1GeMr6lGRwSB4CrvxoMegF48JmAckS+idzCZDZmaXmVkhLDn6AV48KOLVT8h3ePMb/Alnkxw0saChqOqmuyuIOdPGdb+cldW19Y3N3FZ+e2d3b79wcNjQUaIIrZOIR6oVgKacSVo3zHDaihUFEXDaDIa3md98oEqzSN6bUUx9AX3JQkbAWKnZ6YMQkO8Wim7JnQIvE29OipWTSe378XRS7RY+O72IJIJKQzho3fbc2PgpKMMIp+N8J9E0BjKEPm1bKkFQ7afTc8f43Co9HEbKljR4qv6eSEFoPRKB7RRgBnrRy8T/vHZiwms/ZTJODJVktihMODYRzn7HPaYoMXxkCRDF7K2YDEABMTahLARv8eVl0iiXvMtSuWbTuEEz5NAxOkMXyENXqILuUBXVEUFD9IRe0KsTO8/Om/M+a11x5jNH6A+cjx+rrZLU</latexit>

V<latexit sha1_base64="z7WM4r9Z0KztG5OKwLiEIj+qgW4=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8UgWIXdWGgjBm0sEzAXSJYwOzmbjJmdXWZmhbDkCWwsFLHVh7G3Ed/GyaXQ6A8DH/9/DnPOCRLOlHbdLyu3tLyyupZftzc2t7Z3Crt7DRWnkmKdxjyWrYAo5ExgXTPNsZVIJFHAsRkMryZ58w6lYrG40aME/Yj0BQsZJdpYtUa3UHRL7lTOX/DmULx4t8+Tt0+72i18dHoxTSMUmnKiVNtzE+1nRGpGOY7tTqowIXRI+tg2KEiEys+mg46dI+P0nDCW5gntTN2fHRmJlBpFgamMiB6oxWxi/pe1Ux2e+RkTSapR0NlHYcodHTuTrZ0ek0g1HxkgVDIzq0MHRBKqzW1scwRvceW/0CiXvJNSueYWK5cwUx4O4BCOwYNTqMA1VKEOFBDu4RGerFvrwXq2XmalOWvesw+/ZL1+AxOVkBs=</latexit>

H<latexit sha1_base64="OmnsBIcI7uL9xT1nnclacUzB0Hk=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELxwhkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj+7nffkKleSwfzCRBP6JDyUPOqLFSo9YvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia89adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5L1bssjjycwTlcggc3UIUa1KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJ7HjNA=</latexit>

S<latexit sha1_base64="GaAj9CggsiMSYT4U81avqCQzmek=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8EgWIXdWGgjBm0sEzQXSJYwOzmbjJm9MDMrhJAnsLFQxFYfxt5GfBsnl0ITfxj4+P9zmHOOnwiutON8W5ml5ZXVtey6vbG5tb2T292rqTiVDKssFrFs+FSh4BFWNdcCG4lEGvoC637/apzX71EqHke3epCgF9JuxAPOqDZW5aadyzsFZyKyCO4M8hcf9nny/mWX27nPVidmaYiRZoIq1XSdRHtDKjVnAkd2K1WYUNanXWwajGiIyhtOBh2RI+N0SBBL8yJNJu7vjiENlRqEvqkMqe6p+Wxs/pc1Ux2ceUMeJanGiE0/ClJBdEzGW5MOl8i0GBigTHIzK2E9KinT5ja2OYI7v/Ii1IoF96RQrDj50iVMlYUDOIRjcOEUSnANZagCA4QHeIJn6856tF6s12lpxpr17MMfWW8/DwmQGA==</latexit>

da<latexit sha1_base64="FXin9o210jcHw3WzalJRn9ngjAs=">AAAB7HicbVA9SwNBEJ2LXzF+RQUbm8UgWIW7WGgZYmOZgJcEkiPs7e0lS/b2jt09IRz5DTYWitja+S/8BXY2/hY3lxSa+GDg8d4MM/P8hDOlbfvLKqytb2xuFbdLO7t7+wflw6O2ilNJqEtiHsuujxXlTFBXM81pN5EURz6nHX98M/M791QqFos7PUmoF+GhYCEjWBvJDQYZng7KFbtq50CrxFmQSv2k9c3eGx/NQfmzH8QkjajQhGOleo6daC/DUjPC6bTUTxVNMBnjIe0ZKnBElZflx07RuVECFMbSlNAoV39PZDhSahL5pjPCeqSWvZn4n9dLdXjtZUwkqaaCzBeFKUc6RrPPUcAkJZpPDMFEMnMrIiMsMdEmn5IJwVl+eZW0a1XnslprmTQaMEcRTuEMLsCBK6jDLTTBBQIMHuAJni1hPVov1uu8tWAtZo7hD6y3H9qMkmk=</latexit>

2 u<latexit sha1_base64="RstVxpjstVnWqBm1OCMYKA0jj9Y=">AAAB6nicbZDLSgMxFIbPeK3jrerSTbAI4qLM1IVuxKIblxXtBdqhZNJMG5rJhCQjlKGP4MaFIi71Xdy7Ed/G9LLQ1h8CH/9/DjnnhJIzbTzv21lYXFpeWc2tuesbm1vb+Z3dmk5SRWiVJDxRjRBrypmgVcMMpw2pKI5DTuth/2qU1++p0iwRd2YgaRDjrmARI9hY67Z0nLbzBa/ojYXmwZ9C4eLDPZdvX26lnf9sdRKSxlQYwrHWTd+TJsiwMoxwOnRbqaYSkz7u0qZFgWOqg2w86hAdWqeDokTZJwwau787MhxrPYhDWxlj09Oz2cj8L2umJjoLMiZkaqggk4+ilCOToNHeqMMUJYYPLGCimJ0VkR5WmBh7HdcewZ9deR5qpaJ/UizdeIXyJUyUg304gCPw4RTKcA0VqAKBLjzAEzw73Hl0XpzXSemCM+3Zgz9y3n8AGGiQqg==</latexit>

D<latexit sha1_base64="fUnOQaZvk7hSSluAPUO2oUiNinA=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8EgWIXdWGgjBrWwTMBcIFnC7ORsMmb2wsysEEKewMZCEVt9GHsb8W2cXApN/GHg4//PYc45fiK40o7zbWWWlldW17Lr9sbm1vZObnevpuJUMqyyWMSy4VOFgkdY1VwLbCQSaegLrPv9q3Fev0epeBzd6kGCXki7EQ84o9pYlet2Lu8UnInIIrgzyF982OfJ+5ddbuc+W52YpSFGmgmqVNN1Eu0NqdScCRzZrVRhQlmfdrFpMKIhKm84GXREjozTIUEszYs0mbi/O4Y0VGoQ+qYypLqn5rOx+V/WTHVw5g15lKQaIzb9KEgF0TEZb006XCLTYmCAMsnNrIT1qKRMm9vY5gju/MqLUCsW3JNCseLkS5cwVRYO4BCOwYVTKMENlKEKDBAe4AmerTvr0XqxXqelGWvWsw9/ZL39APg+kAk=</latexit>

N<latexit sha1_base64="POWSqR0d1EaeCwyLHgwMNxEECA0=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8EgWIXdWGgjBm2sJAFzgWQJs5OzyZjZCzOzQgh5AhsLRWz1YextxLdxcik08YeBj/8/hznn+IngSjvOt5VZWl5ZXcuu2xubW9s7ud29mopTybDKYhHLhk8VCh5hVXMtsJFIpKEvsO73r8Z5/R6l4nF0qwcJeiHtRjzgjGpjVW7aubxTcCYii+DOIH/xYZ8n7192uZ37bHViloYYaSaoUk3XSbQ3pFJzJnBkt1KFCWV92sWmwYiGqLzhZNAROTJOhwSxNC/SZOL+7hjSUKlB6JvKkOqems/G5n9ZM9XBmTfkUZJqjNj0oyAVRMdkvDXpcIlMi4EByiQ3sxLWo5IybW5jmyO48ysvQq1YcE8KxYqTL13CVFk4gEM4BhdOoQTXUIYqMEB4gCd4tu6sR+vFep2WZqxZzz78kfX2Awd1kBM=</latexit>

k<latexit sha1_base64="x8uOxQMGszhNTbi4DHLMyIL4zM4=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8EgWIXdWGgjBm0sEzAXSJYwOzmbjJm9MDMrhCVPYGOhiK0+jL2N+DZOLoVGfxj4+P9zmHOOnwiutON8Wbml5ZXVtfy6vbG5tb1T2N1rqDiVDOssFrFs+VSh4BHWNdcCW4lEGvoCm/7wapI371AqHkc3epSgF9J+xAPOqDZWbdgtFJ2SMxX5C+4cihfv9nny9mlXu4WPTi9maYiRZoIq1XadRHsZlZozgWO7kypMKBvSPrYNRjRE5WXTQcfkyDg9EsTSvEiTqfuzI6OhUqPQN5Uh1QO1mE3M/7J2qoMzL+NRkmqM2OyjIBVEx2SyNelxiUyLkQHKJDezEjagkjJtbmObI7iLK/+FRrnknpTKNadYuYSZ8nAAh3AMLpxCBa6hCnVggHAPj/Bk3VoP1rP1MivNWfOeffgl6/UbM2mQMA==</latexit>

Sh

<latexit sha1_base64="nrlmsca97cXD+eq/pl8r3ooezQI=">AAACAXicbVC7SgNBFJ31GeNr1UawGQyCNmE3FloGbSwjmgdk1zA7uZsMmX0wMyvENTZ+gK0fIIKFIrb+hZ1/42ySQhMPXDhzzr3MvceLOZPKsr6Nmdm5+YXF3FJ+eWV1bd3c2KzJKBEUqjTikWh4RAJnIVQVUxwasQASeBzqXu808+vXICSLwkvVj8ENSCdkPqNEaallbjscfIVvMb64SrsD7AjW6WbvllmwitYQeJrYY1IoHzzcxLnnx0rL/HLaEU0CCBXlRMqmbcXKTYlQjHIY5J1EQkxoj3SgqWlIApBuOrxggPe00sZ+JHSFCg/V3xMpCaTsB57uDIjqykkvE//zmonyj92UhXGiIKSjj/yEYxXhLA7cZgKo4n1NCBVM74pplwhClQ4tr0OwJ0+eJrVS0T4sls51GidohBzaQbtoH9noCJXRGaqgKqLoDj2hV/Rm3BsvxrvxMWqdMcYzW+gPjM8fK0WZDQ==</latexit>

Figure 2: The HARNN framework takes a given document text D and the corresponding hierarchical category structure γ asinputs, and outputs the hierarchical categories L for the given document.

4.2 Hierarchical Attention-based RecurrentLayer

After obtaining the unified representation of the text and the hi-erarchical category structure, we propose HARL, a recurrent ar-chitecture, to model the dependencies among different levels byleveraging the hierarchical structure gradually in a top-down fash-ion. At each category level, the core repeating unit, i.e., HierarchicalAttention-based Memory (HAM) unit is explicitly designed to cap-ture the associations between texts and categories. Meanwhile, itcan transfer the corresponding hierarchical semantic representa-tion to the next layer. This hierarchical semantic representation isconsistent with the human reading habit that people usually under-stand the concept of a document from shallow to deep. As shown inFigure 1, the "red" words inD pay more attention to physics conceptin C1 (level-1 category) while the words on "red" underline focusmore on nuclear physics concept in C2. Obviously, nuclear physicsis the subclass of physics. Therefore, it is necessary to qualify thecontributions of the text semantic representation to the categoryhierarchy from top to down and learn the attention representationsfor each category level.

Figure 3 shows the details of an HAM unit. An HAM unit mainlyconsists of three components, i.e., Text-Category Attention (TCA),Class Prediction Module (CPM) and Class Dependency Module(CDM). Specifically, TCA first captures the associations betweentexts and categories of the hierarchical structure. Based on the text-category associations, CPM generates the unified representationand predictions of the corresponding category level. Next, CDM isutilized to model the dependencies among different levels of thehierarchical structure. For the h-th category level in γ , the inputto corresponding HAM unit includes three parts, i.e., the wholetext semantic representation V , the corresponding category levelrepresentation Sh , and the information of previous levelωh−1. The

TCA CDM

CPMAvgV<latexit sha1_base64="z7WM4r9Z0KztG5OKwLiEIj+qgW4=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8UgWIXdWGgjBm0sEzAXSJYwOzmbjJmdXWZmhbDkCWwsFLHVh7G3Ed/GyaXQ6A8DH/9/DnPOCRLOlHbdLyu3tLyyupZftzc2t7Z3Crt7DRWnkmKdxjyWrYAo5ExgXTPNsZVIJFHAsRkMryZ58w6lYrG40aME/Yj0BQsZJdpYtUa3UHRL7lTOX/DmULx4t8+Tt0+72i18dHoxTSMUmnKiVNtzE+1nRGpGOY7tTqowIXRI+tg2KEiEys+mg46dI+P0nDCW5gntTN2fHRmJlBpFgamMiB6oxWxi/pe1Ux2e+RkTSapR0NlHYcodHTuTrZ0ek0g1HxkgVDIzq0MHRBKqzW1scwRvceW/0CiXvJNSueYWK5cwUx4O4BCOwYNTqMA1VKEOFBDu4RGerFvrwXq2XmalOWvesw+/ZL1+AxOVkBs=</latexit>

!h1<latexit sha1_base64="/kLDyOI0z7Bn7JmjtmrtfjcJA04=">AAAB83icbVC7SgNBFJ31GddX1NJmMAg2ht1YaCMGbSwjmAcka5idzCZD5sXMrBCW/IaNhaK2foe9jfg3Th6FJh64cDjnXu69J1aMGhsE397C4tLyympuzV/f2Nzazu/s1oxMNSZVLJnUjRgZwqggVUstIw2lCeIxI/W4fzXy6/dEGyrFrR0oEnHUFTShGFkntVqSky66y3rH4bCdLwTFYAw4T8IpKVx8+Ofq9cuvtPOfrY7EKSfCYoaMaYaBslGGtKWYkaHfSg1RCPdRlzQdFYgTE2Xjm4fw0CkdmEjtSlg4Vn9PZIgbM+Cx6+TI9sysNxL/85qpTc6ijAqVWiLwZFGSMmglHAUAO1QTbNnAEYQ1dbdC3EMaYeti8l0I4ezL86RWKoYnxdJNUChfgglyYB8cgCMQglNQBtegAqoAAwUewBN49lLv0Xvx3iatC950Zg/8gff+AxVWlLQ=</latexit> !h

<latexit sha1_base64="5CBCU5zPr7ovdcSs+Dss3cHsAPs=">AAAB8XicbZDLSgMxFIbPeK3jrerSTbAIrspMXehGLLpxWcFesB1LJs20oUlmSDJCGfoWblwooksfxL0b8W1MLwtt/SHw8f/nkHNOmHCmjed9OwuLS8srq7k1d31jc2s7v7Nb03GqCK2SmMeqEWJNOZO0apjhtJEoikXIaT3sX47y+j1VmsXyxgwSGgjclSxiBBtr3bZiQbv4LusN2/mCV/TGQvPgT6Fw/uGeJW9fbqWd/2x1YpIKKg3hWOum7yUmyLAyjHA6dFuppgkmfdylTYsSC6qDbDzxEB1ap4OiWNknDRq7vzsyLLQeiNBWCmx6ejYbmf9lzdREp0HGZJIaKsnkoyjlyMRotD7qMEWJ4QMLmChmZ0WkhxUmxh7JtUfwZ1eeh1qp6B8XS9deoXwBE+VgHw7gCHw4gTJcQQWqQEDCAzzBs6OdR+fFeZ2ULjjTnj34I+f9BzUKlEI=</latexit>

PhL

<latexit sha1_base64="ynCNzAtViCdjAXFFY7zMOe2PUVg=">AAAB8HicbVC7SgNBFL0bXzG+opY2Q4KQKuzGIpZBGwuLCOYhyRpmJ7PJkNnZZWZWCMt+hY2gIrb+ia2d6Mc4eRSaeODC4Zx7ufceL+JMadv+tDIrq2vrG9nN3Nb2zu5efv+gqcJYEtogIQ9l28OKciZoQzPNaTuSFAcepy1vdD7xW3dUKhaKaz2OqBvggWA+I1gb6abeSy7T22SY9vJFu2xPgZaJMyfFWqH0/VV9f6z38h/dfkjigApNOFaq49iRdhMsNSOcprlurGiEyQgPaMdQgQOq3GR6cIqOjdJHfihNCY2m6u+JBAdKjQPPdAZYD9WiNxH/8zqx9k/dhIko1lSQ2SI/5kiHaPI96jNJieZjQzCRzNyKyBBLTLTJKGdCcBZfXibNStk5KVeuTBpnMEMWjqAAJXCgCjW4gDo0gEAA9/AEz5a0HqwX63XWmrHmM4fwB9bbDy0ylGE=</latexit>

Sh<latexit sha1_base64="dpyk1FyY5h3RquW13J8V62NipHs=">AAAB7HicbZC7SgNBFIbPxltcb1FLm8EgWIXdWMRGDNpYRnSTQLKG2clsMmR2dpmZFcKSZ7CxUMRK8FXsbcS3cXIpNPGHgY//P4c55wQJZ0o7zreVW1peWV3Lr9sbm1vbO4XdvbqKU0moR2Iey2aAFeVMUE8zzWkzkRRHAaeNYHA5zhv3VCoWi1s9TKgf4Z5gISNYG8u7ucv6o06h6JScidAiuDMonn/YZ8nbl13rFD7b3ZikERWacKxUy3US7WdYakY4HdntVNEEkwHu0ZZBgSOq/Gwy7AgdGaeLwliaJzSauL87MhwpNYwCUxlh3Vfz2dj8L2ulOjz1MyaSVFNBph+FKUc6RuPNUZdJSjQfGsBEMjMrIn0sMdHmPrY5gju/8iLUyyX3pFS+dorVC5gqDwdwCMfgQgWqcAU18IAAgwd4gmdLWI/Wi/U6Lc1Zs559+CPr/QdKjpH+</latexit>

rhatt

<latexit sha1_base64="W87DM6KdFJr08MqWki4udHsxUs4=">AAAB8nicbVDLSsNAFJ3UV62vqks3Q4vQVUnqoi6LblxWsA9IY5lMJ+3QySTM3Agl5DPcKCji1h9x6070Y5w+Ftp64MLhnHu59x4/FlyDbX9aubX1jc2t/HZhZ3dv/6B4eNTWUaIoa9FIRKrrE80El6wFHATrxoqR0Bes448vp37njinNI3kDk5h5IRlKHnBKwEiu6qcEILtNR1m/WLar9gx4lTgLUm6UKt9f9ffHZr/40RtENAmZBCqI1q5jx+ClRAGngmWFXqJZTOiYDJlrqCQh0146OznDp0YZ4CBSpiTgmfp7IiWh1pPQN50hgZFe9qbif56bQHDupVzGCTBJ54uCRGCI8PR/POCKURATQwhV3NyK6YgoQsGkVDAhOMsvr5J2reqcVWvXJo0LNEcenaASqiAH1VEDXaEmaiGKInSPntCzBdaD9WK9zltz1mLmGP2B9fYDNAWVlA==</latexit> Whatt

<latexit sha1_base64="y38c8ma/GyLSFV9mbjhARMJbcC8=">AAAB8nicbVDLSsNAFJ3UV62vqks3Q4vQVUnqoi6LblxWsA9IY5lMJ+3QySTM3Agl5DPcKCji1h9x6070Y5w+Ftp64MLhnHu59x4/FlyDbX9aubX1jc2t/HZhZ3dv/6B4eNTWUaIoa9FIRKrrE80El6wFHATrxoqR0Bes448vp37njinNI3kDk5h5IRlKHnBKwEhup58SgOw2HWX9Ytmu2jPgVeIsSLlRqnx/1d8fm/3iR28Q0SRkEqggWruOHYOXEgWcCpYVeolmMaFjMmSuoZKETHvp7OQMnxplgINImZKAZ+rviZSEWk9C33SGBEZ62ZuK/3luAsG5l3IZJ8AknS8KEoEhwtP/8YArRkFMDCFUcXMrpiOiCAWTUsGE4Cy/vEratapzVq1dmzQu0Bx5dIJKqIIcVEcNdIWaqIUoitA9ekLPFlgP1ov1Om/NWYuZY/QH1tsPCguVeQ==</latexit>

PhL

<latexit sha1_base64="ynCNzAtViCdjAXFFY7zMOe2PUVg=">AAAB8HicbVC7SgNBFL0bXzG+opY2Q4KQKuzGIpZBGwuLCOYhyRpmJ7PJkNnZZWZWCMt+hY2gIrb+ia2d6Mc4eRSaeODC4Zx7ufceL+JMadv+tDIrq2vrG9nN3Nb2zu5efv+gqcJYEtogIQ9l28OKciZoQzPNaTuSFAcepy1vdD7xW3dUKhaKaz2OqBvggWA+I1gb6abeSy7T22SY9vJFu2xPgZaJMyfFWqH0/VV9f6z38h/dfkjigApNOFaq49iRdhMsNSOcprlurGiEyQgPaMdQgQOq3GR6cIqOjdJHfihNCY2m6u+JBAdKjQPPdAZYD9WiNxH/8zqx9k/dhIko1lSQ2SI/5kiHaPI96jNJieZjQzCRzNyKyBBLTLTJKGdCcBZfXibNStk5KVeuTBpnMEMWjqAAJXCgCjW4gDo0gEAA9/AEz5a0HqwX63XWmrHmM4fwB9bbDy0ylGE=</latexit>

V<latexit sha1_base64="drtoSKogd4FqTe14O8Ts1UX9+1o=">AAAB8HicbZDLSgMxFIYz9VbHW9Wlm2ARXJWZutCNWHTjsoK9SDuUTOa0DU0yQ5IRytCncONCEXHni7h3I76N6WWhrT8EPv7/HHLOCRPOtPG8bye3tLyyupZfdzc2t7Z3Crt7dR2nikKNxjxWzZBo4ExCzTDDoZkoICLk0AgHV+O8cQ9Ks1jemmECgSA9ybqMEmOtu7ZhPIKsPuoUil7Jmwgvgj+D4sWHe568fbnVTuGzHcU0FSAN5UTrlu8lJsiIMoxyGLntVENC6ID0oGVREgE6yCYDj/CRdSLcjZV90uCJ+7sjI0LroQhtpSCmr+ezsflf1kpN9yzImExSA5JOP+qmHJsYj7fHEVNADR9aIFQxOyumfaIINfZGrj2CP7/yItTLJf+kVL7xipVLNFUeHaBDdIx8dIoq6BpVUQ1RJNADekLPjnIenRfndVqac2Y9++iPnPcfcIiT0Q==</latexit>

Figure 3: Hierarchical attention-based memory unit.

unified representation AhL and informationωh at the h-th level areupdated as following formulas:

V = avд(V ),

rhatt ,Whatt = TCA(ω

h−1,V , Sh ),

PhL ,AhL = CPM([V ⊕ r

hatt ]),

ωh = CDM(W hatt , P

hL ),

(3)

where V is the whole text semantic representation, rhatt is the asso-ciated text-category representation of h-th level,W h

att is the text-category attention matrix at h-th level, PhL denotes the predictedprobability vector of h-th level, ⊕ denotes vector concatenationoperation and avg() is the average pooling operation. Note that all


1054

elements in ω0 are initialized as 1.0 since the level 0 is the root ofthe hierarchical structure which contains no extra information.

Next, we introduce each component of the HAM unit.Text-Category Attention targets at capturing the associations

among texts and categories, and outputs the associated text-categoryrepresentation rhatt and text-category attention matrixW h

att at h-th category level. As shown in Figure 4, given the information ofprevious category level ωh−1 ∈ RN×2u , we integrate it with thewhole text semantic representation V ∈ RN×2u to generate the Vh :

Vh = ωh−1 ⊗ V , (4)whereVh ∈ RN×2u denotes the representationwhich introduces thehierarchical information of previous category level, and ⊗ denotesentry-wise product operation.

Inspired by [20], the document text is usually formed by multiplecomponents which focus on different aspects, especially for thelong sentences. For instance, in Figure 1, the "red" words of docu-ment D focus more on physics while the "blue" words concentratemore on chemistry. Thus, for the h-th category level, we need |Ch |aspects for focusing on different categories to represent the over-all semantic of the whole sentence. We use Sh ∈ R |Ch |×da whichis the embedding of h-th category level to perform |Ch | differentcategories of attention. Formally,

Oh = tanh(W hs ·V

Th ),

W hatt = so f tmax(Sh ·Oh ),

(5)

whereW hs ∈ R

da×2u is a randomly initialized weight matrix, Ohdenotes the activations which can be viewed as one MLP withoutbias, da is a hyperparameter we can set arbitrarily, the softmax()ensures all the computedweights sum up to 1 for each category. AndW hatt = (W

h1 ,W

h2 , . . . ,W

h|Ch |) ∈ R |C

h |×N is the annotation matrix,

whereW hi ∈ R

N represents the attention score of the text withi-th category in h-th level after normalization and each element inthis vector represents the contribution of each word token on itscorresponding position contributes to this i-th category.

Then, we compute |Ch | weighted sums by multiplying the an-notation matrixW h

att and the text semantic representation Vh . Theresulting matrix Mh ∈ R

|Ch |×2u is the associated text-categoryrepresentation with each category in h-th level. And rhatt ∈ R

2u

which is the associated text-category representation for whole h-thlevel can be modeled as a vector by averagingMh in category-dims:

Mh =Whatt ·Vh ,

rhatt = avд(Mh ).(6)

With the help of Text-Category Attention, we can get the asso-ciated text-category representation rhatt and the annotation matrixW hatt for the h-th category level.Class Prediction Module aims at generating the unified repre-

sentation and predicting the categories for each level by integratingthe original text semantic representation and the associated text-category representation which introduces the information of itsprevious level. Formally, let AhL denote the representation of h-thlevel and AhL is given by:

AhL = φ(W hT · [V ⊕ r

hatt ] + b

hT ), (7)

tanh

Attention Matrix

softmax

transpose

avg_pool

entry-wiseproduct

Hidden Matrix

2u<latexit sha1_base64="2di7BNJFKjx8ZSPQ2z+S0pokLfg=">AAAB6XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm6kI3YtGNyyr2Au1QMmmmDc0kQ5IRytA3cONCEbd9GPduxLcxvSy09YfAx/+fQ845YcKZNp737eRWVtfWN/Kb7tb2zu5eYf+grmWqCK0RyaVqhlhTzgStGWY4bSaK4jjktBEObiZ545EqzaR4MMOEBjHuCRYxgo217stpp1D0St5UaBn8ORSvPtzLZPzlVjuFz3ZXkjSmwhCOtW75XmKCDCvDCKcjt51qmmAywD3asihwTHWQTScdoRPrdFEklX3CoKn7uyPDsdbDOLSVMTZ9vZhNzP+yVmqiiyBjIkkNFWT2UZRyZCSarI26TFFi+NACJorZWRHpY4WJscdx7RH8xZWXoV4u+Wel8p1XrFzDTHk4gmM4BR/OoQK3UIUaEIjgCV7g1Rk4z86b8z4rzTnznkP4I2f8A7NvkHY=</latexit>



V<latexit sha1_base64="z7WM4r9Z0KztG5OKwLiEIj+qgW4=">AAAB6HicbZC7SgNBFIbPxltcb1FLm8UgWIXdWGgjBm0sEzAXSJYwOzmbjJmdXWZmhbDkCWwsFLHVh7G3Ed/GyaXQ6A8DH/9/DnPOCRLOlHbdLyu3tLyyupZftzc2t7Z3Crt7DRWnkmKdxjyWrYAo5ExgXTPNsZVIJFHAsRkMryZ58w6lYrG40aME/Yj0BQsZJdpYtUa3UHRL7lTOX/DmULx4t8+Tt0+72i18dHoxTSMUmnKiVNtzE+1nRGpGOY7tTqowIXRI+tg2KEiEys+mg46dI+P0nDCW5gntTN2fHRmJlBpFgamMiB6oxWxi/pe1Ux2e+RkTSapR0NlHYcodHTuTrZ0ek0g1HxkgVDIzq0MHRBKqzW1scwRvceW/0CiXvJNSueYWK5cwUx4O4BCOwYNTqMA1VKEOFBDu4RGerFvrwXq2XmalOWvesw+/ZL1+AxOVkBs=</latexit>

!h1<latexit sha1_base64="/kLDyOI0z7Bn7JmjtmrtfjcJA04=">AAAB83icbVC7SgNBFJ31GddX1NJmMAg2ht1YaCMGbSwjmAcka5idzCZD5sXMrBCW/IaNhaK2foe9jfg3Th6FJh64cDjnXu69J1aMGhsE397C4tLyympuzV/f2Nzazu/s1oxMNSZVLJnUjRgZwqggVUstIw2lCeIxI/W4fzXy6/dEGyrFrR0oEnHUFTShGFkntVqSky66y3rH4bCdLwTFYAw4T8IpKVx8+Ofq9cuvtPOfrY7EKSfCYoaMaYaBslGGtKWYkaHfSg1RCPdRlzQdFYgTE2Xjm4fw0CkdmEjtSlg4Vn9PZIgbM+Cx6+TI9sysNxL/85qpTc6ijAqVWiLwZFGSMmglHAUAO1QTbNnAEYQ1dbdC3EMaYeti8l0I4ezL86RWKoYnxdJNUChfgglyYB8cgCMQglNQBtegAqoAAwUewBN49lLv0Xvx3iatC950Zg/8gff+AxVWlLQ=</latexit>




rhatt

<latexit sha1_base64="W87DM6KdFJr08MqWki4udHsxUs4=">AAAB8nicbVDLSsNAFJ3UV62vqks3Q4vQVUnqoi6LblxWsA9IY5lMJ+3QySTM3Agl5DPcKCji1h9x6070Y5w+Ftp64MLhnHu59x4/FlyDbX9aubX1jc2t/HZhZ3dv/6B4eNTWUaIoa9FIRKrrE80El6wFHATrxoqR0Bes448vp37njinNI3kDk5h5IRlKHnBKwEiu6qcEILtNR1m/WLar9gx4lTgLUm6UKt9f9ffHZr/40RtENAmZBCqI1q5jx+ClRAGngmWFXqJZTOiYDJlrqCQh0146OznDp0YZ4CBSpiTgmfp7IiWh1pPQN50hgZFe9qbif56bQHDupVzGCTBJ54uCRGCI8PR/POCKURATQwhV3NyK6YgoQsGkVDAhOMsvr5J2reqcVWvXJo0LNEcenaASqiAH1VEDXaEmaiGKInSPntCzBdaD9WK9zltz1mLmGP2B9fYDNAWVlA==</latexit>

Vh<latexit sha1_base64="9MbUq3EqV66q2CbKTyH4zrwI82g=">AAAB7HicbVC7SgNBFL0TXzG+ooKNzWAQrMJuLLQMsbFMwE0CyRJmJ7PJkNnZZWZWCEu+wcZCEVs7/8IvsLPxW5w8Ck08cOFwzr3ce0+QCK6N43yh3Nr6xuZWfruws7u3f1A8PGrqOFWUeTQWsWoHRDPBJfMMN4K1E8VIFAjWCkY3U791z5Tmsbwz44T5ERlIHnJKjJW8Zi8bTnrFklN2ZsCrxF2QUvWk8c3fax/1XvGz249pGjFpqCBad1wnMX5GlOFUsEmhm2qWEDoiA9axVJKIaT+bHTvB51bp4zBWtqTBM/X3REYircdRYDsjYoZ62ZuK/3md1ITXfsZlkhom6XxRmApsYjz9HPe5YtSIsSWEKm5vxXRIFKHG5lOwIbjLL6+SZqXsXpYrDZtGDebIwymcwQW4cAVVuIU6eECBwwM8wTOS6BG9oNd5aw4tZo7hD9DbD8+/kmI=</latexit>

V Th

<latexit sha1_base64="BfSyOVbYP5GQN566zeeUTe/pBwQ=">AAAB8HicbVDLSgNBEOyNrxhfUY9ehgQhp7AbD/EY9OIxQl6SrGF2MpsMmdldZmaFsOxXeBFUxKt/4tWb6Mc4eRw0saChqOqmu8uLOFPatj+tzNr6xuZWdju3s7u3f5A/PGqpMJaENknIQ9nxsKKcBbSpmea0E0mKhcdp2xtfTv32HZWKhUFDTyLqCjwMmM8I1ka6afWTUXqbNNJ+vmiX7RnQKnEWpFgrlL6/qu+P9X7+ozcISSxooAnHSnUdO9JugqVmhNM014sVjTAZ4yHtGhpgQZWbzA5O0alRBsgPpalAo5n6eyLBQqmJ8EynwHqklr2p+J/XjbV/7iYsiGJNAzJf5Mcc6RBNv0cDJinRfGIIJpKZWxEZYYmJNhnlTAjO8surpFUpO2flyrVJ4wLmyMIJFKAEDlShBldQhyYQEHAPT/BsSevBerFe560ZazFzDH9gvf0AQxKUbw==</latexit>

Wh

att<latexit sha1_base64="y38c8ma/GyLSFV9mbjhARMJbcC8=">AAAB8nicbVDLSsNAFJ3UV62vqks3Q4vQVUnqoi6LblxWsA9IY5lMJ+3QySTM3Agl5DPcKCji1h9x6070Y5w+Ftp64MLhnHu59x4/FlyDbX9aubX1jc2t/HZhZ3dv/6B4eNTWUaIoa9FIRKrrE80El6wFHATrxoqR0Bes448vp37njinNI3kDk5h5IRlKHnBKwEhup58SgOw2HWX9Ytmu2jPgVeIsSLlRqnx/1d8fm/3iR28Q0SRkEqggWruOHYOXEgWcCpYVeolmMaFjMmSuoZKETHvp7OQMnxplgINImZKAZ+rviZSEWk9C33SGBEZ62ZuK/3luAsG5l3IZJ8AknS8KEoEhwtP/8YArRkFMDCFUcXMrpiOiCAWTUsGE4Cy/vEratapzVq1dmzQu0Bx5dIJKqIIcVEcNdIWaqIUoitA9ekLPFlgP1ov1Om/NWYuZY/QH1tsPCguVeQ==</latexit>

Mh<latexit sha1_base64="Eb2fX03dTfS1qNdARWSAcIms6SA=">AAAB7HicbVC7SgNBFL0bXzG+ooKNzWAQrMJuLLQMsbEREnCTQFzC7GQ2GTI7s8zMCmHJN9hYKGJr51/4BXY2fouTR6GJBy4czrmXe+8JE860cd0vJ7eyura+kd8sbG3v7O4V9w+aWqaKUJ9ILlU7xJpyJqhvmOG0nSiK45DTVji8mvite6o0k+LWjBIaxLgvWMQINlbyb7rZYNwtltyyOwVaJt6clKpHjW/2Xvuod4ufdz1J0pgKQzjWuuO5iQkyrAwjnI4Ld6mmCSZD3KcdSwWOqQ6y6bFjdGqVHoqksiUMmqq/JzIcaz2KQ9sZYzPQi95E/M/rpCa6DDImktRQQWaLopQjI9Hkc9RjihLDR5Zgopi9FZEBVpgYm0/BhuAtvrxMmpWyd16uNGwaNZghD8dwAmfgwQVU4Rrq4AMBBg/wBM+OcB6dF+d11ppz5jOH8AfO2w/B95JZ</latexit>

1<latexit sha1_base64="EuLuS3rs/h572pJuaJ5w/9CT5ww=">AAAB6XicbZDLSsNAFIZP6q3GW9Wlm8EiuCpJXehGLLpxWcVeoA1lMp20QyeTMDMRSugbuHGhiNs+jHs34ts4SbvQ1h8GPv7/HOac48ecKe0431ZhZXVtfaO4aW9t7+zulfYPmipKJKENEvFItn2sKGeCNjTTnLZjSXHoc9ryRzdZ3nqkUrFIPOhxTL0QDwQLGMHaWPeu3SuVnYqTCy2DO4fy1Yd9GU+/7Hqv9NntRyQJqdCEY6U6rhNrL8VSM8LpxO4misaYjPCAdgwKHFLlpfmkE3RinD4KImme0Ch3f3ekOFRqHPqmMsR6qBazzPwv6yQ6uPBSJuJEU0FmHwUJRzpC2dqozyQlmo8NYCKZmRWRIZaYaHOc7Aju4srL0KxW3LNK9c4p165hpiIcwTGcggvnUINbqEMDCATwBC/wao2sZ+vNep+VFqx5zyH8kTX9AQ++kAo=</latexit>






Sh<latexit sha1_base64="dpyk1FyY5h3RquW13J8V62NipHs=">AAAB7HicbZC7SgNBFIbPxltcb1FLm8EgWIXdWMRGDNpYRnSTQLKG2clsMmR2dpmZFcKSZ7CxUMRK8FXsbcS3cXIpNPGHgY//P4c55wQJZ0o7zreVW1peWV3Lr9sbm1vbO4XdvbqKU0moR2Iey2aAFeVMUE8zzWkzkRRHAaeNYHA5zhv3VCoWi1s9TKgf4Z5gISNYG8u7ucv6o06h6JScidAiuDMonn/YZ8nbl13rFD7b3ZikERWacKxUy3US7WdYakY4HdntVNEEkwHu0ZZBgSOq/Gwy7AgdGaeLwliaJzSauL87MhwpNYwCUxlh3Vfz2dj8L2ulOjz1MyaSVFNBph+FKUc6RuPNUZdJSjQfGsBEMjMrIn0sMdHmPrY5gju/8iLUyyX3pFS+dorVC5gqDwdwCMfgQgWqcAU18IAAgwd4gmdLWI/Wi/U6Lc1Zs559+CPr/QdKjpH+</latexit>

Oh<latexit sha1_base64="sQ0JrTzxDuGm5PXWEz12FkrdSNM=">AAAB7HicbVC7SgNBFL0bXzG+ooKNzWAQrMJuLLQMsbEzATcJxCXMTmaTIbMzy8ysEJZ8g42FIrZ2/oVfYGfjtzh5FJp44MLhnHu5954w4Uwb1/1yciura+sb+c3C1vbO7l5x/6CpZaoI9YnkUrVDrClngvqGGU7biaI4DjlthcOrid+6p0ozKW7NKKFBjPuCRYxgYyX/ppsNxt1iyS27U6Bl4s1JqXrU+GbvtY96t/h515MkjakwhGOtO56bmCDDyjDC6bhwl2qaYDLEfdqxVOCY6iCbHjtGp1bpoUgqW8Kgqfp7IsOx1qM4tJ0xNgO96E3E/7xOaqLLIGMiSQ0VZLYoSjkyEk0+Rz2mKDF8ZAkmitlbERlghYmx+RRsCN7iy8ukWSl75+VKw6ZRgxnycAwncAYeXEAVrqEOPhBg8ABP8OwI59F5cV5nrTlnPnMIf+C8/QDFB5Jb</latexit>

Whs

<latexit sha1_base64="LrVwad3stC+XY6n8dRzyDO79Wwk=">AAAB8HicbVDLSgNBEOz1GeMr6tHLkCDkFHbjIR6DXjxGMA9JYpidzCZDZmaXmVkhLPsVXgQV8eqfePUm+jFOHgdNLGgoqrrp7vIjzrRx3U9nZXVtfWMzs5Xd3tnd288dHDZ0GCtC6yTkoWr5WFPOJK0bZjhtRYpi4XPa9EcXE795R5Vmobw244h2BR5IFjCCjZVumr1Ep7fJMO3lCm7JnQItE29OCtV88fur8v5Y6+U+Ov2QxIJKQzjWuu25kekmWBlGOE2znVjTCJMRHtC2pRILqrvJ9OAUnVilj4JQ2ZIGTdXfEwkWWo+FbzsFNkO96E3E/7x2bIKzbsJkFBsqyWxREHNkQjT5HvWZosTwsSWYKGZvRWSIFSbGZpS1IXiLLy+TRrnknZbKVzaNc5ghA8eQhyJ4UIEqXEIN6kBAwD08wbOjnAfnxXmdta4485kj+APn7Qdz5ZSP</latexit>

Wh

att<latexit sha1_base64="y38c8ma/GyLSFV9mbjhARMJbcC8=">AAAB8nicbVDLSsNAFJ3UV62vqks3Q4vQVUnqoi6LblxWsA9IY5lMJ+3QySTM3Agl5DPcKCji1h9x6070Y5w+Ftp64MLhnHu59x4/FlyDbX9aubX1jc2t/HZhZ3dv/6B4eNTWUaIoa9FIRKrrE80El6wFHATrxoqR0Bes448vp37njinNI3kDk5h5IRlKHnBKwEhup58SgOw2HWX9Ytmu2jPgVeIsSLlRqnx/1d8fm/3iR28Q0SRkEqggWruOHYOXEgWcCpYVeolmMaFjMmSuoZKETHvp7OQMnxplgINImZKAZ+rviZSEWk9C33SGBEZ62ZuK/3luAsG5l3IZJ8AknS8KEoEhwtP/8YArRkFMDCFUcXMrpiOiCAWTUsGE4Cy/vEratapzVq1dmzQu0Bx5dIJKqIIcVEcNdIWaqIUoitA9ekLPFlgP1ov1Om/NWYuZY/QH1tsPCguVeQ==</latexit>

Ch

<latexit sha1_base64="FVaSI6URf3O6j1qU/kwPqJj59yY=">AAACAXicbVC7SgNBFL3rM8bXqo1gMxgEbcJuLLQMprGMYB6QXcPsZDYZMvtgZlaIa2z8AFs/QAQLRWz9Czv/xtkkhSYeuHDmnHuZe48XcyaVZX0bc/MLi0vLuZX86tr6xqa5tV2XUSIIrZGIR6LpYUk5C2lNMcVpMxYUBx6nDa9fyfzGNRWSReGlGsTUDXA3ZD4jWGmpbe46nPoK3SJUuUp7Q+QI1u1l77ZZsIrWCGiW2BNSKB893MS558dq2/xyOhFJAhoqwrGULduKlZtioRjhdJh3EkljTPq4S1uahjig0k1HFwzRgVY6yI+ErlChkfp7IsWBlIPA050BVj057WXif14rUf6pm7IwThQNyfgjP+FIRSiLA3WYoETxgSaYCKZ3RaSHBSZKh5bXIdjTJ8+SeqloHxdLFzqNMxgjB3uwD4dgwwmU4RyqUAMCd/AEr/Bm3BsvxrvxMW6dMyYzO/AHxucPEiWY/Q==</latexit>

Ch


Ch



Figure 4: Text-Category Attention.

whereW hT ∈ R

v×4u is a randomly initialized weight matrix andbhT ∈ R

v×1 is the corresponding bias vector, φ is a non-linear ac-tivation function (e.g., RELU). Then, the local prediction of h-thcategory level PhL is calculated by:

PhL = σ (W hL · A

hL + b

hL ), (8)

whereW hL ∈ R

|Ch |×v is the weighted matrix that connects theactivations of h-th level with |Ch | output units, bhL ∈ R

|Ch |×1 is thecorresponding bias vector, and σ is the sigmoid activation.

Class Dependency Module is utilized to model the dependen-cies between different levels of the hierarchical structure by keepingthe hierarchical information for each level. For the h-th categorylevel, different categories have different contributions to the pre-diction, which can be used as the trade-off parameter for revisingtext-category attention matrix. Therefore, we first utilize an entry-wise product operation to combine the predicted category valuesPhL and the text-category attention matrixW h

att to generate theweighted text-category attention matrix Kh :

Kh = broadcast(PhL ) ⊗Whatt , (9)

where Kh = (kh1 , . . . ,kh|Ch |) ∈ R |C

h |×N is the weighted text-category attention matrix of h-th level considering the differentcategories should have the different weight, khi denotes weightedattention score of the i-th category with the text semantic repre-sentation in h-th level, and broadcast() is the process of makingmatrixes with different shapes have compatible shapes for arith-metic operations (i.e., entry-wise product).

Then we exploit the category-wise average pooling operation tomerge |Ch | categories into an average representation Kh :

Kh = avд(Kh ), (10)where Kh ∈ RN is the weighted attention vector ofh-th level whichholds the attention associations of the whole category level withthe text.

Next, we broadcast the average representation into ωh :


1055

(a) Education (b) Patent

104<latexit sha1_base64="W5LwQHqOPNMI+HzQG/UTGVW5p6k=">AAAB+3icbVBNS8NAEN34WetXrEcvi0XwVJJa0GPRi8cK9gOaWDbbTbt0swm7E7GE/BUvHhTx6h/x5r9x2+agrQ8GHu/NMDMvSATX4Djf1tr6xubWdmmnvLu3f3BoH1U6Ok4VZW0ai1j1AqKZ4JK1gYNgvUQxEgWCdYPJzczvPjKleSzvYZowPyIjyUNOCRhpYFe8IMw84BHT2HUeskaeD+yqU3PmwKvELUgVFWgN7C9vGNM0YhKoIFr3XScBPyMKOBUsL3upZgmhEzJifUMlMcv8bH57js+MMsRhrExJwHP190RGIq2nUWA6IwJjvezNxP+8fgrhlZ9xmaTAJF0sClOBIcazIPCQK0ZBTA0hVHFzK6ZjoggFE1fZhOAuv7xKOvWae1Gr3zWqzesijhI6QafoHLnoEjXRLWqhNqLoCT2jV/Rm5daL9W59LFrXrGLmGP2B9fkDan6UCA==</latexit>

103<latexit sha1_base64="vLIZtnwFqYYmQ/ijBtf6LAtxgJc=">AAAB+3icbVBNS8NAEJ34WetXrEcvi0XwVJJW0GPRi8cK9gOaWDbbTbt0swm7G7GE/BUvHhTx6h/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQPqp0VJxKQtsk5rHsBVhRzgRta6Y57SWS4ijgtBtMbmZ+95FKxWJxr6cJ9SM8EixkBGsjDeyKF4SZp1lEFXKdh6yR5wO76tScOdAqcQtShQKtgf3lDWOSRlRowrFSfddJtJ9hqRnhNC97qaIJJhM8on1DBTbL/Gx+e47OjDJEYSxNCY3m6u+JDEdKTaPAdEZYj9WyNxP/8/qpDq/8jIkk1VSQxaIw5UjHaBYEGjJJieZTQzCRzNyKyBhLTLSJq2xCcJdfXiWdes1t1Op3F9XmdRFHCU7gFM7BhUtowi20oA0EnuAZXuHNyq0X6936WLSuWcXMMfyB9fkDaPiUBw==</latexit>

Num

ber

of D

ocum

ents

0

1

2

3

Number of Words0 100 200 300 400

104<latexit sha1_base64="W5LwQHqOPNMI+HzQG/UTGVW5p6k=">AAAB+3icbVBNS8NAEN34WetXrEcvi0XwVJJa0GPRi8cK9gOaWDbbTbt0swm7E7GE/BUvHhTx6h/x5r9x2+agrQ8GHu/NMDMvSATX4Djf1tr6xubWdmmnvLu3f3BoH1U6Ok4VZW0ai1j1AqKZ4JK1gYNgvUQxEgWCdYPJzczvPjKleSzvYZowPyIjyUNOCRhpYFe8IMw84BHT2HUeskaeD+yqU3PmwKvELUgVFWgN7C9vGNM0YhKoIFr3XScBPyMKOBUsL3upZgmhEzJifUMlMcv8bH57js+MMsRhrExJwHP190RGIq2nUWA6IwJjvezNxP+8fgrhlZ9xmaTAJF0sClOBIcazIPCQK0ZBTA0hVHFzK6ZjoggFE1fZhOAuv7xKOvWae1Gr3zWqzesijhI6QafoHLnoEjXRLWqhNqLoCT2jV/Rm5daL9W59LFrXrGLmGP2B9fkDan6UCA==</latexit>

103<latexit sha1_base64="vLIZtnwFqYYmQ/ijBtf6LAtxgJc=">AAAB+3icbVBNS8NAEJ34WetXrEcvi0XwVJJW0GPRi8cK9gOaWDbbTbt0swm7G7GE/BUvHhTx6h/x5r9x2+agrQ8GHu/NMDMvSDhT2nG+rbX1jc2t7dJOeXdv/+DQPqp0VJxKQtsk5rHsBVhRzgRta6Y57SWS4ijgtBtMbmZ+95FKxWJxr6cJ9SM8EixkBGsjDeyKF4SZp1lEFXKdh6yR5wO76tScOdAqcQtShQKtgf3lDWOSRlRowrFSfddJtJ9hqRnhNC97qaIJJhM8on1DBTbL/Gx+e47OjDJEYSxNCY3m6u+JDEdKTaPAdEZYj9WyNxP/8/qpDq/8jIkk1VSQxaIw5UjHaBYEGjJJieZTQzCRzNyKyBhLTLSJq2xCcJdfXiWdes1t1Op3F9XmdRFHCU7gFM7BhUtowi20oA0EnuAZXuHNyq0X6936WLSuWcXMMfyB9fkDaPiUBw==</latexit>

Num

ber

of D

ocum

ents

0

0.5

1.0

1.5

2.0

Number of Words0 50 100 150 200N

umbe

r of

Doc

umen

ts

0

5

10

15

20

Number of Categories3 4 5 6 7 8 9 10 4 5 6 7 8 9 1011121314151617N

umbe

r of

Doc

umen

ts

0

1

2

3

4

Number of Categories

Figure 5: Number distributions of observed records.

ωh = broadcast(Kh ), (11)where ωh = (ωh1 ,ω

h2 , . . . ,ω

hN ) ∈ R

N×2u is a matrix holding thehierarchy information and the associations between the text andthe whole category level at h-th level, and ωhi ∈ R

2u measures theassociation weights between the whole previous category level andthe i-th word token in D.

Finally, we bring ωh to the next category level since the infor-mation of each category level not only is influenced by the previouslevel but also will affect the next level.

4.3 Hybrid Predicting LayerThroughout HARL, we can obtain the unified representation AhLand the predictions PhL of each category level. It is important topredict all categories in the entire hierarchy, and predicting thecategories of each level in HMTC problem is also critical. Thus,we apply a Hybrid Predicting Layer, a hybrid prediction methodto generate the final predictions by considering both local andglobal prediction information. Specifically, we note all the AhL asAL =

A1L ,A

2L , . . . ,A

HL∈ RH×v , and then exploit the level-wise

average pooling operation to merge H levels representations into aglobal embedding AL = avд(A1

L ,A2L , . . . ,A

HL ) ∈ R

v . Let PG denotethe global predictions of the entire category hierarchy and PG isgiven by:

AG = φ(WG · AL + bG ),

PG = σ (WM · AG + bM ),(12)

whereWG ∈ Rv×v ,WM ∈ R

K×v are randomly initialized weightmatries, bG ∈ Rv×1 and bM ∈ RK×1 are corresponding bias vec-tors. PG is a continuous vector and each element in this vector P iGdenotes the probability P(ci |x) for ci ∈ C.

In order to integrate both local and global predictions, the finalpredictions PF is calculated by:

PF = (1 − α) · (P1L ⊕ P2L⊕, . . . , P

HL ) + α · PG , (13)

where α ∈ [0, 1] is a balance parameter, which is used to regulatethe trade-off regarding the local and global predictions. We setα = 0.5 as the default option in order to give equal importance toboth local and global predictions.

4.4 Loss FunctionIn this subsection, we specify a loss function which is the sumof the local and global loss functions for each document to trainHARNN. In the training stage, for each document D, the local (LL )and global (LG ) loss are calculated as:

Table 1: The statistics of the datasets

Statistics Education Patent# instances 240,309 100,000# hierarchical levels 3 4# categories in level-1 18 9# categories in level-2 75 128# categories in level-3 366 661# categories in level-4 - 8364# total categories 459 9162# average categories per instance in level-1 1.07 1.46# average categories per instance in level-2 1.21 1.62# average categories per instance in level-3 1.36 1.82# average categories per instance in level-4 - 2.78# average categories per instance 3.64 7.67

LL =

H∑h=1[ε(PhL ,Y

hL )],

LG = ε(PG ,YG ),

(14)

where YG is the binary label vector (expected output) containing allcategories of the hierarchical structure and YhL is the binary labelvector (expected output) containing only the categories of the h-thlevel. Since categories are not mutually exclusive, we use ε(·, ·) thebinary cross-entropy to minimize the local and global loss functionof a document. The final loss function we optimize is thus given by:

L(Θ) = LL + LG + λΘ | |Θ| |2, (15)

whereΘ denotes all parameters of HARNN and λΘ is the regulariza-tion hyperparameter. In this way, we can learn HARNN by directlyminimizing the loss function L(Θ) using Adam [19].

5 EXPERIMENTSIn this section, we first introduce the datasets and our experimentalsetups. Then, we conduct extensive experiments on HARNN modeland the other state-of-the-art methods to answer the followingquestions:• RQ1: How does HARNN perform in predicting total cate-gories of the entire hierarchical structure compared to thestate-of-the-art methods?• RQ2: How does HARNN perform in predicting categoriesof each level in the hierarchical structure compared to theother hierarchical approaches?• RQ3: How does the trade-off coefficient (α ) between localand global predictions influences the performance ?• RQ4: Is the proposed hierarchical attention mechanism help-ful and explanatory in HTMC problem?


1056

Table 2: Performance comparison on the HMTC task.

(a) Education

Baseline MetricsPrecision Recall micro-F1 AU (PRC)

Clus-HMC 0.806 0.704 0.752 0.685HMC-LMLP 0.841 0.732 0.783 0.861HMCN-R 0.844 0.733 0.785 0.867HMCN-F 0.843 0.739 0.787 0.865HARNN-LG 0.853 0.744 0.795 0.876HARNN-LH 0.860 0.740 0.795 0.879HARNN-GH 0.845 0.747 0.793 0.874HARNN 0.860 0.767 0.811 0.893

(b) Patent

Baseline MetricsPrecision Recall micro-F1 AU (PRC)

Clus-HMC 0.419 0.345 0.379 0.145HMC-LMLP 0.692 0.380 0.490 0.526HMCN-R 0.684 0.395 0.501 0.528HMCN-F 0.704 0.376 0.491 0.524HARNN-LG 0.738 0.420 0.535 0.579HARNN-LH 0.733 0.418 0.532 0.578HARNN-GH 0.740 0.405 0.523 0.572HARNN 0.742 0.425 0.541 0.583

5.1 Dataset DescriptionWe conduct experiments on two real-world datasets: educationalexercises dataset and patent documents dataset, which contain docu-ment text records and hierarchical category structures.

EducationDataset is collected from an online education systemwhich provides a series of exercise-based applications for highschool students in China. The dataset contains 240,309 real-worldmath exercises and each math exercise contains multiple categorieswhich are organized into a tree structure with three levels.

Patent Dataset is collected from the USPTO1 which is a patentsystem grating U.S. patents. The dataset includes 100,000 real-worldgranted US patents and each patent document includes textualinformation (e.g., title, abstract) andmultiple hierarchical categories.Figure 1 shows a toy example of a patent document with a four-levelhierarchical category structure.

Table 1 and Figure 5 show the basic statistics and the documen-tation distribution of the two datasets respectively, and we canobserve that: (1) Education dataset has total 459 categories whichare organized into a three-level hierarchical structure while Patentdataset has total 9,162 categories with a four-level hierarchical struc-ture; (2) Each document consists of about 3.64 and 7.67 categories inEducation and Patent dataset respectively. These categories are dis-tributed in different levels, and each level basically averages one ortwo categories; (3) About 95% documents of Education dataset con-tain fewer than 350 words, while 95% documents of Patent datasetcontain fewer than 150 words.

The average number of category per instance is quite small com-pared to the total number of category in the whole hierarchicalstructure. Thus, when predicting the multiple categories for eachdocument, it is critical to integrate the dependencies between dif-ferent levels of the hierarchical structure to reduce the influenceof the other irrelevant category information. Moreover, it is alsonecessary to capture the associations between texts and the hierar-chical structure, especially for the long sentences, which has beenfully discussed in Section 4.2.

5.2 Experimental SetupFor validating the effectiveness of HARNN, we employ 10-foldcross-validation on labeled documents in two datasets, where one

1https://www.uspto.gov/

of ten folds is targeted to construct the testing set and the rest forthe training set.

WordEmbeddingPre-training.Weuse theWord2vec tool [23]with a dimension (k) 100 to generate the word embedding for eachword in a document, which is trained on texts of documents in thedatasets.

HARNN Setting.2 We set the size (da ) of embedding represen-tations for categories as 200, the dimension (u) of hidden states inBi-LSTM as 256, the dimension (v) of the all full-connected layerunits as 512 and the parameter (α ) of local & global informationtrade-off regulation as 0.5.

Training Details. In HARNN, we set the maximum length N ofwords in a document as 350 in Education dataset while 150 in Patentdataset (zero padded when necessary) according to our observationin Figure 5. We initialize parameters in HARNN with a truncatednormal distribution and the standard deviation is set as 0.1. Fortraining HARNN, we use the Adam optimizer [19] with a learningrate of 1 × 10−3, and set the mini-batches as 256, λΘ = 0.00004, β = 0.1 in Eq. (15). All parameters of HARNN can be tunedduring the training process. Note that all the fully-connected layerscomprise 512 ReLU neurons. We also use dropout [29] with a drop-rate of 0.5 to prevent overfitting and gradient clipping to avoid thegradient explosion problem.

5.3 Baseline ApproachesIn order to demonstrate the effectiveness of HARNN, we compare itwith several HMTC methods including four state-of-the-art modelsand some variants of HARNN:• Clus-HMC [31]: Clus-HMC is a global approach that buildsa single decision tree to classify all categories simultaneously.• HMC-LMLP [7]: HMC-LMLP is a local approach whichtrains a set of neural networks and each is responsible forthe prediction of the categories belonging to a given level.• HMCN-F [33]: HMCN-F is a feed-forward hybrid approachwhich combines the prediction of each level in the categoryhierarchy and the overall hierarchy structure.• HMCN-R [33]: HMCN-R is a recurrent hybrid approachwhich combines the prediction of each level in the categoryhierarchy and the overall hierarchy structure.

2The code is available at https://github.com/RandolphVI/Hierarchical-Multi-Label-Text-Classification


1057

HMC-LMLP HMCN-R HMCN-F HARNN

(a) Education dataset

(b) Patent dataset

Rec

all

0.40

0.60

0.80

1.00

HierarchyLevel-1 Level-2 Level-3

Pre

cision

0.60

0.70

0.80

0.90

1.00


Micro

-F1

0.40

0.60

0.80

1.00


AU(P

RC)

0.60

0.70

0.80

0.90

1.00


Level-1 Level-2 Level-3 Level-4

Rec

all

0.20

0.40

0.60

0.80

HierarchyLevel-1 Level-2 Level-3 Level-4

Pre

cision

0.40

0.60

0.80


Micro

-F1

0.20

0.40

0.60

0.80


AU(P

RC)

0.20

0.40

0.60

0.80

Hierarchy

Figure 6: Performance on different levels in hierarchy.

The variants of HARNN are listed as follows:• HARNN-LG: HARNN-LG is a variant of HARNN withoutconsidering the dependencies between different levels of thehierarchical structure.• HARNN-LH: HARNN-LH is a variant of HARNN withoutconsidering the overall hierarchical structure information.• HARNN-GH: HARNN-GH is a variant of HARNN with-out considering the information of the local regions in thehierarchical structure.

Concretely, the chosen baselines can be categorized into localapproach (HMC-LMLP), global approach (Clus-HMC) and hybridapproachs (HMCN-F, HMCN-R). All baselines ignore the associa-tions between texts and the hierarchical structure. Note that wedesign three variants of HARNN to highlight the effectiveness of ourproposed HAM unit and hybrid prediction method. In the follow-ing experiments, all methods are implemented by TensorFlow [1],and all the experiments are conducted on a Linux server with four2.0GHz Intel Xeon E5-2620 CPUs and a Tesla K80 GPU. For faircomparisons, all parameters in these baselines are tuned to achievethe best performance.

5.4 Evaluation Metrics5.4.1 Threshold based evaluation. In HMTC task, we first

adopt three widely used evaluation metrics [3, 5]: precision, recalland F1 measure. Given a category i ∈ C, let TPi , FPi , FNi , be thenumber of true positives, false positives, false negatives, respec-tively. The precision and recall for the whole output hierarchicalstructure are then defined as [31]:

P =

∑i ∈CTPi∑

i ∈CTPi +∑i ∈C FPi

, R =

∑i ∈CTPi∑

i ∈CTPi +∑i ∈C FNi

. (16)

And we also evaluate the performance using micro − F1 [13]which is the combination of precision and recall. Themicro − F1

measures the classification effectiveness by counting the true posi-tives, false negatives and false positives globally, which takes classimbalance into account [36]. For the three threshold based evalua-tion metrics, the larger, the better.

5.4.2 Area under the average PR curve. The outputs of ourmodel HARNN and the other baselines are probability values ofeach category in the entire hierarchical structure. Hence, the finalpredictions are generated after thresholding those probabilities.The choice of an optimal threshold is difficult and often subjective,thus we follow the trend of HMTC research [7, 33] to employ AreaUnder the Average Precision-Recall Curve (AU (PRC)) [9] for avoidingchoosing thresholds. The higher the AU (PRC) of a method is, thebetter its predictive performance is.

5.5 Experimental Results5.5.1 PerformanceComparison (RQ1). To demonstrate prac-

tical significance of our proposed model, we compare HARNN withall the baselines on HMTC task. The results of all methods onboth Education and Patent datasets are shown in Table 2. From theresults, we can get several observations:

(1) HARNN achieves the best performance on all evaluationmetricsin two datasets, which indicates that HARNN is more capablefor HMTC tasks with an advantage of tackling hierarchicalcategory structure effectively and accurately.

(2) All the variants of HARNN perform better than HMCN-F andHMCN-R on all evaluation metrics in two datasets. The reasonis that the variants of HARNN could capture the associationsbetween texts and the hierarchical structure via TCA whileHMCN-F and HMCN-R cannot.

(3) HARNN beats HARNN-LG by additionally leveraging the HAMunit which considers the dependencies among different levelsof the hierarchical structure. And HARNN performs better thanHARNN-LH and HARNN-GH by combining the information


1058

α α

α α

0.755

0.760

0.765

0.770

0.38

0.40

0.42

0.44

0.46

EducationPatent

Rec

all

0 0.5 1.00.850

0.855

0.860

0.865

0.65

0.70

0.75

0.80

EducationPatent

Pre

cisi

on

0 0.5 1.0

0.800

0.805

0.810

0.815

0.51

0.52

0.53

0.54

0.55

0.56

EducationPatent

Mic

ro-F

1

0 0.5 1.00.880

0.885

0.890

0.895

0.900

0.55

0.56

0.57

0.58

0.59

0.60

EducationPatent

AU

(PR

C)

0 0.5 1.0

Figure 7: Model performance with different α .

of the local regions in the category hierarchy and the overallhierarchical structure.

(4) HMCN-F and HMCN-R are generally better than HMC-LMLPand Clus-HMC. This once again implies that it is effective forthe HTMC task by integrating the predictions of each level inthe category hierarchy and the overall hierarchical structure.In summary, these evidences all indicate that the dependen-

cies among different levels of the hierarchical structure and text-category associations are important for classifying documents inHMTC task. Moreover, these clues also reveal that HARNN can clas-sify the documents into the multiple categories more effectively byintegrating the predictions of each level in the category hierarchyand the overall hierarchical structure.

5.5.2 Performance ondifferent levels (RQ2). InHMTC task,it is important to predict all categories in the entire hierarchy, andannotating categories of each level precisely is also critical. Thus, weconduct an experiment for comparing HARNN with HMC-LMLP,HMCN-F and HMCN-R on each level of the hierarchical structureseparately. From Figure 6, we can see that HARNN outperformsall the other methods on all the category levels of the hierarchicalstructure, since HARNN leverages the information of hierarchicalstructure. Moreover, we also notice that the performance of allmodels tend to decrease when the hierarchy deepens, and HARNNretains superior performance compared to the other baselines. Thatis, as the hierarchical level increases, the categories of the levelincrease rapidly (e.g., the Patent dataset has 661, 8364 categoriesin level-3 and level-4 respectively, as shown in Table 1), whichinfluences the model performance a lot. And HARNN considersthe associations between texts and the hierarchical structure anddependencies among different levels of the hierarchical structurewhile the other three baselines do not. These results once againindicate that HARNN is more effective and powerful for HMTC taskby considering both the associations between texts and the hierar-chical structure and modeling the dependencies between differentlevels of the hierarchical structure.

5.5.3 Performance with different α (RQ3). In our HARNNmodel, the trade-off parameter α plays a crucial role which balances

Level-1 Level-2 Level-3 Level-4

Hierarchy

gas

injection

system

nuclear

power

plant

reactor

gaseous

waste

disposal

Wor

d To

kens

0.11 0.07 0.06 0.04

0.27 0.21 0.17 0.13

0.1 0.09 0.07 0.03

0.48 0.62 0.77 0.89

0.17 0.24 0.32 0.39

0.38 0.31 0.25 0.19

0.15 0.23 0.29 0.42

0.03 0.01 0 0

0.08 0.06 0.02 0

0.15 0.09 0.05 0.040.0

0.2

0.4

0.6

0.8

1.0

Figure 8: Illustration of learned attention weights.

the importance of the local prediction and global prediction inEq. (13). When α is smaller, the network tends to prioritize localpredictions during learning. Conversely, as α is larger, the networkis allowed to focus on global prediction more easily. We conduct anexperiment by assigning different α from set 0, 0.1, 0.2, . . . , 1.0.As shown in Figure 7, as α increases, the performance of HARNNincreases at the beginning, but it decreases afterwards both intwo datasets. These results indicate that combining the local andglobal predictions properly is vital for achieving more accurateclassification performance.

5.5.4 Hierarchical Attention Visualization (RQ4). An im-portant characteristic of HARNN is that it can capture hierarchicalattention information between text semantic representations andthe hierarchical structure through visual attention vector Kh inEq. (10) for each category level. The heatmap of the learned at-tention weights for a patent document is illustrated in Figure 8,where the document sample is the one in Figure 1. Please notethat, we only show the part of the document text for illustrationpurposes. The color in Figure 8 changes from white to red while thevalue of attention weights increases. From Figure 8, we find thatthe HARNN model majorly learns to capture some key words inthe text. For instance, the attention weights of the words "nuclear"and "plant" are obviously higher than the other words since thesewords dominate the prediction of the category Physics in Level-1.And we also observe that the attention weights of the words "nu-clear" and "reactor" gradually increase when the hierarchy deepens,and meanwhile, the attention weights of the words "plant" and"injection" decrease. That is, the key words in different levels mightbe different due to the different concept description. Specifically,the words "nuclear" and "reactor" are more appropriate for describ-ing the concepts of subsequent levels (e.g., the Nuclear Reactorsconcept in level-3), while the words "plant" and "injection" seemless relevant. That once again indicates that HARNN is capableof understanding the concept of a document text from shallow todeep properly. These results imply that HARNN can provide a goodway to capture the hierarchical attention information in a givendocument via the HAM unit.


1059

6 CONCLUSIONS AND FUTUREWORKIn this paper, we proposed a model named HARNN for classifyingdocuments into the most relevant categories level by level via comb-ing the text and hierarchical structure. Specifically, we first applieda Documentation Representing Layer for obtaining the semanticencodings of text and categories. Then, we devised a Hierarchi-cal Attention-based Recurrent Layer to model the dependenciesbetween different levels by capturing the associations among textand hierarchical structure in a top-down fashion. After that, we de-signed a hybrid approach which is capable of predicting categoriesof each level while classifying the all categories in the entire hier-archy precisely. Finally, extensive experiments on two real-worlddatasets clearly demonstrated the effectiveness and explanatorypower of HARNN.

In the future, there are still some directions for exploration. First,we would like to leverage the explicit constraints of the hierarchi-cal structure (e.g., the hierarchical violation [28]) as regularizationto improve classification performance. Second, we would like toaddress the issue of incomplete labels [35] and use more effectiveembedding for representing the hierarchical structure [21]. Third,as HAM is LSTM-like structure for encoding the category hierarchyinformation, so we will also try to design Bi-HAM (like Bi-LSTM) tocapture the better hierarchy information from upward and down-ward simultaneously with our HARNN. Finally, as our HARNNis a general framework, we will test its performance on the otherdomains, such as the protein function prediction in biochemistry [7].

7 ACKNOWLEDGEMENTSThis research was partially supported by grants from the NationalNatural Science Foundation of China (Grants No. U1605251, 61727809,61922073, 61602405), and the Young Elite Scientist Sponsorship Pro-gram of CAST.

REFERENCES[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,

A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning onheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

[2] P. N. Bennett and N. Nguyen. Refined experts: improving classification in largetaxonomies. In Proceedings of the 32nd international ACM SIGIR conference onResearch and development in information retrieval, pages 11–18. ACM, 2009.

[3] W. Bi and J. T. Kwok. Multi-label classification on tree-and dag-structured hier-archies. In Proceedings of the 28th International Conference on Machine Learning(ICML-11), pages 17–24, 2011.

[4] H. B. Borges and J. C. Nievola. Multi-label hierarchical classification using acompetitive neural network for protein function prediction. In Neural Networks(IJCNN), The 2012 International Joint Conference on, pages 1–8. IEEE, 2012.

[5] A. Braytee, W. Liu, D. R. Catchpoole, and P. J. Kennedy. Multi-label featureselection using correlation information. In Proceedings of the 2017 ACM onConference on Information and Knowledge Management, pages 1649–1656. ACM,2017.

[6] L. Cai and T. Hofmann. Hierarchical document categorization with supportvector machines. In Proceedings of the thirteenth ACM international conference onInformation and knowledge management, pages 78–87. ACM, 2004.

[7] R. Cerri, R. C. Barros, A. C. de Carvalho, and Y. Jin. Reduction strategies forhierarchical multi-label classification in protein function prediction. BMC bioin-formatics, 17(1):373, 2016.

[8] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Incremental algorithms for hi-erarchical classification. Journal of Machine Learning Research, 7(Jan):31–54,2006.

[9] J. Davis and M. Goadrich. The relationship between precision-recall and roccurves. In Proceedings of the 23rd international conference on Machine learning,pages 233–240. ACM, 2006.

[10] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Pro-ceedings of the twenty-first international conference on Machine learning, page 27.

ACM, 2004.[11] A. Esuli, T. Fagni, and F. Sebastiani. Boosting multi-label hierarchical text cate-

gorization. Information Retrieval, 11(4):287–313, 2008.[12] C. J. Fall, A. Törcsvári, K. Benzineb, and G. Karetka. Automated categorization

in the international patent classification. In Acm Sigir Forum, volume 37, pages10–25. ACM, 2003.

[13] E. Gibaja and S. Ventura. Multi-label learning: a review of the state of the art andongoing research. Wiley Interdisciplinary Reviews: Data Mining and KnowledgeDiscovery, 4(6):411–444, 2014.

[14] J. C. Gomez and M.-F. Moens. A survey of automated hierarchical classificationof patents. In Professional Search in the Modern World, pages 215–249. Springer,2014.

[15] A. Graves. Generating sequences with recurrent neural networks. arXiv preprintarXiv:1308.0850, 2013.

[16] J. Han, C. Wang, and A. El-Kishky. Bringing structure to text: mining phrases,entities, topics, and hierarchies. In Proceedings of the 20th ACM SIGKDD inter-national conference on Knowledge discovery and data mining, pages 1968–1968.ACM, 2014.

[17] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation,9(8):1735–1780, 1997.

[18] Z. Huang, Q. Liu, E. Chen, H. Zhao, M. Gao, S. Wei, Y. Su, and G. Hu. Questiondifficulty prediction for reading problems in standard tests. In Thirty-First AAAIConference on Artificial Intelligence, 2017.

[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014.

[20] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. Astructured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130,2017.

[21] J. Ma, P. Cui, X. Wang, and W. Zhu. Hierarchical taxonomy aware networkembedding. In Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, pages 1920–1929. ACM, 2018.

[22] A. Mayne and R. Perry. Hierarchically classifying documents with multiple labels.In Computational Intelligence and Data Mining, 2009. CIDM’09. IEEE Symposiumon, pages 133–139. IEEE, 2009.

[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed rep-resentations of words and phrases and their compositionality. In Advances inneural information processing systems, pages 3111–3119, 2013.

[24] Z. Ren, M.-H. Peetz, S. Liang, W. Van Dolen, and M. De Rijke. Hierarchical multi-label classification of social text streams. In Proceedings of the 37th internationalACM SIGIR conference on Research & development in information retrieval, pages213–222. ACM, 2014.

[25] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Learning hierarchicalmulti-category text classification models. In Proceedings of the 22nd internationalconference on Machine learning, pages 744–751. ACM, 2005.

[26] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernel-based learningof hierarchical multilabel classification models. Journal of Machine LearningResearch, 7(Jul):1601–1626, 2006.

[27] M. E. Ruiz and P. Srinivasan. Hierarchical text categorization using neuralnetworks. Information Retrieval, 5(1):87–118, 2002.

[28] C. N. Silla and A. A. Freitas. A survey of hierarchical classification across differentapplication domains. Data Mining and Knowledge Discovery, 22(1-2):31–72, 2011.

[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.Dropout: a simple way to prevent neural networks from overfitting. The Journalof Machine Learning Research, 15(1):1929–1958, 2014.

[30] H. Tao, S. Tong, H. Zhao, T. Xu, B. Jin, and Q. Liu. A radical-aware attention-basedmodel for chinese text classification. In Thirty-Third AAAI Conference on ArtificialIntelligence, 2019.

[31] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel. Decision trees forhierarchical multi-label classification. Machine learning, 73(2):185, 2008.

[32] X. Wang and G. Sukthankar. Multi-label relational neighbor classification usingsocial context features. In Proceedings of the 19th ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 464–472. ACM, 2013.

[33] J. Wehrmann, R. Cerri, and R. Barros. Hierarchical multi-label classificationnetworks. In International Conference on Machine Learning, pages 5225–5234,2018.

[34] F. Wu, J. Zhang, and V. Honavar. Learning classifiers using hierarchically struc-tured class taxonomies. In International Symposium on Abstraction, Reformulation,and Approximation, pages 313–320. Springer, 2005.

[35] L. Xu, Z. Wang, Z. Shen, Y. Wang, and E. Chen. Learning low-rank label correla-tions for multi-label classification with missing labels. In 2014 IEEE InternationalConference on Data Mining, pages 1067–1072. IEEE, 2014.

[36] B. Yang, J.-T. Sun, T. Wang, and Z. Chen. Effective multi-label active learningfor text classification. In Proceedings of the 15th ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 917–926. ACM, 2009.

[37] L. Zhang, K. Xiao, Q. Liu, Y. Tao, and Y. Deng. Modeling social attention forstock analysis: An influence propagation perspective. In 2015 IEEE InternationalConference on Data Mining, pages 609–618. IEEE, 2015.


1060

hierarchical multi-label text classification: an attention-based … · 2020. 9. 18. · 4iflytek...

Documents