neural adaptation layers for cross-domain named …neural adaptation layers for cross-domain named...

Neural Adaptation Layers for Cross-domain Named Entity Recognition

Bill Yuchen Lin† and Wei Lu‡†University of Southern California

‡Singapore University of Technology and [email protected], [email protected]

Abstract

Recent research efforts have shown that neuralarchitectures can be effective in conventionalinformation extraction tasks such as namedentity recognition, yielding state-of-the-art re-sults on standard newswire datasets. However,despite significant resources required for train-ing such models, the performance of a modeltrained on one domain typically degrades dra-matically when applied to a different domain,yet extracting entities from new emerging do-mains such as social media can be of signif-icant interest. In this paper, we empiricallyinvestigate effective methods for convenientlyadapting an existing, well-trained neural NERmodel for a new domain. Unlike existing ap-proaches, we propose lightweight yet effec-tive methods for performing domain adapta-tion for neural models. Specifically, we intro-duce adaptation layers on top of existing neu-ral architectures, where no re-training usingthe source domain data is required. We con-duct extensive empirical studies and show thatour approach significantly outperforms state-of-the-art methods.

1 Introduction

Named entity recognition (NER) focuses on ex-tracting named entities in a given text while iden-tifying their underlying semantic types. Mostearlier approaches to NER are based on conven-tional structured prediction models such as condi-tional random fields (CRF) (Lafferty et al., 2001;Sarawagi and Cohen, 2004), relying on hand-crafted features which can be designed basedon domain-specific knowledge (Yang and Cardie,2012; Passos et al., 2014; Luo et al., 2015). Re-cently, neural architectures have been shown ef-fective in such a task, whereby minimal feature en-gineering is required (Lample et al., 2016; Ma and

Accepted as a long paper in EMNLP 2018 (Conferenceon Empirical Methods in Natural Language Processing).

Shared Word Embeddings

3. Fine-Tuning with

Target Data

1. Training with

Source Data

2. Parameter

Initialization

BLSTMs

Training with Random Samples

from Source or Target Data

(a) INIT (b) MULT

BLSTMs

Target-Domain

CRF Layer

Shared Word Embeddings

Source-Domain

CRF Layer

Shared BLSTMs

Source-Domain

CRF Layer

Target-Domain

CRF Layer

Figure 1: Two existing adaptation approaches for NER.

Hovy, 2016; Peters et al., 2017; Liu et al., 2018).Domain adaptation, as a special case for transferlearning, aims to exploit the abundant data of well-studied source domains to improve the perfor-mance in target domains of interest (Pan and Yang,2010; Weiss et al., 2016). There is a growing in-terest in investigating the transferability of neuralmodels for NLP. Two notable approaches, namelyINIT (parameter initialization) and MULT (mul-titask learning), have been proposed for studyingthe transferrability of neural networks under taskssuch as sentence (pair) classification (Mou et al.,2016) and sequence labeling (Yang et al., 2017b).

The INIT method first trains a model using la-beled data from the source domain; next, it ini-tializes a target model with the learned param-eters; finally, it fine-tunes the initialized targetmodel using labeled data from the target domain.The MULT method, on the other hand, simul-taneously trains two models using both sourceand target data respectively, where some param-eters are shared across the two models during thelearning process. Figure 1 illustrates the two ap-proaches based on the BLSTM-CRF (bidirectionalLSTM augmented with a CRF layer) architecturefor NER. While such approaches make intuitivesenses, they also come with some limitations.

First, these methods utilize shared domain-general word embeddings when performing learn-ing from both source and target domains. This es-sentially assumes there is no domain shift of input

arX

iv:1

810.

0636

8v1

[cs

.CL

] 1

5 O

ct 2

018

Bob

[B,o,b]

Dylan

[D,y,l,a,n]

visited

[v,i,s,i,t,e,d] [S,w,e,d,e,n]

Sweden

CRF LayerI-PER O B-GPEB-PER

Bidirectional LSTM Layer

Bob

[B,o,b]

Dylan

[D,y,l,a,n]

visited

[v,i,s,i,t,e,d] [S,w,e,d,e,n]

Sweden

CRF LayerI-PER O B-GPEB-PER

Bidirectional LSTM Layer

Output Adaptation Layer

Sentence Adaptation LayerWord

Adaptation Layer

Figure 2: Our base model (left) and target model (right) where we insert new adaptation layers (highlighted in red).

feature spaces. However, cases when the two do-mains are distinct (words may contain different se-mantics across two domains), we believe such anassumption can be weak.

Second, existing approaches such as INIT di-rectly augment the LSTM layer with a new out-put CRF layer when learning models for the tar-get domain. One basic assumption involved hereis that the model would be able to re-construct anew CRF layer that can capture not only the vari-ation of the input features (or hidden states out-putted from LSTM) to the final CRF layer acrosstwo domains, but also the structural dependenciesbetween output nodes in the target output space.We believe this overly restrictive assumption mayprevent the model from capturing rich, complexcross-domain information due to the inherent lin-ear nature of the CRF model.

Third, most methods involving cross-domainembedding often require highly time-consumingretraining word embeddings on domain-specificcorpora. This makes it less realistic in scenarioswhere source corpora are huge or even inacces-sible. Also, MULT-based methods need retrain-ing on the source-domain data for different tar-get domains. We think this disadvantage of exist-ing methods hinders the neural domain adaptationmethods for NER to be practical in real scenarios.

In this work, we propose solutions to addressthe above-mentioned issues. Specifically, we ad-dress the first issue at both the word and sentencelevel by introducing a word adaptation layer anda sentence adaptation layer respectively, bridgingthe gap between the two input spaces. Similarly,

an output adaptation layer is also introduced be-tween the LSTM and the final CRF layer, captur-ing the variations in the two output spaces. Fur-thermore, we introduce a single hyper-parameterthat controls how much information we wouldlike to preserve from the model trained on thesource domain. These approaches are lightweight,without requiring re-training on data from thesource domain. We show through extensive em-pirical analysis as well as ablation study that ourproposed approach can significantly improve theperformance of cross-domain NER over existingtransfer approaches.

2 Base Model

We briefly introduce the BLSTM-CRF architec-ture for NER, which serves as our base modelthroughout this paper. Our base model is the com-bination of two recently proposed popular worksfor named entity recognition by Lample et al.(2016) and Ma and Hovy (2016). Figure 2 illus-trates the BLSTM-CRF architecture.

Following Lample et al. (2016), we de-velop the comprehensive word representations byconcatenating pre-trained word embeddings andcharacter-level word representations, which areconstructed by running a BLSTM over sequencesof character embeddings. The middle BLSTMlayer takes a sequence of comprehensive wordrepresentations and produces a sequence of hid-den states, representing the contextual informationof each token. Finally, following Ma and Hovy(2016), we build the final CRF layer by utiliz-ing potential functions describing local structures

to define the conditional probabilities of completepredictions for the given input sentence.

This architecture is selected as our base modeldue to its generality and representativeness. Wenote that several recently proposed models (Peterset al., 2017; Liu et al., 2018) are built based onit. As our focus is on how to better transfer sucharchitectures for NER, we include further discus-sions of the model and training details in our sup-plementary material.

3 Our Approach

We first introduce our proposed three adaptationlayers and describe the overall learning process.

3.1 Word Adaptation Layer

Most existing transfer approaches use the samedomain-general word embeddings for trainingboth source and target models. Assuming thatthere is little domain shift of input feature spaces,they simplify the problem as homogeneous trans-fer learning (Weiss et al., 2016). However, thissimplified assumption becomes weak when twodomains have apparently different language stylesand involve considerable domain-specific termsthat may not share the same semantics across thedomains; for example, the term “cell” has differ-ent meaning in biomedical articles and productreviews. Furthermore, some important domain-specific words may not be present in the vocab-ulary of domain-general embeddings. As a result,we have to regard such words as out-of-vacabulary(OOV) words, which may be harmful to the trans-fer learning process.

Stenetorp et al. (2012) show that domain-specific word embeddings tend to perform betterwhen used in supervised learning tasks.2 How-ever, maintaining such an improvement in thetransfer learning process is very challenging. Thisis because two domain-specific embeddings aretrained separately on source and target datasets,and therefore the two embedding spaces are het-erogeneous. Thus, model parameters trained ineach model are heterogeneous as well, which hin-ders the transfer process. How can we addresssuch challenges while maintaining the improve-ment by using domain-specific embeddings?

We address this issue by developing a wordadaptation layer, bridging the gap between the

2We also confirm this claim with experiments presentedin supplementary materials.

source and target embedding spaces, so thatboth input features and model parameters be-come homogeneous across domains. Popularpre-trained word embeddings are usually trainedon very large corpora, and thus methods requir-ing re-training them can be extremely costly andtime-consuming. We propose a straightforward,lightweight yet effective method to construct theword adaptation layer that projects the learned em-beddings from the target embedding space into thesource space. This method only requires somecorpus-level statistics from the datasets (used forlearning embeddings) to build the pivot lexicon forconstructing the adaptation layer.

3.1.1 Pivot Lexicon

A pivot word pair is denoted as (ws, wt), wherews ∈ XS and wt ∈ XT . Here, XS and XT aresource and target vocabularies. A pivot lexicon Pis a set of such word pairs.

To construct a pivot lexicon, first, motivatedby Tan et al. (2015), we define P1, which con-sists of the ordinary words that have high rela-tive frequency in both source and target domaincorpora: P1 = {(ws, wt)|ws = wt, f(ws) ≥φs, f(wt) ≥ φt}, where f(w) is the frequencyfunction that returns the number of occorrence ofthe word w in the dataset, and φs and φt are wordfrequency thresholds3. Optionally, we can utilizea customized word-pair list P2, which gives map-pings between domain-specific words across do-mains, such as normalization lexicons (Han andBaldwin, 2011; Liu et al., 2012). The final lexiconis thus defined as P = P1 ∪ P2.

3.1.2 Projection Learning

Mathematically, given the pre-trained domain-specific word embeddings VS and VT as well asa pivot lexicon P , we would like to learn a lin-ear projection transforming word vectors from VT

into VS. This idea is based on a bilingual wordembedding model (Mikolov et al., 2013), but weadapt it to this domain adaptation task.

We first construct two matrices V∗S and V∗T,where the i-th rows of these two matrices are thevector representations for the words in the i-th en-try of P: P(i) = (wi

s, wit). We use V∗iS to denote

the vector representation of the word wis, and sim-

ilarly for V∗iT and wit.

3A simple way of setting them is to choose the frequencyof the k-th word in the word lists sorted by frequencies.

Next, we learn a transformation matrix Z min-imizing the distances between V∗S and V∗TZ withthe following loss function, where ci is the con-fidence coefficient for the entry P(i), highlightingthe significance of the entry:

argminZ

|P|∑i=1

ci

∣∣∣∣∣∣∣∣V∗iTZ−V∗iS

∣∣∣∣∣∣∣∣2,We use normalized frequency (f ) and Sørensen-

Dice coefficient (Sørensen, 1948) to describe thesignificance of each word pair:

f(ws) =f(ws)

maxw′∈XSf(w′)

, f(wt) =f(wt)

maxw′∈XTf(w′)

ci =2 · f(wi

s) · f(wit)

f(wis) + f(wi

t)

The intuition behind this scoring method is thata word pair is more important when both wordshave high relative frequency in both domains. Thisis because such words are likely more domain-independent, conveying identical or similar se-mantics across these two different domains.

Now, the matrix Z exactly gives the weights tothe word adaptation layer, which takes in the targetdomain word embeddings and returns the trans-formed embeddings in the new space. We learn Zwith stochastic gradient descent. After learning,the projected new embeddings would be VTZ,which would be used in the subsequent steps asillustrated in Figure 2 and Figure 3. With sucha word-level input-space transformation, the pa-rameters of the pre-trained source models basedon VS can still be relevant, which can be used insubsequent steps.

We would like to highlight that, unlike manyprevious approaches to learning cross-domainword embeddings (Bollegala et al., 2015; Yanget al., 2017a), the learning of our word adapta-tion layer involves no modifications to the source-domain embedding spaces. It also requires no re-training of the embeddings based on the target-domain data. Such a distinctive advantage of ourapproach comes with some important practical im-plications: it essentially enables the transfer learn-ing process to work directly on top of a well-trained model by performing adaptation withoutinvolving significant re-training efforts. For ex-ample, the existing model could be one that hasalready gone through extensive training, tuningand testing for months based on large datasets

with embeddings learned from a particular domain(which may be different from the target domain).

3.2 Sentence Adaptation LayerThe word adaptation layer serves as a way tobridge the gap of heterogeneous input spaces, butit does so only at the word level and is context in-dependent. We can still take a step further to ad-dress the input space mismatch issue at the sen-tence level, capturing the contextual informationin the learning process of such a mapping basedon labeled data from the target domain. To thisend, we augment a BLSTM layer right after theword adaptation layer (see Figure 2), and we nameit the sentence adaptation layer.

This layer pre-encodes the sequence of pro-jected word embeddings for each target instance,before they serve as inputs to the LSTM layer in-side the base model. The hidden states for eachword generated from this layer can be seen asthe further transformed word embeddings captur-ing target-domain specific contextual information,where such a further transformation is learned ina supervised manner based on target-domain an-notations. We also believe that with such a sen-tence adaptation layer, the OOV issue mentionedabove may also be partially alleviated. This isbecause without such a layer, OOV words wouldall be mapped to a single fixed vector representa-tion – which is not desirable; whereas with such asentence adaptation layer, each OOV word wouldbe assigned their “transformed” embeddings basedon its respective contextual information.

3.3 Output Adaptation LayerWe focus on the problem of performing domainadaptation for NER under a general setup, wherewe assume the set of output labels for the sourceand target domains could be different. Due to theheterogeneousness of output spaces, we have to re-construct the final CRF layer in the target models.

However, we believe solely doing this may notbe enough to address the labeling difference prob-lem as highlighted in (Jiang, 2008) as the two out-put spaces may be very different. For example,in the sentence “Taylor released her new songs”,“Taylor” should be labeled as “MUSIC-ARTIST”instead of “PERSON” in some social media NERdatasets; this suggests re-classifying with contex-tual information is necessary. In another exam-ple, where we have a tweet “so...#kktny in 30mins?”; here we should recognize “#kktny” as a

Learning the matrix 𝒁

Parameter Initialization

Training with

Target Data

BLSTMs

Source-Domain Embedding Space

BLSTMs

Output Adaptation Layer

Sentence Adaptation Layer

αadapt

Target-DomainEmbedding Space

Target-Domain

CRF LayerSource-Domain

CRF Layer

αbase

Word Adaptation Layer

Pre-trained

Source Model

Figure 3: Overview of our transfer learning process.

CREATIVE-WORK entity, but there is little sim-ilar instances in newswire data, indicating thatcontext-aware re-recognition is also needed.

How can we perform re-classification and re-recognition with contextual information in the tar-get model? We answer this question by insert-ing a BLSTM output adaptation layer in the basemodel, right before the final CRF layer, to capturevariations in outputs with contextual information.This output adaption layer takes the output hiddenstates from the BLSTM layer from the base modelas its inputs, producing a sequence of new hiddenstates for the re-constructed CRF layer. Withoutthis layer, the learning process directly updates thepre-trained parameters of the base model, whichmay lead to loss of knowledge that can be trans-ferred from the source domain.

3.4 Overall Learning Process

Figure 3 depicts our overall learning process. Weinitialize the base model with the parameters fromthe pre-trained source model, and tune the weightsfor all layers — including layers from the basemodel, sentence and output adaptation layers, andthe CRF layer. We use different learning rateswhen updating the weights in different layers us-ing Adam (Kingma and Ba, 2014). In all our ex-periments, we fixed the weights to Z for the wordadaptation layers to avoid over-fitting. This al-lows us to preserve the knowledge learned fromthe source domain while effectively leveraging thelimited training data from the target domain.

Similar to Yang et al. (2017b), who utilizes ahyper-parameter λ for controlling the transferabil-ity, we also introduce a hyper-parameter ψ thatserves a similar purpose — it captures the relation

between the learning rate used for the base model(αbase) and the learning rate used for the adaptationlayers plus the final CRF layer αbase = ψ · αadapt.

If ψ = 0, we fix the learned parameters (fromsource domain) of the base model completely(Ours-Frozen). If ψ = 1, we treat all the layersequally (Ours-FineTune).

4 Experimental Setup

In this section, we present the setup of our exper-iments. We show our choice for source and targetdomains, resources for embeddings, the datasetsfor evaluation and finally the baseline methods.

4.1 Source and Target Domains

We evaluate our approach with the setting that thesource domain is newswire and the target domainis social media. We designed this experimentalsetup based on the following considerations:• Challenging: Newswire is a well-studied do-

main for NER and existing neural modelsperform very well (around 90.0 F1-score (Maand Hovy, 2016)). However, the perfor-mance drop dramatically in social media data(around 60.0 F-score (Strauss et al., 2016)).• Important: Social media is a rich soil for

text mining (Petrovic et al., 2010; Rosen-thal and McKeown, 2015; Wang and Yang,2015), and NER is of significant importancefor other information extraction tasks in so-cial media (Ritter et al., 2011a; Peng andDredze, 2016; Chou et al., 2016).• Representative: The noisy nature of user gen-

erated content as well as emerging entitieswith novel surface forms make the domainshift very salient (Finin et al., 2010; Hanet al., 2016).

Nevertheless, the techniques developed in thispaper are domain independent and thus can beused for other learning tasks across any two do-mains so long as we have the necessary resources.

4.2 Resources for Cross-domain Embeddings

We utilizes GloVe (Pennington et al., 2014) totrain domain-specific and domain-general wordembeddings from different corpora, denoted asfollows: 1) source emb, which is trained on thenewswire domain corpus (NewYorkTimes and Dai-lyMail articles); 2) target emb, which is trainedon a social media corpus (Archive Team’s Twit-

CO ON RI WN#train token 204,567 848,220 37,098 46,469#dev token 51,578 144,319 4,461 16,261#test token 46,666 49,235 4,730 61,908#train sent. 14,987 33,908 1,915 2,394#dev sent. 3,466 5,771 239 1,000#test sent. 3,684 1,898 240 3,856

#entity type 4 11 10 10

Table 1: Statistics of the NER datasets.

ter stream grab4); 3) general emb, which is pre-trained on CommonCrawl containing both formaland user-generated content.5 We obtain the inter-section of the top 5K words from source and tar-get vocabularies sorted by frequency to build P1.For P2, we utilize an existing twitter normaliza-tion lexicon containing 3,802 word pairs(Liu et al.,2012). More details are in supplemental material.

4.3 NER Datasets for Evaluation

For the source newswire domain, we use thefollowing two datasets: OntoNotes-nw – thenewswire section of OntoNotes 5.0 release dataset(ON) (Weischedel et al., 2013) that is publiclyavailable6, as well as the CoNLL03 NER dataset(CO) (Sang and Meulder, 2003). For the firstdataset, we randomly split the dataset into threesets: 80% for training, 15% for development and5% for testing. For the second dataset, we followtheir provided standard train-dev-test split. Forthe target domain, we consider the following twodatasets: Ritter11 (RI) (Ritter et al., 2011b) andWNUT16 (WN) (Strauss et al., 2016), both ofwhich are publicly available. The statistics of thefour datasets we used in the paper are shown in Ta-ble 1.

4.4 Baseline Transfer Approaches

We present the baseline approaches, which wereoriginally investigated by Mou et al. (2016). Leeet al. (2017) explored the INIT method for NER,while Yang et al. (2017b) extended the MULTmethod for sequence labeling.

INIT: We first train a source model MS us-ing the source-domain training data DS . Next, weconstruct a target model MT and reconstruct thefinal CRF layer to address the issue of differentoutput spaces. We use the learned parameters ofMS to initializeMT excluding the CRF layer. Fi-nally, INIT-FineTune continues trainingMT with

4https://archive.org/details/twitterstream

5https://nlp.stanford.edu/projects/glove/

6https://catalog.ldc.upenn.edu/ldc2013t19

the target-domain training data DT , while INIT-Frozen instead only updates the parameters of thenewly constructed CRF layer.

MULT: Multi-task learning based transfermethod simultaneously trains MS and MT us-ing DS and DT . The parameters ofMS andMT

excluding their CRF layers are shared during thetraining process. Both Mou et al. (2016) and Yanget al. (2017b) follow Collobert and Weston (2008)and use a hyper-parameter λ as the probability ofchoosing an instance from DS instead of DT tooptimize the model parameters. By selecting thehyper-parameter λ, the multi-task learning processtends to perform better in target domains. Notethat this method needs re-training of the sourcemodel with DS every time we would like to builda new target model, which can be time-consumingespecially when DS is large.

MULT+INIT: This is a combination of theabove two methods. We first use INIT to initial-ize the target model. Afterwards, we train the twomodels as what MULT does.

5 Results and Discussion

5.1 Main Results

We primarily focus on the discussion of experi-ments with a particular setup where DS is set toOntoNotes-nw and DT is Ritter11. In the exper-iments, “in-domain” means we only use DT totrain our base model without any transfer from thesource domain. “∆” represents the amount of im-provement we can obtain (in terms of F measure)using transfer learning over “in-domain” for eachtransfer method. The hyper-parameters ψ and λare tuned from {0.1, 0.2, ..., 1.0} on the develop-ment set, and we show the results based on thedeveloped hyper-parameters.

We first conduct the first set of experiments toevaluate performance of different transfer meth-ods under the assumption of homogeneous inputspaces. Thus, we utilize the same word embed-dings (general emb) for training both source andtarget models. Consequently we remove the wordadaptation layer (cd emb) in our approach underthis setting. The results are listed in Table 2. Aswe can observe, the INIT-Frozen method leadsto a slight “negative transfer”, which is also re-ported in the experiments of Mou et al. (2016).This indicates that solely updating the parame-ters of the final CRF layer is not enough for per-forming re-classification and re-recognition of the

https://archive.org/details/twitterstream

https://nlp.stanford.edu/projects/glove/

https://catalog.ldc.upenn.edu/ldc2013t19

Settings P R F ∆in-domain 72.73 56.14 63.37 -

INIT-Frozen 72.61 56.11 63.30 -0.07INIT-FineTune 73.13 56.55 63.78 +0.41

MULT 74.07 57.51 64.75 +1.38MULT+INIT 74.12 57.57 64.81 +1.44

Ours (w/o word adapt) 74.87 57.95 65.33 +1.96

Table 2: Comparisons of different methods for homo-geneous input spaces. (DS = ON, DT = RI)

named entities for the new target domain. TheINIT-FineTune method yields better results for italso updates the parameters of the middle LSTMlayers in the base model to mitigate the heteroge-neousness. The MULT and MULT+INIT meth-ods yield higher results, partly due to the factthat they can better control the amount of transferthrough tuning the hyper-parameter. Our proposedtransfer approach outperforms all these baselineapproaches. It not only controls the amount oftransfer across the two domains but also explicitlycaptures variations in the input and output spaceswhen there is significant domain shift.

We use the second set of experiments to un-derstand the effectiveness of each method whendealing with heterogeneous input spaces. We usesource emb for training source models and tar-get emb for learning the target models. Fromthe results shown in Table 3, we can find thatall the baseline methods degrade when the twoword embeddings used for training source mod-els and learning target models are different fromeach other. The heterogeneousness of input fea-ture space hinders them to better use the infor-mation from source domains. However, with thehelp of word and sentence adaptation layers, ourmethod achieves better results. The experimenton learning without the word adaptation layer alsoconfirms the importance of such a layer.7

Our results are also comparable to the re-sults when the cross-lingual embedding methodof Yang et al. (2017a) is used instead of the wordadaptation layer. However, as we mentioned ear-lier, their method requires re-training the embed-dings from target domain, and is more expensive.

5.2 Ablation Test

To investigate the effectiveness of each componentof our method, we conduct ablation test based on

7We use the approximate randomization test (Yeh, 2000)for statistical significance of the difference between “Ours”and “MULT+INIT”. Our improvements are statistically sig-nificant with a p-value of 0.0033.

Settings P R F ∆in-domain 72.51 57.11 63.90 -

INIT-Frozen 72.65 55.25 62.77 -1.13INIT-FineTune 72.83 56.73 63.78 -0.12

MULT 73.11 57.35 64.28 +0.38MULT+INIT 73.13 57.31 64.26 +0.36

Ours 75.87 59.03 66.40 +2.50w/ Yang et al. (2017a) 76.12 59.10 66.53 +2.63w/o word adapt. layer 73.29 57.61 64.51 +0.61

Table 3: Comparisons of different methods for hetero-geneous input spaces. (DS = ON, DT = RI)

our full model (F=66.40) reported in Table 3. Weuse ∆ to denote the differences of the performancebetween each setting and our model. The resultsof ablation test are shown in Table 4. We first setψ to 0 and 1 respectively to investigate the twospecial variant (Ours-Frozen, Ours-FineTune) ofour method. We find they both perform worse thanusing the optimal ψ (0.6).

One natural concern is whether our perfor-mance gain is truly caused by the effective ap-proach for cross-domain transfer learning, or issimply because we use a new architecture withmore layers (that is built on top of the base model)for learning the target model. To understand this,we carry out an experiment named “w/o transfer”by setting ψ to 1, where we randomly initializethe parameters of the middle BLSTMs in the tar-get model instead of using source model parame-ters. Results show that such a model does not per-form well, confirming the effectiveness of transferlearning with our proposed adaptation layers. Re-sults also confirm the importance of all the threeadaptation layers that we introduced. Learning theconfidence scores (ci) and employing the optionalP2 are also helpful but they appear to be playingless significant roles.

5.3 Additional Experiments

As shown in Table 5, we conduct some additionalexperiments to investigate the significance of ourimprovements on different source-target domains,and whether the improvement is simply because ofthe increased model expressiveness due to a largernumber of parameters.8

We first set the hidden dimension to be the sameas the dimension of source-domain word embed-dings for the sentence adaptation layer, which is

8In this table, I-200/300 and M-200/300 represent thebest performance from {INIT-Frozen, INIT-FineTune} and{MULT, MULT+INIT} respectively after tuning; “in-do.”here is the best score of our base model without transfer.

Settings F ∆ψ=0.0 Ours-Frozen 63.95 -2.45ψ=1.0 Ours-FineTune 63.40 -3.00ψ=1.0 w/o transfer 63.26 -3.14ψ=0.6 w/o using confidence ci 66.04 -0.36ψ=0.6 w/o using P2 66.11 -0.29ψ=0.6 w/o word adapt. layer 64.51 -1.89ψ=0.6 w/o sentence adapt. layer 65.25 -1.15ψ=0.6 w/o output adapt. layer 64.84 -1.56

Table 4: Comparing different settings of our method.

DS ,DT in-do. I-200 M-200 I-300 M-300 Ours(ON, RI) 63.37 +0.41 +1.44 +0.43 +1.48 +3.03(CO, RI) +0.23 +0.81 +0.22 + 0.88 +1.86

(ON, WN) 51.03 +0.89 +1.72 +0.88 +1.77 +3.16(CO, WN) +0.69 +1.04 +0.71 +1.13 +2.38

Table 5: Results of transfer learning methods on dif-ferent datasets with different number of LSTM units inthe base model. (I: INIT; M: MULT; 200/300: numberof LSTM units).

200 (I/M-200). The dimension used for the out-put adaptation layer is just half of that of thebase BLSTM model. Overall, our model roughlyinvolves 117.3% more parameters than the basemodel.9 To understand the effect of a larger pa-rameter size, we further experimented with hiddenunit size as 300 (I/M-300), leading to a parame-ter size of 213, 750 that is comparable to “Ours”(203, 750). As we can observe, our approach out-performs these approaches consistently in the foursettings. More experiments with other settings canbe found in the supplementary material.

5.4 Effect of Target-Domain Data Size

To assess the effectiveness of our approach whenwe have different amounts of training data fromthe target domain, we conduct additional experi-ments by gradually increasing the amount of thetarget training data from 10% to 100%. We againuse the OntoNotes-nw and Ritter11 asDS andDT ,respectively. Results are shown in Figure 4. Ex-periments for INIT and MULT are based on therespective best settings used in Table 5. We findthat the improvements of baseline methods tend tobe smaller when the target training set becomeslarger. This is partly because INIT and MULT donot explicitly preserve the parameters from sourcemodels in the constructed target models. Thus, thetransferred information is diluted while we trainthe target model with more data. In contrast, ourtransfer method explicitly saves the transferred in-

9Base Model: (25× 50 + 502) + (250× 200 + 2002) =93, 750; Ours: (25 × 50 + 502) + (200 × 200 + 2002) +(250× 200 + 2002) + (200× 100 + 1002) = 203, 750.

20

30

40

50

60

70

80

0% 20% 40% 60% 80% 100%

F1-s

core

(%)

Amount of target-domain data

w/o transferINIT

MULTOurs

Figure 4: F1-score vs. amount of target-domain data.

30

40

50

60

70

80

0 0.2 0.4 0.6 0.8 1

F1-s

core

(%)

ψ

ON→RION→WNCO→RICO→WN

Figure 5: Performance at different ψ of our methods.

formation in the base part of our target model, andwe use separate learning rates to help the targetmodel to preserve the transferred knowledge. Sim-ilar experiments on other datasets are shown in thesupplementary material.

5.5 Effect of the Hyper-parameter ψ

We present a set of experiments around the hyper-parameter ψ in Figure 5. Such experiments overdifferent datasets can shed some light on how toselect this hyper-parameter. We find that whenthe target data set is small (Ritter11), the best ψare 0.5 and 0.6 respectively for the two source do-mains, whereas when the target data set is larger(WNUT16), the best ψ becomes 0.7 and 0.8. Theresults suggest that the optimal ψ tends to be rela-tively larger when the target data set is larger.

6 Related WorkDomain adaptation and transfer learning has beena popular topic that has been extensively stud-ied in the past few years (Pan and Yang, 2010).For well-studied conventional feature-based mod-

els in NLP, there are various classic transfer ap-proaches, such as EasyAdapt (Daume, 2007),instance weighting (Jiang and Zhai, 2007) andstructural correspondence learning (Blitzer et al.,2006). Fewer works have been focused on trans-fer approaches for neural models in NLP. Mouet al. (2016) use intuitive transfer methods (INITand MULT) to study the transferability of neu-ral network models for the sentence (pair) clas-sification problem; Lee et al. (2017) utilize theINIT method on highly related datasets of elec-tronic health records to study their specific de-identification problem. Yang et al. (2017b) usethe MULT approach in sequence tagging tasks in-cluding named entity recognition. Following theMULT scheme, Wang et al. (2018) introduce alabel-aware mechanism into maximum mean dis-crepancy (MMD) to explicitly reduce domain shiftbetween the same labels across domains in medi-cal data. Their approach requires the output spaceto be the same in both source and target domains.Note that the scenario in our paper is that the out-put spaces are different in two domains.

All these existing works do not use domain-specific embeddings for different domains andthey use the same neural model for source andtarget models. However, with our word adapta-tion layer, it opens the opportunity to use domain-specific embeddings. Our approach also addressesthe domain shift problem at both input and out-put level by re-constructing target models withour specifically designed adaptation layers. Thehyper-parameter ψ in our proposed methods andλ in MULT both control the knowledge transferfrom source domain in the transfer learning pro-cess. While our method works on top of an ex-isting pre-trained source model directly, MULTneeds re-training with source domain data eachtime they train a target model.

Fang and Cohn (2017) add an “augmentedlayer” before their final prediction layer for cross-lingual POS tagging — which is a simple multi-layer perceptron performing local adaptation foreach token separately — ignoring contextual in-formation. In contrast, we employ a BLSTM layerdue to its ability in capturing contextual infor-mation, which was recently shown to be crucialfor sequence labeling tasks such as NER (Ma andHovy, 2016; Lample et al., 2016). We also noticethat a similar idea to ours has been used in the re-cently proposed Deliberation Network (Xia et al.,

2017) for the sequence generation task, where asecond-pass decoder is added to a first-pass de-coder to polish sequences generated by the latter.

We propose to learn the word adaptation layerin our task inspired by two prior studies. Fangand Cohn (2017) use the cross-lingual word em-beddings to obtain distant supervision for tar-get languages. Yang et al. (2017a) propose tore-train word embeddings on target domain byusing regularization terms based on the source-domain embeddings, where some hyper-parametertuning based on down-stream tasks is required.Our word adaptation layer serves as a linear-transformation (Mikolov et al., 2013), which islearned based on corpus level statistics. Althoughthere are alternative methods that also learn a map-ping between embeddings learned from differentdomains (Faruqui and Dyer, 2014; Artetxe et al.,2016; Smith et al., 2017), such methods usually in-volve modifying source domain embeddings, andthus re-training of the source model based on themodified source embeddings would be requiredfor the subsequent transfer process.

7 Conclusion

We proposed a novel, lightweight transfer learn-ing approach for cross-domain NER with neuralnetworks. Our introduced transfer method per-forms adaptation across two domains using adap-tation layers augmented on top of the existing neu-ral model. Through extensive experiments, wedemonstrated the effectiveness of our approach,reporting better results over existing transfer meth-ods. Our approach is general, which can be poten-tially applied to other cross-domain structured pre-diction tasks. Future directions include investiga-tions on employing alternative neural architecturessuch as convolutional neural networks (CNNs) asadaptation layers, as well as on how to learn theoptimal value for ψ from the data automaticallyrather than regarding it as a hyper-parameter. 10

Acknowledgments

We would like to thank the anonymous reviewersfor their thoughtful and constructive comments.Most of this work was done when the first au-thor was visiting Singapore University of Tech-nology and Design (SUTD). This work is sup-ported by DSO grant DSOCL17061, and is par-

10We make our supplementary material and code availableat http://statnlp.org/research/ie.

http://statnlp.org/research/ie

tially supported by Singapore Ministry of Edu-cation Academic Research Fund (AcRF) Tier 1SUTDT12015008.

ReferencesMikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.

Learning principled bilingual mappings of word em-beddings while preserving monolingual invariance.In Proc. of EMNLP.

John Blitzer, Ryan McDonald, and Fernando Pereira.2006. Domain adaptation with structural correspon-dence learning. In Proc. of ACL.

Danushka Bollegala, Takanori Maehara, and Ken ichiKawarabayashi. 2015. Unsupervised cross-domainword representation learning. In Proc. of ACL.

Chien-Lung Chou, Chia-Hui Chang, and Ya-YunHuang. 2016. Boosted web named entity recogni-tion via tri-training. ACM Trans. Asian and Low-Resource Lang. Inf. Process.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: deepneural networks with multitask learning. In Proc. ofICML.

Hal Daume. 2007. Frustratingly easy domain adapta-tion. In Proc. of ACL.

Meng Fang and Trevor Cohn. 2017. Model transferfor tagging low-resource languages using a bilingualdictionary. In Proc. of ACL.

Manaal Faruqui and Chris Dyer. 2014. Improving vec-tor space word representations using multilingualcorrelation. In Proc. of EACL.

Timothy W. Finin, William Murnane, AnandKarandikar, Nicholas Keller, Justin Martineau,and Mark Dredze. 2010. Annotating named entitiesin twitter data with crowdsourcing. In Proc. ofMturk@HLT-NAACL.

Bo Han and Timothy Baldwin. 2011. Lexical normali-sation of short text messages: Makn sens a #twitter.In Proc. of ACL.

Bo Han, Alan Ritter, Leon Derczynski, Wei Xu, andTim Baldwin. 2016. Results of the WNUT16 namedentity recognition shared task.

Jing Jiang. 2008. Domain adaptation in natural lan-guage processing. Ph.D. thesis.

Jing Jiang and ChengXiang Zhai. 2007. Instanceweighting for domain adaptation in nlp. In Proc.of ACL.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

John D. Lafferty, Andrew McCallum, and FernandoPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In Proc. of ICML.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proc. of NAACL-HLT.

Ji Young Lee, Franck Dernoncourt, and PeterSzolovits. 2017. Transfer learning for named-entityrecognition with neural networks. arXiv preprintarXiv:1705.06273.

Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broad-coverage normalization system for social media lan-guage. In Proc. of ACL.

Liyuan Liu, Jingbo Shang, Frank F. Xu, Xiang Ren,Huan Gui, Jian Peng, and Jiawei Han. 2018. Em-power sequence labeling with task-aware neural lan-guage model. In Proc. of AAAI.

Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za-iqing Nie. 2015. Joint named entity recognition anddisambiguation. In Proc. of EMNLP.

Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InProc. of ACL.

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013.Exploiting similarities among languages for ma-chine translation. arXiv preprint arXiv:1309.4168.

Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu,Lu Zhang, and Zhi Jin. 2016. How transferable areneural networks in nlp applications? In Proc. ofEMNLP.

Sinno Jialin Pan and Qiang Yang. 2010. A survey ontransfer learning. IEEE Trans. Knowl. Data Eng.,22(10).

Alexandre Passos, Vineet Kumar, and Andrew Mc-Callum. 2014. Lexicon infused phrase embed-dings for named entity resolution. arXiv preprintarXiv:1404.5367.

Nanyun Peng and Mark Dredze. 2016. Improvingnamed entity recognition for chinese social mediawith word segmentation representation learning. InProc. of ACL.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proc. of EMNLP.

Matthew E. Peters, Waleed Ammar, Chandra Bhaga-vatula, and Russell Power. 2017. Semi-supervisedsequence tagging with bidirectional language mod-els. In Proc. of ACL.

Sasa Petrovic, Miles Osborne, and Victor Lavrenko.2010. Streaming first story detection with applica-tion to twitter. In Proc. of HLT-NAACL.

Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011a.Named entity recognition in tweets: an experimentalstudy. In Proc. of EMNLP.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.2011b. Named entity recognition in tweets: An ex-perimental study. In Proc. of EMNLP.

Sara Rosenthal and Kathy McKeown. 2015. I couldn’tagree more: The role of conversational structure inagreement and disagreement detection in online dis-cussions. In Proc. of SIGDIAL.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. InProc. of CoNLL.

Sunita Sarawagi and William W. Cohen. 2004. Semi-markov conditional random fields for informationextraction. In Proc. of NIPS.

Samuel L. Smith, David H. P. Turban, Steven Hamblin,and Nils Y. Hammerla. 2017. Offline bilingual wordvectors, orthogonal transformations and the invertedsoftmax.

Thorvald Sørensen. 1948. A method of establishinggroups of equal amplitude in plant sociology basedon similarity of species and its application to analy-ses of the vegetation on danish commons. Biol. Skr.

Pontus Stenetorp, Hubert Soyer, Sampo Pyysalo,Sophia Ananiadou, and Takashi Chikayama. 2012.Size (and domain) matters: Evaluating seman-tic word space representations for biomedical text.Proc. of SMBM.

Benjamin Strauss, Bethany E. Toma, Alan Ritter,Marie-Catherine de Marneffe, and Wei Xu. 2016.Results of the wnut16 named entity recognitionshared task. In Proc. of W-NUT.

Luchen Tan, Haotian Zhang, Charles L. A. Clarke, andMark D. Smucker. 2015. Lexical comparison be-tween wikipedia and twitter corpora by using wordembeddings. In Proc. of ACL.

William Yang Wang and Diyi Yang. 2015. That’sso annoying!!!: A lexical and frame-semantic em-bedding based data augmentation approach to au-tomatic categorization of annoying behaviors using#petpeeve tweets. In Proc. of EMNLP.

Zhenghui Wang, Yanru Qu, Liheng Chen, Jian Shen,Weinan Zhang, Shaodian Zhang, Yimei Gao, G. Gu,Ken Chen, and Yong Yu. 2018. Label-aware doubletransfer learning for cross-specialty medical namedentity recognition. In Proc. of NAACL-HLT.

Ralph Weischedel, Martha Palmer, Mitchell Marcus,Eduard Hovy, Sameer Pradhan, Lance Ramshaw,Nianwen Xue, Ann Taylor, Jeff Kaufman, MichelleFranchini, et al. 2013. Ontonotes release 5.0ldc2013t19. Linguistic Data Consortium.

Karl Weiss, Taghi M Khoshgoftaar, and DingDingWang. 2016. A survey of transfer learning. Jour-nal of Big Data.

Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin,Nenghai Yu, and Tie-Yan Liu. 2017. Deliberationnetworks: Sequence generation beyond one-pass de-coding. In Proc. of NIPS.

Bishan Yang and Claire Cardie. 2012. Extracting opin-ion expressions with semi-markov conditional ran-dom fields. In Proc. of EMNLP-CoNLL.

Wei Yang, Wei Lu, and Vincent W. Zheng. 2017a.A simple regularization-based algorithm for learn-ing cross-domain word embeddings. In Proc. ofEMNLP.

Zhilin Yang, Ruslan Salakhutdinov, and William W.Cohen. 2017b. Transfer learning for sequence tag-ging with hierarchical recurrent networks. In Proc.of ICLR.

Alexander S. Yeh. 2000. More accurate tests for thestatistical significance of result differences. In Proc.

of COLING.

neural adaptation layers for cross-domain named …neural adaptation layers for cross-domain named...

Documents