arxiv:2102.12846v1 [cs.cl] 25 feb 2021

15
QNLP in Practice: Running Compositional Models of Meaning on a Quantum Computer Robin Lorenz Anna Pearson Konstantinos Meichanetzidis †‡ Dimitri Kartsaklis Bob Coecke Cambridge Quantum Computing University of Oxford {robin.lorenz;anna.pearson;k.mei;dimitri.kartsaklis; bob.coecke}@cambridgequantum.com Abstract Quantum Natural Language Processing (QNLP) deals with the design and implemen- tation of NLP models intended to be run on quantum hardware. In this paper, we present results on the first NLP experiments con- ducted on Noisy Intermediate-Scale Quantum (NISQ) computers for datasets of size 100 sentences. Exploiting the formal similarity of the compositional model of meaning by Coecke et al. (2010) with quantum theory, we create representations for sentences that have a natural mapping to quantum circuits. We use these representations to implement and successfully train two NLP models that solve simple sentence classification tasks on quantum hardware. We describe in detail the main principles, the process and challenges of these experiments, in a way accessible to NLP researchers, thus paving the way for practical Quantum Natural Language Processing. 1 Introduction Bringing the premise of computational speeds ex- ponentially higher than the current standard, quan- tum computing is rapidly evolving to become one of the most popular cutting-edge areas in computer science. And while, until recently, most of the work in quantum computing was purely theoretical or concerned with simulations on classical hardware, the advent of the first quantum computers available to researchers, referred to as Noisy Intermediate- Scale Quantum (NISQ) devices, has already led to some promising practical results and applications spanning a wide range of topics such as cryptogra- phy (Pirandola et al., 2020), chemistry (Cao et al., 2019), and biomedicine (Cao et al., 2018). An obvious question is whether this new paradigm of computation can also be used for NLP. Such applicability may be to the end of leveraging the computational speed-ups for language-related problems, as well as for investigating how quan- tum systems, their mathematical description and the way information is encoded “quantumly” may lead to conceptual and practical advances in repre- senting and processing language meaning beyond computational speed-ups. Inspired by these prospects, quantum natural language processing (QNLP), a field of research still in its infancy, aims at the development of NLP models explicitly designed to be executed on quan- tum hardware. There exists some impressive theo- retical work in this area, but the proposed experi- ments are classically simulated. A notable excep- tion to this is recent work by two of the authors (Meichanetzidis et al., 2020), where a proof of con- cept experiment of very small scale was performed on quantum hardware for the first time. Following and adding to the ideas and re- sults of that work, in this paper we present two complete medium-scale experiments consisting of linguistically-motivated NLP tasks running on quantum hardware. The goal of these experiments is not to demonstrate some form of “quantum ad- vantage” over classical implementations in NLP tasks; we believe this is not yet possible due to the limited capabilities of currently available quantum computers. In this work, we are mostly interested in providing a detailed account to the NLP commu- nity of what QNLP entails in practice. We show how the traditional modelling and coding paradigm can shift to a quantum-friendly form, and we ex- plore the challenges and limitations imposed by the current NISQ computers. From an NLP perspective, both tasks involve some form of sentence classification: for each sentence in the dataset, we use the compositional model of Coecke et al. (2010) – often dubbed as DISCOCAT (DIStributional COmpositional CATe- gorical) – to compute a state vector, which is then converted to a binary label. The model is trained on a standard binary cross entropy objective, using an optimisation technique known as Simultaneous Per- arXiv:2102.12846v1 [cs.CL] 25 Feb 2021

Upload: others

Post on 01-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

QNLP in Practice: Running Compositional Modelsof Meaning on a Quantum Computer

Robin Lorenz† Anna Pearson† Konstantinos Meichanetzidis†‡Dimitri Kartsaklis† Bob Coecke†

†Cambridge Quantum Computing ‡University of Oxford

{robin.lorenz;anna.pearson;k.mei;dimitri.kartsaklis;bob.coecke}@cambridgequantum.com

AbstractQuantum Natural Language Processing(QNLP) deals with the design and implemen-tation of NLP models intended to be run onquantum hardware. In this paper, we presentresults on the first NLP experiments con-ducted on Noisy Intermediate-Scale Quantum(NISQ) computers for datasets of size ≥ 100sentences. Exploiting the formal similarityof the compositional model of meaning byCoecke et al. (2010) with quantum theory,we create representations for sentences thathave a natural mapping to quantum circuits.We use these representations to implementand successfully train two NLP models thatsolve simple sentence classification tasks onquantum hardware. We describe in detail themain principles, the process and challenges ofthese experiments, in a way accessible to NLPresearchers, thus paving the way for practicalQuantum Natural Language Processing.

1 Introduction

Bringing the premise of computational speeds ex-ponentially higher than the current standard, quan-tum computing is rapidly evolving to become oneof the most popular cutting-edge areas in computerscience. And while, until recently, most of the workin quantum computing was purely theoretical orconcerned with simulations on classical hardware,the advent of the first quantum computers availableto researchers, referred to as Noisy Intermediate-Scale Quantum (NISQ) devices, has already led tosome promising practical results and applicationsspanning a wide range of topics such as cryptogra-phy (Pirandola et al., 2020), chemistry (Cao et al.,2019), and biomedicine (Cao et al., 2018).

An obvious question is whether this newparadigm of computation can also be used for NLP.Such applicability may be to the end of leveragingthe computational speed-ups for language-relatedproblems, as well as for investigating how quan-tum systems, their mathematical description and

the way information is encoded “quantumly” maylead to conceptual and practical advances in repre-senting and processing language meaning beyondcomputational speed-ups.

Inspired by these prospects, quantum naturallanguage processing (QNLP), a field of researchstill in its infancy, aims at the development of NLPmodels explicitly designed to be executed on quan-tum hardware. There exists some impressive theo-retical work in this area, but the proposed experi-ments are classically simulated. A notable excep-tion to this is recent work by two of the authors(Meichanetzidis et al., 2020), where a proof of con-cept experiment of very small scale was performedon quantum hardware for the first time.

Following and adding to the ideas and re-sults of that work, in this paper we present twocomplete medium-scale experiments consistingof linguistically-motivated NLP tasks running onquantum hardware. The goal of these experimentsis not to demonstrate some form of “quantum ad-vantage” over classical implementations in NLPtasks; we believe this is not yet possible due to thelimited capabilities of currently available quantumcomputers. In this work, we are mostly interestedin providing a detailed account to the NLP commu-nity of what QNLP entails in practice. We showhow the traditional modelling and coding paradigmcan shift to a quantum-friendly form, and we ex-plore the challenges and limitations imposed by thecurrent NISQ computers.

From an NLP perspective, both tasks involvesome form of sentence classification: for eachsentence in the dataset, we use the compositionalmodel of Coecke et al. (2010) – often dubbed asDISCOCAT (DIStributional COmpositional CATe-gorical) – to compute a state vector, which is thenconverted to a binary label. The model is trained ona standard binary cross entropy objective, using anoptimisation technique known as Simultaneous Per-

arX

iv:2

102.

1284

6v1

[cs

.CL

] 2

5 Fe

b 20

21

turbation Stochastic Approximation (SPSA). Theresults obtained from the quantum runs are accom-panied by various classical simulations which showthe projected long-term behaviour of the system inthe absence of hardware limitations.

The choice of DISCOCAT is motivated by thefact that the derivations it produces essentially forma tensor network, which means they are alreadyvery close to how quantum computers process data.Furthermore, the model comes with a rigorous treat-ment of the interplay between syntax and semanticsand with a convenient diagrammatic language. InSection 5 we will see how the produced diagramsget a natural translation to quantum circuits – thebasic units of computation on a quantum computer1

– and how sentences of different grammatical struc-ture are mapped to different quantum circuits. Wewill also explain how tensor contraction (the com-position function) has its expression in terms ofquantum gates. We further discuss the role of noiseon a NISQ device, and how this affects our design-ing choices for the circuits. The experiments areperformed on an IBM NISQ computer provided bythe IBM Quantum Experience platform2.

For our experiments we use two differentdatasets. The first one (130 sentences) is generatedautomatically by a simple context-free grammar,with half of the sentences related to food and halfrelated to IT (a binary classification task). Theother one (105 noun phrases) is extracted from theRELPRON dataset (Rimell et al., 2016), and thegoal of the model is to predict whether a nounphrase contains a subject-based or an object-basedrelative clause (again a binary classification task).We demonstrate that the models converge smoothly,and that they produce good results (given the size ofthe datasets) in both quantum and simulated runs.

In summary, the contributions of this paper arethe following: firstly, we outline in some depth theprocess, the technicalities and the challenges oftraining and running an NLP model on a quantumcomputer; secondly, we provide a strong proof ofconcept that quantum NLP is in our reach.

The structure of the paper is the following: Sec-tion 2 discusses the most important related work onexperimental quantum computing and QNLP; Sec-tion 3 describes DISCOCAT; Section 4 providesan introduction to quantum computing; Section5 gives a high-level overview for a general QNLPpipeline; Section 6 explains the tasks; Section 7 pro-

1In the circuit-based, as opposed to the measurement-based, model of quantum computation.

2https://quantum-computing.ibm.com

vides all the necessary details for the experimentsand finally; Section 8 summarises our findings andpoints to future work.

2 Related work

There is a plethora of hybrid classical-quantum al-gorithms with NISQ technology in mind (Bhartiet al., 2021). The majority of already imple-mentable quantum machine learning (QML) pro-tocols are based on variational quantum circuitmethods (Benedetti et al., 2019). However, use-ful quantum algorithms with theoretically provenspeedups assume fault-tolerant quantum comput-ers, which are currently not available. (See Section4 for more details.)

In the context of compositional QNLP, there isthe early theoretical work by Zeng and Coecke(Zeng and Coecke, 2016), in which the DISCOCAT

model3 is leveraged to obtain a quadratic speedup.Further theoretical work (Coecke et al., 2020) laysthe foundations for implementations on NISQ de-vices. In (O’Riordan et al., 2020), a DISCOCAT-inspired workflow was introduced along with ex-perimental results obtained by classical simulation.Regarding parsing, in (Bausch et al., 2020) theauthors employ Grover search to achieve super-polynomial speedups, while Wiebe et al. (2019)use quantum annealing to solve problems for gen-eral context-free languages. In (Gallego and Orus,2019), parse trees are interpreted as information-coarse-graining tensor networks where it is alsoproposed that they can be instantiated as quantumcircuits. Ramesh and Vinay (2003) provide quan-tum speedups for string-matching which is relevantfor language recognition. We also mention workon quantum-inspired classical models incorporat-ing features of quantum theory such as its inherentprobabilistic nature (Basile and Tamburini, 2017)or the existence of many-body entangled states(Chen et al., 2020). Finally, although not directlyrelated to NLP, there is a lot of interesting work onquantum neural networks, see for example (Guptaand Zia, 2001; Beer et al., 2020).

Recently, Meichanetzidis et al. (2020) providedfor the first time a proof of concept that practicalQNLP is in principle possible in the NISQ era, bymanaging to run and optimize on a NISQ devicea classifier using a dataset of 16 artificial shortsentences. The current work is a natural next step,presenting for the first time two NLP experimentsof medium scale on quantum hardware.

3For an introduction to DISCOCAT, see Section 3.

3 A model of meaning inspired byquantum mechanics

Based on the rigorous mathematical framework ofcompact closed categories, the DISCOCAT modelof Coecke et al. (2010) is of particular interest forour purposes, since its underlying compact-closedstructure provides an abstraction of the Hilbertspace formulation of quantum theory (Abramskyand Coecke, 2004). In this model, the meaningof words is represented by tensors whose order isdetermined by the types of words, expressed in apregroup grammar (Lambek, 2008). A type p hasa left (pl) and a right adjoint (pr), and the grammarhas only two reduction rules:

p · pr → 1 pl · p→ 1 (1)

Assuming atomic types n for nouns and nounphrases and s for sentences, the type of a transitiveverb becomes nr · s · nl, denoting that an n isexpected on the left and another one on the right,to return an s. Thus, the derivation for a transitivesentence such as “John likes Mary” gets the form:

n · (nr · s · nl) · n→(n · nr) · s · (nl · n)→ 1 · s · 1→ s (2)

showing that this is a grammatical sentence. Indiagrammatic form:

nr s nl nnJohn likes Mary

where the “cups” (∪) denote the grammar reduc-tions. The transition from pregroups to vector spacesemantics is achieved by a mapping4 F that sendsatomic types to vector spaces (n to N and s toS) and composite types to tensor product spaces(nr ·s ·nl to N ⊗S⊗N ). For example, a transitiveverb becomes a tensor of order 3, which can be seenas a bilinear map N ⊗N → S, while an adjective(with type n ·nl) can be seen as a matrix, represent-ing a linear map N → N . Further, F translates allgrammar reductions to tensor contractions, so thatthe meaning of a sentence s = w1w2 . . . wn with apregroup derivation α is given by:

s = F(α) [w1 ⊗w2 ⊗ · · · ⊗wn] (3)

Here, F turns α into a linear map that, appliedto the tensor product of the word representations,by tensor-contracting that expression returns a vec-tor for the whole sentence. As a concrete exam-ple, the meaning of the sentence “John likes Mary”

4This mapping can be formalised as a category-theoreticfunctor (Kartsaklis et al., 2016).

becomes s = j · L · m, where j,m ∈ N andL ∈ N ⊗ S ⊗ N . Above, L is a tensor of order3, and s is a vector in S. Note that the underlyingfield of the vector spaces is not specified, and can,e.g., be R or C depending on the particular type ofmodel (it is C in this work).

Meaning computations like the example abovecan be conveniently represented using the diagram-matic calculus of compact closed categories (Co-ecke and Kissinger, 2017):

N S N NNJohn likes Mary

where the boxes denote tensors, the order of whichis determined by the number of their wires, whilethe cups are now the tensor contractions. Notethe similarity between the pregroup diagram abovewith the one here, and how the grammatical deriva-tion essentially dictates both the shapes of the ten-sors and the contractions. As we will see later,these string diagrams can further naturally bemapped to quantum circuits, as used in quantumcomputation.

There is a rich literature on the DISCOCAT

model, and here we mention only a few indicativepublications. Proposals for modelling verbs andimplementations of the model have been providedby Grefenstette and Sadrzadeh (2011) and Kartsak-lis et al. (2012); Kartsaklis and Sadrzadeh (2014).Sadrzadeh et al. (2013) address the modelling ofrelative pronouns. Piedeleu et al. (2015) present aversion of the model where the meaning of wordsis given by density matrices, encoding phenomenasuch as homonymy and polysemy. Finally, versionsof the model have been used extensively in concep-tual tasks such as textual entailment at the level ofsentences, see for example (Sadrzadeh et al., 2018;Bankova et al., 2019; Lewis, 2019).

4 Introduction to quantum computing

For any serious introduction into quantum informa-tion theory and quantum computing, which obvi-ously is beyond the scope of this paper, the readeris referred to the literature (see, e.g., Refs. Co-ecke and Kissinger (2017) and Nielsen and Chuang(2011)). However, for the sake of a self-containedmanuscript, this section will, in a pragmatic wayand with a reader in mind who has had no exposureto quantum theory, set out the required terms andconcepts.

We naturally start with a qubit, which, as the

most basic unit of information carrier, is the quan-tum analogue of a bit, and yet a very differentsort of thing. It is associated with a property ofa physical system such as the spin of an electron(‘up’ or ‘down’ along some axis), and has a state|ψ〉 that lives in a 2-dimensional complex vectorspace (more precisely a Hilbert space). With |0〉,|1〉 denoting orthonormal basis vectors,5 whichare related to the respective outcomes ‘0’ and ‘1’of a measurement, a general state of the qubitis a linear combination known as a superposi-tion: |ψ〉 = α |0〉 + β |1〉, where α, β ∈ C and|α|2 + |β|2 = 1.

Importantly, quantum theory is a fundamentallyprobabilistic theory, that is, even given that a qubitis in state |ψ〉 – a state known as perfectly as isin principle possible – this generally allows oneonly to make predictions for the probabilities, withwhich the outcomes ‘0’ and ‘1’, respectively, occurwhen the qubit is measured. These probabilities aregiven by the so-called Born rule P (i) = | 〈i|ψ〉 |2,where i = 0, 1 and the complex number 〈i|ψ〉,called the amplitude, is given by the inner prod-uct written as the composition of state |ψ〉 withquantum effect 〈i|. Hence, for the above state,P (0) = |α|2 and P (1) = |β|2.

The evolution of an isolated qubit before measur-ing it is described through the transformation of itsstate with a unitary linear mapU , i.e. |ψ′〉 = U |ψ〉.See Fig. 1a for a diagrammatic representation ofsuch evolution.

The joint state space of q qubits is given bythe tensor product of their individual state spacesand thus has (complex) dimension 2q. For in-stance, for two ‘uncorrelated’ qubits in states|ψ1〉 = α1 |0〉 + β1 |1〉 and |ψ2〉 = α2 |0〉 +β2 |1〉, the joint state is |ψ1〉 ⊗ |ψ2〉, which inbasis-dependent notation becomes (α1, β1)T ⊗(α2, β2)T = (α1α2, α1β2, β1α2, β1β2)T . Theevolution of a set of qubits that interact with eachother, is described by a unitary map acting on theoverall state space. The diagrammatic representa-tion from Fig. 1a then extends correspondingly toa quantum circuit, such as the example shown inFig. 1b.

On the one hand, such a quantum circuit capturesthe structure of the overall linear map evolving therespective qubits, where parallel ‘triangles’ andparallel boxes are to be read as tensor products of

5We use Dirac (bra-ket) notation, where the ‘bra’, alsocalled the effect, 〈φ| is the dual vector of the ‘ket’ |φ〉, thestate. States are representable as column vectors and effectsas row vectors.

=ψ′

0

U

ψ

0

(a)

0

H

Rz(α)

Rx(β)

0

H

Rz(γ)

0

H

0

H

Rz(δ)

0

H

Rx(ζ)

(ii)

(i)

+

(b)

Figure 1: (a) Basic example of the diagrammatic rep-resentation, here of 〈0|ψ′〉 = 〈0|U |ψ〉, i.e. the evo-lution of a qubit in initial state |ψ〉 with unitary mapU and then composed with the effect 〈0|, correspond-ing to the (non-deterministic) outcome ‘0’. (b) Exam-ple of a quantum circuit, which contains all the kindsof gates relevant to this paper: the Hadamard gate H ,the X-rotation gate Rx(β) by angle β, the controlled Z-rotation gate (i), a component of which is a Z-rotationgate Rz(δ) by angle δ, and finally the quantum CNOTgate (ii).

states and unitary maps, respectively, and sequen-tial ‘wiring up’ of boxes as composition of linearmaps. Hence, a circuit as a whole represents theapplication of a linear map to a vector, comput-ing the systems’ overall state at a later time – inother words, it is simple linear algebra and can beviewed as tensor contraction of complex valuedtensors that are represented by the gates.

On the other hand, a circuit can convenientlyalso be seen as a list of abstract instructions forwhat to actually implement on some quantum hard-ware in order to make, say, some photons6 undergo,physically, the corresponding transformations asprescribed by the gates of the circuit.

Now, coming back to the fact that quantum the-ory is probabilistic, once a circuit has been run onhardware all qubits are measured. In the case ofFig. 1b this yields 5 bits each time and many suchruns have to be repeated to obtain statistics fromwhich to estimate the outcome probabilities. Theseprobabilities connect theory and experiment.

In order to obtain the result for a given problem,the design of the circuit has to encode the problemsuch that that result is a function of the outcomeprobabilities. Hence, the choice of circuit is key.

A special case, albeit straightforward, is worthmentioning due to its relevance for this paper: theencoding of the quantity of interest in a circuit over,say, q qubits may be such that the result is a func-tion of the outcome distribution on just r of the

6The basic physical object used as a qubit varies vastlyacross different quantum computers.

qubits (r < q), but subject to the condition thatthe remaining q − r qubits have yielded particularoutcomes, i.e. the result is a function of the corre-sponding conditional probability distribution. Thetechnical term for this is post-selection, seeing asone has to run the whole circuit many times andmeasure all qubits to then restrict – post-select –the data for when the condition on the q − r qubitsis satisfied. At the diagrammatic level the need forsuch post-selection is typically indicated by the cor-responding quantum effects as done, e.g., in Fig. 5,which up to the 0-effects on 4 of the 5 qubits, isidentical to that of Fig. 1b.

Usually, the power of quantum computing isnaively attributed to ‘quantum parallelism’; thatone can create exponentially large superpositionsover the solution space of a problem. However,the caveat, and the reason why it is not believedthat quantum computers can efficiently solve NP-complete problems, is that the output is a probabil-ity distribution one needs to sample from. Viewingquantum theory as a generalisation of probabilitytheory, one realises that a key notion that quantumcomputing allows for is interference; amplitudessum as complex numbers. Designing quantum al-gorithms amounts to crafting amplitudes so thatthe wrong solutions to a problem interfere destruc-tively and the correct ones constructively so thatthey can be measured with high probability. This isa hard conceptual problem in itself, which explainsthe relatively small number of quantum algorithms,albeit very powerful ones, in the quantum computerscientist’s arsenal.

Actually building and running a quantum com-puter is a challenging engineering task for multiplereasons. Above all, qubits are prone to randomerrors from their environment and unwanted inter-actions amongst them. This ‘coherent noise’ isdifferent in nature to that of a classical computinghardware. A quantum computer that would givethe expected advantages for large scale problems,is one that comes with a large number of so calledfault-tolerant qubits, essentially obtained by clevererror correction techniques. Quantum error correc-tion (Brown et al., 2016) reduces to finding waysfor distributively encoding the state of a logicalqubit on a large number of physical qubits (hun-derds or even thousands). Their scalable technicalrealisability is still out of reach at the time of writ-ing. The currently available quantum devices arerather noisy medium-scale machines with fewerthan 100 physical qubits, playing mostly the role

of proof of concept and being extremely valuableassets for the development of both theory and appli-cations. This is the reason one speaks of the NISQera, mentioned in Sec. 1, and this is the light inwhich the experimental pioneering on QNLP pre-sented in this paper has to be seen – exciting proofof concept, while the machines are still too smalland noisy for large scale QNLP experiments.

5 The general pipeline

Section 4 explained that the quantum computingequivalent to classical programming involves theapplication of quantum gates on qubits accordingto a quantum circuit. This section explains the gen-eral pipeline of our approach, i.e. in particular theprocess of how to go from a sentence to its represen-tation as a quantum circuit, on the basis of whichthe model predicts the label. Figure 2 schemati-cally depicts this pipeline, each numbered step ofwhich will be addressed subsequently at a genericlevel, whereas concrete examples of choices thatone has to make along the way will be covered inthe implementation of this pipeline presented inSec. 7.

Since DISCOCAT is syntax-sensitive, the firststep is to get a syntax tree corresponding to the sen-tence, on the basis of which a DISCOCAT deriva-tion is created in a diagrammatic form. In orderto avoid computational complications on quantumhardware, this diagram is first optimised to yieldthe input into an ansatz that then determines the ac-tual conversion into a quantum circuit. A quantumcompiler translates the latter into hardware-specificcode that can be run on quantum hardware. Thesestages are described in more detail below.

Step 1: For a large-scale NLP experiment withthousands of sentences of various structures, theuse of a pregroup parser for providing the syn-tax trees would be necessary.7 In the present

7Since to the best of our knowledge at the time of writingthere are not any robust pregroup parsers available, an alterna-

syntaxparsersentence

tree

DisCoCatderivation

DisCoCatdiagramdiagram

DisCoCat

rewriteansätze

quantum quantumcompiler computer

postprocessing

quantumcircuit

circuitquantumoptimised measurement

statistics result

DisCoCat

1

234

5 6 7

Figure 2: Schematic overview of the general pipeline.

work, though, this step can be executed semi-automatically due to the limited vocabulary andthe small number of different grammatical struc-tures in our sentences. For instance, with nouns,adjectives and transitive verbs having the respectivetypes n, n · nl and nr · n · nl, the sentence “personprepares tasty dinner” is parsed as below (see Sec.3 for more details):

n · (nr · s · nl) · (n · nl) · n→ (n · nr)·s · (nl · n) · (nl · n)→ 1 · s · 1 · 1→ s (4)

Step 2: Construct the sentences’ DisCoCat di-agrams by representing each word as a state andthen ‘wiring them up’ by drawing a cup for everyreduction rule. The above example becomes:8

person prepares tasty dinner

nl n nl nn nr s

N N S NNNN

Figure 3

Step 3: The structure of compact closed cate-gories comes with rewrite rules that allow the trans-formation of diagrams like the above into equiva-lent ones, which, depending on the final implemen-tation, can bring computational advantages. Whileso far there is no universal recipe for how to dothis optimally, one crucial observation in light ofa quantum implementation is that cups are costly(see Step 5). One simple, but effective transition toequivalent diagrams is therefore achieved by ‘bend-ing down’ all nouns of a sentence9 like ‘person’and ‘dinner’ in the example:

person

prepares tasty

dinner

nl n nlnr

s

Figure 4

tive approach would be to use a CCG parser and subsequentlyconvert the types into pregroups. This however comes with afew caveats, the discussion of which is outside the scope ofthis paper.

8As discussed in Sec. 3, the diagram in Fig. 3 representslinear-algebraic operations between tensors in vector spacesN , S and tensor products of them. For convenience, we alsoinclude the pregroup types as a reminder of the grammaticalrules involved in each contraction. For the remainder of thispaper, only the pregroup types will be shown since they areindicative of the vector spaces they are mapped to.

9Effectively replacing the cup and the noun’s state 1 → nwith an effect nr → 1 or nl → 1, depending on which end ofthe cup the noun was.

0

H

Rz(α)

0

H

Rz(γ)

0

H

0

H

Rz(δ)

0

H

Rx(ζ)+

0 0 0

Rx(β)

0

(i) (ii)

(iii) (iv) (v)

Figure 5: Example of interpreting Fig. 4 as a quantumcircuit according to an ansatz. Here qn = 1 = qs, thewords ‘prepares’ and ‘tasty’ replaced with the quantumstates marked by (i) and (ii), respectively, ‘person’ and‘dinner’ replaced with the quantum effects marked as(iii) and (v), respectively, and the cup written as theBell effect in component (iv).

This way the number of cups is reduced by asmany nouns as are present in the sentence.

Step 4: This is the step where the abstract DIS-COCAT representation takes a more concrete form:the DISCOCAT diagram is mapped to a specificquantum circuit. This map is determined by: (a)choosing the number qn and qs of qubits that ev-ery wire of type n and s, respectively, as well asdual types thereof, get mapped to; and (b) choosingconcrete parametrised quantum states (effects) thatall word states (effects) get consistently replacedwith. We refer to the conjunction of such choicesas an ansatz. Note that each cup10 is equivalent toa Bell-effect (see Fig. 5). Principled approaches to(b) are presented in Sec. 7, but for an illustrationconsider the example from Fig. 4 translated into aquantum circuit of the form as shown in Fig. 5.

It is worth emphasising that the mapping’s out-put is a circuit whose connectivity is fixed by thesentence’s syntax, while the choice of ansatz deter-mines the number of parameters for each word’srepresentation. Now, in principle it is of courseknown how many parameters p are needed to fixthe most general state on q qubits, so why is thischoice of ansatz important (independent from ques-tions of overfitting and generalisation)? There aretwo reasons, which are both of a practical nature.First, p is exponential in q. So, even beyond theNISQ era, for the sort of dataset size and length ofsentences one wishes to consider in NLP, a feasiblenumber of parameters has to be achieved. Second,different quantum machines have different sets of‘native’ gates, and some gates are less prone to er-rors than others when implemented. Hence, on

10Recall from the discussion in Sec. 3 that cups can also beseen to correspond to tensor contractions.

NISQ machines the choice of ansatz should be in-formed by the actual hardware to avoid unnecessarygate-depth after compilation and hence noise frommere re-parametrisation.

Step 5: A quantum compiler translates the quan-tum circuit into machine specific instructions. Thisincludes expressing the circuit’s quantum gates interms of those actually physically available on thatmachine and ‘routing’ the qubits to allow for thenecessary interactions given the device’s topology,as well as an optimisation to further reduce noise.

Part of this step also is the necessary bookkeep-ing to do with post-selection (see Sec. 4). The0-effects in Fig. 5, while crucial parts of the sen-tence’s representation, are not deterministically im-plementable operations; as outcomes of measure-ments, they are obtained only with a certain proba-bility. The circuit corresponding to Fig. 5 that canbe implemented, is hence precisely that of Fig. 1bwith the additional operation of measuring all 5qubits at the end. All that one has to do is remem-ber which qubits, after many runs of the quantumcircuit (these runs giving shots), have to be post-selected on the 0-outcome.

Here we see the reason for why cups are costly,since each cup leads to 2qn (or 2qs ) qubits that re-quire post-selection. Had one stuck to the diagramin Fig. 3 then 6 out of 7 qubits would have had tobe post-selected, rather than 4 out of 5 based onFig. 4. With sentences just slightly longer than inthe example and limited number nshots of shots ofany given circuit (213 for IBM quantum devices)one easily runs into severe statistical limitations toreliably estimate the desired outcome probabilities.

Step 6: The quantum computer runs the circuitnshots times, i.e. for each run prepares initial states,applies the gates and measures all the qubits at theend. This returns outcome counts of the shots forall qubits.

Step 7: Post-processing applies the above men-tioned post-selection process and turns the countsinto estimations of relative frequencies, which arethe input into any further post-processing for thecalculation of a task-specific final result.

6 The tasks

We define two simple binary classification tasks forsentences. In the first one, we generated sentencesof simple syntactical forms (containing at most 5words) from a fixed vocabulary by using a simplecontext-free grammar (Table 1). The nature ofthe vocabulary (of size 17) allowed us to choose

noun phrase→ nounnoun phrase→ adjective nounverb phrase→ verb noun phrasesentence→ noun phrase verb phrase

Table 1: Context-free grammar for the MC task.

sentences that look natural and refer to one of twopossible topics, food or IT. The chosen dataset ofthis task, henceforth referred to as MC (‘meaningclassification’), consists of 65 sentences from eachtopic, similar to the following:

“skillful programmer creates software”“chef prepares delicious meal”

Part of the vocabulary is shared between the twoclasses, so the task (while still an easy one from anNLP perspective) is not trivial.

In a slightly more conceptual task, we select105 noun phrases containing relative clauses fromthe RELPRON dataset (Rimell et al., 2016). Thephrases are selected in such a way that each wordoccurs at least 3 times in the dataset, yielding anoverall vocabulary of 115 words. While the originaltask is to map textual definitions (such as “devicethat detects planets”) to terms (“telescope”), for thepurposes of this work we convert the problem intoa binary-classification one, with the goal to predictwhether a certain noun phrase contains a subjectrelative clause (“device that detects planets”) oran object relative clause (“device that observatoryhas”). Our motivation behind this task, henceforthreferred to as RP, is that it requires some syntacticawareness from the model, so it is a very reasonablechoice for testing DISCOCAT. Further, the sizeof vocabulary and consequently the sparseness ofwords make this task a much more challengingbenchmark compared to the MC task.

These simple datasets already pose challengesin two ways. First, concerning the lengths of sen-tences (see Step 5 of Sec. 5 and also Sec. 8). Sec-ond, concerning the lengths of datasets, since theyalready reach the limits of the currently availablequantum hardware – even just doubling the numberof sentences would start to approach an unfeasi-ble time cost given the shared available resources(more details about this in Sec. 7.4).

7 Experiments

The experiments reported in this paper address thetwo tasks MC and RP, respectively, by implement-ing the pipeline from Fig. 2 together with an opti-misation procedure to train the model parameters

devicen

that

n

obser-

snr

hasnsl

(nl)l

nr nlvatory

device

n

that

n

detects

snr

planet

n

sl

nnr nl 0

H

0

H

0

H

H H

(a)

(b) (c)

++

Figure 6: DISCOCAT diagrams for example phrasesfrom the RP dataset, where in (a) ‘that’ is the subject,while in (b) it is the object. (c) depicts the quantumstate (a GHZ state) assigned to ‘that’ as part of ouransatze, given that qn = 1 and qs = 0.

against an objective function, as in standard super-vised machine learning. Importantly, training themodel amounts to learning the word representa-tions in a task-specific way, while all other aspectsof the DISCOCAT model are dictated by syntax orare contingent hyperparameters that fix the ansatz.Our Python implementation11 used the DISCOPY

package12 (de Felice et al., 2020) to implement theDISCOCAT specific Steps 2-4, the Python inter-face of the quantum compiler t|ket〉TM 13 (Sivara-jah et al., 2020) for Step 5, and the IBM deviceibmq bogota for Step 6. The remainder of thissection describes all steps in detail.

As the datasets of both tasks are simple, pars-ing can be done semi-automatically. For the nounphrases and sentences considered in this work, wesimply note that relative pronouns in the subjectcase take pregroup type nr · n · sl · n, while in theobject case their type is nr · n · (nl)l · sl, and weremind the reader that the types for adjectives andtransitive verbs are, respectively, n ·nl and nr ·s ·nl.One can then refer to Step 2 in Sec. 5 to work outthe derivations and convince oneself that they havea unique reduction to n or s. For illustration, dia-grams (a) and (b) in Fig. 6 depict two noun phrasesfrom the RP dataset, while the example in Fig. 3is a sentence from the MC dataset. The schemedescribed in Step 3 of ‘bending the nouns’ wasconsistently applied to both datasets.

7.1 Parametrisation – the ansatze

As explicated in Sec. 5, Step 4, the choice of ansatzdetermines the parametrisation of the concrete fam-ily of models. We studied a variety of ansatze for

11The Python code and the datasets will be made avail-able at https://github.com/CQCL/qnlp_lorenz_etal_2021_resources.

12https://github.com/oxford-quantum-group/discopy

13https://github.com/CQCL/pytket

both tasks. For each ansatz all appearing parame-ters are valued in [0, 2π].

In order to keep the number of qubits as low aspossible in light of the noise in NISQ devices, whilehaving at least one qubit to encode the label – weare trying to solve a binary classification task afterall – we set qn = 1 and qs = 1 for the MC task,but qs = 0 for the RP task (noting that the type ofthe phrases here is n). Recall that the vector spacesin which the noun and sentence embeddings livehave (complex) dimension N = 2qn and S = 2qs ,respectively.

Due to our scheme from Step 3 (see Sec. 5) allnouns appear as effects 〈w|, for which two optionswere considered (always consistently applied to allnouns of the dataset). First, an Euler decompo-sition that involves three parameters and assigns〈w| = 〈0|Rx(θ1)Rz(θ2)Rx(θ3). This is one wayto write down the most general qubit effect. Sec-ond, the use of a single Rx gate by assigning〈w| = 〈0|Rx(θ), which, with a single parame-ter, gives a more economical option that is stillwell-motivated, since an Rx gate can mediate be-tween the |0〉 and |1〉 basis states. Let pn ∈ {3, 1}represent the choice of which of these two optionsis chosen.

Adjectives are states on two qubits and verbs(only transitive ones appear) are, depending on qs,states on two or three qubits. For adjectives andverbs, so called IQP14-based states were used. Form qubits this consists of all m qubits initialised in|0〉, followed by dmany IQP layers (Havlıcek et al.,2019), where each layer consists of an H gate onevery qubit composed with m − 1 controlled Rzgates, connecting adjacent qubits. Components (i)and (ii) of Fig. 5 show one IQP layer for a three anda two qubit state, respectively. Motivation for thischoice of IQP layers comes – beyond being express-ible enough to perform QML (Havlıcek et al., 2019)– from the fact that the appearing gates were nativeto IBMQ’s machines.15 We considered d ∈ {1, 2},again in order to keep the depth of circuits as smallas possible.

For the RP task, the relative pronoun ‘that’ is alsoincluded in the vocabulary. While at the pregrouplevel its type depends on whether it is the subject orobject case, at the quantum level (noting that qs =0 for this task) only one kind of quantum state isrequired. Following the use of Kronecker tensors in(Sadrzadeh et al., 2013, 2014) to model functional

14Instantaneous Quantum Polynomial.15At the time the experiments were run.

MC RP(qs, pn, d) k (qs, pn, d) k

(1, 1, 1) 22 (0, 1, 1) 114(1, 1, 2) 35 (0, 1, 2) 168(1, 3, 1) 40 (0, 3, 1) 234(1, 3, 2) 53 (0, 3, 2) 288

Table 2: Overview of ansatze studied for the two tasks.

words like relative pronouns, we chose for ‘that’ aGHZ-state, which is displayed in Fig. 6c and which,notably, does not involve any parameters.

Hence, with the laid out approach, the choicesthat fix an ansatz (with qn = 1 fixed) can be sum-marised by a triple of hyperparameters (qs, pn, d).The total number k of parameters, denoted Θ =(θ1, ..., θk), varies correspondingly and depends onthe vocabulary. See Table 2 for the ansatze westudied. Note that Fig. 5 shows the circuit of theexample sentence from the MC task precisely foransatz (1, 1, 1).

7.2 Model prediction and optimisationAfter Step 4, every sentence or phrase P (of type sfor MC and n for RP) is represented by a quantumcircuit according to the chosen ansatz. Let thecorresponding output quantum state16 be denoted|P (Θ)〉 and define

liΘ(P ) :=∣∣ |〈i|P (Θ)〉|2 − ε

∣∣ ,where i ∈ {0, 1} and ε is a small positive number,in our case set to ε = 10−9, which ensures thatliΘ(P ) ∈ (0, 1) and l0Θ(P ) + l1Θ(P ) ≤ 1, so that

lΘ(P ) :=(l0Θ(P ), l1Θ(P )

)/(∑

i

liΘ(P ))

defines a probability distribution. The label for Pas predicted by the model is then obtained fromrounding, i.e. defined to be LΘ(P ) := dlΘ(P )ewith [0, 1] ([1, 0]) corresponding to ‘food’ (‘IT’)for the MC task and to ‘subject case’ (‘object case’)for the RP task.

The MC dataset was partitioned randomly intosubsets T (training), D (development) and P (test-ing) with cardinalities |T | = 70, |D| = 30,|P| = 30. Similarly, for the RP task with |T | = 74,|P| = 31, but no development setD, since the ratioof the sizes of vocabulary and dataset did not allowfor yet fewer training data, while the overall datasetof 105 phrases could not be easily changed17. For

16Generally, a sub-normalised state (in physics jargon).17In contrast to the MC task, here the data was extracted

from an existing dataset, and picking further phrases whileensuring a minimum frequency of all words was non-trivial(see Sec. 6).

both tasks, this was done such that all respectivesubsets are perfectly balanced with respect to thetwo classes.

The objective function used for the training isstandard cross-entropy; that is, if letting L(P ) de-note the actual label according to the data, thecost is C(Θ) :=

∑P∈T L(P )T · log(lΘ(P )). For

the minimisation of C(Θ), the SPSA algorithm(Spall, 1998) is used, which for an approximationof the gradient uses two evaluations of the costfunction (in a random direction in parameter spaceand its opposite). The reason for this choice isthat in a variational quantum circuit context likehere, proper back-propagation requires some formof ‘circuit differentiation’ that would in turn haveto be evaluated on a quantum computer – some-thing being actively developed but still unfeasiblefrom a practical perspective. The SPSA approachprovides a less effective but acceptable choice forthe purposes of these experiments. Finally, no re-qularisation was used in any form.

7.3 Classical simulationOwing to the fact that computation with NISQdevices is slow, noisy and limited at the time ofwriting, it is not practical to do extensive trainingand comparative analyses on them. This was in-stead done by using classical calculations to replaceSteps 5 and 6 of Fig. 2 in the following sense. Forany choice of parameters Θ and some sentence orphrase P , the complex vector |P (Θ)〉 can be calcu-lated by simple linear algebra – basically throughtensor contraction – and hence the values lΘ(P ),and thereby also the cost C(Θ) as well as the re-spective types of errors, can be obtained through a‘classical simulation’ of the pipeline.

Figs. 7a and 7b present the convergence of themodels for the MC and RP task, respectively, onthe training datasets, for the selected sets of ansatzefrom Table 2. Shown is the cost over SPSA iter-ations, where each line is from averaging over 20runs of the optimisation with a random initial pa-rameter point Θ. The reason for this averaging isthat there are considerable variances and fluctua-tions between any individual run due to the crudeapproximation of the gradient used in the stochasticSPSA algorithm and the specificities of the cost-parameter landscape. As is clear from the plots,the training converges well in all cases. What ismore, the dependence of the minima that the aver-age cost converges to on the chosen ansatz reflectsthe theoretical understanding as follows.

For the MC task, the minimum is the lower the

(a) (b)

Figure 7: Convergence of the models in the classical simulation (averaged over 20 runs) for different ansatze; in(a) for MC task and in (b) for RP task.

(a) (b)

Figure 8: Classical simulation results for the cost and errors (again averaged over 20 runs) in (a) for MC task andchosen ansatz (1, 3, 1) and in (b) for RP task and chosen ansatz (0, 1, 2).

more parameters. For the RP task – being aboutsyntactic structure, essentially about word order –the situation is different but in a way that again isunderstandable. Given our treatment of ‘that’ (cf.Sec. 7.2), it is not hard to see that the task comesdown to learning embeddings such that the verbs’states become sensitive to which of their two wiresconnects to the first and the second noun in thephrase, respectively. Hence, the larger d (whichfixes the number of parameters for verbs) the lowerthe minimum.

The plots in Figs. 7a and 7b showcase what isexpected from a quantum device if it were noise-free, and if many iterations and runs were feasibletime-wise. On that basis, we chose one ansatz pertask, that does well with as few parameters as pos-sible, for the actual implementation on quantumhardware: (1, 3, 1) for the MC task and (0, 1, 2)

for the RP task. Figs. 8a and 8b show simulationresults for the correspondingly chosen ansatze to-gether with various errors, but for fewer iterationsthan in Figs. 7a and 7b for better visibility.

After 500 iterations in the MC case, the train andtest errors read 16.9% and 20.2%, respectively; inthe RP case the train and test errors are 9.4% and27.7%, respectively. Noticeably, the test error forthe latter task is somewhat higher than the test errorfor the former task, as expected from the discussionin Sec. 6. The large vocabulary in combinationwith the small size of the dataset is one of the mostimportant reasons for this; for example, analysingthe data in the aftermath revealed that many of the115 words in the vocabulary appear only in P , butnot at all in T .18

18More precisely, 17% (36%) of the vocabulary appear zerotimes (once) in T .

(a) (b)

Figure 9: Results from quantum computation for cost and train and test errors (test error for every 10th iteration)in (a) for MC task and chosen ansatz (1, 3, 1) and in (b) for RP task and chosen ansatz (0, 1, 2).

7.4 Quantum runs

Finally, for the actual experiments on quantumhardware the only missing details are how thedefinitions from Sec. 7.2 concerning predicted la-bels, the cost and errors relate to Steps 5 and 6in Fig. 2. For both experiments, all circuits (com-piled with t|ket〉TM) were run on IBM’s machineibmq bogota. This is a superconducting quan-tum computing device with 5 qubits and quantumvolume 32.19

Now, every time the value of the cost or theerrors are to be calculated, the compiled circuitscorresponding to all sentences or phrases in thecorresponding dataset (T ,D or P) are sent as a sin-gle job to IBM’s device. There, each circuit is run213 times (the maximum number possible from themachine side). The returned data thus comprisesfor each circuit (involving q qubits) 213 × q mea-surement outcomes (0 or 1). For every sentenceor phrase P , after appropriately post-selecting thedata20(see Secs. 4 and 5), the relative frequenciesof outcomes 0 and 1 of the qubit that carries theoutput state of P , give the estimate of | 〈i|P (Θ)〉 |2(with i = 0, 1) and thus of lΘ(P ). The remainingpost-processing to calculate the cost or an error isthen as for the classical simulation.

The experiments involved one single run of min-

19Quantum volume is a metric which allows quantum com-puters of different architectures to be compared in terms ofoverall performance, and it quantifies the largest random cir-cuit of equal width and depth that a quantum computer cansuccessfully implement.

20Note that the MC dataset includes sentences that leadto circuits on only three qubits. Here ‘appropriately post-selecting’ means post-selecting two out of the three usedqubits.

imising the cost over 100 iterations for the MC taskand 130 iterations for the RP task, in each casewith an initial parameter point that was chosen onthe basis of simulated runs on the train (and dev)datasets.21 For the MC task, obtaining all the re-sults shown in Fig. 9a took just under 12 hoursof run time. This was enabled by having exclu-sive access to ibmq bogota for this period oftime. In contrast, the RP jobs were run in IBMQ’s‘fairshare’ mode, i.e. with a queuing system inplace that ensures fair access to the machine fordifferent researchers. As a consequence, for the RPtask, which is not computationally more involvedthan the MC task, obtaining all the results shownin Fig. 9b took around 72 hours. With access toquantum devices still being a limited resource and‘exclusive access’ being rationed, we see here thereason for the problems that the time cost of yetlarger datasets would entail.

Figures 9a and 9b show the cost and variouserrors for the MC task with ansatz (1, 3, 1) andfor the RP task with ansatz (0, 1, 2), respectively.Despite the noise levels that come with NISQ eraquantum computers, and given the fact of a singlerun with few iterations compared to the classicalsimulations in Sec. 7.3, the results look remarkablygood – the cost is decreasing with SPSA iterationsin both cases, modulo the expected fluctuations.After 100 (130) iterations as reported in Fig. 9a(Fig. 9b), the test error was 16.7% (32.3%) for

21This choice was made to reduce the chances of beingparticularly unlucky with the one run that we did on actualquantum hardware. Yet, this choice’s significance should notbe overrated given that the influence of quantum noise spoilsthe predictability of the cost at a particular parameter pointfrom simulated data.

the MC and RP task, respectively, with F-score0.85 (0.75). These results were checked to be sta-tistically significant against random guessing withp ≤ 0.001 for MC and p ≤ 0.10 for RP accordingto a permutation test.

Compared to the simulations (Figs. 8a, 8b), itcan be seen that after the same number of iterations,test errors are actually lower for the quantum runs.However, due to the special conditions under whichthese experiments were performed (single run onquantum hardware subject to quantum noise versusmany averaged runs on classical hardware withoutnoise, but with the inherent instability of SPSA op-timization still present), such comparisons are notvery conclusive. In general, the trends presentedin the plots of Figure 9 are the expected based onthe size of the datasets. For example, the test er-ror in Fig. 9b shows a paradigmatic example ofoverfitting around iteration 60.

8 Future work and conclusions

In this work we have provided a detailed expositionof two experiments of NLP tasks implemented onsmall noisy quantum computers. Our main goalwas to present novel larger-scale experiments build-ing on prior proof-of-concept work (Meichanet-zidis et al., 2020), while having in mind the NLPpractitioner. Despite the prototypical nature ofthe currently available, albeit rapidly growing insize and quality, quantum processors, we obtainmeaningful results on medium-scale datasets. Wereport that our experiments were successful andconverged smoothly, and we conclude that the DIS-COCAT framework we have employed is a naturalchoice for QNLP implementations. We also hopethat the current exposition will serve as a usefulintroduction to practical QNLP for the NLP com-munity.

Having established a QNLP framework for near-term quantum hardware, we briefly outline direc-tions for future work. The ansatz circuits we haveused to parameterise the word meanings servedwell for this work’s goal and also are motivated inthe QML literature by the fact that they are conjec-tured to be hard to simulate classically. However,it was beyond the scope of this work to search foroptimal word circuits in a task-specific way. Thisopens up an exploratory arena for future work onansatze. In particular, an open question regardstrade-offs of performance of ansatz families in aspecific task versus general performance on manytasks.

Furthermore, a crucial direction for further workregards scalability. There is more than one waythat one can think of scaling-up NLP tasks. Whatis special to QNLP, is how scaling up in differ-ent dimensions manifests itself as a resource costin the context of quantum computation given themodest quantum devices available today. First ofall, we can consider the cost as the sentences getlonger. As a sentence scales in length, the num-ber of qubits on which its corresponding quantumcircuit is defined, i.e. the circuit width, will scaleas well, depending on the number of qubits as-signed to each pregroup type. This consideration isremedied by the realisation that quantum comput-ers have been growing in qubit numbers and thereis no sign of this growth slowing down. More im-portantly, however, a longer sentence will incur anexponential time-cost in the number of qubits beingpost-selected. Note that in the long term, one doesnot aim to post-select, but employ more sophisti-cated protocols where only one qubit needs to bemeasured, resulting in additive approximations ofan amplitude encoding a tensor contraction (Aradand Landau, 2010). Of course, in natural language,sentences are usually upper bounded in length andso we can consider these as up-front constant costs.

We can consider two additional ways of scalingup our experiments: the number of sentences, andthe size of the vocabulary. A greater number of sen-tences results in a multiplicative prefactor on thetime-cost, which depends only on the time neededto get statistics from the quantum computer for arepresentative sentence-circuit. As in classical ap-proaches to NLP involving large-scale tasks andbig data, in theory we can parallelise the indepen-dent evaluations of quantum circuits on multiplequantum processors, repeat jobs in batches on onequantum computer, or run jobs in parallel on manyquantum computers to gather more samples andincrease the prediction accuracy. However, the cur-rent state of available quantum computers does notyet allow for experiments of such magnitude; theruntime of a single circuit is substantial, there arelimits to the number of shots and number of circuitssubmitted in each job, and high-fidelity quantumprocessors are limited in number which can leadto prohibitive queuing times. Since the quality ofquantum hardware is constantly improving, though,such techniques will eventually become valid pos-sibilities.

A larger vocabulary would mean a higher-dimensional parameter space. This motivates the

careful study of the landscapes defined by a task’scost function, as well as the exploration of otheroptimisation methods beyond SPSA. Interestingly,this opens up the obvious discussion on a potentialquantum advantage in NLP. The type of quantumadvantage one hopes to gain over any classical al-gorithm varies with the problem and would dependon the task at hand. In the NISQ era, a reasonabledirection for attempting to establish a quantum ad-vantage is the expressibility of quantum models.There is therefore a need to place both classicaland quantum models on equal footing, so that afair comparison can be made. This can be achievedfor example by adopting tools from information ge-ometry (Abbas et al., 2020), rather than just usingthe naive approach of simply counting variationalparameters.

Finally, we remark on the apparent linearity ofour model. Indeed, quantum theory is a lineartheory, as unitary evolution of pure states makesapparent. However, the subtlety lies in how wechose to embed the input data, in this case the wordmeanings, and how the cost function is defined interms of them. The word embeddings are defined aspure quantum states and the cost function is givenin terms of outcome probabilities determined bythese pure quantum states. Importantly, the Bornrule, which gives the probabilities, is a non-linearfunction of the amplitudes. More generally, quan-tum machine learning with variational circuits canbe viewed elegantly in terms of kernel methods(Schuld and Killoran, 2019; Schuld, 2021). In thislight, it becomes clear that the mapping from theparameters defining the input data to the cost isnon-linear. Relating to the aforementioned poten-tial quantum advantage in the form of expressibilityof quantum models, a possible avenue for obtaininga quantum advantage arises when a QNLP task isdesigned so that the evaluation of the cost function(or kernel) is hard to simulate classically. Thesetypes of quantum advantage in the field of NLPwould be meaningful in that they would be exam-ples of non-contrived real-world applications ofquantum computers in the near-term.

Acknowledgments

We are grateful to Richie Yeung for his help on tech-nical issues, and also, along with Alexis Toumi,for DISCOPY support. We also thank MarcelloBenedetti for helpful discussions. We would fur-thermore like to thank CQC’s t|ket〉TM team forsupport with pytket. We acknowledge the use of

IBM Quantum services for this work. The viewsexpressed are those of the authors, and do not re-flect the official policy or position of IBM or theIBM Quantum team.

ReferencesAmira Abbas, David Sutter, Christa Zoufal, Aurelien

Lucchi, Alessio Figalli, and Stefan Woerner. 2020.The Power of Quantum Neural Networks.

S. Abramsky and B. Coecke. 2004. A Categorical Se-mantics of Quantum Protocols. In Proceedings ofthe 19th Annual IEEE Symposium on Logic in Com-puter Science, pages 415–425. IEEE Computer Sci-ence Press. arXiv:quant-ph/0402130.

Itai Arad and Zeph Landau. 2010. Quantum Computa-tion and the Evaluation of Tensor Networks. SIAMJournal on Computing, 39(7):3089–3121.

Dea Bankova, Bob Coecke, Martha Lewis, and DanMarsden. 2019. Graded Entailment for Composi-tional Distributional Semantics. Journal of Lan-guage Modelling, 6(2):225–260.

Ivano Basile and Fabio Tamburini. 2017. TowardsQuantum Language Models. In Proceedings of the2017 Conference on Empirical Methods in Natu-ral Language Processing, pages 1840–1849, Copen-hagen, Denmark. Association for ComputationalLinguistics.

Johannes Bausch, Sathyawageeswar Subramanian, andStephen Piddock. 2020. A Quantum Search De-coder for Natural Language Processing.

Kerstin Beer, Dmytro Bondarenko, Terry Farrelly, To-bias J. Osborne, Robert Salzmann, Daniel Scheier-mann, and Ramona Wolf. 2020. Training DeepQuantum Neural Networks. Nature Communica-tions, 11.

Marcello Benedetti, Erika Lloyd, Stefan Sack, and Mat-tia Fiorentini. 2019. Parameterized Quantum Cir-cuits as Machine Learning Models. Quantum Sci-ence and Technology, 4(4):043001.

Kishor Bharti, Alba Cervera-Lierta, Thi Ha Kyaw, To-bias Haug, Sumner Alperin-Lea, Abhinav Anand,Matthias Degroote, Hermanni Heimonen, Jakob S.Kottmann, Tim Menke, Wai-Keong Mok, Sukin Sim,Leong-Chuan Kwek, and Alan Aspuru-Guzik. 2021.Noisy Intermediate-Scale Quantum (NISQ) Algo-rithms.

Benjamin J. Brown, Daniel Loss, Jiannis K. Pachos,Chris N. Self, and James R. Wootton. 2016. Quan-tum Memories at Finite Temperature. Reviews ofModern Physics, 88(4).

Y. Cao, J. Romero, and A. Aspuru-Guzik. 2018. Po-tential of Quantum Computing for Drug Discov-ery. IBM Journal of Research and Development,62(6):6:1–6:20.

Yudong Cao, Jonathan Romero, Jonathan P. Ol-son, Matthias Degroote, Peter D. Johnson, MariaKieferova, Ian D. Kivlichan, Tim Menke, BorjaPeropadre, Nicolas P. D. Sawaya, Sukin Sim, Li-bor Veis, and Alan Aspuru-Guzik. 2019. Quan-tum Chemistry in the Age of Quantum Computing.Chemical Reviews, 119(19):10856–10915.

Yiwei Chen, Yu Pan, and Daoyi Dong. 2020. QuantumLanguage Model with Entanglement Embedding forQuestion Answering.

Bob Coecke, Giovanni de Felice, Konstantinos Me-ichanetzidis, and Alexis Toumi. 2020. Foundationsfor Near-Term Quantum Natural Language Process-ing.

Bob Coecke and Aleks Kissinger. 2017. PicturingQuantum Processes: A First Course in QuantumTheory and Diagrammatic Reasoning. CambridgeUniversity Press.

Bob Coecke, Mehrnoosh Sadrzadeh, and StephenClark. 2010. Mathematical Foundations for a Com-positional Distributional Model of Meaning. Lin-guistic Analysis, 36:345–384.

Giovanni de Felice, Alexis Toumi, and Bob Coecke.2020. DisCoPy: Monoidal Categories in Python. InProceedings of the 3rd Annual International AppliedCategory Theory Conference. EPTCS.

Angel J. Gallego and Roman Orus. 2019. LanguageDesign as Information Renormalization.

Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011.Experimental Support for a Categorical Composi-tional Distributional Model of Meaning. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing, pages 1394–1404. Asso-ciation for Computational Linguistics.

Sanjay Gupta and R.K.P. Zia. 2001. Quantum Neu-ral Networks. Journal of Computer and System Sci-ences, 63(3):355–383.

Vojtech Havlıcek, Antonio D Corcoles, Kristan Temme,Aram W Harrow, Abhinav Kandala, Jerry M Chow,and Jay M Gambetta. 2019. Supervised Learningwith Quantum-Enhanced Feature Spaces. Nature,567(7747):209–212.

Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. AStudy of Entanglement in a Categorical Frameworkof Natural Language. In B. Coecke, I. Hasuo, andP. Panangaden, editors, Quantum Physics and Logic2014 (QPL 2014). EPTSC 172, pages 249–261.

Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and StephenPulman. 2012. A Unified Sentence Space forCategorical Distributional-Compositional Seman-tics: Theory and Experiments. In COLING 2012,24th International Conference on ComputationalLinguistics, Proceedings of the Conference: Posters,8-15 December 2012, Mumbai, India, pages 549–558.

Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, StephenPulman, and Bob Coecke. 2016. Reasoning aboutMeaning in Natural Language with Compact ClosedCategories and Frobenius Algebras, Lecture Notesin Logic, page 199–222. Cambridge UniversityPress.

J. Lambek. 2008. From Word to Sentence. Polimetrica,Milan.

Martha Lewis. 2019. Modelling Hyponymy for DisCo-Cat. In Proceedings of the Applied Category TheoryConference, Oxford, UK.

Konstantinos Meichanetzidis, Alexis Toumi, Giovannide Felice, and Bob Coecke. 2020. Grammar-AwareQuestion-Answering on Quantum Computers.

Michael A. Nielsen and Isaac L. Chuang. 2011. Quan-tum Computation and Quantum Information: 10thAnniversary Edition, 10th edition. Cambridge Uni-versity Press, New York, NY, USA.

Lee J O’Riordan, Myles Doyle, Fabio Baruffa, andVenkatesh Kannan. 2020. A Hybrid Classical-Quantum Workflow for Natural Language Process-ing. Machine Learning: Science and Technology,2(1):015011.

Robin Piedeleu, Dimitri Kartsaklis, Bob Coecke, andMehrnoosh Sadrzadeh. 2015. Open System Categor-ical Quantum Semantics in Natural Language Pro-cessing. In Proceedings of the 6th Conference onAlgebra and Coalgebra in Computer Science, Ni-jmegen, Netherlands.

S. Pirandola, U. L. Andersen, L. Banchi, M. Berta,D. Bunandar, R. Colbeck, D. Englund, T. Gehring,C. Lupo, C. Ottaviani, and et al. 2020. Advancesin Quantum Cryptography. Advances in Optics andPhotonics, 12(4):1012.

H. Ramesh and V. Vinay. 2003. String Matching inO(n+m) Quantum Time. Journal of Discrete Algo-rithms, 1(1):103–110. Combinatorial Algorithms.

Laura Rimell, Jean Maillard, Tamara Polajnar, andStephen Clark. 2016. RELPRON: A RelativeClause Evaluation Data Set for Compositional Dis-tributional Semantics. Computational Linguistics,42(4):661–701.

Mehrnoosh Sadrzadeh, Stephen Clark, and Bob Co-ecke. 2013. The Frobenius Anatomy of Word Mean-ings I: Subject and Object Relative Pronouns. Jour-nal of Logic and Computation, 23(6):1293–1317.

Mehrnoosh Sadrzadeh, Stephen Clark, and Bob Co-ecke. 2014. The Frobenius Anatomy of Word Mean-ings II: Possessive Relative Pronouns. Journal ofLogic and Computation, 26(2):785–815.

Mehrnoosh Sadrzadeh, Dimitri Kartsaklis, and EsmaBalkır. 2018. Sentence Entailment in CompositionalDistributional Semantics. Annals of Mathematicsand Artificial Intelligence, 82:189–218.

Maria Schuld. 2021. Quantum Machine Learning Mod-els are Kernel Methods.

Maria Schuld and Nathan Killoran. 2019. QuantumMachine Learning in Feature Hilbert Spaces. Physi-cal review letters, 122(4):040504.

Seyon Sivarajah, Silas Dilkes, Alexander Cowtan, WillSimmons, Alec Edgington, and Ross Duncan. 2020.t|ket>: a Retargetable Compiler for NISQ Devices.Quantum Science and Technology, 6(1):014003.

J. C. Spall. 1998. Implementation of the SimultaneousPerturbation Algorithm for Stochastic Optimization.IEEE Transactions on Aerospace and Electronic Sys-tems, 34(3):817–823.

Nathan Wiebe, Alex Bocharov, Paul Smolensky,Matthias Troyer, and Krysta M Svore. 2019. Quan-tum Language Processing.

William Zeng and Bob Coecke. 2016. Quantum Al-gorithms for Compositional Natural Language Pro-cessing. Electronic Proceedings in Theoretical Com-puter Science, 221:67–75.