chapter 11. the origin of communicative system: communicative efficiency min su lee the...

Chapter 11. The Origin of Communicative System:

Communicative Efficiency

Min Su Lee

The Computational Nature of Language Learning and Evolution

Contents

11.1 Communicative Efficiency of Language¨ 1. Communicability in animal, human and machine communication

11.2 Communicability for Linguistic Systems¨ 1. Basic notions¨ 2. Probability of events and a communicability function

11.3 Reaching the Highest Communicability¨ 1. A special case of finite languages¨ 2. Generalizations

(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2

Contents

11.4 Implications for Learning¨ 1. Estimating P¨ 2. Estimating Q¨ 3. Sample Complexity Bounds

11.5 Communicative Efficiency and Linguistic Structure¨ 1. Phonemic Contrasts and Lexical Structure¨ 2. Functional Load and Communicative Efficiency¨ 3. Perceptual Confusibility and Functional Load


3

Introduction

In this part, we turn our attention to¨ The genesis of human language from prelinguistic versions of it¨ How and why did the recursive communication system of human lan-

guage arise in biological populations? Communicative efficiency

¨ Important in the evolution of competing linguistic groups where the different groups had different communicative ability

¨ Differential fitness and natural selection If communicative efficiency provides biological fitness to individ-

uals in terms of increased ability to reproduce and survive, then would populations converges to coherent linguistic states?

¨ Coherence Coherent population: Linguistically homogeneous population


4

Introduction

In this part, the book studies¨ Interplay between communicative efficiency, learning fitness, and co-

herence In this chapter,

¨ Develop the notion of communicative efficiency and fitness¨ Characterize language as a probabilistic association between form and

meaning ¨ Provide a natural definition of communicative efficiency between two

linguistic agents possessing different languages¨ Perform empirical study on large linguistic corpora to find that the

structure of the lexicon of several modern languages do not reflect op-timality in terms of communicability


5

1. Communicative Efficiency of Language

Mutual intelligibility [Communicative efficiency]¨ Quantify the rate of success in information transfer between two lin-

guistic agents¨ Increasing intelligibility F(L1, L2) between two languages, L1 and L2

Given a language L, what language L’ maximizes the mutual intel-ligibility F(L, L’) for two way communication about the shared world?

What are some acquisition mechanisms/learning algorithms that can serve the task of improving intelligibility?

What are the consequences of individual language acquisition be-havior on the population dynamics and the communicative effi-ciency of an interacting population of linguistic agents?


6

1. Communicative Efficiency of Language

Communicability¨ Language may be viewed as an association matrix A which links ref-

erents to signals¨ M referents and N signals A is an N×M matrix¨ aij: relative strength of the association between signal i and meaning j.

¨ The matrix A characterizes the behavior of the linguistic agent in Production mode

– Produce any of the signals corresponding to a particular meaning in proportion to the strength of the association

Comprehension mode– Interpret a particular signal as any of the meanings in propor-

tion to the association strength


7

1. Communicative Efficiency of Language Communicability in animal communication

¨ Finite lexical association matrix (animal signals & their specific meaning) Communicability in human communication

¨ Infinite lexical association matrices¨ Human grammars mediate a complex mapping between form and meaning¨ The set of possible sentences and meanings are infinite

(infinite expressibility of human grammars) Communicability in machine communication

¨ AI Linguistic agents interact with each other in simulated worlds Study whether coherent communication ultimately emerges

¨ Natural language understanding systems Develop a computer system that is able to communicate with a human Underlying probability model is learned from data


8

2. Communicability for Linguistic Systems

Basic notions¨ S: the set of all possible linguistic forms (signals), {s1, s2,…}

¨ M: the set of all possible semantic objects (meanings), {m1, m2, …}

¨ Define a language to be a probability measure μ over S×M Encoding matrix P (production mode)

– Prob. of producing the signal si given that one wishes to convey the meaning mj

– Decoding matrix Q (comprehension mode)

– Prob. of interpreting the expression sj to mean mj by the same user


9

2. Communicability for Linguistic Systems

Probability of events and a communicability function¨ Given two communication systems (language μ1, μ2)

¨ The prob. that an event occurs whose meaning is successfully communicated From μ1 to μ2

From μ1 to μ2

¨ Define communicability function of μ1 and μ2 (mutual intelligibility, communicative efficiency)

where Λ is a diagonal matrix s.t. Λii=σ(mi), tr(A) denotes the trace of matrix A, and P(i), Q(i) refer to the coding and decoding matrices associated with μi.

Note that tr(P(1)Λ(Q(2))T) is simply the prob. that an event occurs and is successfully communicated from user of μ1 to user of μ2


10

3. Reaching the Highest Communicability

Given a language μ0

For any language μ, we have (where σi=σ(mi))

Define the best response as a language μ* s.t.

¨ The maximum possible mutual intelligibility between a user of μ0 and a user of any allowable language

How to construct a family of languages (με where ε>0) s.t. F(μ0, με) can be made arbitrarily close to supμF(μ0, μ)


11


A special case of finite languages¨ Three simplifying assumptions

The languages are finite, and the matrices have the size N×M The distribution σ is uniform, i.e. σi = 1/M ∀i

The measure μ0 satisfies the property of unique maxima, i.e. for each i, there exist a unique p0(i) and a unique r0(i) s.t.

– There exists strictly one element of each column of μ0(s|m) (row of μ0(m|s)) s.t. it is the biggest element in the column (row)


12


A special case of finite languages (cont.)¨ Maximize communicative efficiency

Find a matrix Q* s.t.

where we maximize over all matrices Q whose elements are non-negative and sum up to one within each row

The best decoder Q*:

Find a matrix P* s.t.

where we maximize over all matrices P whose elements are non-negative and sum up to one within each column

The best encoder P*:

¨ If a μ* existed s.t. μ*(s|m)=P* and μ*(m|s)=Q*, then the μ* is best response

¨ It turns out that in general, μ* does not exist.

¨ However, there always exists a measure which approximates the performance of P* and Q* arbitrarily well(C) 2009, SNU Biointelligence

Lab, http://bi.snu.ac.kr/13


Theorem 11.1 (Komarova and Niyogi 2004)¨ For any finite language μ0 satisfying the property of unique maxima, and a

uniform probability distribution σ, we have

¨ In order to prove Theorem 11.1, we need to show that

The auxiliary matrix and the absence of loops¨ Define an auxiliary matrix X

X contains nonzero entries at the slots where either of P* or Q* contains a nonzero entry

¨ Draw lines connecting all the “ones” of the X matrix that belong to the same row, and all the “ones” of the X matrix that belong to the same column


14

Def. of the best decoder and the best encoder


Lemma 11.1¨ Suppose that a finite measure μ0 has the property of unique maxima.

Graphs constructed as described above do not contain any closed loops

Proof of Lemma 11.1¨ Assume that there exists a close loop¨ Consider its “turning points”

¨ Suppose there are 2K such vertices: xαi,βj

, where the pair of integers,

(αi,βj), gives the coordinates of the vertex. (1≤i,j≤K)

¨ Let xα1,β1

be connected with xα1,β2

with a horizontal line. Then xα1,β2

is

connected with xα2,β2

with a vertical line, ...., xαK,β1

is connected with

xα1,β1

with a vertical line, the closing the loop.


15

3. Reaching the Highest Communicability Proof of lemma 11.1 (cont.)

¨ If a vertex corresponds to a “one” of the Q* matrix,then the corresponding slot of the P* matrix is zero,and vice versa.

¨ Suppose that Q*α1,β1=1, P*α1,β1

=0.

Q*α1,β2=0 ( there can be only one nonzero element in the same row of the ∵ Q*)

P*α1,β2=1 ( the corresponding vertex is preset in the ∵ X )

P*α2,β2=0 ( we can only have one positive element in each column of ∵ P*)

...¨ ( positive elements in the ∵ Q* correspond to the largest elements in

the corresponding rows of the P0 matrix

¨ Show the system is incompatible Let where Rewrite (C) 2009, SNU Biointelligence



Proof of lemma 11.1 (cont.)¨ The system can be presented as a closed chain of inequalities for Q0

¨ This contradiction proves that there can be no close loops in the matrix X


17

3. Reaching the Highest Communicability Constructing the matrix με

¨ From Lemma 11.1, if we connect all the vertices of the matrix X by horizontal and vertical lines, the resulting graphs will contain no closed loops

¨ For each of these graphs, perform following procedure Repeat this for all pairs of vertices

– Take a pair of vertices– If they are connected by a horizontal (vertical) line, refer to corresponding en-

tries of the Q* matrix (P* matrix)– One of them will be one and the other, zero.– Draw an arrow on the graph from the element corresponding to zero to the el-

ement corresponding to one Starting from some vertex, replace the corresponding element in X by ε Following the arrows keep replacing the elements of X by entries of the form εk,

where the integer k increases or decreases from one vertex to the next depending on the direction of the arrow.

The resulting matrix: Aε

The measure με :(C) 2009, SNU Biointelligence



e.g.

Proof of theorem 11.1, part(b)¨ To find entries of με(s|m), renormalize each column of the matrix με

so that its elements sum up to one¨ Each column will contain at most one segment of one of the graphs¨ By construction, the biggest element of this segment of the graph corresponds to the

positive element of Q*¨ In the limit ε0, the other elements will be vanshingly small in comparison with the

corresponding column of the P* matrix¨ Similarly, rows of με(s|m) that in the limit become the rows of the Q* matrix

¨ ∴ The family of measures με satisfy the requirements of Theorem 11.1


19


Generalizations¨ Three restrictions on the measures μ

Unique local maxima in the rows and columns of μ(s|m) and μ(m|s) respec-tively

If there are multiple maxima, it turn out that loops may exist. Lemma 11.1 can be modified using a neutral vertex of graph

Uniform distribution of events in the world that need to be communicated

If events do not occur with uniform probability, redefine P* and Q* accord-ingly.e.g.

Finite cardinality of S and M

μ on a countably infinite space can be approximated arbitrarily closely by a measure with finite supportthis reduces the infinite case to the finite case


20

4. Implications for Learning

Language learning¨ An agent trying to learn a language in order to communicate with some other

agent whose language is characterized by the measure μ¨ The best response μ* itself may not exist¨ An arbitrarily close approximation με (for any ε) does exist

¨ The best the learner can do is estimate με

Two natural learning scenarios (How much info. is available to the learner)

¨ Full information Where the learner is able to sample μ directly to get (sentence, meaning)

pairs¨ Partial information

The learner only hears the sentence while the intended meaning is latent What the learner reasonably may have access to is whether its interpreta-

tion of the sentence was successful or not(C) 2009, SNU Biointelligence



Estimating P¨ Q* is derivable from the P matrix of the teacher

¨ Learning with full information The learner has access to (s, m) pairs every time the teacher produces

the sentence Define the event Aij = Teacher produces si to communicate mj

The prob. of event Aij: If the teacher produces n (s, m) pairs at random in i.i.d. manner, then

the ratio is an empirical estimate of the prob. of the event Aij

n∞, with prob. 1. Bound the rate at which this convergence occurs

– Applying Hoeffding’s inequality


22


Estimating P (cont.)¨ Learning with full information (cont.)

Assume N possible sentences and M possible meanings Total NM different events whose probabilities need to be estimated

(Aij, i=1, ...., N, j=1, ...,N)

Let event Eij be By the union bound,

With the high probability (depending on n) all empirical estimates are

close to respectively


23


Estimating P (cont.)¨ Learning with partial information

Let the learner guess a meaning uniformly at random. Define event Aij = Teacher produces si; Learner guesses mj; Communica-

tion successful. Prob. of event Aij:

The learner counts kij: the number of times event Aij has occurred

Empirical estimates of the prob. of Aij: Since M is fixed in advance and known, this allows the learner to guess

for each i, j arbitrarily well

Uniform bound:


24


Estimating Q¨ Estimating P*: derivable from the Q matrix of the teacher¨ Learning with full information

The learner picks a sentence uniformly at random (with prob. 1/N)and produces it for the teacher to hear

Define event Aij = Learner produces si; Teacher interprets as mj

Prob. of Aij : After n trials, estimate of is


25


Estimating Q (cont.)¨ Learning with partial information

Learner picks a (sentence, meaning) pair uniformly at random (with prob. 1/NM)

Define event Aij = Learner produces (si, mj); Communication is suc-cessful

Prob. of Aij is


26


Sample complexity bounds¨ Determine the number of learning events that need to occur so that with high

prob., the learner will be able to develop a language with ε-good communi-cabilitly

¨ Let the teacher’s measure be μ.& Assume that μ is s.t. the P and Q have unique row-wise and column-wise maxima respectively.

¨ Margin


27

11.1


Sample complexity bounds (cont.)¨ Learning with partial information

Proof of Theorem 11.2– Let there be n/2 interactions where the teacher speaks and the learner lis-

tens and n/2 interactions of the other form.– Estimation of : & estimation of : – Setting ε = γ/4– &

– Using the fact that , and are both within γ/4 of the true values with prob. greater than


28

11.2

4. Implications for Learning Sample complexity bounds (cont.)

¨ Learning with partial information (cont.), Proof of Theorem 11.2 (cont.)– The learner chooses Q*

• For each i the learner desires to obtain ji*

• The learner chooses• Show

• Assume that this is not the case.• Then,• However,

leads to a contradiction.• Since for each i, the Q* is identified exactly

– The learner chooses P* Similarly, P* is also identified exactly– Ensure that n is large enough so that this occurs with high prob.

is satisfied for

with the prob. greater than 1-δ, both P* and Q* are identified exactly


29


Remarks¨ The number of examples is seen to be a function of M, N, and γ.

The margin γ that depends on the teacher’s language μ, determines how easy it is to estimate Q* and P* for the learner.

It characterizes the learning difficulty of μ in this setting.¨ Infinite matrices are not learnable.

Infinite dimensional spaces are known to be unlearnable Further constraints will be required on the space of possible mea-

sures to which the teacher’s language belongs.¨ The constants in the bound on sample complexity may be tightened,

although the order is essentially correct.


30


Sample complexity bounds (cont.)¨ Learning with full information


31

11.3

5. Communicative Efficiency and Linguis-tic Structure

Empirical study of the structure of lexical items that suggests that a tight coupling between communicative effi-ciency and lexical structure may not always be present.

Phonemic contrasts and lexical structure¨ If all the phonemes in the sequence are heard correctly by the hearer, then

the word has been successfully transmitted from speaker to hearer. Communicative efficiency would be high

¨ If hearer cannot distinguish between /p/ and /b/ Hearer would not be able to tell apart the worlds pat & bat, pit & bit...Information would no longer be perfectly transmitted from speaker to hearer

How much information is lost on the whole?


32


Phonemic contrasts and lexical structure (cont.)¨ p1, ..., pn: the probabilities with which the n words are used on average

¨ W: lexicon¨ Entropy (information content) of the entire lexicon:¨ If all words were equally likely, H(W): average measure of the information transmitted from speaker to hearer by transmitting words of the lexicon.

¨ Reduced lexicon W({/p/, /b/}) = {c1({/p/, /b/}), ... , ck({/p/, /b/})}.

¨ : The prob. with which the hearer will encounter a word that belongs to c1({/p/, /b/})

¨ : Information content of the reduced lexicon¨ : Normalized loss of information

(functional load)0≤ FL ≤1, FL: fraction of information lost (at the lexical level) by losing the ability to distinguish between /p/ and /b/


33


Functional load and communicative efficiency¨ Listener’s guessing strategy (Randomized strategy)

Prob. of transmission: , here

Communicative efficiency:

Loss in communicative efficiency:

Information-theoretic measure of functional load– H(W) = log(n), H(W({/p/, /b/}) =–

Range of FL


34


Perceptual confusiblity and functional load¨ If communicative efficiency played a role in the evolution of linguistic

structure, we should observe a correlation between the perceptual difficulty of making a phonetic contrast and the functional load of that contrast

¨ Empirical experiment Data: Dutch, English, and Chinese Perceptual confusibility between phonemes: psychoacoustic data (a-

coustic difference) Phoneme-confusion matrix; experimental psycholinguistic data Lexical data: corpus-based linguistic data (colloquial pronunciation pat-

terns, frequency of usage, semantic and syntactic information)¨ Result

There is no significant correlation between functional load and con-fusibility


35


Perceptual confusiblity and functional load¨ Functional load against perceptual confusibility

for phonetic distinctions in English

¨ Other contextual cues help in identifying the word uniquely¨ Several possible interpretations

The structure of the lexicon does not display any sign of having been optimized to suit the perceptual limitations of humans

Communicative efficiency might play little role in the structure of natu-ral languages

More proper quantitative formulation of functional load or communica-tive efficiency may be needed.

Internal optimization of linguistic interfaces rather than external opti-mization of communicative efficiency derives change and evolution


36

chapter 11. the origin of communicative system: communicative efficiency min su lee the...

Documents