chapter 11. the origin of communicative system: communicative efficiency min su lee the...
TRANSCRIPT
Chapter 11. The Origin of Communicative System:
Communicative Efficiency
Min Su Lee
The Computational Nature of Language Learning and Evolution
Contents
11.1 Communicative Efficiency of Language¨ 1. Communicability in animal, human and machine communication
11.2 Communicability for Linguistic Systems¨ 1. Basic notions¨ 2. Probability of events and a communicability function
11.3 Reaching the Highest Communicability¨ 1. A special case of finite languages¨ 2. Generalizations
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
Contents
11.4 Implications for Learning¨ 1. Estimating P¨ 2. Estimating Q¨ 3. Sample Complexity Bounds
11.5 Communicative Efficiency and Linguistic Structure¨ 1. Phonemic Contrasts and Lexical Structure¨ 2. Functional Load and Communicative Efficiency¨ 3. Perceptual Confusibility and Functional Load
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
Introduction
In this part, we turn our attention to¨ The genesis of human language from prelinguistic versions of it¨ How and why did the recursive communication system of human lan-
guage arise in biological populations? Communicative efficiency
¨ Important in the evolution of competing linguistic groups where the different groups had different communicative ability
¨ Differential fitness and natural selection If communicative efficiency provides biological fitness to individ-
uals in terms of increased ability to reproduce and survive, then would populations converges to coherent linguistic states?
¨ Coherence Coherent population: Linguistically homogeneous population
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
Introduction
In this part, the book studies¨ Interplay between communicative efficiency, learning fitness, and co-
herence In this chapter,
¨ Develop the notion of communicative efficiency and fitness¨ Characterize language as a probabilistic association between form and
meaning ¨ Provide a natural definition of communicative efficiency between two
linguistic agents possessing different languages¨ Perform empirical study on large linguistic corpora to find that the
structure of the lexicon of several modern languages do not reflect op-timality in terms of communicability
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
1. Communicative Efficiency of Language
Mutual intelligibility [Communicative efficiency]¨ Quantify the rate of success in information transfer between two lin-
guistic agents¨ Increasing intelligibility F(L1, L2) between two languages, L1 and L2
Given a language L, what language L’ maximizes the mutual intel-ligibility F(L, L’) for two way communication about the shared world?
What are some acquisition mechanisms/learning algorithms that can serve the task of improving intelligibility?
What are the consequences of individual language acquisition be-havior on the population dynamics and the communicative effi-ciency of an interacting population of linguistic agents?
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
1. Communicative Efficiency of Language
Communicability¨ Language may be viewed as an association matrix A which links ref-
erents to signals¨ M referents and N signals A is an N×M matrix¨ aij: relative strength of the association between signal i and meaning j.
¨ The matrix A characterizes the behavior of the linguistic agent in Production mode
– Produce any of the signals corresponding to a particular meaning in proportion to the strength of the association
Comprehension mode– Interpret a particular signal as any of the meanings in propor-
tion to the association strength
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
1. Communicative Efficiency of Language Communicability in animal communication
¨ Finite lexical association matrix (animal signals & their specific meaning) Communicability in human communication
¨ Infinite lexical association matrices¨ Human grammars mediate a complex mapping between form and meaning¨ The set of possible sentences and meanings are infinite
(infinite expressibility of human grammars) Communicability in machine communication
¨ AI Linguistic agents interact with each other in simulated worlds Study whether coherent communication ultimately emerges
¨ Natural language understanding systems Develop a computer system that is able to communicate with a human Underlying probability model is learned from data
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
2. Communicability for Linguistic Systems
Basic notions¨ S: the set of all possible linguistic forms (signals), {s1, s2,…}
¨ M: the set of all possible semantic objects (meanings), {m1, m2, …}
¨ Define a language to be a probability measure μ over S×M Encoding matrix P (production mode)
– Prob. of producing the signal si given that one wishes to convey the meaning mj
– Decoding matrix Q (comprehension mode)
– Prob. of interpreting the expression sj to mean mj by the same user
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
2. Communicability for Linguistic Systems
Probability of events and a communicability function¨ Given two communication systems (language μ1, μ2)
¨ The prob. that an event occurs whose meaning is successfully communicated From μ1 to μ2
From μ1 to μ2
¨ Define communicability function of μ1 and μ2 (mutual intelligibility, communicative efficiency)
where Λ is a diagonal matrix s.t. Λii=σ(mi), tr(A) denotes the trace of matrix A, and P(i), Q(i) refer to the coding and decoding matrices associated with μi.
Note that tr(P(1)Λ(Q(2))T) is simply the prob. that an event occurs and is successfully communicated from user of μ1 to user of μ2
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
3. Reaching the Highest Communicability
Given a language μ0
For any language μ, we have (where σi=σ(mi))
Define the best response as a language μ* s.t.
¨ The maximum possible mutual intelligibility between a user of μ0 and a user of any allowable language
How to construct a family of languages (με where ε>0) s.t. F(μ0, με) can be made arbitrarily close to supμF(μ0, μ)
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
3. Reaching the Highest Communicability
A special case of finite languages¨ Three simplifying assumptions
The languages are finite, and the matrices have the size N×M The distribution σ is uniform, i.e. σi = 1/M ∀i
The measure μ0 satisfies the property of unique maxima, i.e. for each i, there exist a unique p0(i) and a unique r0(i) s.t.
– There exists strictly one element of each column of μ0(s|m) (row of μ0(m|s)) s.t. it is the biggest element in the column (row)
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
3. Reaching the Highest Communicability
A special case of finite languages (cont.)¨ Maximize communicative efficiency
Find a matrix Q* s.t.
where we maximize over all matrices Q whose elements are non-negative and sum up to one within each row
The best decoder Q*:
Find a matrix P* s.t.
where we maximize over all matrices P whose elements are non-negative and sum up to one within each column
The best encoder P*:
¨ If a μ* existed s.t. μ*(s|m)=P* and μ*(m|s)=Q*, then the μ* is best response
¨ It turns out that in general, μ* does not exist.
¨ However, there always exists a measure which approximates the performance of P* and Q* arbitrarily well(C) 2009, SNU Biointelligence
Lab, http://bi.snu.ac.kr/13
3. Reaching the Highest Communicability
Theorem 11.1 (Komarova and Niyogi 2004)¨ For any finite language μ0 satisfying the property of unique maxima, and a
uniform probability distribution σ, we have
¨ In order to prove Theorem 11.1, we need to show that
The auxiliary matrix and the absence of loops¨ Define an auxiliary matrix X
X contains nonzero entries at the slots where either of P* or Q* contains a nonzero entry
¨ Draw lines connecting all the “ones” of the X matrix that belong to the same row, and all the “ones” of the X matrix that belong to the same column
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
Def. of the best decoder and the best encoder
3. Reaching the Highest Communicability
Lemma 11.1¨ Suppose that a finite measure μ0 has the property of unique maxima.
Graphs constructed as described above do not contain any closed loops
Proof of Lemma 11.1¨ Assume that there exists a close loop¨ Consider its “turning points”
¨ Suppose there are 2K such vertices: xαi,βj
, where the pair of integers,
(αi,βj), gives the coordinates of the vertex. (1≤i,j≤K)
¨ Let xα1,β1
be connected with xα1,β2
with a horizontal line. Then xα1,β2
is
connected with xα2,β2
with a vertical line, ...., xαK,β1
is connected with
xα1,β1
with a vertical line, the closing the loop.
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
3. Reaching the Highest Communicability Proof of lemma 11.1 (cont.)
¨ If a vertex corresponds to a “one” of the Q* matrix,then the corresponding slot of the P* matrix is zero,and vice versa.
¨ Suppose that Q*α1,β1=1, P*α1,β1
=0.
Q*α1,β2=0 ( there can be only one nonzero element in the same row of the ∵ Q*)
P*α1,β2=1 ( the corresponding vertex is preset in the ∵ X )
P*α2,β2=0 ( we can only have one positive element in each column of ∵ P*)
...¨ ( positive elements in the ∵ Q* correspond to the largest elements in
the corresponding rows of the P0 matrix
¨ Show the system is incompatible Let where Rewrite (C) 2009, SNU Biointelligence
Lab, http://bi.snu.ac.kr/16
3. Reaching the Highest Communicability
Proof of lemma 11.1 (cont.)¨ The system can be presented as a closed chain of inequalities for Q0
¨ This contradiction proves that there can be no close loops in the matrix X
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
17
3. Reaching the Highest Communicability Constructing the matrix με
¨ From Lemma 11.1, if we connect all the vertices of the matrix X by horizontal and vertical lines, the resulting graphs will contain no closed loops
¨ For each of these graphs, perform following procedure Repeat this for all pairs of vertices
– Take a pair of vertices– If they are connected by a horizontal (vertical) line, refer to corresponding en-
tries of the Q* matrix (P* matrix)– One of them will be one and the other, zero.– Draw an arrow on the graph from the element corresponding to zero to the el-
ement corresponding to one Starting from some vertex, replace the corresponding element in X by ε Following the arrows keep replacing the elements of X by entries of the form εk,
where the integer k increases or decreases from one vertex to the next depending on the direction of the arrow.
The resulting matrix: Aε
The measure με :(C) 2009, SNU Biointelligence
Lab, http://bi.snu.ac.kr/18
3. Reaching the Highest Communicability
e.g.
Proof of theorem 11.1, part(b)¨ To find entries of με(s|m), renormalize each column of the matrix με
so that its elements sum up to one¨ Each column will contain at most one segment of one of the graphs¨ By construction, the biggest element of this segment of the graph corresponds to the
positive element of Q*¨ In the limit ε0, the other elements will be vanshingly small in comparison with the
corresponding column of the P* matrix¨ Similarly, rows of με(s|m) that in the limit become the rows of the Q* matrix
¨ ∴ The family of measures με satisfy the requirements of Theorem 11.1
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
19
3. Reaching the Highest Communicability
Generalizations¨ Three restrictions on the measures μ
Unique local maxima in the rows and columns of μ(s|m) and μ(m|s) respec-tively
If there are multiple maxima, it turn out that loops may exist. Lemma 11.1 can be modified using a neutral vertex of graph
Uniform distribution of events in the world that need to be communicated
If events do not occur with uniform probability, redefine P* and Q* accord-ingly.e.g.
Finite cardinality of S and M
μ on a countably infinite space can be approximated arbitrarily closely by a measure with finite supportthis reduces the infinite case to the finite case
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
20
4. Implications for Learning
Language learning¨ An agent trying to learn a language in order to communicate with some other
agent whose language is characterized by the measure μ¨ The best response μ* itself may not exist¨ An arbitrarily close approximation με (for any ε) does exist
¨ The best the learner can do is estimate με
Two natural learning scenarios (How much info. is available to the learner)
¨ Full information Where the learner is able to sample μ directly to get (sentence, meaning)
pairs¨ Partial information
The learner only hears the sentence while the intended meaning is latent What the learner reasonably may have access to is whether its interpreta-
tion of the sentence was successful or not(C) 2009, SNU Biointelligence
Lab, http://bi.snu.ac.kr/21
4. Implications for Learning
Estimating P¨ Q* is derivable from the P matrix of the teacher
¨ Learning with full information The learner has access to (s, m) pairs every time the teacher produces
the sentence Define the event Aij = Teacher produces si to communicate mj
The prob. of event Aij: If the teacher produces n (s, m) pairs at random in i.i.d. manner, then
the ratio is an empirical estimate of the prob. of the event Aij
n∞, with prob. 1. Bound the rate at which this convergence occurs
– Applying Hoeffding’s inequality
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
22
4. Implications for Learning
Estimating P (cont.)¨ Learning with full information (cont.)
Assume N possible sentences and M possible meanings Total NM different events whose probabilities need to be estimated
(Aij, i=1, ...., N, j=1, ...,N)
Let event Eij be By the union bound,
With the high probability (depending on n) all empirical estimates are
close to respectively
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
23
4. Implications for Learning
Estimating P (cont.)¨ Learning with partial information
Let the learner guess a meaning uniformly at random. Define event Aij = Teacher produces si; Learner guesses mj; Communica-
tion successful. Prob. of event Aij:
The learner counts kij: the number of times event Aij has occurred
Empirical estimates of the prob. of Aij: Since M is fixed in advance and known, this allows the learner to guess
for each i, j arbitrarily well
Uniform bound:
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
24
4. Implications for Learning
Estimating Q¨ Estimating P*: derivable from the Q matrix of the teacher¨ Learning with full information
The learner picks a sentence uniformly at random (with prob. 1/N)and produces it for the teacher to hear
Define event Aij = Learner produces si; Teacher interprets as mj
Prob. of Aij : After n trials, estimate of is
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
25
4. Implications for Learning
Estimating Q (cont.)¨ Learning with partial information
Learner picks a (sentence, meaning) pair uniformly at random (with prob. 1/NM)
Define event Aij = Learner produces (si, mj); Communication is suc-cessful
Prob. of Aij is
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
26
4. Implications for Learning
Sample complexity bounds¨ Determine the number of learning events that need to occur so that with high
prob., the learner will be able to develop a language with ε-good communi-cabilitly
¨ Let the teacher’s measure be μ.& Assume that μ is s.t. the P and Q have unique row-wise and column-wise maxima respectively.
¨ Margin
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
27
11.1
4. Implications for Learning
Sample complexity bounds (cont.)¨ Learning with partial information
Proof of Theorem 11.2– Let there be n/2 interactions where the teacher speaks and the learner lis-
tens and n/2 interactions of the other form.– Estimation of : & estimation of : – Setting ε = γ/4– &
– Using the fact that , and are both within γ/4 of the true values with prob. greater than
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
28
11.2
4. Implications for Learning Sample complexity bounds (cont.)
¨ Learning with partial information (cont.), Proof of Theorem 11.2 (cont.)– The learner chooses Q*
• For each i the learner desires to obtain ji*
• The learner chooses• Show
• Assume that this is not the case.• Then,• However,
leads to a contradiction.• Since for each i, the Q* is identified exactly
– The learner chooses P* Similarly, P* is also identified exactly– Ensure that n is large enough so that this occurs with high prob.
is satisfied for
with the prob. greater than 1-δ, both P* and Q* are identified exactly
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
29
4. Implications for Learning
Remarks¨ The number of examples is seen to be a function of M, N, and γ.
The margin γ that depends on the teacher’s language μ, determines how easy it is to estimate Q* and P* for the learner.
It characterizes the learning difficulty of μ in this setting.¨ Infinite matrices are not learnable.
Infinite dimensional spaces are known to be unlearnable Further constraints will be required on the space of possible mea-
sures to which the teacher’s language belongs.¨ The constants in the bound on sample complexity may be tightened,
although the order is essentially correct.
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
30
4. Implications for Learning
Sample complexity bounds (cont.)¨ Learning with full information
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
31
11.3
5. Communicative Efficiency and Linguis-tic Structure
Empirical study of the structure of lexical items that suggests that a tight coupling between communicative effi-ciency and lexical structure may not always be present.
Phonemic contrasts and lexical structure¨ If all the phonemes in the sequence are heard correctly by the hearer, then
the word has been successfully transmitted from speaker to hearer. Communicative efficiency would be high
¨ If hearer cannot distinguish between /p/ and /b/ Hearer would not be able to tell apart the worlds pat & bat, pit & bit...Information would no longer be perfectly transmitted from speaker to hearer
How much information is lost on the whole?
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
32
5. Communicative Efficiency and Linguis-tic Structure
Phonemic contrasts and lexical structure (cont.)¨ p1, ..., pn: the probabilities with which the n words are used on average
¨ W: lexicon¨ Entropy (information content) of the entire lexicon:¨ If all words were equally likely, H(W): average measure of the information transmitted from speaker to hearer by transmitting words of the lexicon.
¨ Reduced lexicon W({/p/, /b/}) = {c1({/p/, /b/}), ... , ck({/p/, /b/})}.
¨ : The prob. with which the hearer will encounter a word that belongs to c1({/p/, /b/})
¨ : Information content of the reduced lexicon¨ : Normalized loss of information
(functional load)0≤ FL ≤1, FL: fraction of information lost (at the lexical level) by losing the ability to distinguish between /p/ and /b/
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
33
5. Communicative Efficiency and Linguis-tic Structure
Functional load and communicative efficiency¨ Listener’s guessing strategy (Randomized strategy)
Prob. of transmission: , here
Communicative efficiency:
Loss in communicative efficiency:
Information-theoretic measure of functional load– H(W) = log(n), H(W({/p/, /b/}) =–
Range of FL
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
34
5. Communicative Efficiency and Linguis-tic Structure
Perceptual confusiblity and functional load¨ If communicative efficiency played a role in the evolution of linguistic
structure, we should observe a correlation between the perceptual difficulty of making a phonetic contrast and the functional load of that contrast
¨ Empirical experiment Data: Dutch, English, and Chinese Perceptual confusibility between phonemes: psychoacoustic data (a-
coustic difference) Phoneme-confusion matrix; experimental psycholinguistic data Lexical data: corpus-based linguistic data (colloquial pronunciation pat-
terns, frequency of usage, semantic and syntactic information)¨ Result
There is no significant correlation between functional load and con-fusibility
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
35
5. Communicative Efficiency and Linguis-tic Structure
Perceptual confusiblity and functional load¨ Functional load against perceptual confusibility
for phonetic distinctions in English
¨ Other contextual cues help in identifying the word uniquely¨ Several possible interpretations
The structure of the lexicon does not display any sign of having been optimized to suit the perceptual limitations of humans
Communicative efficiency might play little role in the structure of natu-ral languages
More proper quantitative formulation of functional load or communica-tive efficiency may be needed.
Internal optimization of linguistic interfaces rather than external opti-mization of communicative efficiency derives change and evolution
(C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
36