information distance ming li canada research chair in bioinformatics university of waterloo

Information Distance

Ming LiCanada Research Chair in Bioinformatics

University of Waterloo

A story about chain letters

Charles H. Bennett collected 33 copies 1980--1997.

Like a virus, they have infected billions of people. Like a gene, they are about 2000 characters and

mutate; Traditional phylogeny methods fail:

Can’t do multiple alignment due to translocationsNo models of evolution

They are not alone: programs, music scores, genomes ...

A sampleletter:

A very pale letter reveals evolutionary path: ((copy)*mutate)*

Information Distance Bennett, Gacs, Li, Vitanyi, Zurek, STOC’93 Li et al: Bioinformatics, 17:2(2001), 149-154, Li et al, IEEE Trans. Info. Theory, 2004

In classical Newton world, we use length to measure distance: 10 miles, 2 km

In the modern information world, what measure do we use to measure the distances between• Two documents?• Two genomes?• Two computer virus?• Two junk emails?• Two (possibly plagiarized) programs?• Two pictures?• Two internet homepages?• Same two objects may be measured at different

granulation levels

A general theory must satisfy:

Application independent Information granulation independentDominate all other theoriesUseful in practice.

Outline

A theory of information distance

Applications: a paradigm of parameter-free data mining

Part I: A Theory of Information Distance

The classical approaches do not work

For all the distances we know: Euclidean distance, Hamming distance, edit distance, none is proper. For example, they do not reflect our intuition on:

But from where shall we start? We will start from first principles of physics and make no more

assumptions. We wish to derive a general theory of information distance.

Austria Byelorussia

Kolmogorov complexity

K(x)= length of shortest description of x K(x|y)=length of shortest description of x given y. K(x)-K(x|y) is information y knows about x Theorem (Mutual Information).

K(x)-K(x|y) = K(y)-K(y|x)

Kolmogorov complexity

Thermodynamics of Computing

Physical Law: 1kT is needed to irreversibly process 1 bit (Von Neumann, Landauer)

Reversible computation is free.

Heat Dissipation

Input OutputCompute

A

B

A AND B

A AND B

B AND NOT A

A AND NOT B

A billiardball computer

Input Output

0110011

1000111

Ultimate thermodynamics cost of erasing x:• “Reversibly compress” x to x*• Then erase x*. Cost ~K(x) bits.• The longer you compute, the less heat dissipation.

Cost of computing x from y, define: E(x,y) = min { |p| : U(x,p) = y, U(y,p)=x }.

Fundamental Theorem: Fundamental Theorem: E(x,y) = max{ K(x|y), K(y|x) }E(x,y) = max{ K(x|y), K(y|x) } Bennett, Gacs, Li, Vitanyi, Zurek STOC’93Bennett, Gacs, Li, Vitanyi, Zurek STOC’93

Normalized Information distance:

max(K(x|y),Ky|x)) d(x,y) = ------------------------ max{K(x),K(y)}

First proposed in Li et al: Bioinformatics, 17:2(2001), 149-154, in slightly different form.In this form in: Li et al, IEEE Trans Info. Theory, 2004

Theorem. d(x,y) is a nontrivial distance. It is symmetric, satisfies triangle inequality, etc.

Open Question. We wish to show d(x,y) is universal: if x and y are “close” in any sense, then they are “close” under d(x,y). That is, for any reasonable computable distance D, there exists constant c, for all x,y,

d(x,y) ≤ D(x,y) + c

d(x,y) Properties

For any computable D, for all x,y: d(x,y) ≤ D(x,y)+ c

Proof Ideas: Naively, by density assumption |{y: |y|=n and D(x,y) ≤ d }| ≤ 2dn, we have K(x|y), K(y|x) ≤ nD(x,y). So max{K(x|y), K(y|x) nD(x,y) d(x,y) = ------------------------ ≤ -------------------- (1) max{K(x),K(y) max{K(x),K(y)

Then we are stuck. This will work only if K(x) or K(y)=n. To solve this, we first prove,Lemma: There exist shortest programs x* for x, and y* for y, such that: K(x|y) ≤ K(x*|y*). Now,

max{K(x|y), K(y|x)} max{K(x*|y*),K(y*|x*)} max{|x*|,|y*|}D(x,y) d(x,y) = ------------------------ ≤ ----------------------------- ≤ --------------------------- ≤ D(x,y) +c max{ K(x),K(y) } max{K(x),K(y) } max{|x*|,|y*|}

Part II: A paradigm of parameter-free data mining

Keogh, Lonardi, Ratanamahatana, KDD 2004

Perils of parameter-laden data mining algorithms:

Incorrect settings miss true patterns Too much tuning leads to over fitting ---

excellent performance on one dataset, fails badly on new but similar datasets.

Parameters impose presumptions on data

Parameter-free data mining

Our theory provides a paradigm of parameter-free data mining, as d(x,y) is universal.

Works at all granularity levels of information No assumptions on data. But d(x,y) is not computable. Does this theory

work at all? We have decided to do extensive and real life experiments.

Application 1: Reconstructing History of Chain Letters For each pair of chain letters (x, y) we

computed d(x,y) by GenCompress, hence a distance matrix.

Using standard phylogeny program to construct their evolutionary history based on the d(x,y) distance matrix.

The resulting tree is a perfect phylogeny: distinct features are all grouped together.

Bennett, M. Li and B. Ma, Chain letters and evolutionary histories. Scientific American, 288:6(June 2003) (feature article), 76-81.

A typical chain letter input file:with love all things are possiblethis paper has been sent to you for good luck. the original is in new england. it has been around the world nine times. the luck has been sent to you. you will receive good luck within four days of receiving this letter. provided, in turn, you send it on. this is no joke. you will receive good luck in the mail. send no money. send copies to people you think need good luck. do not send money as faith has no price. do not keep this letter. Itmust leave your hands within 96 hours. an r.a.f. (royal air force) officerreceived $470,000. joe elliot received $40,000 and lost them because he broke the chain. while in the philippines, george welch lost his wife 51 days after he received the letter. however before her death he received $7,755,000. please, send twenty copies and see what happens in four days. the chain comes from venezuela and was written by saul anthony de grou, a missionary from south america. since this letter must tour the world, you must make twenty copies and send them to friends and associates. after a few days you will get a surprise. this is true even if you are not superstitious. do note the following: constantine dias received the chain in 1953. he asked his secretary to make twenty copies and send them. a few days later, he won a lottery of two million dollars. carlo daddit, an office employee, received the letter and forgot it had to leave his hands within 96 hours. he lost his job. later, after finding the letter again, he mailed twenty copies; a few days later he got a better job. dalan fairchild received the letter, and not believing, threw the letter away, nine days later he died. in 1987, the letter was received by a young woman in california, it was very faded and barely readable. she promised herself she would retype the letter and send it on, but she put it aside to do it later. she was plagued with various problems including expensive car repairs, the letter did not leave her hands in 96 hours. she finally typed the letter as promised and got a new car. remember, send no money. do not ignore this. it works.st. jude

Phylogeny of 33 Chain Letters

Confirmed by VanArsdale’s study, answers an open question

Application 2: Evolution of Species Li et al: Bioinformatics, 17:2(2001),

Traditional methods: for a single gene• Max. likelihood: multiple alignment, assumes

statistical evolutionary models, computes the most likely tree.

• Max. parsimony: multiple alignment, then finds the best tree, minimizing cost.

• Distance-based methods: multiple alignment, NJ; Quartet methods, Fitch-Margoliash method.

Problem: different gene trees, manual alignment, horizontally transferred genes, do not handle genome level events.

Whole Genome Phylogeny Many complete genomes sequenced (400 eukaryote projects). No evolutionary models Multiple alignment not possible Single-gene trees often give conflicting results. Snel, Bork, Huynen: compare gene contents. Boore, Brown:

gene order. Sankoff, Pevzner, Kececioglu: reversal/translocation.

All above are either too simplistic or NP-hard and need approximation anyways.

Our method using shared information is robust. Uses all the information in the genome. No need of evolutionary model – universal. No need of alignment Special cases: gene contents, gene order, reversal/translocation

Eutherian Orders: It has been a disputed issue which of the two

groups of placental mammals are closer: Primates, Ferungulates, Rodents.

In mtDNA, 6 proteins say primates closer to ferungulates; 6 proteins say primates closer to rodents.

Hasegawa’s group concatenated 12 mtDNA proteins from: rat, house mouse, grey seal, harbor seal, cat, white rhino, horse, finback whale, blue whale, cow, gibbon, gorilla, human, chimpanzee, pygmy chimpanzee, orangutan, sumatran

orangutan, with opossum, wallaroo, platypus as out group, 1998, using max likelihood method in MOLPHY.

Who is our closer relative?

Eutherian Orders ...

We use complete mtDNA genome of exactly the same species.

We computed d(x,y) for each pair of species, and used Neighbor Joining in MOLPHY package (and our own Hypercleaning software).

We constructed exactly the same tree. Confirming Primates and Ferungulates are closer than Rodents.

Evolutionary Tree of Mammals:

Li et al: Bioinformatics, 17:2(2001)

Applications 3: “Uncheatable” Plagiarism Test X. Chen, B. Francia, M. Li, B. Mckinnon, A. Seker. IEEE Trans. Information Theory, 50:7(July 2004), 1545-1550.

The shared information measure also works for checking student program assignments. We have implemented the system SID.

Our system would take input on the web, strip user comments, unify variables, we openly advertise our methods (unlike other programs) that we check shared information between each pair. It is uncheatable because it is universal.

Available at http://genome.math.uwaterloo.ca/SID

Application 4:A language treecreated usingUN’s The Universal DeclarationOf Human Rights,by three Italianphysicists, inPhy. Rev. Lett.,& New Scientist

Application 5: Classifying Music

By Rudi Cilibrasi, Paul Vitanyi, Ronald de Wolf, reported in New Scientist, April 2003.

They took 12 Jazz, 12 classical, 12 rock music scores. Classified well.

Potential application in identifying authorship. The technique's elegance lies in the fact that it is

tone deaf. Rather than looking for features such as common rhythms or harmonies, says Vitanyi, "it simply compresses the files obliviously."

Parameter-Free Data Mining: Keogh, Lonardi, Ratanamahatana, KDD’04

Time series clustering• Compared against 51 different parameter-laden

measures from SIGKDD, SIGMOD, ICDM, ICDE, SSDB, VLDB, PKDD, PAKDD, the simple parameter-free shared information method outperformed all --- including HMM, dynamic time warping, etc.

Approximating Normalized Information distance for non-literal objects (R. Cilibrasi, P. Vitanyi)

Internet distribution:

Internet page count for “x”

g(x) = ----------------------------------

# pages indexed

Theorem. –log m(x) = K(x) + O(1), where m(x) is the universal distribution. (The shorter the more likely.)

If we assume the internet distribution roughly follows m(x), then we can approximate the normalized information distance by replacing K(x) by –log m(x).

Shannon-Fano Code Consider n symbols 1,2, …, N, with decreasing

probabilities: p1 ≥ p2 ≥, … ≥ pn. Let Pr=∑i=1..rpi. The binary code E(r) for r is obtained by truncating the binary expansion of Pr at length |E(r)| such that

-log pr ≤ |E(r)| < -log pr +1 Highly probably symbols are mapped to shorter codes, and

2-|E(r)| ≤ pr < 2-|E(r)|+1

Near optimal: Let H = -∑rprlogpr --- the average number of bits needed to encode 1…N. Then we have

- ∑rprlogpr ≤ H < ∑r (-log pr +1)pr = 1 - ∑rprlogpr

Examples

“horse”: #hits = 46,700,000 “rider”: #hits = 12,200,000 “horse” “rider”: #hits = 2,630,000 #pages indexed: 8,058,044,651 d”(horse,rider) = 0.453 Theoretically+empirically: scale-invariant Cilibrasi-Vitanyi classified numbers vs colors, 17th century

dutch painters, prime numbers, electrical terms, religious terms, translation English->Spanish.

New ways of doing expert systems, wordnet, AI, translation, all sorts of stuff.

Query-Answer System Y. Hao, X. Zhang, X. Zhu, M. Li

Adding conditions to normalized information distance, we built a Query-Answer system.

Example: “Who invented the light bulb?” Our system computes

d”(who, light bulb | invent)

Result: Candidates Distance d’’ tomas edison 0.4801 light bulb 0.6087 latimer 0.7283 joseph swan 0.7750

Other applications C. Ane and M.J. Sanderson: Phylogenetic

reconstruction K. Emanuel, S. Ravela, E. Vivant, C. Risi:

Hurricane risk assessment Protein sequence classification Fetal heart rate detection Ortholog detection Authorship, topic, domain identification Worms and network traffic analysis

Summary

A robust method that works when there is no clear data model: English text, music, genome.

A quick, primitive, and dirty way that (almost) always works, when other methods don’t.

A solid theory behind. When a domain is well-understood, it is usually

better to combine with domain-specific methods, perhaps with parameter, then.

Open Questions & Research Issues

Other applications: Authorship inference, Internet plagiarism detection.

Better compression algorithms – entropy estimation.

Conjecture: There problems where you cannot solve/approximate, whileas simple algorithms usually work in “practice” – but this fact is not provable.

Provably computable approximation?• An obvious example is Shannon information

Collaborators & Credits:

Chain letters: C. Bennett, B. Ma GenCompress: X. Chen, S. Kwong DNACompress: X. Chen, B. Ma, J. Tromp Tree programs: Jiang, Kearney, Zhang Biological experiments: J. Badger Plagiarism, SID: X. Chen, B. McKinnon, A. Seker Literature comparison: B. Ma, P. Vitanyi, X.

Chen, X. Li

information distance ming li canada research chair in bioinformatics university of waterloo

Documents

information y

y cfor

y propertiestheorem

y cproof ideas

kyx dx

information distance

cost of computing x

euclidean distance