partial occam's razor and its applications

Information ~gx3rying

Information Processing Ietters 64 ( 1997) 179-185

Partial Occam’s Razor and its applications

Carlos Domingo a, Tatsuie Tsukiji b, Osamu Watanabe ‘,* a Departamenf de LSI, Universitat PoWcnica de Catalunya. Barcelona, Spain b Graduate School of Human Informatics, Nagoya University, Nagoya, Japan

’ Department of Computer Science, Tokyo Institute of Technology. Meguro-ku Ookayama, Tokyo 152-8552, Japan

Received 15 August 1997; revised 8 July 1997 Communicated by P.M.B. Vitanyi

Keywords: PAC learning; Occam’s Razor; Partially consistent hypothesis; Boolean formulas;

1. Partial Occam’s Razor: A new strategy for PAC learning

Occam’s Razor is a philosophy claiming that a “suc-

cinct” hypothesis that is “consistent” with the observed data may predict a future observation with high ac-

curacy. Blumer et al. [2] formally established this principle in Valiant’s PAC learning framework. They proved that Occam algorithms, algorithms producing succinct and consistent hypotheses, are PAC learning algorithms. An output hypothesis of an Occam algorithm must be completely consistent with the given examples. It is sometimes the case, however, that a small proportion of difference between the hypothesis and the examples does not reduce the accuracy of the hypothesis too much, while it helps us to reduce the hypothesis size; thereby, several concept classes

are more economically learnable by outputting such partially consistent hypotheses. In this paper, we for-

malize this idea as “partial Occam algorithms”, and propose one approach for obtaining a PAC learning algorithm by using a partial Occam algorithm.

It has been observed and used that one can obtain a better learning algorithm by relaxing the consistency

* Corresponding author. Email: [email protected].

of Occam algorithms. A typical such example is learning the class of conjunctions with few, say d, relevant variables. Haussler [7] exhibited an Occam algorithm that learns m examples labeled by a conjunction of d literals by a completely consistent conjunction of 0( d log m) literals. ’ Later on, Warmuth [ 11, Section 2.21 suggested to allow a small proportion of misclassification in the sample in order to reduce the

hypothesis size to O(d log( l/e)). Such a reduction on the hypothesis size brings a reduction on the sample complexity (see [ 11, Chapter 21 for details). This paper pushes this direction into an extreme, and ad- mits a produced hypothesis to have close to but less

than a half proportion of mistakes in a given set of examples. (A half or more proportion of mistakes is too much, since some constant function (either the zero or one function) already agrees with at least half

of the examples.) A partial Occam algorithm is an algorithm that produces a succinct hypothesis that is

“partially consistent” (in the above sense) with the sample.

As in the case of Occam algorithms, we show that a partial Occam algorithm can be used as a weak PAC

I We use log and In to denote the logarithm of base 2 and e, respectively.

0020-0190/97/$17.00 @ 1997 Elsevier Science B.V. All rights reserved. PIf SOO20-0190(97)00169-5

180 C. Domingo et al./Information Processing Letters 64 (1997) 179-185

learning algorithm. Kearns et al. [ 91 used this fact to weakly PAC learn the class of monotone Boolean functions under the “uniform distribution”. On the other

hand, under the distribution free model, once a partial Occam algorithm is obtained for some concept class,

we can use several remarkable techniques [ 14,5] to boost it to a usual PAC learning algorithm. Follow- ing this approach, we show efficient PAC learning al-

gorithms for some Boolean concept classes, such as K-DNF and decision lists, DL.

Definition 1 (Valiant [ 151). Let C and 7-1 be any two concept classes. A PAC learning algorithm L of C by 3-1 is a randomized algorithm that satisfies the

following, for any distribution D over X, and for any given c E C, 0 < E < 1, and 0 < 6 < 1: L produces

an e-close hypothesis h E 7-l with probability at least 1 - S. Here a hypothesis is called e-close if it satisfies Pr~{x 1 h(x) # c(x)} 6 E. The left-hand side of this inequality is denoted as errorD (h) .

Helmbold and Warmuth [ 81 have also studied a Definition 2 (Kearns and Valiant [ lo] ) . Let C and

different notion of partial Occam algorithms where ‘H be any two concept classes. A weak PAC learning

instead of relaxing the consistency condition, they algorithm L of C by 7f is a randomized algorithm

relaxed the condition on the hypothesis size. They such that there exist some 0 < ~0 < l/2 and 0 <

showed that allowing the hypothesis to be larger (see As < 1 for which the following conditions hold for any

the paper for precise details) while keeping the full distribution D over X and for any c E C: L produces

consistency with the sample also implies weak PAC a ( l/2 - ‘yc) -close hypothesis h E 7-l with probability

learnability. at least Ao.

Due to the economic use of the weak hypothesis that boosting makes plus the simplicity of the hypotheses in our applications, we are able to obtain algorithms for L-DNF and DL that match the best known sample complexity while using small hypotheses. Hypotheses that depend on a small number of features can be ar- gued to be semantically simpler than functions over more variables, in the sense that they consider fewer aspects of each example to classify it. Moreover, algorithms that consider the minimum number of features needed to describe a target concept seem to work better in practice than well known machine learning al-

gorithms like ID3 [ 11. Moreover, our algorithms can be modified so that they are robust against random classification noise (see [ 41) .

Now we define the notion of “partial Occam algorithm” formally.

Definition 3. For any C G 2’, an algorithm A is a partial Occam algorithm for C if it satisfies the following for some parameters me, ea, So, and 61, where mo~1,0<~0~1/2,andO<So+S~<l.

( 1) A takes a sample S of size mo as an input, and outputs (a representation of) a hypothesis h E 2x,

where S is a set of examples.

(2) Let ‘FI & 2’ be the hypothesis space of A,

i.e., the range of the output of A. Then we have

I’FIJ exp( -e&0/2) 6 6.

We consider the problem of learning a Boolean con-

cept represented by a Boolean function of a certain type. We fix, throughout the paper, the number of attributes (or, Boolean variables) n to be some suffi-

ciently large number. and we assume that the domain of our Boolean concepts is X = (0, I}“, and concepts

are from 2’, the set of Boolean functions over X. For any concept c E 2’ and any x E X, we use c(x) to denote the membership of x in c.

(3) For any c E C, suppose that A is given a sample S consisting of ma examples (x, c( x)). Then, with probability at least 1 - So, A outputs a hypothesis h in 7-t such that

I{(x,c(x)) E S I h(x) + c(x))I/mo 6 l/2 -CO

We first recall the notions of “PAC learning” and “weak PAC learning”. In the PAC learning framework, learning algorithms can make use of an oracle EX( c, D) that returns one labeled example (x, c(x) ) in unit time, where instance x is drawn at random according to the distribution D.

holds, where the probability depends on the internal coin flips of A. (Thus, if A is deterministic, then SO = 0.) The left-hand side of the above inequality is called the sample error of h and denoted as mistakes(h) . A hypothesis satisfying the above inequality is called a ( l/2 + ~0) -consistent hypothesis.

In one word, a partial Occam algorithm A produces a (conceptually) succinct hypothesis that has at most a l/2 - ee proportion of mistakes in a given sample

C. Domingo et al. /information Processing Letters 64 (1997) 179-185 181

with probability at least 1 - 80. In this definition, we bound the cardinality of the hypothesis space 3-1 instead of bounding the description length of the output hypothesis. That is, our condition (2) requires that each hypothesis in 7-l has a succinct index though it may have a long representation.

From the analogy of Occam algorithms, one might expect that a hypothesis produced by a partial Occam

algorithm can predict the target concept to some ex- tent. We prove such a property formally.

Theorem 4. Let A be a partial Occam algorithm for

C w.r.t. parameters no, ~0, SO, and 81. Then for any

distribution D over X, and for any c E C, if A is given

a sample of size mo that is randomly generated w.r. t.

c and D, then with probability at least 1 - (So + 81 ), A outputs a hypothesis h with the followingproperty:

F’r~{x 1 h(x) # c(x)} 6 l/2 -&o/2. That is, A is

a weak PAC leaning algorithm with yo = ~012 and

ho = 1 - (8, + 61).

Proof. We follow the proof for Occam’s razor as it appeared in [2] together with the modification suggested by Warmuth in [ 11, Section 2.21.

First, we estimate the difference between randomly

drawn sample and the universe. Let S be the set of mo examples (x, c(x)), whose instances x are drawn randomly from X according to distribution D.

From the Chernoff bound (see [ 11, Theorem 9.21) , for any Boolean function f E 2’ and any a > 0,

we have Prs{erroro(f) > mistakes(f) + a} <

exp( -2a2q). Let 7-f~ c ‘H be the set of hypotheses h that are

( l/2 + ~0) -consistent with S, and let hs be the output hypothesis of A on input S. Then,

ps{errorD(hs) > l/2 - .SO + a}

< ,Pr,{hs 4 Rs} + ,Prs{[eflordhs)

> mistakes(‘hs) + a] A [ hs E l-is]}

< Ap:.{hs 4 %} + $r{3h E Hs[errorD(h)

> mistakes( hs) + a] }

< npr,{hs 4 3-t~)

+ c q{erroro( h) > mistakes(h) + a}

hEH

< So + (‘HI exp( -2a*mo).

Letting a = &o/2, the theorem follows from condition (2) of Definition 3. 0

By using a similar argument for showing the rela- tionship between PAC learning algorithms and Occam algorithms [ 31, we can show the converse; that is, a weak PAC learning algorithm is indeed a partial Oc- cam algorithm. Thus, a concept class is weakly learn-

able if and only if it has a partial Occam algorithm. (Since the proof is similar to the one in [ 31, we omit it here, see [4] .)

Once a weak PAC learning algorithm is obtained, we can use several techniques to boost it to a PAC learning algorithm. Schapire [ 141 first investigated a

technique of efficiently boosting weak PAC learning algorithms to PAC learning algorithms. Later, Freund

[5] gave an improved version which is stated here.

Theorem 5. (Board and Pitt [ 5, Corollary 3.31) For

any C 2 2’, let L be a deterministic weak PAC learn-

ing algorithm with parameters yo and Ao, and let mo

be the number of examples required by L. Then one

can construct a PAC learning algorithmfor C with the

following complexity,

a Sample complexity:

m=O(ilni + s(lns)*)

l Hypothesis size: A produced hypothesis is the majority of s = 0( iogm/yg) hypotheses produced by L.

l Time complexity:

0( s/A0 In (s/6) ) x (L’s running time).

The following theorem is obtained by just substituting the parameter values derived from Theorem 4

into the parameters in the above theorem.

Theorem 6. For any C & 2’) let A be a deterministic

partial Occam algorithm with parameters mo, ~0, and

61. (Since A is deterministic, 60 = 0.) Then one can construct a PAC learning algorithm for C with sample

complexity

for any given 0 < E < 1 and 0 < S < 1. (The hypothesis size and time complexity are derived similarly.)

182 C. Domingo et al./Information Processing Letters 64 (1997) 179-185

2. Applications

We have introduced the notion of “partial Occam algorithm”, and proposed one approach for designing PAC learning algorithms; i.e., first construct a partial

Occam algorithm, and then boost it to a PAC learning algorithm. Though simple, this approach some-

times provides us with a way to obtain an efficient

PAC learning algorithm. Here we show two such examples.

First we consider the problem of learning k-DNF,

the class of disjunct normal form formulas with each term consisting of at most k literals. We note that every k-DNF formula has some term that can be

used as its weak hypothesis, and that such a term can be found by searching all possible terms of size at most k. From these observations, we obtain our PAC learning algorithm. In the following, we use

d to denote the number of terms of a given target

formula. Valiant’s original paper [ 151 already analyzed

PAC learnability of this class. He gave an algorithm for learning k-DNF that uses O(nk) examples (For

simplicity, we omit the dependency E and S in the following discussion). Littlestone improved this dra- matically with his algorithm Winnow, which is de-

signed for on-line learning [ 121. Winnow’s sample complexity, after converting it to a PAC algorithm, is 0( kd logn). The output hypothesis of Winnow is

a linear threshold function with O(n’() components, each component being a conjunction of at most k

literals. We show a PAC learning algorithm with similar sample complexity. However, our algorithm improves the size of the output hypothesis; it outputs a hypothesis that is a majority of O( d2 log( dk log n) > terms consisting of at most k literals while Winnow’s hypothesis is exponential in k, i.e. O(nk). Note also that our algorithm is non-proper like Winnow, i.e. the hypothesis class and the target class are differ-

ent. As discussed previously, what we need is a partial

Occam algorithm for k-DNF.

Lemma 7. There exists a deterministic partial Oc-

cam algorithm for k-DNF with parameters rno 3

32d2((k+ I) ln(2n) + I), es = 1/4d, and& = l/e, where d is the number of terms in a given target k-DNE It runs in time O(mnk).

Proof. The idea of our algorithm is based on the following observation. Suppose that we are given a sample of size mo w.r.t. a given target k-DNF formula of

d terms. Let mp and m,,, respectively, be the number of positive examples and negative examples; hence,

mo = m,, + m,. We observe that there is a term of size at most k which is partially consistent with the sample. The same idea was sketched by Schapire [ 14,

Section 5.31 when proving the weak learnability of k-term DNF.

Clearly, if either mp or m, is bigger than mo/2 +

mo/4d, then a constant formula (either 0 or 1) is 1/4d-consistent with the sample. Thus, consider the

nontrivial case, mo/2 - q/4d < mr, m, < mo/2 +

mo/4d. Since our target has d terms, there exists one term in the target formula that satisfies at least m,/d

positive assignments from the sample and is falsified by all the m,, negative examples. Such a term can be used as a partially consistent hypothesis. If m, >

ma/2, then the term is consistent with at least mo/2 +

(me/2 - mo/4d)/d > mo/2 + mo/4d examples in the sample. On the other hand, if rnr B m/2, then the term is consistent with at least (mo/2 - mo/4d) +

mo/2d = m~/2 + mo/4d examples. Since the target

formula is a k-DNF, we have a term of length < k that is consistent with at least me/2 + m/4d examples in

the sample. Our partial Occam algorithm finds such a term by

exhaustive search in time O( nk+’ ) among all the terms of size at most k (including the two constant formu-

las 0 and 1) and chooses the one that is consistent with the bigger number of examples from the sample.

It is now clear from the above observation that it always finds a hypothesis that is at least ( l/2 + 1/4d)- consistent. Notice that our hypothesis class is the class of terms with at most k literals and that its cardinality

is bounded by (2n) kf’ Thus, our choice of mo and . 61 satisfies condition (2) of Definition 3. 0

By substituting the values of Lemma 7 for the parameters in Theorem 6 we obtain the following result.

Theorem 8. There is a PAC learning algorithm for

k-DNF whose running time is polynomial in nk, d,

l/e, and log l/6, and whose sample complexity is

O(A(logi + z(ln $)‘>),

C. Domingo et al./Information Processing Letters 64 (1997) 179-185 183

where rnb = d4k logn. (With this choice of m& we

have mole: = O( rnh) .> The output hypothesis is the

majority of

log: +loglog~+logm~

terms of size at most k.

Next we consider decision lists, DL. A decision list c of length d is a sequence (Zt , bl ), . . . , (ld, bd),

(Id+,, bd+,), where for each i E { 1,. . . ,d}, Zi is a

literal, bi E (0, l}, Zd+t is 1, and bd+l E (0, 1). For every x E X, c(x) is defined to be bi, where i is the smallest index such that Zi( x) = 1.

Rivest [ 131 exhibited an Occam algorithm for DL, which gives us a polynomial-time PAC learning algorithm for DL. This PAC learning algorithm requires 0( rz log n) examples. (Again we ignore accuracy and confidence parameters.) This sample complexity can

be reduced if the target decision list is short, i.e., d is

small, and it depends on a small number of variables. Trivially, exhaustively searching among all the decision lists of length d, totally nocd), finds one that is

consistent with a given training sample. Although the sample complexity of this naive algorithm is, by Oc-

cam’s razor [ 21, 0( d log n) , its running time nocd) is big. A reasonable goal of improvements may be, as Warmuth has recently suggested [ 161, to find a polynomial-time PAC learning algorithm whose sample complexity grows only polynomial in d log n.

Littlestone’s Winnow [ 121 can also be used to learn

DL. Any DL can be represented as a threshold for-

mula. However, its mistake bound depends on the sep- aration parameter, and for the case of DL represented as threshold formulas, this could be exponential in d.

Thus, after converting Winnow from on-line learning to off-line (PAC) learning, we obtain a sample size

of O(22dlogn). We present here a non-proper PAC learning algo-

rithm for DL whose running time is bounded by a polynomial in n and 2d (and 1 /e and log 1 /S) , and whose sample complexity is 0( 16d log n( d + log log n>*) . Again, we achieve sample complexity similar to Win- now, but our hypothesis size is more efficient w.r.t. the number of relevant variables. While our algorithm outputs as a hypothesis the majority of O(d2*“) literals, Winnow’s hypothesis for this case is a threshold of n variables. Hence, the hypothesis produced by

our algorithm is more succinct if d is smaller than

log n/ log log n. For “proper” learning, Hancock et al. [ 61 recently

proved that the decision lists of length at most d are not polynomial-time PAC learnable by decision lists

of length at most d2 “g’ d, for any 6 < 1, unless NP C D-jJME[2PO’Yh(“)].

Now we explain our learning algorithm. Again we first observe that a rather simple decision list can be used as a partially consistent hypothesis. The key observation is the following combinatorial lemma: For

obtaining a small decision list c’ that is partially con-

sistent with a sample generated from c, it is enough to pick one literal used in c. Thus, such a literal can be used as a partially consistent hypothesis, and the con- struction of the desired PAC learning algorithm follows from Theorem 6.

Consider any decision list c = (11, bl ), . . . , (Id+,, bd+l) as a target and a sample S consistent with c. For each 1 < i < d + 1, we denote by w; the number of instances x from the sample S that stop at the ith leaf of the decision list, i.e.

A Zi(X) = l}l.

By p+ (respectively, pi), we denote the number of instances x in the sample S such that Zi (x) = 1 and bi = C(X) (respectively, Z~(X) = 1 and bi # C(X)).

Our goal is to prove that for a given sample S of m

examples, there always exists one literal Ii in a target such that the difference between p,? and p,: is bounded from below by some function in d. In other words, such a literal breaks the bias of the sample. This is

formalized in the following lemma.

Lemma 9. For any decision list c = (II, bl ), . . . , ( z&l, b&l ) , and any set s be a set of m examples consistent with c, there exists a literal Zi, 1 < i <

d + 1, such that (p’ - pi) 3 m/2d+‘.

Proof. We first prove the following two claims.

Claim 10. For any i, 1 < i < d + 1, Wi - Ckci wk < p’ - pi.

Proof. First by definition of wi and p+, we have p+ 3

wi. On the other hand, all the instances x of S such

184 C. Domingo et aLlInformation Processing Letters 64 (1997) 179-185

that ii(x) = 1 and c(x) f 6i must satisfy Id = 1 and c(x) = bk for some k < i; otherwise, x is

misclassified. Hence, pi 6 Ckci wk. TherefOre, p’ -

pi > wi-~k<jWk. 0

Claim 11. If wi - xkci wk < &for all i, 1 < i <

d + 1, then Wi < 2i-‘s.

Proof. We show the claim by induction. From the assumption wt < E, which proves the base case of the induction. For the induction step, suppose that for any k < i the claim holds. Thus, we have wi < xtci wk + &<E+2E+2%+~~~+2~--2&+&=2i--1a. 0

Now, we are ready for proving the lemma. Suppose the lemma is false; that is, for every 1 < i < d + 1, we have (pi’ - pi ) < m/2df’. By Claim 10 this implies that for every 1 < i < d + 1, we have Wi -

Ck<i wk < m/2 df’. It also follows from Claim 11 that forevery 1 < i < d+l, wehavewi < 2’-‘(m/2d+‘). Here, consider the sum of all instances falling into

the different leaves of c; then we have cf=:’ Wi <

m/2df’ + 2m/2d+’ + . . ’ + 2dm/2dtt < m, which

contradicts the fact that C”,t’ Wi is the total number of instances in the sample S. 0

Lemma 9 implies that just one literal covers slightly

more than half of the sample. Using this fact, we can

construct a partial Occam algorithm for DL.

Lemma 12. There exists a partial Occam algorithm for DL with parameters eo = 1/2d+2, 81 = l/e, ana’ mo > 22d+5 (In 4n + 1) , where d is the length of a given target decision list. The algorithm runs in time Otnm0).

Proof. The algorithm works as follows. Since we know that there exist one literal li and a value bi E (0, 1) such that (p’ - pi) 3 n~,/2~+‘, we exhaustively search for all 2n possible literals and one constant 1 and we choose the one for which (p’ -pi ) is maximized. Assume that the literal li satisfies this condition. Then, we output the hypothesis (li, bi) , ( 1, b), where b = 1 if and only if at least half of the instances in the sample that do not satisfy li are positive. Now, we calculate the number of instances in the sample misclassified by this hypothesis. Suppose that mp + m, instances of the sample satisfied li, where mp

of them are classified as bt and m, of them as &. By Lemma 9 and since li is the one that has the biggest value for (p,’ - pi), then m, - m, 2 n~/2~+‘. Thus, the number of instances correctly classified by our hypothesis is mt, + (mo - (mt, + m,) )/2. This quantity is at least mo/2 + mo/2d+2. The size of the hypothesis class is 4n, and hence, our choice of mo and at satisfies the condition (2) of Definition 3. The running time is easily seen to be 0( nm) . 0

From this lemma together with Theorem 6, we can conclude the following.

Theorem 13. There is a PAC learning algorithm for DL whose running time is polynomial in 2d, n, l/e and log 1 /S (where d is the length of a given target decision list), and whose sample complexity is

O(i(lOgi + ?(ln?)‘,),

where rni = 16d log n. (With this choice of rn& we have ~JJ/E~ = 0( rnh) .) The output hypothesis is the majority of

literals.

Acknowledgements

The authors would like to thank Vijay Raghavan,

Victor Lavfn and Jose Balcazar for helpful discussions and to the referees of ALT’97 and IPL for their com-

ments for improving the presentation of the paper.

References

[ 11 H. Almuallim, TO. Dietterich, Learning Boolean concepts

in the presence of many irrelevant features, Artif. Intell. 69 ( 1) ( 1994) 279-306.

[2] A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth,

Occam’s razor. Inform. Process. Lett. 24 (1987) 377-380. [3] R. Board, L. Pitt, On the necessity of Occam algorithms,

Theoret. Comput. Sci. 100 (1992) 157-184.

[4] C. Domingo, T. Tsukiji, 0. Watanabe, Partial Occam’s Razor

and its applications, in: Proc. ALT’97, Sendai, Japan, October 1997 (to appear).

151 Y. Freund, Boosting a weak learning algorithm by majority,

Inform. and Comput. 121 (1995) 256-285.

C. Domingo et al/Information Processing Letters 64 (1997) 179-185 185

[6j T. Hancock, T. Jiang, M. Li, J. Tromp, Lower bounds on learning decision lists and trees, in: Proc. 12th Ann. Symp. on Theoretical Aspects of Computer Science, Lecture Notes in Computer Science, Springer, Berlin, 1995, pp. 527-538.

[ 71 D. Haussler, Quantifying inductive bias: Al learning algorithms and Valiant’s learning framework, Artif. Intell. 36 (1988) 177-221.

[ 81 D. Helmbold, M.K. Warmuth, On weak learning, J. Comput. System Sci. 50 (1995) 551-573.

[ 91 M.J. Kearns, M. Li, L.G. Valiant, Learning Boolean formulas, J. ACM 41 (1994) 1298-1328.

[lo] M.J. Kearns, L.G. Valiant, Cryptographic limitations on learning boolean formulae and finite automata, J. ACM 41 (1994) 67-95.

[ 111 M.J. Kearns, U.V. Vazirani, An Introduction to Computational Learning Theory, Cambridge University Press, Cambridge, 1994.

[ 121 N. Littlestone, Learning when irrelevant attributes abound: A new linear-threshold algorithm, Machine Learning 2 ( 1988) 285-318.

[ 131 R.L. Rivest, Learning decision lists, Machine Learning 2 (1987) 229-246.

[ 141 R.E. Schapire, The strength of weak learnability, Machine Learning 5 (1990) 197-227.

[ 151 LG. Valiant, A theory of the learnable, Comm. ACM 27 (1984) 1134-1142.

[ 161 M.K. Warmuth, Posted in the COLT list.

partial occam's razor and its applications

Documents