Learning Problemsin Theorem Proving
Cezary Kaliszyk
Universitat Innsbruck
July 3, 2017LAIVe Summer School
Computer Theorem Proving: Historical Context
• 1940s: Algorithmic proof search (λ-calculus)
• 1960s: de Bruijn’s Automath
• 1970s: Small Certifiers (LCF)
• 1990s: Resolution (Superposition)
• 2000s: Large theories
• 2010s: ?
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 2/72
Lecture content
Theorem Proving Introduction
Machine Learning Problems in Theorem Proving
Premise Selection
Useful Intermediate Steps
Theorem Names
Internal Guidance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 3/72
Theorem Proving Introduction
Outline
Theorem Proving Introduction
Machine Learning Problems in Theorem Proving
Premise Selection
Useful Intermediate Steps
Theorem Names
Internal Guidance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 4/72
Theorem Proving Introduction
The Kepler Conjecture (year 1611)
The most compact way ofstacking balls of the same size inspace is a pyramid.
V =π√18≈ 74%
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 5/72
Theorem Proving Introduction
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 6/72
Theorem Proving Introduction
The Kepler Conjecture (year 1611)
Proved in 1998• Tom Hales, 300 page proof using computer programs
• Submitted to the Annals of Mathematics
• 99% correct. . . but we cannot verify the programs
1039 equalities and inequalities
For example:
−x1x3−x2x4+x1x5+x3x6−x5x6++x2(−x2+x1+x3−x4+x5+x6)√√√√4x2
(x2x4(−x2+x1+x3−x4+x5+x6)++x1x5(x2−x1+x3+x4−x5+x6)++x3x6(x2+x1−x3+x4+x5−x6)−−x1x3x4−x2x3x5−x2x1x6−x4x5x6
) < tan(π
2− 0.74)
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 7/72
Theorem Proving Introduction
The Kepler Conjecture (year 1611)
Proved in 1998• Tom Hales, 300 page proof using computer programs
• Submitted to the Annals of Mathematics
• 99% correct. . . but we cannot verify the programs
1039 equalities and inequalities
For example:
−x1x3−x2x4+x1x5+x3x6−x5x6++x2(−x2+x1+x3−x4+x5+x6)√√√√4x2
(x2x4(−x2+x1+x3−x4+x5+x6)++x1x5(x2−x1+x3+x4−x5+x6)++x3x6(x2+x1−x3+x4+x5−x6)−−x1x3x4−x2x3x5−x2x1x6−x4x5x6
) < tan(π
2− 0.74)
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 7/72
Theorem Proving Introduction
The Kepler Conjecture (year 1611)
Solution? Formalized Proof!• Formalize the proof using Proof Assistants
• Implement the computer code in the system
• Prove the code correct
• Run the programs inside the Proof Assistant
Flyspeck Project
• Completed 2015
• Many Proof Assistants and contributors
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 8/72
Theorem Proving Introduction
What is a Proof Assistant? (1/2)
A Proof Assistant is a• a computer program• to assist a mathematician• in the production of a proof• that is mechanically checked
What does a Proof Assistant do?• Keep track of theories, definitions, assumptions• Interaction - proof editing• Proof checking• Automation - proof search
What does it implement? (And how?)
• a formal logical system intended as foundation for mathematics• decision procedures
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 9/72
Theorem Proving Introduction
WKHRUHP�VTUW�BQRWBUDWLRQDO����VTUW��UHDO��������SURRI��DVVXPH��VTUW��UHDO����������WKHQ�REWDLQ�P�Q����QDW�ZKHUH����QBQRQ]HUR���Q�X����DQG�VTUWBUDW���hVTUW��UHDO���h� �UHDO�P���UHDO�Q�����DQG�ORZHVWBWHUPV���JFG�P�Q� ��������IURP�QBQRQ]HUR�DQG�VTUWBUDW�KDYH��UHDO�P� �hVTUW��UHDO���h� �UHDO�Q��E\�VLPS��WKHQ�KDYH��UHDO��Pt�� ��VTUW��UHDO����t� �UHDO��Qt������E\��DXWR�VLPS�DGG��SRZHU�BHTBVTXDUH���DOVR�KDYH���VTUW��UHDO����t� �UHDO����E\�VLPS��DOVR�KDYH������ �UHDO��Pt�� �UHDO���� �Qt���E\�VLPS��ILQDOO\�KDYH�HT���Pt� ��� �Qt������KHQFH����GYG�Pt������ZLWK�WZRBLVBSULPH�KDYH�GYGBP�����GYG�P��E\��UXOH�SULPHBGYGBSRZHUBWZR���WKHQ�REWDLQ�N�ZKHUH��P� ��� �N������ZLWK�HT�KDYH���� �Qt� ��t� �Nt��E\��DXWR�VLPS�DGG��SRZHU�BHTBVTXDUH�PXOWBDF���KHQFH��Qt� ��� �Nt��E\�VLPS��KHQFH����GYG�Qt������ZLWK�WZRBLVBSULPH�KDYH����GYG�Q��E\��UXOH�SULPHBGYGBSRZHUBWZR���ZLWK�GYGBP�KDYH����GYG�JFG�P�Q��E\��UXOH�JFGBJUHDWHVW���ZLWK�ORZHVWBWHUPV�KDYH����GYG����E\�VLPS��WKXV�)DOVH�E\�DULWKTHG
de Bruijn factor
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 10/72
Theorem Proving Introduction
WKHRUHP�VTUW�BQRWBUDWLRQDO����VTUW��UHDO��������SURRI��DVVXPH��VTUW��UHDO����������WKHQ�REWDLQ�P�Q����QDW�ZKHUH����QBQRQ]HUR���Q�X����DQG�VTUWBUDW���hVTUW��UHDO���h� �UHDO�P���UHDO�Q�����DQG�ORZHVWBWHUPV���JFG�P�Q� ��������IURP�QBQRQ]HUR�DQG�VTUWBUDW�KDYH��UHDO�P� �hVTUW��UHDO���h� �UHDO�Q��E\�VLPS��WKHQ�KDYH��UHDO��Pt�� ��VTUW��UHDO����t� �UHDO��Qt������E\��DXWR�VLPS�DGG��SRZHU�BHTBVTXDUH���DOVR�KDYH���VTUW��UHDO����t� �UHDO����E\�VLPS��DOVR�KDYH������ �UHDO��Pt�� �UHDO���� �Qt���E\�VLPS��ILQDOO\�KDYH�HT���Pt� ��� �Qt������KHQFH����GYG�Pt������ZLWK�WZRBLVBSULPH�KDYH�GYGBP�����GYG�P��E\��UXOH�SULPHBGYGBSRZHUBWZR���WKHQ�REWDLQ�N�ZKHUH��P� ��� �N������ZLWK�HT�KDYH���� �Qt� ��t� �Nt��E\��DXWR�VLPS�DGG��SRZHU�BHTBVTXDUH�PXOWBDF���KHQFH��Qt� ��� �Nt��E\�VLPS��KHQFH����GYG�Qt������ZLWK�WZRBLVBSULPH�KDYH����GYG�Q��E\��UXOH�SULPHBGYGBSRZHUBWZR���ZLWK�GYGBP�KDYH����GYG�JFG�P�Q��E\��UXOH�JFGBJUHDWHVW���ZLWK�ORZHVWBWHUPV�KDYH����GYG����E\�VLPS��WKXV�)DOVH�E\�DULWKTHG
de Bruijn factorC. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 10/72
Theorem Proving Introduction
Intel Pentium R© P5 (1994)
Superscalar; Dual integer pipeline; Faster floating-point, ...
4 159 835
3 145 727= 1.333820...
4 159 835
3 145 727P5= 1.333739...
FPU division lookup table: for certain inputs division result off
Replacement
• Few customers cared, still cost of $475 million
• Testing and model checking insufficient:• Since then Intel and AMD processors formally verified• HOL Light and ACL2 (along other techniques)
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 11/72
Theorem Proving Introduction
Intel Pentium R© P5 (1994)
Superscalar; Dual integer pipeline; Faster floating-point, ...
4 159 835
3 145 727= 1.333820...
4 159 835
3 145 727P5= 1.333739...
FPU division lookup table: for certain inputs division result off
Replacement
• Few customers cared, still cost of $475 million
• Testing and model checking insufficient:• Since then Intel and AMD processors formally verified• HOL Light and ACL2 (along other techniques)
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 11/72
Theorem Proving Introduction
Typical proof assistant problem
• Does there exist a function f from R to R, such thatfor all x and y , f (x + y2)− f (x) ≥ y ?
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 12/72
Theorem Proving Introduction
Typical proof assistant problem
• Does there exist a function f from R to R, such thatfor all x and y , f (x + y2)− f (x) ≥ y ?
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 12/72
1. f (x + y2)− f (x) ≥ y for any given x and y
2. f (x + n · y2)− f (x) ≥ n · y for any x , y , and n ∈ N(easy induction using [1] for the step case)
3. f (1)− f (0) ≥ m + 1 for any m ∈ N(set n = (m + 1)2, x = 0, y = 1
m+1 in [2])
4. Contradiction with the Archimedean property of R
Theorem Proving Introduction
Formalization
let lemma =(‘∀f:real→real. ¬(∀x y. f(x + y * y) − f(x) ≥ y)‘,REWRITE_TAC[real_ge] THEN REPEAT STRIP_TAC THEN
SUBGOAL_THEN ‘∀n x y. &n * y ≤ f(x + &n * y * y) − f(x)‘ MP_TAC THENL
[MATCH_MP_TAC num_INDUCTION THEN SIMP_TAC[REAL_MUL_LZERO; REAL_ADD_RID] THEN
REWRITE_TAC[REAL_SUB_REFL; REAL_LE_REFL; GSYM REAL_OF_NUM_SUC] THEN
GEN_TAC THEN REPEAT(MATCH_MP_TAC MONO_FORALL THEN GEN_TAC) THEN
FIRST_X_ASSUM(MP_TAC o SPECL [‘x + &n * y * y‘; ‘y:real‘]) THEN
SIMP_TAC[REAL_ADD_ASSOC; REAL_ADD_RDISTRIB; REAL_MUL_LID] THEN
REAL_ARITH_TAC;
X_CHOOSE_TAC ‘m:num‘ (SPEC ‘f(&1) − f(&0):real‘ REAL_ARCH_SIMPLE) THEN
DISCH_THEN(MP_TAC o SPECL [‘SUC m EXP 2‘; ‘&0‘; ‘inv(&(SUC m))‘]) THEN
REWRITE_TAC[REAL_ADD_LID; GSYM REAL_OF_NUM_SUC; GSYM REAL_OF_NUM_POW] THEN
REWRITE_TAC[REAL_FIELD ‘(&m + &1) pow 2 * inv(&m + &1) = &m + &1‘;
REAL_FIELD ‘(&m + &1) pow 2 * inv(&m + &1) * inv(&m + &1) = &1‘] THEN
ASM_REAL_ARITH_TAC]);;
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 13/72
Theorem Proving Introduction
Formalization
let lemma =(‘∀f:real→real. ¬(∀x y. f(x + y * y) − f(x) ≥ y)‘,REWRITE_TAC[real_ge] THEN REPEAT STRIP_TAC THEN
SUBGOAL_THEN ‘∀n x y. &n * y ≤ f(x + &n * y * y) − f(x)‘ MP_TAC THENL
[MATCH_MP_TAC num_INDUCTION THEN SIMP_TAC[REAL_MUL_LZERO; REAL_ADD_RID] THEN
REWRITE_TAC[REAL_SUB_REFL; REAL_LE_REFL; GSYM REAL_OF_NUM_SUC] THEN
GEN_TAC THEN REPEAT(MATCH_MP_TAC MONO_FORALL THEN GEN_TAC) THEN
FIRST_X_ASSUM(MP_TAC o SPECL [‘x + &n * y * y‘; ‘y:real‘]) THEN
SIMP_TAC[REAL_ADD_ASSOC; REAL_ADD_RDISTRIB; REAL_MUL_LID] THEN
REAL_ARITH_TAC;
X_CHOOSE_TAC ‘m:num‘ (SPEC ‘f(&1) − f(&0):real‘ REAL_ARCH_SIMPLE) THEN
DISCH_THEN(MP_TAC o SPECL [‘SUC m EXP 2‘; ‘&0‘; ‘inv(&(SUC m))‘]) THEN
REWRITE_TAC[REAL_ADD_LID; GSYM REAL_OF_NUM_SUC; GSYM REAL_OF_NUM_POW] THEN
REWRITE_TAC[REAL_FIELD ‘(&m + &1) pow 2 * inv(&m + &1) = &m + &1‘;
REAL_FIELD ‘(&m + &1) pow 2 * inv(&m + &1) * inv(&m + &1) = &1‘] THEN
ASM_REAL_ARITH_TAC]);;
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 13/72
HOL(y)Hammer: general purposeproof assistant automation
• Machine Learning
• Automated Reasoning
Theorem Proving Introduction
Proof Assistant (2/2)
• Keep track of theories, definitions, assumptions• set up a theory that describes mathematical concepts
(or models a computer system)• express logical properties of the objects
• Interaction - proof editing• typically interactive• specified theory and proofs can be edited• provides information about required proof obligations• allows further refinement of the proof• often manually providing a direction in which to proceed.
• Automation - proof search• various strategies• decision procedures
• Proof checking• checking of complete proofs• sometimes providing certificates of correctness
• Why should we trust it?• small core
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 14/72
Theorem Proving Introduction
Can a Proof Assistant do all proofs?
Decidability!
• Validity of formulas is undecidable
• (for non-trivial logical systems)
Automated Theorem Provers• Specific domains
• Adjust your problem
• Answers: Valid (Theorem with proof)
• Or: Countersatisfiable (Possibly with counter-model)
Proof Assistants• Generally applicable
• Direct modelling of problems
• Interactive
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 15/72
Theorem Proving Introduction
Tool Categories
Computer Algebra
• Solving equations, simplifications, numerical approximations
• Maple, Mathematica, . . .
Model Checkers• Space state abstraction
• Spin, Uppaal, . . .
ATPs• Built in automation (model elimination, resolution)
• ACL2, Vampire, Eprover, SPASS, . . .
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 16/72
Theorem Proving Introduction
Theorems and programs that use ITP/ATP
Software and Hardware• Processors and Chips
• Security Protocols
• Project Cristal (Comp-Cert)
• L4-Verified
• Java Bytecode
Mathematical Theorems• Kepler Conjecture (2015)
• 4 color theorem
• Feit-Thomson
• Robbins conjecture, AIM
Wiedijk’s 100
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 17/72
Machine Learning Problems in Theorem Proving
Outline
Theorem Proving Introduction
Machine Learning Problems in Theorem Proving
Premise Selection
Useful Intermediate Steps
Theorem Names
Internal Guidance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 18/72
Machine Learning Problems in Theorem Proving
Fast progress in machine learning
Tasks involving logical inference
• Natural language question answering [Sukhbaatar+2015 ]
• Knowledge base completion [Socher+2013 ]
• Automated translation [Wu+2016 ]
GamesAlphaGo problems similar to proving [Silver+2016 ]
• Node evaluation
• Policy decisions
Computer Vision
Better than human performance on some tasks [Russakovsky+2015 ]
A lot of AI, but little use of learning in proving
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 19/72
Machine Learning Problems in Theorem Proving
AI theorem proving techniques
High-level AI guidance
• tactic selection and premise (lemma) selection
• based on suitable features (characterizations) of the formulas
• and on learning lemma-relevance from many related proofs
Mid-level AI guidance
• learn good ATP strategies/tactics/heuristics for classes of problems
• learning lemma and concept re-use
• learn conjecturing
Low-level AI guidance
• guide (almost) every inference
step by previous knowledge
• good proof-state characterization and fast relevance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 20/72
Machine Learning Problems in Theorem Proving
AI theorem proving techniques
High-level AI guidance
• tactic selection and premise (lemma) selection
• based on suitable features (characterizations) of the formulas
• and on learning lemma-relevance from many related proofs
Mid-level AI guidance
• learn good ATP strategies/tactics/heuristics for classes of problems
• learning lemma and concept re-use
• learn conjecturing
Low-level AI guidance
• guide (almost) every inference step by previous knowledge
• good proof-state characterization and fast relevance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 20/72
Machine Learning Problems in Theorem Proving
Problems for Machine Learning
• Is a statement is useful?
• For a conjecture
• What are the dependencies of statement? (premise selection)
• Is a statement important? (named)
• What should the next proof step be?
• Tactic? Instantitation?
• How to name a statement?
• What new problem is likely to be true?
• Intermediate statement for a conjecture
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 21/72
Premise Selection
Outline
Theorem Proving Introduction
Machine Learning Problems in Theorem Proving
Premise Selection
Useful Intermediate Steps
Theorem Names
Internal Guidance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 22/72
Premise Selection
Premise selection
Intuition
Given:
• set of theorems T (together with proofs)
• conjecture c
Find:
• minimal subset of T that can be used to prove c
More formally
arg mint⊆T
{|t| | t ` c}
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 23/72
Premise Selection
In machine learning terminology
Multi-label classification
Input: set of samples S, where samples are triples s,F (s), L(s)
• s is the sample ID
• F (s) is the set of features of s
• L(s) is the set of labels of s
Output: function f that predicts n labels (sorted by relevance) for set offeatures
Sample add comm (a + b = b + a) could have:
• F(add comm) = {“+”, “=”, “num”}• L(add comm) = {num induct, add 0, add suc, add def}
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 24/72
Premise Selection
Not exactly the usual machine learning problem
Labels correspond to premises and samples to theorems
• Very often same
Similar theorems are likely to be useful in the proof
• Also likely to have similar premises
Theorems sharing logical features are similar
• Theorems sharing rare features are very similar
Recently considered theorems and premises are important
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 25/72
Premise Selection
Not exactly for the usual machine learning tools
Multi-label classifier output
• Often asked for 1000 or more most relevant lemmas
Efficient classifier update
• Learning time + prediction time small
• User will not wait more than 10–30 sec for all phases
Large numbers of features
• Complicated feature relations
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 26/72
Premise Selection
Tried Premise Selection Techniques
• Syntactic methods• Neighbours using various metrics• Clustering• Recursive: SInE, MePo
• k-NN, Naive Bayes
• (space reductions)
• Regression• Kernel-based multi-output ranking
• Decision Trees (Random Forests)
• Neural Networks• Winnow, Perceptron (SNoW)• DeepMath
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 27/72
Premise Selection
Tried Premise Selection Techniques
• Syntactic methods• Neighbours using various metrics• Clustering• Recursive: SInE, MePo
• k-NN, Naive Bayes
• (space reductions)
• Regression• Kernel-based multi-output ranking
• Decision Trees (Random Forests)
• Neural Networks• Winnow, Perceptron (SNoW)• DeepMath
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 27/72
Premise Selection
Problem pruning with SInE
• SInE – SUMO Inference Engine [Hoder 2008 ]
• Original version implemented in Python and debuted in CASC-J4(with E and Vampire)
• Used by other ATPs: iProver-LTB, etc.• Efficient reimplementations
Vampire (2009/2010) and E (2010, GSinE)
• SInE selects relevant axioms from a large set• Axiom selection based on function/predicate symbols:
• The conjecture is relevant• Symbols in the conjecture are relevant• Formulas defining these symbols are relevant, and their other symbols
become relevant• Repeat until a fixed point or limit is reached
• A formula is interpreted as defining the globally rarest symbol(s)occurring in it
• Very Similar to MePo [MengPaulson2009 ]
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 28/72
Premise Selection
GSInE in E: e axfilter
• Parameterizable filters• Different generality measures
(frequency count, generosity, benevolence)• Different limits (absolute/relative size, # of iterations)• Different seeds (conjecture/hypotheses)
• Efficient implementation• E data types and libraries• Indexing (symbol → formula, formula → symbol)
• Multi-filter support• Parse & index once (amortize costs)• Apply different independent filters
• Primary use: Initial over-approximation(efficiently reduce HUGE input files to manageable size)
• Secondary use: Filtering for individual E strategies
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 29/72
Premise Selection
GSInE in E: e axfilter
• Parameterizable filters• Different generality measures
(frequency count, generosity, benevolence)• Different limits (absolute/relative size, # of iterations)• Different seeds (conjecture/hypotheses)
• Efficient implementation• E data types and libraries• Indexing (symbol → formula, formula → symbol)
• Multi-filter support• Parse & index once (amortize costs)• Apply different independent filters
• Primary use: Initial over-approximation(efficiently reduce HUGE input files to manageable size)
• Secondary use: Filtering for individual E strategies
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 29/72
Premise Selection
Naive Bayes
P(f is relevant for proving g)
= P(f is relevant | g ’s features)
= P(f is relevant | f1, . . . , fn)
∝ P(f is relevant)Πni=1P(fi | f is relevant)
∝ #f is a proof dependency#all dependencies
· Πni=1
#fi appears when f is a proof dependency#f is a proof dependency
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 30/72
Premise Selection
Naive Bayes: adaptation to premise selection
extended features F (φ) of a fact φ
features of φ and of the facts that were proved using φ(only one iteration)
More precise estimation of the relevance of φ to prove γ:
P(φ is used in ψ’s proof)
·∏
f∈F (γ)∩F (φ)P(ψ has feature f | φ is used in ψ’s proof
)·∏
f∈F (γ)−F (φ)P(ψ has feature f | φ is not used in ψ’s proof
)·∏
f∈F (φ)−F (γ)P(ψ does not have feature f | φ is used in ψ’s proof
)
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 31/72
Premise Selection
All these probabilities can be computed efficiently
Update two functions (tables):
• t(φ): number of times a fact φ was dependency
• s(φ, f ):number of times a fact φ was dependency of a fact described byfeature f
Then:
P(φ is used in a proof of (any) ψ) =t(φ)
K
P(ψ has feature f | φ is used in ψ’s proof
)=
s(φ, f )
t(φ)
P(ψ does not have feature f | φ is used in ψ’s proof
)= 1− s(φ, f )
t(φ)
≈ 1− s(φ, f )− 1
t(φ)
Times for large dataset: secondsC. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 32/72
Premise Selection
Decision Trees
Definition• each leaf stores a set of samples
• each branch stores a feature f and two subtrees, where:
• the left subtree contains only samples having f• the right subtree contains only samples not having f
Example
+
×
a× (b + c) =a× b + a× c
a + b =b + a
sin
sin x =− sin(−x)
×
a× b = b × a a = a
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 33/72
Premise Selection
Random Forests
• Online and Offline variants [Safari09,Agrawal13 ]
• Online gives too much influence to the first learned nodes• Offline too slow to update
• Too few selected labels: Multi-path query
• Too slow to compute Gini-index for splitting feature selection
• With 50 trees and 1000 labels in each performance weak• Needs combining with other classifier: k-NN in the leave.
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 34/72
Premise Selection
Random Forests
• Online and Offline variants [Safari09,Agrawal13 ]
• Online gives too much influence to the first learned nodes• Offline too slow to update
• Too few selected labels: Multi-path query
• Too slow to compute Gini-index for splitting feature selection
• With 50 trees and 1000 labels in each performance weak• Needs combining with other classifier: k-NN in the leave.
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 34/72
Premise Selection
Random Forests: Results [Farber2015 ]
200 400 600 800 1,000
0.9
0.95
1
Evaluation samples
Pre
cisi
on
RFk-NN
Runtimes to achieve same number of proven theorems
Classifier Classifier runtime E timeout E runtime Total
k-NN 0.5min 15sec 314min 341minRF 22min 10sec 252min 272min
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 35/72
Premise Selection
Deep Learning vs Shallow Learning
Hand crafted Features
Predictor
Data
Traditional machine learning
• Mostly convex, provably tractable
• Special purpose solvers
• Non-layered architectures
Learned Features
Predictor
Data
Deep Learning
• Mostly NP-Hard
• General purpose solvers
• Hierarchical models
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 36/72
Premise Selection
Deep Learning vs Shallow Learning
Hand crafted Features
Predictor
Data
Traditional machine learning
• Mostly convex, provably tractable
• Special purpose solvers
• Non-layered architectures
Learned Features
Predictor
Data
Deep Learning
• Mostly NP-Hard
• General purpose solvers
• Hierarchical models
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 36/72
Premise Selection
Deep Learning vs Shallow Learning
Hand crafted Features
Predictor
Data
Traditional machine learning
• Mostly convex, provably tractable
• Special purpose solvers
• Non-layered architectures
Learned Features
Predictor
Data
Deep Learning
• Mostly NP-Hard
• General purpose solvers
• Hierarchical models
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 36/72
Premise Selection
Deep Learning for Mizar Lemma Selection [Alemi+2016 ]
• Embed all lemmas into Rn using an LSTM
• Embed conjecture into Rn using an LSTM
• Simple classifier on top of concatenated embeddings
• Trained to estimate usefulness on positive and negative examples
Statement to be proved
Embedding network
Potential Premise
Embedding network
Combiner network
Classifier/Ranker
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 37/72
Premise Selection
DeepMath (2/2)
Cutoff k-NN Baseline (%) char-CNN (%) word-CNN (%) def-CNN-LSTM (%) def-CNN (%) def+char-CNN (%)16 674 (24.6) 687 (25.1) 709 (25.9) 644 (23.5) 734 (26.8) 835 (30.5)32 1081 (39.4) 1028 (37.5) 1063 (38.8) 924 (33.7) 1093 (39.9) 1218 (44.4)64 1399 (51) 1295 (47.2) 1355 (49.4) 1196 (43.6) 1381 (50.4) 1470 (53.6)
128 1612 (58.8) 1534 (55.9) 1552 (56.6) 1401 (51.1) 1617 (59) 1695 (61.8)256 1709 (62.3) 1656 (60.4) 1635 (59.6) 1519 (55.4) 1708 (62.3) 1780 (64.9)512 1762 (64.3) 1711 (62.4) 1712 (62.4) 1593 (58.1) 1780 (64.9) 1830 (66.7)
1024 1786 (65.1) 1762 (64.3) 1755 (64) 1647 (60.1) 1822 (66.4) 1862 (67.9)
Table 1: Results of ATP premise selection experiments with hard negative mining on a test set of 2,742 theorems.
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 38/72
Useful Intermediate Steps
Outline
Theorem Proving Introduction
Machine Learning Problems in Theorem Proving
Premise Selection
Useful Intermediate Steps
Theorem Names
Internal Guidance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 39/72
Useful Intermediate Steps
Intermediate lemmas [JSC 2015,FroCoS 2015 ]
Size of the inference graphs
Flyspeck graph
nodes edges
kernel inferences 1,728,861,441 1,953,406,411reduced trace 159,102,636 233,488,673
tactical inferences 11,824,052 42,296,208tactical trace 1,067,107 4,268,428
Same orders of magnitude for single ATP proofs
Repetitions → Re-use
• The graphs are already computed outsize of HOL
• Feature extraction, prediction, translation do not scale...
• Pre-select heuristically interesting lemmas
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 40/72
Useful Intermediate Steps
Intermediate lemmas [JSC 2015,FroCoS 2015 ]
Size of the inference graphs
Flyspeck graph
nodes edges
kernel inferences 1,728,861,441 1,953,406,411reduced trace 159,102,636 233,488,673
tactical inferences 11,824,052 42,296,208tactical trace 1,067,107 4,268,428
Same orders of magnitude for single ATP proofs
Repetitions → Re-use
• The graphs are already computed outsize of HOL
• Feature extraction, prediction, translation do not scale...
• Pre-select heuristically interesting lemmas
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 40/72
Useful Intermediate Steps
Heuristics for interesting lemmas (1/2)
Definition (Recursive dependencies and uses)
D(i) =
1 if i ∈ Named ∨ i ∈ Axioms,∑j∈d(i)
D(j) otherwise.
Definition (Lemma quality)
Q1(i) =U(i) ∗ D(i)
S(i)
Q2(i) =U(i) ∗ D(i)
S(i)2
Qr1(i) =
U(i)r ∗ D(i)2−r
S(i)
Q3(i) =U(i) ∗ D(i)
1.1S(i)
EpclLemma (longest chain), AGIntRater, ...
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 41/72
Useful Intermediate Steps
Heuristics for interesting lemmas (2/2)
PageRank
• Eigenvector centrality of a graph
• Fast, non-iterative, usable on whole Flyspeck
• Dominant eigenvector of:
PR1(i) =1− f
N+ f
∑i∈d(j)
PR1(j)
|d(j)|
• Size normalized
PR2(i) =PR1(i)
S(i)
Maximum Graph Cut
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 42/72
Useful Intermediate Steps
Heuristics for interesting lemmas (2/2)
PageRank
• Eigenvector centrality of a graph
• Fast, non-iterative, usable on whole Flyspeck
• Dominant eigenvector of:
PR1(i) =1− f
N+ f
∑i∈d(j)
PR1(j)
|d(j)|
• Size normalized
PR2(i) =PR1(i)
S(i)
Maximum Graph Cut
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 42/72
Useful Intermediate Steps
Learning Lemma Usefulness [ICLR 2017 ]
HOLStep Dataset
• Intermediate steps of the Kepler proof
• Only relevant proofs of reasonable size
• Annotate steps as useful and unused• Same number of positive and negative
• Tokenization and normalization of statements
Statistics
Train Test Positive Negative
Examples 2013046 196030 1104538 1104538Avg. length 503.18 440.20 535.52 459.66Avg. tokens 87.01 80.62 95.48 77.40Conjectures 9999 1411 - -Avg. deps 29.58 22.82 - -
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 43/72
Useful Intermediate Steps
Learning Lemma Usefulness [ICLR 2017 ]
HOLStep Dataset
• Intermediate steps of the Kepler proof
• Only relevant proofs of reasonable size
• Annotate steps as useful and unused• Same number of positive and negative
• Tokenization and normalization of statements
Statistics
Train Test Positive Negative
Examples 2013046 196030 1104538 1104538Avg. length 503.18 440.20 535.52 459.66Avg. tokens 87.01 80.62 95.48 77.40Conjectures 9999 1411 - -Avg. deps 29.58 22.82 - -
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 43/72
Useful Intermediate Steps
Considered Models
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 44/72
Useful Intermediate Steps
Baselines (Training Profiles)
char-level token-level
un
con
dit
ion
edco
ject
ure
con
dit
ion
ed
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 45/72
Theorem Names
Outline
Theorem Proving Introduction
Machine Learning Problems in Theorem Proving
Premise Selection
Useful Intermediate Steps
Theorem Names
Internal Guidance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 46/72
Theorem Names
Example Theorem Names [Aspinall2016]
• Names attached to noteworthy theoremsRIEMANN MAPPING THEOREM:∀s.open s ∧ simply connected s ⇐⇒ s = {} ∨ s = (:real2) ∨∃f g. f holomorphic on s ∧ g holomorphic on ball(0, 1) ∧(∀z. z IN s =⇒ f z IN ball(Cx(&0), &1) ∧ g(f z) = z) ∧(∀z. z IN ball(Cx(&0),&1) =⇒ g z IN s ∧ f(g z) = z)
• Descriptive namesADD ASSOC : ∀m n p. m+(n+p)=(m+n)+p
REAL MIN ASSOC : ∀x y z. min x (min y z)=min (min x y) z
SUC GT ZERO : ∀x. Suc x> 0
• Automatic names:HOJODCM LEBHIRJ OBDATYB MEEIXJO KBWPBHQ RYIUUVKDIOWAAS PNXVWFS PBFLHET QCDVKEA NCVIBWU PPLHULJ
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 47/72
Theorem Names
Example Theorem Names [Aspinall2016]
• Names attached to noteworthy theoremsRIEMANN MAPPING THEOREM:∀s.open s ∧ simply connected s ⇐⇒ s = {} ∨ s = (:real2) ∨∃f g. f holomorphic on s ∧ g holomorphic on ball(0, 1) ∧(∀z. z IN s =⇒ f z IN ball(Cx(&0), &1) ∧ g(f z) = z) ∧(∀z. z IN ball(Cx(&0),&1) =⇒ g z IN s ∧ f(g z) = z)
• Descriptive namesADD ASSOC : ∀m n p. m+(n+p)=(m+n)+p
REAL MIN ASSOC : ∀x y z. min x (min y z)=min (min x y) z
SUC GT ZERO : ∀x. Suc x> 0
• Automatic names:HOJODCM LEBHIRJ OBDATYB MEEIXJO KBWPBHQ RYIUUVKDIOWAAS PNXVWFS PBFLHET QCDVKEA NCVIBWU PPLHULJ
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 47/72
Theorem Names
Example Theorem Names [Aspinall2016]
• Names attached to noteworthy theoremsRIEMANN MAPPING THEOREM:∀s.open s ∧ simply connected s ⇐⇒ s = {} ∨ s = (:real2) ∨∃f g. f holomorphic on s ∧ g holomorphic on ball(0, 1) ∧(∀z. z IN s =⇒ f z IN ball(Cx(&0), &1) ∧ g(f z) = z) ∧(∀z. z IN ball(Cx(&0),&1) =⇒ g z IN s ∧ f(g z) = z)
• Descriptive namesADD ASSOC : ∀m n p. m+(n+p)=(m+n)+p
REAL MIN ASSOC : ∀x y z. min x (min y z)=min (min x y) z
SUC GT ZERO : ∀x. Suc x> 0
• Automatic names:HOJODCM LEBHIRJ OBDATYB MEEIXJO KBWPBHQ RYIUUVKDIOWAAS PNXVWFS PBFLHET QCDVKEA NCVIBWU PPLHULJ
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 47/72
Theorem Names
Associations between names and statements
Consistent
• association between symbols and parts of theorem names
Abstract
• abstracting from concrete symbols
• association between patterns and name parts
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 48/72
Theorem Names
Modified k-Nearest Neighbours multi-label classifier
The nearness of two statements s1, s2:
n(s1, s2) =
√∑f∈f (s1)∩f (s2)
w(f )2
• w(f ) is the IDF (inverse document frequency) of feature f
Training examples are indexed by featuresto efficiently evaluate the relevance of a label:
R(l) =∑
s1∈N,l∈l(s1)
n(s1, s2)∣∣l(s1)∣∣
Positions for a stem proposed using the weighted average of thepositions in the recommendations.
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 49/72
Theorem Names
Selected features of ADD ASSOC
Feature Frequency Position IDF
(V0 + (V1 + V2) = (V0 + V1) + V2) 1 0.37 7.82((V0 + V1) + V2) 1 0.75 7.13
(V0 + V1) 1 0.84 3.95+ 4 0.72 2.62
num 3 0.21 1.15= 1 0.43 0.23∀ 3 0.15 0.03
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 50/72
Theorem Names
Nearest Neighbours of ADD ASSOC
Theorem name Statement Nearness
MULT ASSOC (V0 * (V1 * V2)) = ((V0 * V1) * V2) 553ADD AC 1 ((V0 + V1) + V2) = (V0 + (V1 + V2)) 264EXP MULT (EXP V0 (V1 * V2)) = (EXP (EXP V0 V1) V2) 247
HREAL ADD ASSOC (V0 +H (V1 +H V2)) = ((V0 +H V1) +H V2) 246HREAL MUL ASSOC (V0 *H (V1 *H V2)) = ((V0 *H V1) *H V2) 246REAL ADD ASSOC (V0 +R (V1 +R V2)) = ((V0 +R V1) +R V2) 246REAL MUL ASSOC (V0 *R (V1 *R V2)) = ((V0 *R V1) *R V2) 246REAL MAX ASSOC (MAXR V0 (MAXR V1 V2)) = (MAXR (MAXR V0 V1) V2) 246INT ADD ASSOC (V0 +Z (V1 +Z V2)) = ((V0 +Z V1) +Z V2) 246
REAL MIN ASSOC (MINR V0 (MINR V1 V2)) = (MINR (MINR V0 V1) V2) 246
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 51/72
Theorem Names
Predicted Stems for associativity of addition
Stem Positions
ADD [0.18; 0.14; 0.12; 0; 0.12]ASSOC [1]
AC [0; 1]NUM [0.67; 0.45; 0; 0.70]
FIXED [0]EQUAL [0.25; 0.25]
ONE [0]
Stem Positions
MONO [0.44; 0.49]REFL [1]
SELECT [0]SPLITS [0]
RCANCEL [1]IITN [0]GEN [1]
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 52/72
Theorem Names
Learning and Evaluation
• Use k-NN to predict stems and their positions• Based on constants and shapes of theorems (ADD ASSOC)
• Leave-one-out cross-validation
• 2298 statements in HOL Light standard library
First Later Same All StemChoice Choice Stems Stems Overlap Fail
273 501 291 757 459 17
Failures:
• EXCLUDED MIDDLE (proposed: DISJ THM)
• INFINITY AX (proposed: ONTO ONE)
• ETA AX, WLOG RELATION TOPOLOGICAL SORT
Inconsistencies
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 53/72
Internal Guidance
Outline
Theorem Proving Introduction
Machine Learning Problems in Theorem Proving
Premise Selection
Useful Intermediate Steps
Theorem Names
Internal Guidance
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 54/72
Internal Guidance
leanCoP: Lean Connection Prover (Jens Otten)
• Connected tableaux calculus• Goal oriented, good for large theories
• Regularly beats Metis and Prover9 in CASC• despite their much larger implementation• very good performance on some ITP challenges
• Compact Prolog implementation, easy to modify• Variants for other foundations: iLeanCoP, mLeanCoP• First experiments with machine learning: MaLeCoP
• Easy to imitate• leanCoP tactic in HOL Light
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 55/72
Internal Guidance
Lean connection Tableaux
Very simple rules:
• Reduction unifies the current literal with a literal on the path
• Extension unifies the current literal with a copy of a clause
{}, M, PathAxiom
C , M, Path ∪ {L2}C ∪ {L1}, M, Path ∪ {L2}
Reduction
C2 \ {L2}, M, Path ∪ {L1} C , M, Path
C ∪ {L1}, M, PathExtension
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 56/72
Internal Guidance
leanCoP: Basic Unoptimized Code
1 prove([Lit|Cla],Path ,PathLim ,Lem ,Set) :-
23 (-NegLit=Lit;-Lit=NegLit) ->
4 (
567 member(NegL ,Path),unify_with_occurs_check(NegL ,NegLit)
8 ;
9 lit(NegLit ,NegL ,Cla1 ,Grnd1),
10 unify_with_occurs_check(NegL ,NegLit),
11121314 prove(Cla1 ,[Lit|Path],PathLim ,Lem ,Set)
15 ),
1617 prove(Cla ,Path ,PathLim ,Lem ,Set).
18 prove([],_,_,_,_).
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 57/72
Internal Guidance
leanCoP: Actual Optimized Code
1 prove([Lit|Cla],Path ,PathLim ,Lem ,Set) :-
2 \+ (member(LitC ,[Lit|Cla]),member(LitP ,Path),LitC==LitP),
3 (-NegLit=Lit;-Lit=NegLit) ->
4 (
5 member(LitL ,Lem), Lit==LitL
6 ;
7 member(NegL ,Path),unify_with_occurs_check(NegL ,NegLit)
8 ;
9 lit(NegLit ,NegL ,Cla1 ,Grnd1),
10 unify_with_occurs_check(NegL ,NegLit),
11 ( Grnd1=g -> true ;
12 length(Path ,K), K<PathLim -> true ;
13 \+ pathlim -> assert(pathlim), fail ),
14 prove(Cla1 ,[Lit|Path],PathLim ,Lem ,Set)
15 ),
16 ( member(cut ,Set) -> ! ; true ),
17 prove(Cla ,Path ,PathLim ,[Lit|Lem],Set).
18 prove([],_,_,_,_).
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 58/72
Internal Guidance
leanCoP in HOL Light
• OCaml leanCoP integrated as an automated proof tactic• Reconstruction technique for hammers
• Some Prolog technology missing in functional languages• explicit stack for current proof state
• Including the trail of variable bindings
• continuation passing style for flow control• backtracking alternatives• further tasks
• Improvement compared to existing reconstruction• Very secure HOL Light certification of leanCoP proofs
Prover Theorem (%)
OcaML-leanCoP (cut) 759 (87.04)Metis (2.3) 708 (81.19)Meson 683 (78.32)
any 832 (95.41)
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 59/72
Internal Guidance
FEMaLeCoP: Advice Overview and Used Features
• Advise the:• selection of clause for every tableau extension step
• Proof state: weighted vector of symbols (or terms)• extracted from all the literals on the active path• Frequency-based weighting (IDF)• Simple decay factor (using maximum)
• Consistent clausification• formula ?[X]: p(X) becomes p(’skolem(?[A]:p(A),1)’)
• Advice using custom sparse naive Bayes• association of the features of the proof states• with contrapositives used for the successful extension steps
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 60/72
Internal Guidance
FEMaLeCoP: Data Collection and Indexing
• Slight extension of the saved proofs• Training Data: pairs (path, used extension step)
• External Data Indexing (incremental)• te num: number of training examples• pf no: map from features to number of occurrences ∈ Q• cn no: map from contrapositives to numbers of occurrences• cn pf no: map of maps of cn/pf co-occurrences
• Problem Specific Data• Upon start FEMaLeCoP reads
• only current-problem relevant parts of the training data
• cn no and cn pf no filtered by contrapositives in the problem• pf no and cn pf no filtered by possible features in the problem
Rest of Naive Bayes, as in previous lecture
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 61/72
Internal Guidance
It cannot work?
But it does!
Inference speed ...
drops to about 40%, but:
Prover Proved (%)
Unguided OCaml-leanCoP 574 (27.6%)FEMaLeCoP 635 (30.6%)
together 664 (32.0%)
(evaluation on MPTP bushy problems, 60 s)
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 62/72
Internal Guidance
It cannot work? But it does!
Inference speed ... drops to about 40%, but:
Prover Proved (%)
Unguided OCaml-leanCoP 574 (27.6%)FEMaLeCoP 635 (30.6%)
together 664 (32.0%)
(evaluation on MPTP bushy problems, 60 s)
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 62/72
Internal Guidance
Even more impressive in HOL (Satallax) [BrownFarber2016 ]
0 5 10 15 20 25 30
0
1,000
2,000
3,000
4,000
Time [s]
Pro
ble
ms
solv
ed
OfflineTraining
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 63/72
Internal Guidance
E-Prover given-clause loop
Most important choice: unprocessed clause selection [Schulz 2015 ]
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 64/72
Internal Guidance
Data Collection
Mizar top-level theorems [Urban 2006 ]
• Encoded in FOF
32,521 Mizar theorems with ≥ 1 proof
• training-validation split (90%-10%)
• replay with one strategy
Collect all CNF intermediate steps
• and unprocessed clauses when proof is found
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 65/72
Internal Guidance
Deep Network Architectures
Clause EmbedderNegated conjecture
embedder
Concatenate
Fully Connected(1024 nodes)
Fully Connected(1 node)
Logistic loss
Clause tokens Negated conjecture tokens
Conv 5 (1024) + ReLU
Input token embeddings
Conv 5 (1024) + ReLU
Conv 5 (1024) + ReLU
Max Pooling
Overall network Convolutional Embedding
Non-dilated and dilated convolutions
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 66/72
Internal Guidance
Recursive Neural Networks
• Curried representation of first-order statements
• Separate nodes for apply, or, and, not
• Layer weights learned jointly for the same formula
• Embeddings of symbols learned with rest of network
• Tree-RNN and Tree-LSTM models
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 67/72
Internal Guidance
Model accuracy
Model Embedding Size Accuracy: 50-50% splitTree-RNN-256×2 256 77.5%Tree-RNN-512×1 256 78.1%
Tree-LSTM-256×2 256 77.0%Tree-LSTM-256×3 256 77.0%Tree-LSTM-512×2 256 77.9%
CNN-1024×3 256 80.3%?CNN-1024×3 256 78.7%
CNN-1024×3 512 79.7%CNN-1024×3 1024 79.8%
WaveNet-256×3×7 256 79.9%?WaveNet-256×3×7 256 79.9%WaveNet-1024×3×7 1024 81.0%
WaveNet-640×3×7(20%) 640 81.5%?WaveNet-640×3×7(20%) 640 79.9%
? = train on unprocessed clauses as negative examples
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 68/72
Internal Guidance
Hybrid Heuristic
Already on proved statementsperformance requires modifications
102 103 104 105
Processed clause limit
0%
20%
40%
60%
80%
100%
Perc
ent
unpro
ved
Pure CNN
Hybrid CNN
Pure CNN; Auto
Hyrbid CNN; Auto
102 103 104 105
Processed clause limit
0%
20%
40%
60%
80%
100%
Perc
ent
unpro
ved
Auto
WaveNet 640*
WaveNet 256
WaveNet 256*
WaveNet 640
CNN
CNN*
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 69/72
Internal Guidance
Harder Mizar top-level statements
Model DeepMath 1 DeepMath 2 Union of 1 and 2
Auto 578 581 674?WaveNet 640 644 612 767?WaveNet 256 692 712 864
WaveNet 640 629 685 997?CNN 905 812 1,057
CNN 839 935 1,101
Total (unique) 1,451 1,458 1,712
Overall proved 7.4% of the harder statements
• Batching and hybrid necessary
• Model accuracy unsatisfactory
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 70/72
Internal Guidance
Harder Mizar top-level statements
Model DeepMath 1 DeepMath 2 Union of 1 and 2
Auto 578 581 674?WaveNet 640 644 612 767?WaveNet 256 692 712 864
WaveNet 640 629 685 997?CNN 905 812 1,057
CNN 839 935 1,101
Total (unique) 1,451 1,458 1,712
Overall proved 7.4% of the harder statements
• Batching and hybrid necessary
• Model accuracy unsatisfactory
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 70/72
Internal Guidance
Generative Models of Text
[A. Karpathy 2016]C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 71/72
Internal Guidance
Take Home• Proofs are hard
• Machine learning key to most powerful proof assistant automation
• Older but very efficient algorithms with significant adjustments
• Many other learning problems and scenarios
Not covered• Learning strategy selection [Jakubuv,Urban]
• Kernel methods [Kuhlwein]
• SVM [Holden]
• Deep Prolog [Rocktaschel ]
• Semantic Features, Conecturing
• Tactic selection [Nagashima,Gauthier ]
• Adversarial Networks [Szegedy ]
• Human proof optimization
• Theory exploration [Bundy+]
• Learning to formalize [Vyskocil ]
• ...
C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 72/72