feature importance analysis with xgboost in tax audit
TRANSCRIPT
![Page 1: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/1.jpg)
Preparation of a tax audit with Machine Learning
“Feature Importance” analysis applied
to accounting using XGBoost R package
Meetup Paris Machine Learning Applications Group – Paris – May 13th, 2015
![Page 2: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/2.jpg)
Who am I?
Michaël Benesty@pommedeterre33 @pommedeterresautee fr.linkedin.com/in/mbenesty
• CPA (Paris): 4 years
• Financial auditor (NYC): 2 years
• Tax law associate @ Taj (Deloitte - Paris) since 2013• Department TMC (Computerized tax audit)
• Co-author XGBoost R package with Tianqi Chen (main author) & Tong He (package maintainer)
![Page 3: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/3.jpg)
WARNINGEverything that will be presented
tonight is exclusively based on open source software
Please try the same at home
![Page 4: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/4.jpg)
Plan
1. Accounting & tax audit context2. Machine learning application3. Gradient boosting theory
![Page 5: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/5.jpg)
Accounting crash course 101 (1/2)
Accounting is a way to transcribe economical operations.
• My company buys €10 worth of potatoes to cook delicious French fries.
Account number Account Name Debit Credit
601 Purchase 10.00
512 Bank 10.00
Description: Buy €10 of potatoes to XYZ
![Page 6: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/6.jpg)
Accounting crash course 101 (2/2)
French Tax law requires many more information in my accounting:
• Who?• Name of the potatoes provider• Account of the potatoes provider
• When?• When the accounting entry is posted• Date of the invoice from the potatoes seller• Payment date• …
• What?• Invoice ref• Item description• …
• How Much?• Foreign currency• …
• …
![Page 7: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/7.jpg)
Tax audit context
Since 2014, companies audited by the French tax administration shall provide their entire accounting as a CSV / XML file.
Simplified* example:
EcritureDate|CompteNum|CompteLib|PieceDate|EcritureLib|Debit|Credit
20110805|601|Purchase|20110701|Buy potatoes|10|0
20110805|512|Bank|20110701|Buy potatoes|0|10
*: usually there are 18 columns
![Page 8: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/8.jpg)
Example of a trivial apparent anomaly
Article 39 of French tax code states that (simplified):
“For FY 2011, an expense is deductible from P&L 2011 when its operative event happens in 2011”
In our audit software (ACL), we add a new Boolean feature to the dataset: True if the invoice date is out of 2011, Falseotherwise
![Page 9: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/9.jpg)
Boring tasks to perform by a human
Find a pattern to predict if accounting entry will be tagged as an anomaly regarding the way its fields are populated.
1. Take time to display lines marked as out of FYdemo dataset (1 500 000 lines) ≈ 100 000 lines marked having invoice out of FY
2. Take time to analyze 18 columns of the accountingfrom 200 to >> 100 000 different values per column
3. Take time to find a pattern/rule by hand. Use filters. Iterate.
4. Take time to check that pattern found in selection is not in remaining data
![Page 10: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/10.jpg)
What Machine Learning can do to help?
1. Look at whole dataset without human help
2. Analyze each value in each column without human help
3. Find a pattern without human help
4. Generate a (R-Markdown) report without human help
Requirements:• Interpretable• Scalable• Works (almost) out of the box
![Page 11: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/11.jpg)
2 tries for a success
1st try: Subgroup mining (Failed)Find feature values common to a group of observations which are different from the rest of the dataset.
2nd try: Feature importance on decision tree based algorithm (Success)
Use predictive algorithm to describe the existing data.
![Page 12: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/12.jpg)
1st try: Subgroup mining algorithm
Find feature values common to a group of observations which are different from the rest of the dataset.
1. Find an existing open source project
2. Check it gives interpretable results in reasonable time
3. Help project main author on:• reducing memory footprint by 50%, fixing many small bugs (2 months)
• R interface (1 month)
• Find and fix a huge bug in the core algorithm just before going in production (1 week)
After the last bug fix, the algorithm was too slow to be used on real accounting…
![Page 13: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/13.jpg)
2nd try: XGBoost
Available on R, Python, Julia, CLI
Fast speed and memory efficient• Can be more than 10 times faster than GBM in Sklearn and R (Benchmark on GitHub deposit)
• New external memory learning implementation (based on distributed computation implementation)
Distributed and Portable• The distributed version runs on Hadoop (YARN), MPI, SGE etc.
• Scales to billions of examples (tested on 4 billions observations / 20 computers)
XGBoost won many Kaggle competitions, like:• WWW2015 Microsoft Malware Classification Challenge (BIG 2015)
• Tradeshift Text Classification
• HEP meets ML Award in Higgs Boson Challenge
• XGBoost is by far the most discussed tool in ongoing Otto competition
![Page 14: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/14.jpg)
Iterative feature importance with XGBoost (1/3)
Shows which features are the most important to predict if an entry has its field PieceDate (invoice date) out of the Fiscal Year.
In this example, FY is from 2010/12/01 to 2011/11/30
It is not surprising to have PieceDateamong the most important features because the label is based on this feature! But the distribution of important invoice date is interesting here.
Most entries out of the FY have the same invoice date:20111201
![Page 15: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/15.jpg)
Iterative feature importance with XGBoost (2/3)
Since in previous slide, one feature represents > 99% of the gain we remove it from the dataset and we run a new analysis.
Most entries are related to the same JournalCode(nature of operation)
![Page 16: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/16.jpg)
Iterative feature importance with XGBoost (3/3)
Entries marked as out of FY have the same invoice date, and are related to the same JournalCode. We run a new analysis without JournalCode:
Most of the entries with an invoice date issue are related to Inventory accounts!
That’s the kind of pattern we were looking for
![Page 17: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/17.jpg)
XGBoost explained in 2 pics (1/2)
Classification And Regression Tree (CART)
Decision tree is about learning a set of rules:
if 𝑋1 ≤ 𝑡1 & if 𝑋2 ≤ 𝑡2 then 𝑅1if 𝑋1 ≤ 𝑡1 & if 𝑋2 > 𝑡2 then 𝑅2
…
Advantages:
• Interpretable
• Robust
• Non linear link
Drawbacks:
• Weak Learner
• High variance
![Page 18: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/18.jpg)
XGBoost explained in 2 pics (2/2)
Gradient boosting on CART
• One more tree = loss mean decreases = more data explained
• Each tree captures some parts of the model
• Original data points in tree 1 are replaced by the loss points for tree 2 and 3
![Page 19: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/19.jpg)
Learning a model ≃ Minimizing the loss functionGiven a prediction 𝑦 and a label 𝑦, a loss function ℓ measures the discrepancy between the algorithm's 𝑛 prediction and the desired 𝑛 output.• Loss on training data:
𝐿 = 𝑖=1
𝑛
ℓ(𝑦𝑖 , 𝑦𝑖)
• Logistic loss for binary classification:
ℓ 𝑦𝑖 , 𝑦𝑖 = −1
𝑛
𝑖=1
𝑛
𝑦𝑖 log 𝑦𝑖 + 1 − 𝑦𝑖 log(1 − 𝑦𝑖)
Logistic loss punishes by the infinity* a false certainty in prediction 0; 1*: lim
𝑥→0+log 𝑥 = −∞
![Page 20: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/20.jpg)
Growing a tree
In practice, we grow the tree greedily:
• Start from tree with depth 0
• For each leaf node of the tree, try to add a split. The change of objective after adding the split is:
𝐺𝑎𝑖𝑛 =𝐺𝐿
2
𝐻𝐿 + 𝜆+
𝐺𝑅2
𝐻𝑅 + 𝜆−
𝐺𝐿 + 𝐺𝑅2
𝐻𝑅 + 𝐻𝐿 + 𝜆− 𝛾
G is called sum of residual which means the general mean direction of the residual we want to fit.
H corresponds to the sum of weights in all the instances.
𝛾 and 𝜆 are 2 regularization parameters.
Score of left child Score of right child Score if we don’t split
Complexity cost byintroducing Additional leaf
Tianqi Chen. (Oct. 2014) Learning about the model: Introduction to Boosted Trees
![Page 21: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/21.jpg)
Gradient Boosting
Iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier.
• Each round we learn a new tree to approximate the negative gradient and minimize the loss
𝑦𝑖(𝑡)
= 𝑦𝑖(𝑡−1)
+ 𝑓𝑡(𝑥𝑖)
• Loss:
𝑂𝑏𝑗(𝑡) = 𝑖=1
𝑛
ℓ 𝑦𝑖 , 𝑦𝑡−1 + 𝑓𝑡(𝑥𝑖) + Ω(𝑓𝑡)
Friedman, J. H. (March 1999) Stochastic Gradient Boosting. Complexity cost by introducing additional tree
Tree t predictionWhole model prediction
![Page 22: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/22.jpg)
Gradient descent
“Gradient Boosting is a special case of the functional gradient descent view of boosting.”Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (May 1999). Boosting Algorithms as Gradient Descent in Function Space.
2D View
Loss
Sometimes you are lucky
Usually you finish here
![Page 23: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/23.jpg)
Building a good model for feature importance
For feature importance analysis, in Simplicity Vs Accuracy trade-off, choose the first. Few rule of thumbs (empiric):
• nrounds: number of trees. Keep it low (< 20 trees)
• max.depth: deepness of each tree. Keep it low (< 7)
• Run iteratively the feature importance analysis and remove the most important features until the 3 most important features represent less than 70% of the whole gain.
![Page 24: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/24.jpg)
Love XGBoost? Vote XGBoost!
Otto challenge
Help XGBoost open source project to spread knowledge by voting for our script explaining how to use our tool (no prize to win)https://www.kaggle.com/users/32300/tianqi-chen/otto-group-product-classification-challenge/understanding-xgboost-model-on-otto-data
![Page 25: Feature Importance Analysis with XGBoost in Tax audit](https://reader031.vdocuments.net/reader031/viewer/2022012320/55a8485b1a28ab98108b471b/html5/thumbnails/25.jpg)
Too much time in your life?
• General papers about gradient boosting:
• Greedy function approximation a gradient boosting machine. J.H. Friedman
• Stochastic Gradient Boosting. J.H. Friedman
• Tricks used by XGBoost
• Additive logistic regression a statistical view of boosting. J.H. Friedman T. Hastie R. Tibshirani (for the second-order statistics for tree splitting)
• Learning Nonlinear Functions Using Regularized Greedy Forest. R. Johnson and T. Zhang (proposes to do fully corrective step, as well as regularizing the tree complexity)
• Learning about the model: Introduction to Boosted Trees. Tianqi Chen. (from the author of XGBoost)