d. sobczak april 27, 2017 - semcacfe · 2017-04-29 · ingredients of data visualization with a...
TRANSCRIPT
Basic Data Analytics, Visualizations, and Predictive Analytics
D. SobczakApril 27, 2017
Topics
• Tips on how to get started in data mining and basic analytic techniques.
• Storytelling techniques to convert analytical results to easy-to-understand presentations.
• Discussion on predictive modeling.
Median Loss -$975,000
J
k
Median Loss -$200,000
Types of Business Fraud
http://www.acfe.com/rttn2016/docs/Staggering-Cost-of-Fraud-infographic.pdf
Median Loss -$125,000
Types of Business Fraud (cont’d)
http://www.acfe.com/rttn2016/docs/Staggering-Cost-
of-Fraud-infographic.pdf
Median Losses Valued by Region
http://www.acfe.com/rttn2016/docs/Staggering-Cost-of-Fraud-infographic.pdf
Anti-Fraud Controls minimize the amount of the fraud theft.
http://www.acfe.com/rttn2016/docs/Staggering-Cost-of-Fraud-infographic.pdf
http://www.acfe.com/rttn2016/detection.aspx
Fraud Triangle / Data Scientist Venn Diagram
http://www.internalauditor.me/article/the-fraud-triangle/http://drewconway.com/zia/2013/3/26/the-data-
science-venn-diagram
CRISP-DM - Cross Industry Standard Process for Data Mining
The Analytics Process – CRISP-DM (Cross Industry Standard Process for Data Mining)
https://itsalocke.com/crisp-dm/
CRISP-DM is an Iterative Process
The Analytics Process – CRISP-DM (Cross Industry Standard Process for Data Mining)
Business Understanding• Understand the business goal• Assess the Situation • Translate the business goal into
a data mining objective• Develop a project plan
Data Understanding• Consider data requirements• Collect and explore the data• Determine quality of the data
The Analytics Process – CRISP-DM (cont’d)(Cross Industry Standard Process for Data Mining)
Data preparation• Select needed data• Data acquisition• Data integration and formatting• Data cleaning• Data transformation and
enrichment
Modeling• Selection of appropriate modeling
technique• Splitting of the dataset into training
and testing subsets for evaluation purposes
• Development and examination of alternative modeling algorithms and parameter settings
• Fine tuning of the model settings according to an initial assessment of the model’s performance
The Analytics Process – CRISP-DM (cont’d)(Cross Industry Standard Process for Data Mining)
Model evaluation• Evaluation of the model in the
context of the business success criteria
• Model approval
Deployment• Create a report of findings• Planning and development of the
deployment procedure• Deployment of the model• Distribution of the model results and
integration in the organization's operational system
• Development of a maintenance / update plan
• Review of the project• Planning the next steps
Why Use CRISP-DM? (Cross Industry Standard Process for Data Mining)
http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html
CRISP-DM is the most used methodology for analytics projects. It has been around since 1996 and was developed by IBM and is part of their SPSS software package.
The best reason to use a methodology is for standardization of approach.
How to Minimize Fraud Using Data
An ExampleUsing Made Up Data
The Analytics Process – CRISP-DM (Cross Industry Standard Process for Data Mining)
Business Understanding• Understand the business goal• Assess the Situation • Translate the business goal into
a data mining objective• Develop a project plan
Steps• Anomalous Providers• Chiropractors for 3 months• Determine which fields will
address the goal• Write down all steps needed to
complete task
The Analytics Process – CRISP-DM (Cross Industry Standard Process for Data Mining)
Data Understanding• Consider data requirements• Collect and explore the data• Determine quality of the data
Steps• Where is the data, what is needed?• Collect and summarize the data• Determine quality of the data
The Analytics Process – CRISP-DM (cont’d)(Cross Industry Standard Process for Data Mining)
Data preparation• Select needed data• Data acquisition• Data integration and formatting• Data cleaning• Data transformation and
enrichment
Steps• Chiropractor Claims• System – SQL Server• One data set• Ensure data is complete/accurate• Pull only variables that are needed
Project Plan – Chiropractor Claims
1. Determine if there are any questionable billings from Providers2. Get a sample of the data to determine the quality/quantity of the data, as
well as field names. (Chiropractor claims for X time frame)3. Summarize the data based on procedure codes by payment and count4. Speak with a Chiropractor or do research into procedures they can perform5. Determine how many procedure codes to use in your analysis6. Request data from IT or pull it yourself.7. Summarize data, again, based on procedure codes by payment and count
Criteria is Very Important
• Do you want:• To do the analysis once?• To provide bad results?• Your work to be valued and used?• To explain why it is incorrect?
• Understanding the business problem and data wrangling (cleaning and transforming the data) are 80% of the work
• Repeating this step is very inefficient• Once you have a process, continue to use it. Refine when necessary.
Warranty Information - Criteria
• Global/Specific Region• Local or US currency• Dealer/City/State/Country• Labor codes• Final payment flag• Category of Warranty• Total/labor/part payment
• Vehicle Production information• Dealer information• Payment schedules• Model Year/Calendar Year• Model• Engine RPO• Transmission RPO
With this many variables, it is wise to understand exactly what the requestor wants. This process is iterative, but should not be repeated over and over from the beginning. Get the correct data the first time!!
Chiropractic Medicine – Top 14 Procedure Codes – 3 Months of DataProcedure
CodesDescription Claim Count Paid
98941 Chiropractic manipulative treatment (CMT); spinal, three to four regions 2,281,358 78,064,87798942 Chiropractic manipulative treatment (CMT); spinal, five regions 462,381 20,663,39998940 Chiropractic manipulative treatment (CMT); spinal, one to two regions 591,984 15,872,04897012 Traction, mechanical 1,219,642 21,541,06172100 Radiologic examination, spine, lumbosacral; two or three views 64,904 2,579,04872070 Radiologic examination, spine; thoracic, two views 53,622 2,071,04199203 Office or other outpatient visit for the evaluation and management of a new patient 30,399 2,033,93972050 Radiologic examination, spine, cervical; minimum of four views 33,795 1,821,41072040 Radiologic examination, spine, cervical; two or three views 48,493 1,773,11072010 Radiologic examination, spine, entire, survey study, anteroposterior and lateral 22,552 1,491,33899202 Office or other outpatient visit for the evaluation and management of a new patient 30,809 1,306,16772170 Radiologic examination, pelvis; one or two views 33,201 1,032,78072110 Radiologic examination, spine, lumbosacral; minimum four views 12,319 675,283
99213 Office or other outpatient visit for the evaluation and management of an established patient
17,432 639,408
Total of above procedure codes 151,564,909Total of all chiropractic procedure codes 153,626,380
75%
14%
Manipulations
There are five regions of the vertebrae:1. Cervical2. Thoracic3. Lumbar4. Sacral5. Coccygeal
Procedure codes:98940 – one to two manipulations98941 – three to four manipulations98942 – five manipulations
Providers with High Percentages of 98942
Provider Total Total 98942 % 98941 % 98940 % Code Prov Pymt Count Prov Pymt 98942 Prov Pymt 98941 Prov Pymt 98940
All Claims 114,600,324 3,335,723 20,663,399 18% 78,064,877 68% 15,872,048 14%
J 424,710 9,750 405,937 95% 17,824 5% 949 0%E 305,118 6,305 300,941 99% 2,345 1% 1,832 0%F 280,632 6,181 253,826 91% 19,343 9% 7,464 0%G 255,393 5,268 236,332 93% 14,265 7% 4,796 0%H 245,337 5,684 229,482 93% 14,611 7% 1,243 0%C 239,414 5,714 208,748 86% 25,928 14% 4,738 0%I 213,776 5,212 202,576 94% 11,200 6% - 0%A 226,838 5,080 194,337 84% 26,907 16% 5,594 0%K 216,626 4,803 191,203 88% 20,330 12% 5,093 0%B 136,710 2,651 135,502 99% 899 1% 309 0%D 130,255 2,725 130,221 100% 34 0% - 0%
What Else Can You Do With The Data?• Summarize by Provider and sort payment in descending
order – do the results make sense? Are there outliers?• How do you determine outliers? Easiest way is Boxplots.
What Else Can You Do With The Data? (Cont’d)
• Determine how much Providers have worked in the 3 month period. • Have any providers worked Sundays/Holidays? • Every day? (see next slide)• What is the average number of days and have some worked a lot higher?
• Summarize by Provider and procedure code.• Do some Providers bill procedures codes a lot higher than others?
• Summarize by Provider and Patient counts per day (see next slide)
Provider P2
Rule-Based Data Mining
Rule-Based Data Mining
• Allergy Testing and Immunotherapy• Billing can be confusing – this procedure code is quantity processed, based on
dosages not cc’s in a vial and not on number of vials.• Percutaneous testing is done first and then for any inconclusive results, intracutaneous
testing is done.• Specialty – Allergists perform generally 7 times more percutaneous than intracutaneous tests.
Other specialties perform 0 to many percutaneous to intracutaneous tests.• Percutaneous tests are always done first. Other specialties are not following procedure.
• Providers, many times bill based on cc’s. This leads to huge overpayments. Almost 90% of dosages should be between 1 and 20. Amounts above this level may be worth looking at.
• Evaluation and Management services should not be reported simultaneously with allergy/immunotherapy injections unless a separate service was provided.
Rule-Based Data Mining (cont’d)
• Echocardiography• 93307 - Echocardiography, transthoracic, real-time with image documentation
(2D), includes M-mode recording, when performed, complete, without spectral or color Doppler echocardiography
• 93320 - Doppler echocardiography, pulsed wave and/or continuous wave with spectral display (List separately in addition to codes for echocardiographic imaging); complete
• 93325 - Doppler echocardiography color flow velocity mapping (List separately in addition to codes for echocardiography)
• Medically used to evaluate the possible diseases of the aorta, shunts, septal defects, and to determine the severity of any meaningful heart valve narrowing (stenosis) or regurgitation (leaking backward) or of evaluations of prosthetic valves.
ACC/AHA Guidelines for the Clinical Application of Echocardiography. Circulation. 1997;95:1686-1744
Ideas for Data Mining
• Ask an expert what odd things they have seen and pull data for the expert to review
• Next steps would be to discuss the data and if further analysis is needed
• Prior audits with bad controls can be good starting points• Research what types of fraud are occurring in healthcare• Pharmacy
• Providers prescribing medications that do not make sense with their specialty (Podiatrist prescribing heart medication)
• Quantity of medication prescribed in total and by patient• Pharmacist with highest fulfillment of controlled substances
Ideas for Financial Fraud
• Vendor Fraud• Audit vendor list and remove any vendors that are no longer providing
services and verify remaining vendors on the list• Review newest vendors – last 5 years• Review recurring payments, especially direct deposits• Review employees banking information against vendor banking information• Analyze data for any patterns
• Analyze further for any suspicious transactions• Verify currency rates are calculated correctly
viding
It was three feet deep on average.
Visualization – Make Your Results Shine
10 GOLDEN RULES WILL HELP YOU CREATE THE MOST SUCCESSFUL DATA VISUALIZATIONS1. Begin with a goal. Starting with a goal provides the foundation to bring together
ingredients of data visualization with a purpose. Prompting a decision or action, or inviting an audience to explore the data to find new insights.
2. Know your data. This knowledge will also serve to verify that you have the best data to support your goal.
3. Put your audience first. Data visualization is rarely one size fits all, and its message can be lost if it’s not customized for its audience. What does your audience need to know?
http://www.dbta.com/BigDataQuarterly/Articles/10-Golden-Rules-of-Data-Visualization-114796.aspx
10 GOLDEN RULES WILL HELP YOU CREATE THE MOST SUCCESSFUL DATA VISUALIZATIONS
4. Be media sensitive. iPad, Notebook, Cell Phone, etc. Considering how the visualization will be viewed will help you make sure your visualization reaches its audience.
5. Choose the right chart. Know the strengths of each chart type and what key features of data they best visualize.
6. Chart smart. Data visualizations should not distort, mislead, or misrepresent. Avoid cherry picking data and do not force the data to fit a message.
http://www.dbta.com/BigDataQuarterly/Articles/10-Golden-Rules-of-Data-Visualization-114796.aspx
ggg
750,000 vs 700,000
distorted
not distorted
10 GOLDEN RULES WILL HELP YOU CREATE THE MOST SUCCESSFUL DATA VISUALIZATIONS
7. Use labels wisely. Give your audience context by including a simple and compelling title. Easy to read labels.
8. Design to the point. The key to designing data visualizations is to be straightforward. Ultimately, make sure everything on the visualization serves a purpose.
9. Let the data speak. Use visual cues strategically to guide the audience and draw their attention, but let the data tell the story, not the design.
10. Feedback is a good thing. Take time to fine-tune visualizations by engaging with stakeholders to gather feedback.
http://www.dbta.com/BigDataQuarterly/Articles/10-Golden-Rules-of-Data-Visualization-114796.aspx
How do you know which type of data vizworks best?Consider the following guidelines:
• Line Charts track changes or trends over time and show the relationship between two or more variables.
• Bar Charts are used to compare quantities of different categories.• Scatter Plots show joint variation of two data items.• Bubble Chart show joint variation of three data items.• Pie Charts are used to compare parts of a whole and should be used carefully.
• Never compare two pie charts without clearly noting that the size of the pie may have changed as well.
https://www.gooddata.com/blog/5-data-visualization-best-practices
https://img.labnol.org/di/data-chart-type.png?_ga=1.47105312.600219594.1492990363
Do’s and Don’ts of Charts
Do’s• Use the full axis• Simplify less important info• Be creative with legends/labels• Pass the squint test• Ask the opinion of others
Don’ts• Use 3-D or blow apart effects• Use more than 6 colors• Change the style of charts that
are being compared• Make users do visual math• Overload the chart
Some Really Bad Visuals
http://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6#did-anyone-learn-anything-by-looking-at-this-pseudo-pie-chart-what-do-these-colors-even-mean-why-is-it-divided-into-quadrants-well-never-know-1
http://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6#i-never-thought-it-was-possible-but-i-actually-understand-soccer-less-after-looking-at-this-chart-3
http://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6#theres-a-lot-going-on-with-this-bloomberg-chart-that-doesnt-seem-like-an-evenly-cut-lamb-chop-and-while-im-not-a-biologist-i-have-a-strong-feeling-an-onion-is-not-a-melon-4
What Do You Think of This Visual?
Question Everything!!!
• Data can be used to show whatever a person wants to show
• Make sure it is fair and unbiased• The method and assumptions
should be stated.• Ask to see the data, if necessary
Providers with High Percentages of 98942
Provider Total Total 98942 % 98941 % 98940 % Code Prov Pymt Count Prov Pymt 98942 Prov Pymt 98941 Prov Pymt 98940
All Claims 114,600,324 3,335,723 20,663,399 18% 78,064,877 68% 15,872,048 14%
J 424,710 9,750 405,937 95% 17,824 5% 949 0%E 305,118 6,305 300,941 99% 2,345 1% 1,832 0%F 280,632 6,181 253,826 91% 19,343 9% 7,464 0%G 255,393 5,268 236,332 93% 14,265 7% 4,796 0%H 245,337 5,684 229,482 93% 14,611 7% 1,243 0%C 239,414 5,714 208,748 86% 25,928 14% 4,738 0%I 213,776 5,212 202,576 94% 11,200 6% - 0%A 226,838 5,080 194,337 84% 26,907 16% 5,594 0%K 216,626 4,803 191,203 88% 20,330 12% 5,093 0%B 136,710 2,651 135,502 99% 899 1% 309 0%D 130,255 2,725 130,221 100% 34 0% - 0%
Four Providers Against the Norm
68%
14%
18%
100%
Which one is easier/quicker to understand?
Some Really Good Visuals
Counts
YEAR
Pinpoints problem and is easy to understand
Normalizes data for comparison
Many pages of documentation led to the final results. Management does not want to see the documentation, just the results.
This one pager, speaks of an investigation of 12 officers regarding 10 contracts of almost a $1B
The costs are broken out by various categories that are informative at a high level.
The highest costs are at the top of the page.
The lowest costs are at the bottom.
The currency is shown in $ and N (naira). Should be consistent.
Currency rate 1N = $.0033N22.5bn = $73.4m
http://www.orodataviz.com/project/arms-procurement-fraud/
https://www.slideshare.net/Linkurious/kick-start-graph-visualization-projects
https://public.tableau.com/en-us/s/gallery/world-golf-rankings
Predictive or Advanced Analytics
The Internet of Things (IOT)
• The Internet of things (IoT) is the inter-networking of physical devices, vehicles, buildings, and other items—embedded with electronics, software, sensors, actuators, and network connectivity that enable these objects to collect and exchange data.
• In short connected devices can talk to each other. This is used in Autonomous Vehicles, among other things.
Q. Why Has Predictive Analytics Become Popular?A. The explosion of big data and the lowering cost of storing data
• What is big data?• 100 millions rows of a relational
database?• 100,000 rows of 10 connected
relational databases?• All financial statement
information/numbers?
• What specifically makes big data?• Words, pictures
• Examples of big data• Internet clickstream data• Web server logs • Social media content • Text from customer emails and
survey responses• Mobile-phone call/text-detail
records • Machine data captured by sensors
connected to the internet of things.
https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjgouL8s8DTAhXK7YMKHYQ_CwsQjB0IBg&url=http%3A%2F%2Fwww.bangkokbiznews.com%2Fblog%2Fdetail%2F636201&psig=AFQjCNH_FRAXch9T60kP7gfUUH01NX61YA&ust=1493237070424052
The 4 V’s of Big Data
http://www.abajournal.com/magazine/article/the_dawn_of_big_data
The need for large storage for data has grown due to unstructured data. Costs of storage has been coming down. In 1960, a hard drive cost per gigabyte was almost $1 million. Now, it is under a dime.
In November 2015, a petabyte (1 trillion kilobytes) of data cost $48,800.
http://www.mkomo.com/cost-per-gigabyte-update
Supervised vs Unsupervised Learning
• Supervised Learning • Output answer is provided.
• Fraud / No Fraud• Yes / No• A value is provided, i.e., house value
• Classification/Neural Nets/Decision Trees/Linear and Logistic Regression/Others
• Unsupervised Learning• No output answer is provided
• Looks for outliers• Marketing profiles
• Clustering/Kmeans/Gaussian Mixture Models
Supervised Learning Algorithms
http://www.dataschool.io/comparing-supervised-learning-algorithms/
Supervised Learning Algorithms (cont’d)
http://www.dataschool.io/comparing-supervised-learning-algorithms/
http://scikit-learn.org/stable/tutorial/machine_learning_map/Unsupervised Learning
Supervised Learning
Supervised Learning
Four types of predictive analytic models are being used to detect fraud - rules-based, anomaly, predictive, and social networking. 1. Rules-based models flag certain
charges automatically. 2. Anomaly models raise suspicion
based on factors that seem improbable.
3. Predictive models compare charges against a fraud profile and raise suspicion.
4. Social networking models raise suspicion based on the associations of a provider.
1. Podiatrist subscribing heart medication.
2. A provider who billed more procedures than could be possibly performed in a day.
3. A provider is billing in a fashion similar to previous known fraudsters.
4. If certain providers worked with previously known fraudulent providers.
http://www.modernhealthcare.com/article/20150225/NEWS/150229947
Credit Card Fraud - $190B/yearLocation:
• Live in one place, but make a purchase in anotherWhat you buy:
• If your card is commonly used to buy your morning cup of coffee and then a tank of gas, and out of the blue is used to buy a pair of expensive designer shoes
Spending amount: • If you typically spend $500/month, and suddenly rack up $3000 in a week
Spending frequency: • If your card is used to make a large number of purchases over a short period of time
Large purchase after a smaller one: • Thieves typically test stolen credit cards with smaller purchases first; if the card works, they
will proceed to make another larger purchase, like an expensive camera, or television, or sound system
Digital origins: • The digital origins of purchases are recorded and analyzed with each purchase; if an IP
address had been used to commit fraud in the past, subsequent transactions from the same IP or network may be flagged
https://thinksaveretire.com/2015/09/14/how-credit-card-fraud-detection-works/
Neural Network – Thinks Like a HumanA computer system modeled on the human brain and nervous system
• Considered the black box of predictive analytics as you cannot explain how the algorithm came up with the result
inputsoutputs
http://www.ijsce.org/attachments/File/NCAI2011/IJSCE_NCAI2011_025.pdf
Black Box
Cluster Analysis
Unsupervised Learning:Clusters 1, 2, and 5 are considered outliers and may be fraudulent. Additional research should be performed on these data points.
https://pdfs.semanticscholar.org/c42e/861472fa629c37cf7ba0d329d168f0e2f890.pdf
Investigations – Text Analytics
How did the FBI analyze 600,000 emails in just a few days?• More efficient discovery through automated tools
• Search for various keywords and terms related to classified information and flag these emails for further review.
• Remove materials based on keywords in email From: fields - any emails sent from Netflix, Amazon, or eBay
• Remove duplicates using signatures or hash-representations of files
• Manual review of flagged emails
586 514-0135