emily higgins data mining. data triangle quotes “drowning in data but starving for knowledge”...

Download Emily Higgins DATA MINING. Data Triangle Quotes “Drowning in data but starving for knowledge” “Computers have promised us a fountain of wisdom but delivered

If you can't read please download the document

Upload: grace-riley

Post on 23-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1
  • Emily Higgins DATA MINING
  • Slide 2
  • Data Triangle
  • Slide 3
  • Quotes Drowning in data but starving for knowledge Computers have promised us a fountain of wisdom but delivered a flood of data
  • Slide 4
  • Definition According to Encyclopedia Britannica Data mining is the process of discovering interesting and useful patterns and relationships in large volumes of data. Incorporates statistics and artificial intelligence (i.e. neural networks and machine learning) to analyze data sets.
  • Slide 5
  • A Brief History Term data mining introduced in the 1990s, however, its had its roots for many years 1960s: 1980s:
  • Slide 6
  • A Brief History cont. 1990s: Data mining has gone through and is continuing to go through many processes to further improve the technologies.
  • Slide 7
  • Data Warehouse A relational database designed for query and analysis. Contains historical data, and enables the consolidation of data from several sources. Subject oriented: i.e. Sales: Allow questions to be answered such as Who was the best customer for this item in the past year? Integrated: Bring in data from different sources into a single consistent format Nonvolatile: Once entered data should not change Time Variant: Focuses on change over time
  • Slide 8
  • Data Warehouse Architecture (basic) Summary data very important as they pre-compute long operations in advance
  • Slide 9
  • Data Warehouse Architecture (with staging area and data mart) Staging Area: Data must be cleaned before being put into a warehouse, this is typically done using a staging area Data Marts: Allow for customization of the warehouses architecture. Data marts are systems designed for a particular line of business
  • Slide 10
  • Data Warehouse vs. OLTP (Online Transaction Processing) Not usually in 3NF (type of data normalization) Optimized to perform well for a wide variety of possible query operations Updated on a regular basis through the ETL process Typical query scans thousands or millions of rows (i.e. Find the total sales for all customers last month) Store many months or even years of data Usually in 3NF Support only predefined operations End users update Accesses only a handful of records (i.e. Retrieve the current order for this customer) Only stores data needed to successfully meet requirements of the current transaction
  • Slide 11
  • Data Mining and Data Warehousing Data Mining: Process of statistical analysis Data warehousing: Process of designing how the data is stored in order to improve analysis Relationship between data mining and data warehousing: Data properly warehoused is easier to mine Problem: if a data mining query has to search through terabytes of data all across multiple databases on different networks, it would not be efficient whatsoever. Solution: Data warehouse experts design a data storage system that connects closely relevant data in different databases. Data miner can then run much more efficient queries.
  • Slide 12
  • Data Mining Process Overview Extract, transform, and load (ETL) data onto a data warehouse system Store and manage data on multidimensional database system Provide data access to business analysts and information technology professionals Analyze data by application software Present the data in a useful format
  • Slide 13
  • Data Mining Techniques: Decision Trees Classification Algorithm (most common for predicting specific outcomes) Based on conditional probabilities Generate rules (conditional statements humans can read and can be used within a database to identify a set of records)
  • Slide 14
  • Data Mining Techniques: k-means Clustering Data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships Commonly used in medical imaging and biometrics Steps 1.Arbitrarily selects k points as the initial cluster centers (means) 2.Each point in dataset is assigned to closest cluster based upon the distance between each point and each cluster center 3.Each cluster center is recomputed as the average of the points in that cluster 4.Steps 2 and 3 repeat until clusters converge (no observations change when steps 2 and 3 are repeated or minor changes that do not affect the definition of the clusters )
  • Slide 15
  • Data Mining Techniques: Neural Networks Expand upon data minings predictive power by their ability to learn from data Mimic the neurophysiology of the human brain (learn from examples to find patterns in data) Learn to predict by training themselves on sample data, adjusts the networks weights to produce optimal predictions, then applies this knowledge to data being mined Uses fuzzy logic, algorithms, etc. to make determinations
  • Slide 16
  • Data Mining Techniques: Visualization Allows access to huge amounts of data Human mind is not built to see relationships and patterns in tabular data Optic nerve takes in huge amounts of data and sends it to the brain which is very good at edge detection, shape recognition, and pattern detection Example: Anscombes Quartet: Four data sets share the same mean, variance, correlation, and regression. On a spreadsheet, not of much interest. Displayed graphically, however, is a different story
  • Slide 17
  • Slide 18
  • Slide 19
  • Who Uses All of This? Companies with strong consumer focus: retail, financial, communication, and marketing organizations I.e. Blockbuster mines video rental history database to recommend rentals to individual customers NBA in conjunction with image recordings to help coaches orchestrate plays and strategies NYC Fire Department to predict which buildings are most likely to catch fire 60 factors, built an algorithm that assigns each of NYCs 330,000 inspectable buildings with a risk score Hospitals/Science Government
  • Slide 20
  • Future of Data Mining Everyday use by consumers: As common and easy to use as e- mail Find best airfare, cheapest prices, etc. Detecting Eco-System disturbances Intelligent systems may discover new treatments for diseases, insights to the nature of the universe
  • Slide 21
  • Works Cited 50 Great Examples of Data Visualization (2009, June 1). In W. Retrieved October 22, 2014, from http:// ww.webdesignerdepot.com/2009/06/50-great-examples-of-data-visualization/ Adams, S. (Artist). (2000). Dilbert. [Image of painting]. Retrieved October 22, 2014, from http://search.dilbert.com/comic/Data%20Mininghttp://search.dilbert.com/comic/Data%20Mining Are data mining and data warehousing related? (n.d.). In howstuffworks. Retrieved October 22, 2014, from http://computer.howstuffworks.com/are-data-mining-and-data-warehousing-related.htm Alexander, D. (n.d.). Data Mining. Retrieved October 22, 2014, from http://www.laits.utexas.edu/~anorman/http://www.laits.utexas.edu/~anorman/ BUS.FOR/course.mat/Alex/#9 Basu, S. (1997). Data Mining. Retrieved October 22, 2014, from https://www.siggraph.org/education/materials/HyperVis/applicat/https://www.siggraph.org/education/materials/HyperVis/applicat/ data_mining/data_mining.html Clifton, C. (n.d.). data mining. In Encyclopaedia Britannica. Retrieved October 22, 2014, from http:// www.britannica.com/EBchecked/topic/1056150/data-mining Data Mining: What is Data Mining? (n.d.). Retrieved October 22, 2014, from http://www.anderson.ucla.edu/http://www.anderson.ucla.edu/ faculty/jason.frand/teacher/technologies/palace/datamining.htm Data Warehousing Concepts (n.d.). In Oracle9i Data Warehousing Guide Release 2 (9.2). Retrieved October 22, 2014, from http://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.htmhttp://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.htm Dwoskin, E. (n.d.). How New Yorks Fire Department Uses Data Mining. In Digits. Retrieved January 24, 2014, from http://blogs.wsj.com/digits/2014/01/24/how-new-yorks-fire-department-uses-data-mining/http://blogs.wsj.com/digits/2014/01/24/how-new-yorks-fire-department-uses-data-mining/ Elvidge, S. (2014, August 4). Researchers hunt through gene chip data for Parkinson's clue. In GenoKeyBlog. Retrieved October 22, 2014, from http://blog.genokey.com/researchers-hunt-through-gene-chip-data-for-parkinsons- clue/http://blog.genokey.com/researchers-hunt-through-gene-chip-
  • Slide 22
  • Works Cited cont. fuzzy logic (n.d.). In whatis.com. Retrieved October 22, 2014, from http://whatis.techtarget.com/definition/fuzzy-logichttp://whatis.techtarget.com/definition/fuzzy-logic Hardes, T. (n.d.). Clustering. In Tobias Hardes. Retrieved October 23, 2014, from http://thardes.de/big-data-englisch/clustering/ History of Data Mining (2012, November 17). In SQLDataMining.com. Retrieved October 22, 2014, from http://www.sqldatamining.com/index.php/data-mining-basics/history-of-data-miningwww.sqldatamining.com/index.php/data-mining-basics/history-of-data-mining Iliinsky, N. (n.d.). Why is Data Visualization So Hot?. In visual.ly. Retrieved October 22, 2014, from http://blog.visual.ly/why-is-data-visualization-so-hot/ Kumar, D., & Bhardwaj, D. (2011, September). Rise of Data Mining: Current and Future Application Areas. Retrieved October 22, 2014, from http://ijcsi.org/papers/IJCSI-8-5-1-256-260.pdfhttp://ijcsi.org/papers/IJCSI-8-5-1-256-260.pdf Mailvaganam, H. (n.d.). Data Modeling and Mining. In Data Warehousing Review. Retrieved October 22, 2014, from http://www.dwreview.com/Data_mining/DM_models.htmlhttp://www.dwreview.com/Data_mining/DM_models.html Sample Data Visualization (n.d.). In Dundas. Retrieved October 22, 2014, from http://www.dundas.com/http://www.dundas.com/ gallery/data-visualization-gallery/scorecard/ SQL tutorial (n.d.). In w3 resource. Retrieved October 22, 2014, from http://www.w3resource.com/sql/http://www.w3resource.com/sql/ tutorials.php