Data Mining for Scientific Applications

  • CategoryDocuments

  • View1962

Report
  • 1. Introduction to Data Mining Natasha Balac, Ph.D.
  • 2. Outline
    • Motivation: Why Data Mining?
    • What is Data Mining?
    • History of Data Mining
    • Data Mining Functionality and Terminology
    • Data Mining Applications
    • Are all the Patterns Interesting?
    • Issues in Data Mining
  • 3. Necessity is the Mother of Invention
    • Data explosion
      • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
    • We are drowning in data, but starving for knowledge !
  • 4. Necessity is the Mother of Invention
    • We are drowning in data, but starving for knowledge!
    • Solution
      • Data Mining
        • Extraction of interesting knowledge (rules, regularities,patterns, constraints) from data in large databases
  • 5. Why DATA MINING?
    • Huge amounts of data
    • Electronic records of our decisions
      • Choices in the supermarket
      • Financial records
      • Our comings and goings
    • We swipe our way through the world – every swipe is a record in a database
    • Data rich – but information poor
    • Lying hidden in all this data is information!
  • 6. Data vs. Information
    • Society produces massive amounts of data
      • business, science, medicine, economics, sports, …
    • Potentially valuable resource
    • Raw data is useless
      • need techniques to automatically extract information
      • Data: recorded facts
      • Information: patterns underlying the data
  • 7. What is DATA MINING?
    • Extracting or “mining” knowledge from large amounts of data
    • Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data
    • Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data
  • 8.
    • Data mining:
      • Extraction of interesting( non-trivial, implicit, previously unknownandpotentially useful )information or patterns from data inlarge databases
    What Is Data Mining?
  • 9. Data Mining isNOT
    • Data Warehousing
    • (Deductive) query processing
      • SQL/ Reporting
    • Software Agents
    • Expert Systems
    • Online Analytical Processing (OLAP)
    • Statistical Analysis Tool
    • Data visualization
  • 10. Data Mining
    • Programs that detect patterns and rules in the data
    • Strong patterns can be used to make non-trivial predictions on new data
  • 11. Data Mining Challenges
    • Problem 1:most patterns are not interesting
    • Problem 2 : patterns may be inexact or completely spurious when noisy data present
  • 12. Machine Learning Techniques
    • Technical basis for data mining: algorithms for
    • acquiring structural descriptions from examples
    • Methods originate from artificial intelligence,
    • statistics, and research on databases
  • 13. Machine Learning Techniques
    • Structural descriptions represent patterns explicitly can be used to
      • predict outcome in new situation
      • understand and explain how prediction is derived (maybe even more important)
  • 14. Multidisciplinary Field Data Mining DatabaseTechnology Statistics Other Disciplines ArtificialIntelligence Machine Learning Visualization
  • 15. Multidisciplinary Field
    • Database technology
    • Artificial Intelligence
      • Machine Learning including Neural Networks
    • Statistics
    • Pattern recognition
    • Knowledge-based systems/acquisition
    • High-performance computing
    • Data visualization
  • 16. History of Data Mining
  • 17. History
    • Emerged late 1980s
    • Flourished –1990s
    • Roots traced back along three family lines
      • Classical Statistics
      • Artificial Intelligence
      • Machine Learning
  • 18. Statistics
    • Foundation of most DM technologies
      • Regression analysis, standard distribution/deviation/variance, cluster analysis, confidence intervals
    • Building blocks
    • Significant role in today’s data mining – but alone is not powerful enough
  • 19. Artificial Intelligence
    • Heuristics vs. Statistics
    • Human-thought-like processing
    • Requires vast computer processing power
    • Supercomputers
  • 20. Machine Learning
    • Union of statistics and AI
      • Blends AI heuristics with advanced statistical analysis
    • Machine Learning – let computer programs
      • learn about data they study - make different decisions based on the quality of studied data
      • using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms
  • 21. Data Mining
    • Adoption of the Machine learning techniques to the real world problems
    • Union: Statistics, AI, Machine learning
    • Used to find previously hidden trends or patterns
    • Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise
  • 22. Terminology
    • Gold Mining
    • Knowledge mining from databases
    • Knowledge extraction
    • Data/pattern analysis
    • Knowledge Discovery Databases or KDD
    • Information harvesting
    • Business intelligence
  • 23. KDD Process Database SelectionTransformationData Preparation Data Mining Training Data Evaluation, Verification Model, Patterns
  • 24. LEARNING ALGORITHMS
    • Fundamental idea:
    • learn rules/patterns/relationships automatically from the data
  • 25. Data Mining Tasks
    • Exploratory Data Analysis
    • Predictive Modeling: Classification and Regression
    • Descriptive Modeling
      • Cluster analysis/segmentation
    • Discovering Patterns and Rules
      • Association/Dependencyrules
      • Sequential patterns
      • Temporal sequences
    • Deviation detection
  • 26. Data Mining Tasks
    • Concept/Class description : Characterization and discrimination
      • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
    • Association ( correlation and causality)
      • Multi-dimensional or single-dimensional association
      • age(X, “20-29”) ^ income(X, “60-90K”)buys(X, “TV”)
  • 27. Data Mining Tasks
    • Classification and Prediction
      • Finding models (functions) that describe and distinguish classes or concepts for future prediction
      • Example:classify countries based on climate, or classify cars based on gas mileage
      • Presentation:
        • If-THEN rules, decision-tree, classification rule, neural network
      • Prediction: Predict some unknown or missing numerical values
  • 28.
    • Cluster analysis
      • Class label is unknown: Group data to form new classes,
        • Example: cluster houses to find distribution patterns
      • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
    Data Mining Tasks
  • 29. Data Mining Tasks
    • Outlier analysis
      • Outlier: a data object that does not comply with the general behavior of the data
      • Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis
    • Trend and evolution analysis
      • Trend and deviation:regression analysis
      • Sequential pattern mining, periodicity analysis
  • 30. Data Mining: Classification Schemes
    • General functionality
      • Descriptive data mining Vs. Predictive data mining
    • Different views - different classifications
      • Kinds of databases to be mined
      • Kinds of knowledge to be discovered
      • Kinds of techniques employed
      • Kinds of applications
  • 31. A Multi-Dimensional View of Data Mining Classification
    • Databases to be mined
      • Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media,WWW, etc.
    • Knowledge to be mined
      • Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.
      • Multiple/integrated functions
      • Mining at multiple levels of abstractions
  • 32. A Multi-Dimensional View of Data Mining Classification
    • Techniques utilized
      • Decision/Regression trees, clustering, neural networks, etc.
    • Applications adapted
      • Retail, telecom, banking, DNA mining, stock market analysis, Web mining
  • 33. Data Mining Applications
    • Science: Chemistry, Physics, Medicine
      • Biochemical analysis
      • Remote sensors on a satellite
      • Telescopes – star galaxy classification
      • Medical Image analysis
  • 34. Data Mining Applications
    • Bioscience
      • Sequence-based analysis
      • Protein structure and function prediction
      • Protein family classification
      • Microarray gene expression
  • 35.
    • Pharmaceutical companies, Insurance and Health care, Medicine
      • Drug development
      • Identify successful medical therapies
      • Claims analysis, fraudulent behavior
      • Medical diagnostic tools
      • Predict office visits
    Data Mining Applications
  • 36.
    • Financial Industry, Banks, Businesses, E-commerce
      • Stock and investment analysis
      • Identify loyal customers vs. risky customer
      • Predict customer spending
      • Risk management
      • Sales forecasting
    Data Mining Applications
  • 37.
    • Retail and Marketing
      • Customer buying patterns/demographic characteristics
      • Mailing campaigns
      • Market basket analysis
      • Trend analysis
    Data Mining Applications
  • 38.
    • Database analysis and decision support
      • Market analysis and management
        • target marketing, customer relation management,market basket analysis, cross selling, market segmentation
      • Risk analysis and management
        • Forecasting, customer retention, improved underwriting, quality control, competitive analysis
      • Fraud detection and management
    Data Mining Applications
  • 39.
    • Sports and Entertainment
      • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
    • Astronomy
      • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
    Data Mining Applications
  • 40. DATA MINING EXAMPLES
    • Grocery store
    • NBA
    • Banking and Credit Card scoring
      • Fraud detection
    • Personalization & Customer Profiling
    • Campaign Management and Database Marketing
  • 41. Data mining at work:Case study 1
  • 42. Processing Loan Applications
    • Given: questionnaire with financial and personal information
    • Problem: should money be lend?
    • Borderline cases referred to loan officers
    • But: 50% of accepted borderline cases defaulted!
    • Solution:
      • reject all borderline cases?
    • Borderline cases are most active customers!
  • 43. Enter Machine Learning
    • Given:
      • 1000 training examples of borderline cases
    • 20 attributes :
      • age, years with current employer,years at current address, years with the bank, years at current job, other creditcards
    • Learned rules predicted 2/3 of borderline cases correctly!
    • Rules could be used to explain decisions to customers
  • 44. Case study 2: Screening images
    • Given:
      • radar satellite images of coastal waters
    • Problem:
      • detecting oil slicks in those images
    • Oil slicks = dark regions with changing size and shape
    • Look-alike dark regions can be caused by weather conditions (e.g. high wind)
    • Expensive process requiring highly trained personnel
  • 45.
    • Dark regions extracted from normalized image
    • Attributes:
      • size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background
    • Constraints:
      • Scarcity of training examples (oil slicks are rare!)
      • Unbalanced data: most dark regions aren’t oil slicks
      • Regions from same image form a batch
      • Requirement is adjustable false-alarm rate
    Enter Machine Learning
  • 46. Data Mining Challenges
    • Computationally expensive to investigate all possibilities
    • Dealing with noise/missing information and errors in data
    • Choosing appropriate attributes/input representation
    • Finding the minimal attribute space
    • Finding adequate evaluation function(s)
    • Extracting meaningful information
    • Not overfitting
  • 47. Are All the “Discovered” Patterns Interesting?
    • Interestingness measures : A pattern isinterestingif it iseasily understoodby humans,valid on new or test datawith some degree of certainty,potentially useful ,novel, or validates some hypothesisthat a user seeks to confirm
  • 48. Are All the “Discovered” Patterns Interesting?
    • Objective vs. subjective measures:
      • Objective:based on statistics and structures of patterns
        • support and confidence
      • Subjective:based on user’s belief in the data
        • unexpectedness, novelty, action ability, etc.
  • 49. Can We Find All and Only Interesting Patterns?
    • Completeness - Find all the interesting patterns
      • Can a data mining system findallthe interesting patterns?
      • Association vs. classification vs. clustering
  • 50. Can We Find All and Only Interesting Patterns?
    • Optimization - Search for only interesting patterns
      • Can a data mining system findonlythe interesting patterns?
      • Approaches
        • First general all the patterns and then filter out the uninteresting ones
        • Mining query optimization
  • 51. Major Issues in Data Mining
    • Mining methodology and user interaction
      • Mining different kinds of knowledge in databases
      • Incorporation of background knowledge
      • Handling noise and incomplete data
      • Pattern evaluation: the interestingness problem
      • Expression and visualization of data mining results
  • 52.
    • Performance and scalability
      • Efficiency of data mining algorithms
      • Parallel, distributed and incremental mining methods
    • Issues relating to the diversity of data types
      • Handling relational and complex types of data
      • Mining information fromdiversedatabases
    Major Issues in Data Mining
  • 53.
    • Issues related to applications and social impacts
      • Application of discovered knowledge
        • Domain-specific data mining tools
        • Intelligent query answering
        • Expert systems
        • Process control and decision making
      • A knowledge fusion problem
      • Protection of data security, integrity, and privacy
    Major Issues in Data Mining
  • 54. Summary
    • Data mining: discovering interesting patterns from large amounts of data
    • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
  • 55. Summary
    • Mining can be performed in a variety of information repositories
    • Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.
    • Classification of data mining systems
    • Major issues in data mining
  • 56. Exercise
    • Practical Data mining example
  • 57. Kinds ofData Mining
    • Decision Tree Learning
    • Clustering
    • Neural Networks
    • Association Rules
    • Support Vector Machines
    • Genetic Algorithms
    • Nearest Neighbor Method
  • 58. Decision Tree Example Grandparents A lot A little
  • 59. DECISION TREE FOR THE CONCEPT“ Play Tennis” Mitchell, 1997
  • 60. DECISION TREE FOR THE CONCEPT“ Play Tennis ” [Mitchell,1997]
  • Description