Scientific Applications Of Data Mining

  • Published on
    18-Dec-2014

  • View
    1.450

  • Download
    2

DESCRIPTION

 

Transcript

  • 1. Scientific Applications of Data Mining Bioinformatics Seminar August 28, 2002 Gary Lindstrom School of Computing University of Utah
  • 2. Outline
    • What is data mining?
    • Where has it been successfully applied?
    • How can it be applied to scientific applications?
    • Research Opportunities
  • 3. What Is Data Mining?
    • One definition (Robert Grossman)
      • Data mining is the semi-automatic discovery of patterns, associations, anomalies, structures, and changes in large data sets
  • 4. Data Mining
    • Characteristics
      • Large data, vs. small data
      • Discovery , not validation
      • Data driven , not hypothesis driven
      • Automated , not manual application
    • Supported by
      • Statistics, machine learning, databases, high performance computing
  • 5. The Data Gap
    • Exponential growth of data
      • More automation, greater throughput, more models, e.g. simulated
    • But: linear increase in number of researchers
      • Sift the sand, rather than searching a sensor
  • 6. Classical Data Mining Applications
    • Retail
      • Market basket analysis
    • Political science
      • Targeting campaign resources
    • Financial
      • Exploiting market trends & imbalances
  • 7. Decision Support Systems
    • Generic term for analytic and historic uses of DBs
      • Contrast with: operational uses
      • Commonly known as On-Line Transaction Processing (OLTP)
    • Data warehouses
      • Data culled from operational DBs, with history and derived summary data
  • 8. Data Warehouses vs. Databases
      • Replicate data from distributed sources
      • Do not require strict currency of data
      • Oriented toward complex, often statistical queries
      • Often based on materialized views of operational data
        • Views which have been expanded into real tables
  • 9. Tools for DSS
    • Ad hoc SQL-style queries
      • Optimized for large, complex data
    • On-Line Analytic Processing (OLAP)
      • Queries optimized for aggregation operations
      • Data is viewed as multidimensional array
      • Influenced by end-user tools such as spreadsheets
    • Data mining
      • Exploratory data analysis
      • Looking for interesting unanticipated patterns in the data
  • 10. Data Warehousing External Data Source Metadata Repository OLAP Data Warehouse Data Mining SERVES EXTRACT TRANSFORM LOAD REFRESH Visualization
  • 11. Creating And Maintaining A Warehouse
    • Challenges
      • Schema design for integrated information
      • Operations
        • Cleaning (curation): filling gaps, correcting errors
        • Transforming: making consistent with new schema
        • Loading: also sorting and summarizing
        • Refreshing: incorporate updates to operation data
        • Purging: aging out old data
    • Role of metadata
      • Sources of data, schema conversion information, refresh history, etc.
  • 12. OLAP
    • Focuses on data reduction in context
      • In SQL terms, lots of group by and aggregation, and having operators
    • Multidimensional data model
      • More appropriate than operation DB tables
      • Based on a numeric measures
      • Each measure depends on a set of dimensions
  • 13. Example: Retail Sales
    • Dimensions are Product , Location and Time
    8 10 10 30 20 50 25 8 15 locid
    • 2 3
    • timeid
    pid 11 12 1 3
  • 14. Working With Multidimensional Data
    • Can be represented as a conventional table
      • n dimensions need n+1 columns
      • Called a fact table
    • OLAP systems may store all data in relational form
      • These are Relational OLAP or ROLAP systems
    • Each dimension can have multiple components
      • Example: Location = (Country, State, City)
  • 15. OLAP Queries
    • Samples
      • Find the total sales
      • Find total sales for each city
      • Find total sales for each state
      • Find the top five products ranked by total sales
    • OLAP query jargon
      • Dimensionality reduction
        • Aggregating on a dimension, e.g., total sales by city
      • Roll-up
        • Given total sales by city, find total sales by state
      • Drill-down
        • Given total sales by state, ask for total sales by city
  • 16. OLAP Operations
    • Pivoting
      • Spreadsheet-like summaries
      • Example: given tabular representation of sales
        • Pivoting on Location and Time gives table of total sales for each location and each time value
      • Can be combined with aggregation
        • E.g., yearly sales by state
    • Cross tabulation
      • Displays result of pivoting
      • Aggregation values shown as summary rows and columns
      • As extra rows and columns added to original table
  • 17. OLAP Operations (cont’d)
    • Slicing
      • Equality selection on one or more dimensions
    • Dicing
      • Range selection
  • 18. OLAP Naturally Leads to Data Mining
    • Seeks interesting trends or patterns in large datasets
      • An example of exploratory data analysis
      • Related to knowledge discovery and machine learning
    • Mining for rules
      • Association rules: motivated by retail market basket analysis
  • 19. Market Basket Analysis
    • Market basket
      • A collection of items purchased by a customer in one transaction
      • Retailers want to learn of items often purchased together
        • For promotional and display grouping purposes
      • Simple tabular representation
        • Purchases(transid, custid, date, item, price, quantity)
  • 20. Association Rules
    • Seek rules of the form:
        • { pen } => { ink }
      • Meaning:
        • If a pen is purchased in a transaction, it is likely that ink will also be purchased in that transaction
  • 21. Important Measures for Association Rules
    • Support
      • % of transactions containing all items mentioned in rule
      • Low support reduces interest in the rule
    • Confidence
      • % of transactions containing the LHS that also contain RHS
      • Indicates degree of correlation
  • 22. Using Association Rules For Prediction
    • Always somewhat risky
      • Because ultimate goal is understanding causality
      • Which is not directly reflected in transaction data
  • 23. There Can Be High Support and Confidence
    • … but no causality
    • Example: pencils and pens are often bought together
      • And pens and ink are often bought together
      • Hence pencils and ink are often bought together
    • But there is no causal link between pencils and ink
      • Hence sale promotions on pencils and ink probably won’t be effective
  • 24. Finding Association Rules
    • Seek rules with:
      • Support greater than minsup
      • Confidence greater than minconf
    • Steps
      • Find frequent item sets
        • Sets of items with support >= minsup
      • Break each frequent item set into LHS and RHS of candidate rules
        • Keep those with confidence >= minconf
  • 25. Testing Candidate Rules
    • Confidence calculation for each candidate rule
      • Maintain two counters: lhscount , rhscount
      • Scan entire customer transaction table
      • Count in lhscount occurrences of all items in LHS
      • If LHS is present, tally in rhscount if all items in RHS are present
  • 26. Identifying Frequent Item Sets
    • The a priori property:
      • Every subset of a frequent item set is also a frequent item set
    • This leads to an iterative algorithm
      • Identify frequent item sets of one item
      • Iteratively, seek to extend frequent item sets by adding an item
  • 27. Finding Frequent Itemsets foreach item, check if it is a frequent itemset repeat foreach new frequent itemset I k with k items generate all itemsets I k+1 with k+1 items, I k  I k+1 Scan all transactions once and check if the generated k+1-itemsets are frequent until no new frequent itemsets are found
  • 28. Generalized Association Rules
    • Grouping by transaction attributes
      • Example: group by custid
      • Association can be across multiple transactions by same customer
    • Group by categories
      • Example: rules of form
        • { apparel } => { stationery }
  • 29. Generalized Association Rules (Cont’d)
    • Sequential patterns
      • Identify frequently arising buying patterns over time
    • Classification rules
      • “ If age is in a certain range and balance is in a certain range, then the customer is likely to default on a loan.”
  • 30. Example: Managing Microarray Data
    • MS thesis by John Kokinis, 2000
    • ArrayBank
      • Tool for management of microarray gene expression data sets
      • Implemented in Visual Basic / Sybase
    • Figures from MS thesis
      • Pages 20, 45, 46
      • http://www.cs.utah.edu/~kokinis/THESIS.pdf
  • 31. Example: Mining Simulated Combustion Data
    • Joint work with
      • Brijesh Garabadu, School of Computing
      • Zoran Djurisic, Chem. & Fuels Engg.
    • The problem
      • Combustion model for powdered coal furnaces
      • Which conditions control NOx pollution?
  • 32. The Data
    • Multidimensional space
      • Pressure, fuel mix, oxygen concentration
      • Can explore (simulate) any combination
        • But which to look at?
    • Need to:
      • Locate relevant subspaces
      • Characterize important events
      • Develop causal hypotheses
  • 33. Techniques Applied
    • Cluster analysis
      • Which datasets are similar?
    • Neural networks
      • Which datasets are interesting?
    • Decision trees
      • Which features best explain similarities?
  • 34. Cluster Analysis: Unsupervised Learning
    • At outset, category structure of the data is unknown
      • All that is known is a collection of observations
    • Objective: To discover a category structure which fits the observation
      • i.e. finding natural groups in data
  • 35. Combustion Application
    • Cluster analysis was used to detect relationships among various species
      • Are the behaviors of any two species related?
      • Is the concentration of one species dependent on that of one or more other species? 
    • One confirmed hypothesis:
      • CH reaches it peak concentration either before or at the same time as H reaches its peak concentration
      • An important engineering observation
  • 36. Artificial Neural Networks
    • A general, practical method for learning real-valued, discrete-values, and vector-values function from examples
    • Combustion application
      • Finding out different kinds of pattern (increasing / decreasing, etc) in the lifetime of a species during the combustion process
      • This can be used to prove various hypothesis as well as to detect patterns of specific species in previously unseen data
  • 37. Neural Networks: Supervised Learning
  • 38. Application Technique
    • Training set data are labeled by the user
      • These labeled data are used to train the ANN
    • The ANN is then used to classify previously unseen data
      • e.g., species in a particular combustion
      • Into a particular pattern class
    • For example, NO shows two different trends under differing conditions
    • A trained ANN can be used to classify the datasets according to the trend of NO
  • 39. Decision Trees
    • Characterize data by features
      • e.g., species concentration at an instant
    • Categorize data sets
      • Manually, or use ANN
      • e.g., according to the trend of NO
    • Use decision tree algorithm to discover clustering criteria
  • 40. Sample Output === Classifier model (full training set) === J48 pruned tree --------------------- CO <= 0.002945 | OH <= 0.000016 | | CO <= 0.000166: yes (17.0/1.0) | | CO > 0.000166: no (3.0) | OH > 0.000016: yes (30.0) CO > 0.002945: no (60.0 / 1.0)
  • 41. Research Opportunities
    • Try it!
      • In your area, on your data, for new results
    • Features
      • Definition, efficient extraction
    • Community building
      • Sharing data mining results
  • 42. PMML
    • Predictive Model Markup Language
    • XML based representation of association rules
    • Developed by Data Mining Group
      • Industrial and university research collaboration
  • 43. An Excellent Tutorial
    • Used for material in this talk
      • Data Mining Scientific and Engineering Applications
        • Tutorial at SC2001, November 12, 2001 by R. Grossman, C. Kamath and V. Kumar
    • http://www-users.cs.umn.edu/ ~kumar/Presentation/sc2001.html