1. Scientific Applications of Data Mining Bioinformatics Seminar August 28, 2002 Gary Lindstrom School of Computing University of Utah 2. Outline What is data mining? Where has it been successfully applied? How can it be applied to scientific applications? Research Opportunities 3. What Is Data Mining? One definition (Robert Grossman) Data mining is the semi-automatic discovery of patterns, associations, anomalies, structures, and changes in large data sets 4. Data Mining Characteristics Large data, vs. small data Discovery , not validation Data driven , not hypothesis driven Automated , not manual application Supported by Statistics, machine learning, databases, high performance computing 5. The Data Gap Exponential growth of data More automation, greater throughput, more models, e.g. simulated But: linear increase in number of researchers Sift the sand, rather than searching a sensor 6. Classical Data Mining Applications Retail Market basket analysis Political science Targeting campaign resources Financial Exploiting market trends & imbalances 7. Decision Support Systems Generic term for analytic and historic uses of DBs Contrast with: operational uses Commonly known as On-Line Transaction Processing (OLTP) Data warehouses Data culled from operational DBs, with history and derived summary data 8. Data Warehouses vs. Databases Replicate data from distributed sources Do not require strict currency of data Oriented toward complex, often statistical queries Often based on materialized views of operational data Views which have been expanded into real tables 9. Tools for DSS Ad hoc SQL-style queries Optimized for large, complex data On-Line Analytic Processing (OLAP) Queries optimized for aggregation operations Data is viewed as multidimensional array Influenced by end-user tools such as spreadsheets Data mining Exploratory data analysis Looking for interesting unanticipated patterns in the data 10. Data Warehousing External Data Source Metadata Repository OLAP Data Warehouse Data Mining SERVES EXTRACT TRANSFORM LOAD REFRESH Visualization 11. Creating And Maintaining A Warehouse Challenges Schema design for integrated information Operations Cleaning (curation): filling gaps, correcting errors Transforming: making consistent with new schema Loading: also sorting and summarizing Refreshing: incorporate updates to operation data Purging: aging out old data Role of metadata Sources of data, schema conversion information, refresh history, etc. 12. OLAP Focuses on data reduction in context In SQL terms, lots of group by and aggregation, and having operators Multidimensional data model More appropriate than operation DB tables Based on a numeric measures Each measure depends on a set of dimensions 13. Example: Retail Sales Dimensions are Product , Location and Time 8 10 10 30 20 50 25 8 15 locid 2 3 timeid pid 11 12 1 3 14. Working With Multidimensional Data Can be represented as a conventional table n dimensions need n+1 columns Called a fact table OLAP systems may store all data in relational form These are Relational OLAP or ROLAP systems Each dimension can have multiple components Example: Location = (Country, State, City) 15. OLAP Queries Samples Find the total sales Find total sales for each city Find total sales for each state Find the top five products ranked by total sales OLAP query jargon Dimensionality reduction Aggregating on a dimension, e.g., total sales by city Roll-up Given total sales by city, find total sales by state Drill-down Given total sales by state, ask for total sales by city 16. OLAP Operations Pivoting Spreadsheet-like summaries Example: given tabular representation of sales Pivoting on Location and Time gives table of total sales for each location and each time value Can be combined with aggregation E.g., yearly sales by state Cross tabulation Displays result of pivoting Aggregation values shown as summary rows and columns As extra rows and columns added to original table 17. OLAP Operations (cont’d) Slicing Equality selection on one or more dimensions Dicing Range selection 18. OLAP Naturally Leads to Data Mining Seeks interesting trends or patterns in large datasets An example of exploratory data analysis Related to knowledge discovery and machine learning Mining for rules Association rules: motivated by retail market basket analysis 19. Market Basket Analysis Market basket A collection of items purchased by a customer in one transaction Retailers want to learn of items often purchased together For promotional and display grouping purposes Simple tabular representation Purchases(transid, custid, date, item, price, quantity) 20. Association Rules Seek rules of the form: { pen } => { ink } Meaning: If a pen is purchased in a transaction, it is likely that ink will also be purchased in that transaction 21. Important Measures for Association Rules Support % of transactions containing all items mentioned in rule Low support reduces interest in the rule Confidence % of transactions containing the LHS that also contain RHS Indicates degree of correlation 22. Using Association Rules For Prediction Always somewhat risky Because ultimate goal is understanding causality Which is not directly reflected in transaction data 23. There Can Be High Support and Confidence … but no causality Example: pencils and pens are often bought together And pens and ink are often bought together Hence pencils and ink are often bought together But there is no causal link between pencils and ink Hence sale promotions on pencils and ink probably won’t be effective 24. Finding Association Rules Seek rules with: Support greater than minsup Confidence greater than minconf Steps Find frequent item sets Sets of items with support >= minsup Break each frequent item set into LHS and RHS of candidate rules Keep those with confidence >= minconf 25. Testing Candidate Rules Confidence calculation for each candidate rule Maintain two counters: lhscount , rhscount Scan entire customer transaction table Count in lhscount occurrences of all items in LHS If LHS is present, tally in rhscount if all items in RHS are present 26. Identifying Frequent Item Sets The a priori property: Every subset of a frequent item set is also a frequent item set This leads to an iterative algorithm Identify frequent item sets of one item Iteratively, seek to extend frequent item sets by adding an item 27. Finding Frequent Itemsets foreach item, check if it is a frequent itemset repeat foreach new frequent itemset I k with k items generate all itemsets I k+1 with k+1 items, I k  I k+1 Scan all transactions once and check if the generated k+1-itemsets are frequent until no new frequent itemsets are found 28. Generalized Association Rules Grouping by transaction attributes Example: group by custid Association can be across multiple transactions by same customer Group by categories Example: rules of form { apparel } => { stationery } 29. Generalized Association Rules (Cont’d) Sequential patterns Identify frequently arising buying patterns over time Classification rules “ If age is in a certain range and balance is in a certain range, then the customer is likely to default on a loan.” 30. Example: Managing Microarray Data MS thesis by John Kokinis, 2000 ArrayBank Tool for management of microarray gene expression data sets Implemented in Visual Basic / Sybase Figures from MS thesis Pages 20, 45, 46 http://www.cs.utah.edu/~kokinis/THESIS.pdf 31. Example: Mining Simulated Combustion Data Joint work with Brijesh Garabadu, School of Computing Zoran Djurisic, Chem. & Fuels Engg. The problem Combustion model for powdered coal furnaces Which conditions control NOx pollution? 32. The Data Multidimensional space Pressure, fuel mix, oxygen concentration Can explore (simulate) any combination But which to look at? Need to: Locate relevant subspaces Characterize important events Develop causal hypotheses 33. Techniques Applied Cluster analysis Which datasets are similar? Neural networks Which datasets are interesting? Decision trees Which features best explain similarities? 34. Cluster Analysis: Unsupervised Learning At outset, category structure of the data is unknown All that is known is a collection of observations Objective: To discover a category structure which fits the observation i.e. finding natural groups in data 35. Combustion Application Cluster analysis was used to detect relationships among various species Are the behaviors of any two species related? Is the concentration of one species dependent on that of one or more other species?  One confirmed hypothesis: CH reaches it peak concentration either before or at the same time as H reaches its peak concentration An important engineering observation 36. Artificial Neural Networks A general, practical method for learning real-valued, discrete-values, and vector-values function from examples Combustion application Finding out different kinds of pattern (increasing / decreasing, etc) in the lifetime of a species during the combustion process This can be used to prove various hypothesis as well as to detect patterns of specific species in previously unseen data 37. Neural Networks: Supervised Learning 38. Application Technique Training set data are labeled by the user These labeled data are used to train the ANN The ANN is then used to classify previously unseen data e.g., species in a particular combustion Into a particular pattern class For example, NO shows two different trends under differing conditions A trained ANN can be used to classify the datasets according to the trend of NO 39. Decision Trees Characterize data by features e.g., species concentration at an instant Categorize data sets Manually, or use ANN e.g., according to the trend of NO Use decision tree algorithm to discover clustering criteria 40. Sample Output === Classifier model (full training set) === J48 pruned tree --------------------- CO 0.002945: no (60.0 / 1.0) 41. Research Opportunities Try it! In your area, on your data, for new results Features Definition, efficient extraction Community building Sharing data mining results 42. PMML Predictive Model Markup Language XML based representation of association rules Developed by Data Mining Group Industrial and university research collaboration 43. An Excellent Tutorial Used for material in this talk Data Mining Scientific and Engineering Applications Tutorial at SC2001, November 12, 2001 by R. Grossman, C. Kamath and V. Kumar http://www-users.cs.umn.edu/ ~kumar/Presentation/sc2001.html

Scientific Applications Of Data Mining

  • Published on

  • View

  • Download


1. Scientific Applications of Data Mining Bioinformatics Seminar August 28, 2002 Gary Lindstrom School of Computing University of Utah 2. Outline What is data mining? Where…