Data Mining for Security Applications

  • Published on

  • View

  • Download


1. Data Mining for Security ApplicationsKulesh Shanmugasundaram kulesh@cis.poly.eduNovember 10 20011 Introduction CCS to meet with researchers who are interested inapplying data mining techniques to security applica- This is a summary of discussions at Workshop ontions and discuss critical issues of mutual interest Data Mining for Security Applications CCS01, PA.during a concentrated period. In this document data mining takes a broad meaning which may, sometimes, include machine learning Two fundamental questions were asked and mostly (ML) and articial intelligence (AI). Furthermore, went unanswered in our discussions. forensics and intrusion detection are interchangeable in some contexts. Please note it is beyond the scope of our discussions to provide better denitions to1. Are we trying to solve the right security prob- these terms. lem(s)?Are denial of service and intrusion detectionright problems for data mining or are there anyIt is noted in early discussions that most data min-other security problems where data mining could ing solutions avoid recognition of simple, eectivebe more eective such as cryptanalysis, useful substitutes in place of sophisticated, computation-and perhaps preventive? It was suggested foren- ally intensive data mining techniques. One of thesics is one of the elds that can use data mining suggestions is that future performance comparisonsfor eective data reduction and for learning new should include some of these simple, eective meth-insights or patterns. There were no other sug- ods where appropriate. Also noted that most datagestions. mining techniques do not emphasize enough on pre- processing and post-processing of datasets. It was emphasized that we should use machine learning tech-2. Do security problems need development of new niques to ne tune input datasets before data mining data mining techniques? and should automate decision making after data min-It is hard to answer this question without answer- ing. Authors demonstrated use of data mining for ing the previous question. We have not yet iden- intrusion detection, for identifying denial of service tied security problems that mandate a whole attacks and for forensics. new data mining approach. However, it was feltstrongly among the panel that new data miningtechniques will have to be investigated in near 2 Questions... future. One of the suggestions was to investigatetechniques used in bio-informatics to solve secu- Traditionally data mining has solved problems in rity problems especially intrusion detection. database systems and bio-informatics where data mining techniques are still being used successfully to map genome and nancial engineering. Recently 3Ideas & Opinions data mining community started applying similar techniques to existing security problems. This eventFollowing is a collection of ideas and opinions that provides an opportunity for attendees of the ACMcame out of this workshop. Most of them revolve Feel free to edit this document but please let me know around security problems for which data mining can what you did so that I can keep my copy fresh. Thanks! provide solutions. Page 1 2. 3.1Fusion of Information 3.4Gene Coding Applications As networks and network sensors become ubiquitousIdea of gene coding applications is similar to gene fusion of sensor information is critical to the devel- coding bio-organisms. Tools and methods should be opment of accurate insights on incidents. Therefore, developed to extract application behaviors at dier- fusion of sensor information and infrastructure devel- ent levels of software engineering process and embed opment to support fusion are an important areas forthese behaviors along with application code. Upon research and development. For instance, stream min-execution of a gene coded application, application ing techniques can be used to develop tools that can level rewalls and intrusion detection systems use give better overview of network trac in [near] real embedded gene code of the application to detect time[2]. Network forensics is another area well po-anomalous behaviors. sitioned to benet from fusion of information. One of many problems with information fusion is lackIt seems quite obvious network is not the best place of industry support in adopting a common standardto perform eective ltering. There is too much noise for intrusion message exchange. However, IETF andon the network; To do any eective ltering means TRENA are working together on couple of standardsrst ltering nosie out and then focusing on interest- for intrusion message signals. However, host based detection methodsin comparison are much more eective in that thereis less noise. We can, however, raise the bar further 3.2Rule Generation & Data Reduc- by deploying detection methods at applications them-selves. New methods should be developed which al-tionlow application developers to characterize normal Automated rule generation for intrusion detectionbehaviors of applications and package that informa- systems [to identify new threats,] rule generation for tion as part of application code. Intrusion detection data mining systems to lter datasets eciently andsystems and rewalls can then rely on this informa- data reduction in data mining systems without lose tion to model anomalous behaviors. of critical information are still in primitive stages of development. Research and development eort must 3.5Feature Selection of Attacks be put in to develop better automated rule genera- tion methods. False alarm ltering is considered anFeature selection of attacks is an important ele- open problem. It was mentioned most commercial ment to intrusion detection systems. Currently there IDS products produce as much as 80% false positives. are not many useful feature selection, categorization New methods are required to lter false alarms with- methods available[14]. Such selection criteria would out leaving way to stealth attacks. That is, an at-allow real time attack proling and adaptive attack tacker may trigger high volume of false alarms and containment by intrusion detection systems. if IDS reacts by ltering out that false alarm the at- tacker can now by pass the IDS without triggering3.6Lack of Data Visualization the alarm. New methods should reduce false alarms but should avoid such attacks as well. There is a lack of data visualization tools for networkapplications and forensics. Development of data vi-sualization tools with single data multiple perspective 3.3Automated Ruleset Propagation is an immediate necessary. Some form of certicationshould be developed to certify forensic tools such ab- One of the problems still not addressed by IDS ven-stractions are not altering evidence and such abstrac- dors is how to propagate rulesets or attack signatures tions are actually telling the truth. An open source securely over networks. An automated update strat- forensic data visualization library seems to be a good egy, through an overlay network approach, should al- starting point for such a certication process. low intrusion detection systems to be more adaptive. Currently RealSecure ( of Datasets is the only system that supports anti virus like method to update rulesets from a central server. It is a great concern of the community that lacks of However, updating rulesets in a heterogeneous net- realistic test datasets are making the research uncer- work means more than connecting to a central servertain. What works on test dataset may not work prop- and downloading new rule sets. erly in real datasets; on the other hand, methods thatPage 2 3. are not ecient on test datasets may turn out to be eective on real data. Therefore, there is immediate need for a tool or a network infrastructure to collect real datasets for the community while maintaining privacy standards. References[1] Critical Thoughts on Contemporary Data Min-ing Research for Security Applications, KlausJulisch[2] Fusing Heterogeneous Alert Streams into Scenar-ios, Oliver M Dain, Robert K. Cunningham[3] Using MIB II Variables For Network AnomalyDetection- A Feasibility Study, Xinzhou Qin[4] Intrusion Detection with Unlabled Data UsingClustering, Leonid Portnoy et. al[5] Multi-Topic EmailAuthorshipAttributionForensics[6] Panel discussions in and out of Sonata -03[7] An Intrusion Detection System Based on theTeiresisas Pattern Discovery Algorithm, AndreasWespi et. al[8] The GeneMine system for genome/proteome an-notation and collaborative data mining[9] Mining High Speed Data Streams, Pedro Domin-gos, Geo Hulten [10] Dr. Sushil Jajodia [11] Johannes Gehrke [12] Wenke Lee [13] Philip Chan [14] Columbia IDS Group 3


View more >