What Data Do We Need and Why Do We Need It?
What Data Do We Need and Why Do We Need It?. Jim Pepin Chief Technology Officer University of Southern California. Network Data: Research Depends on It. Solutions depend on understanding the problem Advances in many areas depend on analysis of real data - PowerPoint PPT Presentation
What Data Do We Needand Why Do We Need It?Jim PepinChief Technology OfficerUniversity of Southern CaliforniaNetwork Data: Research Depends on ItSolutions depend on understanding the problemAdvances in many areas depend on analysis of real dataNetwork Management: Traffic engineering, net designNetwork Control: Improving routing protocolsHigh Performance: Better transport protocolsSecurity: Tracking/stopping DoS and worm attacksOver 30% of papers in top networking conference (SIGCOMM04) depended on data collected by othersMost common providers: ISPs (e.g., ATT, Sprint, I2)Service Providers (e.g., Akamai)Individual campuses (e.g., UNC, UOregon, USC some campuses give data only to local researchers)Network Data: More than Just Packet TracesSome data more sensitive than othersDynamic routing information: routing protocol advertisementsStatic design information: Router configuration files, peering arrangements, policiesOperational events: alarms, trouble tickets (very few sources of this important info!)Traffic logs: netflow records, packet header tracesApplication data: URLs, p2p filenames, DNS queriesTension how much correlation to permit?Data that can be correlated across multiple sites most valuable in measuring network-wide events, e.g. wormsTechniques for privacy anonymize and blur identityExample of Data ProviderDHS PREDICTDHS support for network researchNot for operational use by DHSMajor PlayersPeer review ground rulesGeneric sources for legitimate researchLANDER Project Example of PREDICT supplierJoint project of USC-ISI networking division and USC/ISD Center for High Performance Computing and CommunicationsUSC-HPCC is manager of WAN for USC/CIT/JPL.ISI provides networking research backgroundHPCC provides data storage and computational resourcesWe work together on ground rules and MOUsLANDER funds collection systems, support staff and disk/tape spaceWhat is hard and easyLANDER ground rulesScrambled headers is primary product todayRequires MOU with researcherNo collection of data payloads.Working on very strict MOU for very limited use of non-scrambled header data for very select uses in very controlled environment.Build collection management system integrated with other PREDICT sites.How we do thisVery close co-operation between ISI, ISD and university legalMOUs will be very clear and understandable for the researcherUSC can reject any applicationUSC will review any publication based on unscrambled headers and all work processing these headers will be done inside HPCCWhy would we do thisThe Internet needs to be studied and engineeredWhat is the modern equivalent of Bell Labs for phone system?How did we get to where we are today?Co-operation between researchers and operators.We cant allow ourselves to have complete bunker mentalityWe need to be selective in what we provide, but in case of demonstrated need provide what is needed consistent with policiesIf we dont do this no one willThe risks can be managed if we take the time and effort to work with campus management (legal, CIOs etc) to mitigateResearchers can be brought into these discussions if cast correctlyIf we dont study how the network works our ability to manage it will degrade to zero over time