Spark Summit 2013 Spark Streaming Real Time big data processing

  • Published on

  • View

  • Download


Spark Streaming:Scales to hundreds of nodesAchieves second-scale latenciesEfficiently recover from failuresIntegrates with batch and interactive processing


Spark Streaming Large-scale near-real-time stream processing Spark StreamingReal-time big-data processingTathagata Das (TD)UC BERKELEYWhat is Spark Streaming?Extends Spark for doing big data stream processingProject started in early 2012, alpha released in Spring 2013 with Spark 0.7Moving out of alpha in Spark 0.9SparkSpark StreamingGraphXSharkMLlibBlinkDBWhy Spark Streaming?Many big-data applications need to process large data streams in realtimeWebsite monitoringFraud detectionAd monetization3Why Spark Streaming?Need a framework for big data stream processing thatWebsite monitoringFraud detectionAd monetizationScales to hundreds of nodesAchieves second-scale latenciesEfficiently recover from failuresIntegrates with batch and interactive processing4Integration with Batch ProcessingMany environments require processing same data in live streaming as well as batch post-processingExisting frameworks cannot do bothEither, stream processing of 100s of MB/s with low latency Or, batch processing of TBs of data with high latencyExtremely painful to maintain two different stacks Different programming modelsDouble implementation effortStateful Stream ProcessingTraditional modelMutable state is lost if node failsMaking stateful stream processing fault tolerant is challenging!Processing pipeline of nodesEach node maintains mutable stateEach input record updates the state and new records are sent out mutable statenode 1node 3input recordsnode 2input recordsTraditional stream processing use the continuous operator model, where every node in the processing pipeline continuously run an operator with in-memory mutable state. As each input records is received, the mutable state is updated and new records are sent out to downstream nodes. The problem with this model is that the mutable state is lost if the node fails. To deal with this ,various techniques have been developed to make this state fault-tolerant. I am going to divide them into two broad classes and explain their limitations.6Existing Streaming SystemsStormReplays record if not processed by a nodeProcesses each record at least onceMay update mutable state twice!Mutable state can be lost due to failure!Trident Use transactions to update stateProcesses each record exactly oncePer-state transaction to external database is slow7Spark Streaming8Spark StreamingRun a streaming computation as a series of very small, deterministic batch jobs9SparkSparkStreamingbatches of X secondslive data streamprocessed resultsChop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operationsFinally, the processed results of the RDD operations are returned in batchesSpark StreamingRun a streaming computation as a series of very small, deterministic batch jobs10Batch sizes as low as second, latency of about 1 secondPotential for combining batch processing and streaming processing in the same systemSparkSparkStreamingbatches of X secondslive data streamprocessed resultsExample Get hashtags from Twitter val tweets = ssc.twitterStream()DStream: a sequence of RDDs representing a stream of databatch @ t+1batch @ tbatch @ t+2tweets DStreamstored in memory as an RDD (immutable, distributed)Twitter Streaming APIExample Get hashtags from Twitter val tweets = ssc.twitterStream()val hashTags = tweets.flatMap(status => getTags(status))flatMapflatMapflatMaptransformation: modify data in one DStream to create another DStream new DStreamnew RDDs created for every batch batch @ t+1batch @ tbatch @ t+2tweets DStreamhashTags Dstream[#cat, #dog, ]Example Get hashtags from Twitter val tweets = ssc.twitterStream()val hashTags = tweets.flatMap(status => getTags(status))hashTags.saveAsHadoopFiles("hdfs://...")output operation: to push data to external storageflatMapflatMapflatMapsavesavesavebatch @ t+1batch @ tbatch @ t+2tweets DStreamhashTags DStreamevery batch saved to HDFSExample Get hashtags from Twitter val tweets = ssc.twitterStream()val hashTags = tweets.flatMap(status => getTags(status))hashTags.foreach(hashTagRDD => { ... })foreach: do whatever you want with the processed dataflatMapflatMapflatMapforeachforeachforeachbatch @ t+1batch @ tbatch @ t+2tweets DStreamhashTags DStreamWrite to a database, update analytics UI, do whatever you wantDemoJava ExampleScalaval tweets = ssc.twitterStream()val hashTags = tweets.flatMap(status => getTags(status))hashTags.saveAsHadoopFiles("hdfs://...")JavaJavaDStream tweets = ssc.twitterStream()JavaDstream hashTags = tweets.flatMap(new Function { })hashTags.saveAsHadoopFiles("hdfs://...")Function objectDStream of dataWindow-based Transformationsval tweets = ssc.twitterStream()val hashTags = tweets.flatMap(status => getTags(status))val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()sliding window operationwindow lengthsliding intervalwindow lengthsliding intervalArbitrary Stateful ComputationsSpecify function to generate new state based on previous state and new dataExample: Maintain per-user mood as state, and update it with their tweets def updateMood(newTweets, lastMood) => newMood moods = tweetsByUser.updateStateByKey(updateMood _)Arbitrary Combinations of Batch and Streaming ComputationsInter-mix RDD and DStream operations!Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => {tweetsRDD.join(spamHDFSFile).filter(...) })DStreams + RDDs = PowerOnline machine learningContinuously learn and update data models (updateStateByKey and transform)Combine live data streams with historical dataGenerate historical data models with Spark, etc.Use data models to process live data stream (transform)CEP-style processingwindow-based operations (reduceByWindow, etc.)Input SourcesOut of the box, we provideKafka, HDFS, Flume, Akka Actors, Raw TCP sockets, etc.Very easy to write a receiver for your own data sourceAlso, generate your own RDDs from Spark, etc. and push them in as a streamFault-toleranceBatches of input data are replicated in memory for fault-toleranceData lost due to worker failure, can be recomputed from replicated input datainput data replicatedin memoryflatMaplost partitions recomputed on other workerstweetsRDDhashTagsRDDAll transformations are fault-tolerant, and exactly-once transformationsPerformanceCan process 60M records/sec (6 GB/sec) on 100 nodes at sub-second latencyComparison with other systemsHigher throughput than StormSpark Streaming: 670k records/sec/nodeStorm: 115k records/sec/nodeCommercial systems: 100-500k records/sec/nodeStreaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack24Fast Fault RecoveryRecovers from faults/stragglers within 1 secMobile Millennium ProjectTraffic transit time estimation using online machine learning on GPS observationsMarkov-chain Monte Carlo simulations on GPS observationsVery CPU intensive, requires dozens of machines for useful computationScales linearly with cluster sizeAdvantage of an unified stackExplore data interactively to identify problemsUse same code in Spark for processing large logsUse similar code in Spark Streaming for realtime processing$ ./spark-shellscala> val file = sc.hadoopFile(smallLogs)...scala> val filtered = file.filter(_.contains(ERROR))...scala> val mapped = ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(productionLogs) val filtered = file.filter(_.contains(ERROR)) val mapped = ... }}object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = stream.filter(_.contains(ERROR)) val mapped = ... }}RoadmapSpark 0.8.1 Marked alpha, but has been quite stable Master fault tolerance manual recoveryRestart computation from a checkpoint file saved to HDFSSpark 0.9 in Jan 2014 out of alpha!Automated master fault recoveryPerformance optimizationsWeb UI, and better monitoring capabilitiesRoadmapLong term goalsPython APIMLlib for Spark StreamingShark StreamingCommunity feedback is crucial!Helps us prioritize the goalsContributions are more than welcome!!Todays TutorialProcess Twitter data stream to find most popular hashtags over a windowRequires a Twitter accountNeed to setup Twitter OAuth keys to access tweetsAll the instructions are in the tutorialYour account will be safe!No need to enter your password anywhere, only the keysDestroy the keys after the tutorial is doneConclusionStreaming programming guide Paper you!


View more >