[Spark meetup] Spark Streaming Overview

  • Published on
    14-Jul-2015

  • View
    1.043

  • Download
    6

Transcript

SPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSparkSQLSparkStreamingMLlib(machine learning)GraphX(graph)SPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEW Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are. Each consumer keeps control of its own offset (read) On demand topic creationSPARK STREAMING OVERVIEW ETL and ELT, wide catalog of sources and sinks Flexible design of topologies and agent deployment strategies. Data transformation, thanks to interceptors.SPARK STREAMING OVERVIEWreadClobreadCSVreadLinereadMultiLinereadAvroreadJsonaddCurrentTimeaddLocalHostgeoIPfindReplaceSplitgenerateUUIDdecompressIfextractJsonPathsdetectMimeTypexqueryextractURIComponentsxsltGrok (regular expressions)execspoolingloggerSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWSPARK STREAMING OVERVIEWCASSANDRAKafkaSTRATIO DEEPSTRATIO DEEPSPARK STREAMING OVERVIEWShark(SQL)SparkStreamingMllib(machine learning)GraphX(graph)SPARK STREAMING OVERVIEWRDD, what is that?SPARK STREAMING OVERVIEW RDD, what is that?SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW ?SPARK STREAMING OVERVIEW Spark Streaming: Overall viewSPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW Spark Streaming: Overall viewDiscretized Stream or DStream.SPARK STREAMING OVERVIEW Discretized Stream or DStream.SPARK STREAMING OVERVIEW Discretized Stream or DStream.SPARK STREAMING OVERVIEW Overall viewSPARK STREAMING OVERVIEW Input DStreams and Receivers. Basic (distributed with Spark Streaming). Advanced (available as dependency).SPARK STREAMING OVERVIEW Basic sources File Stream. Sockets. Actors (Akka). Queue RDDs (Testing).SPARK STREAMING OVERVIEW Advanced sourcesSPARK STREAMING OVERVIEW Do It Yourself Code onStart() Code onStop() Code receive() Custom Receiver ready!SPARK STREAMING OVERVIEW map(func), flatMap(func), filter(func), count() repartition(numPartitions) union(otherStream) reduce(func),countByValue(), reduceByKey(func, [numTasks]) join(otherStream, [numTasks]), cogroup(otherStream, [numTasks]) transform(func) updateStateByKey(func) window(windowLength, slideInterval) countByWindow(windowLength, slideInterval) reduceByWindow(func, windowLength, slideInterval) reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) countByValueAndWindow(windowLength, slideInterval, [numTasks]) print() foreachRDD(func) saveAsObjectFiles(prefix, [suffix]) saveAsTextFiles(prefix, [suffix]) saveAsHadoopFiles(prefix, [suffix]) SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW Stateful transformations (updateStateByKey, reduceByKeyAndWindow). As fault-tolerance mechanism, when driver crashes.HDFS is mandatory if you are going to use operations that requires checkpointing.SPARK STREAMING OVERVIEW Configuration parameters spark.streaming.receiver.maxRate spark.streaming.concurrentJobs spark.streaming.receiver.writeAheadLogs.enable spark.streaming.unpersistSPARK STREAMING OVERVIEW each node has mutable state and for each record they have to update state & send new recordsSPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW SPARK STREAMING OVERVIEW