Spark, spark streaming & tachyon

  • Published on
    20-Aug-2015

  • View
    855

  • Download
    2

Transcript

1. Spark, Spark Streaming& TachyonSolving big data problem without programming for big data 2. Who am I? What do we do? Name: Johan Hong johan.hong@pearson.com Software Architect work for Pearson Higher Education Deliver personalized and connected learning at scale Build assessment platform with micro-services to serveinternal and public services and applications 3. DefinitionsApache Spark is a fast and general engine for large-scaledata processing.Apache Spark is a cluster computing platform designed tobe fast and general-purpose.Spark Streaming makes it easy to build scalable fault-tolerantstreaming applications.Tachyon is a memory-centric distributed file systemenabling reliable file sharing at memory-speed acrosscluster frameworks 4. Spark Stack 5. Stack with Tachyon 6. Distributed Execution 7. HDFS Architecture 8. HDFS Block Replication 9. Limitations of Map Reduce 10. Spark Runtime 11. RDD is an InterfaceAdvanced Spark Internals and Tuning 12. Sample Application 13. Narrow & Wide Dependencies 14. Tachyon System ArchitectureTachyon: Memory Throughput I/O for Cluster Computing Frameworks 15. Spark Execution Plan 16. Spark Executor on Worker Node 17. Fault-Tolerant in Spark StreamingCould data be lost if the receiving node crashes before it replicatesincoming data to other data node(s)?It happens. Ooyala loses 1% of their data but it is considered asacceptable.What can we do to prevent data loss?We could persist events before they reach Spark Streaming Receiver,replay the events/messages after receiver crashes and recovers. 18. Spark Internal Operations 19. Data Locality