Realtime Reporting using Spark Streaming

  • Published on
    13-Aug-2015

  • View
    54

  • Download
    6

Transcript

1. Breaking ETL barrier with Real-time reporting using Kafka, Spark Streaming 2. About us Concur (now part of SAP) provides travel and expense management services to businesses. 3. Data Insights A team that is building solutions to provide customer access to data, visualization and reporting. Expense Travel Invoice 4. About me Santosh Sahoo Principal Architect III, Data Insights 5. Stack so far.. OLAP ReportETL OLTP App 6. Numbers 7K OLTP database sources 14K OLAP Reporting dbs 28K ETL Jobs 2B row changes 300M rows (Compacted) Only ~20 failure a night 7. Traditional ETL challenges Scheduled (High latency) Hard to scale. Failover and recovery. Monolithic-ness Spaghetti (Logic +SQL) 8. Moving forward Streaming, real time Scalable Highly available Reduce maintenance overhead Eventual Consistency 9. Streaming Data Pipeline Source Flow Management Processor Storage Querying 10. Data Source Event bus for business events Log Scrapping Transaction log scraping (Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog) Change Data Capture Application messaging/JMS Micro batching (High watermarked, change tracking) 11. Kafka - Flow Management No nonsense logging 100K/s throughput vs 20k of RabbitMQ Log compaction Durable persistence Partition tolerance Replication Best in class integration with Spark 12. Columnar Storage Optimized for analytic query performance. Vertical partitioning Column Projection Compression Loosely coupled schema. HBase AWS Redshift Parquet ORC Postgres (Citrus) SAP HANA 13. Hadoop/HDFS Pro - Scale Con- Latency 14. Spark Streaming What? A data processing framework to build scalable fault-tolerant streaming applications. Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. 15. Spark Streaming Architecture Worker Worker Worker Receiver Driver Master Executor Executor Executor Source D1 D2 D3 D4 WAL D1 D2 Replication Data Store TASK DStream- Discretized Stream of RDD RDD - Resilient Distributed Datasets 16. Optimized Direct Kafka API https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html 17. How val kafkaParams = Map("metadata.broker.list" -> "localhost:9092, anotherhost:9092") val topics = Set("sometopic", "anothertopic") val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topics) 18. Architecture 19. App OLTP Kafka Spark Streaming OLAP Reporting App High level view 20. OLTP Reporting Cognos Tableau ? Archive Flume Camus Stream Processor Spark Samza, Storm, Flink HDFS Import FTP HTTP SMTP C Tachyon P Standby Protobuf Json Broker Kafka Hive/ Spark SQL HANA Load balance Failover HANA HANA HANA Replication Service bus SqoopSnapshot Pig/Hive/MR - Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Complete Architecture 21. Can Spark Streaming survive Chaos Monkey? http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html 22. Lambda Architecture Lambda architecture is a data-processing pattern designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. 23. Demo . 24. QnA 25. concur.com/en-us/careers We are hiring 26. Thank you!

Recommended

View more >