Hadoop - Past, Present and Future - v1.1

  • Published on
    11-Aug-2014

  • View
    1.034

  • Download
    15

DESCRIPTION

Overview of Hadoop .. What it was, what it is and what it will be.

Transcript

  • 6/19/2014 Prepared for: Presented by: Big Data Joe Rossi @bigdatajoerossi Hadoop Past, Present and Future
  • Roadmap ~45mins 1- What Makes Up Hadoop 1.x? 2- Whats New In Hadoop 2.x? 3- The Future Of Hadoop
  • What Makes Up Hadoop 1.x?
  • Hadoop 1.0: HDFS + MapReduce NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker JobTracker Client 1-1 1-21-3
  • Hadoop 1.0: HDFS + MapReduce NameNode DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker JobTracker Client 1-1 1-2 1-3 ReduceMap 2-1 3-2 3-3 4-1 2-3 4-2 2-2 3-1 4-3 ReduceMap
  • MapReduce v1 Limitations Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000 Availability JobTracker failure kills all queued and running jobs Resources Partitioned into Map and Reduce Hard partitioning of Map and Reduce slots led to low resource utilization No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else
  • HADOOP 1.0 Single Use System Batch Apps Apache Hadoop 1.0: Single Use System HDFS (redundant, reliable storage) MapReduce (cluster resource management and data processing) Pig Hive
  • Whats New In Hadoop 2.x?
  • YARN Replaces MapReduce Yet Another Resource Negotiator YARN YARN will be the de-facto distributed operating system for Big Data
  • Store DATA in one place YARN: Taking Hadoop Beyond Batch Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applications Run Natively IN Hadoop HDFS2 (redundant, reliable storage) YARN (cluster resource management) BATCH (MapReduce) INTERACTIVE (Tez) ONLINE (HBase) STREAMING (DataTorrent) GRAPH (Giraph)
  • Running all on the same Hadoop cluster to give applications access to all the same source data! YARN: Applications MapReduce v2 Stream Processing Master-WorkerOnline In-Memory Apache Storm
  • 2010 2011 2012 2013 2014 Today YARN: Moving Quickly Conceived at Yahoo! Alpha Releases 2.0 Beta Releases 2.1 GA Released 2.2 100,000+ nodes, 400,000+ jobs daily 10 million+ hours of compute daily Version 2.3 Version 2.4
  • YARN: Dr. Evil Approved
  • YARN: What Has Changed? YARN MRv1 RMResourceManager AMApplicationMaster JT JobTracker Scheduler Scheduler NMNodeManager TTTaskTracker Container Map Reduce ResourceManager Scheduler JobTracker Scheduler NodeManager ApplicationMaster TaskTracker Map Reduce NodeManager Container Container TaskTracker Map Reduce
  • Scale New programming models and services Improved cluster utilization Agility Backwards compatible with MapReduce v1 Mixed workloads on the same source of data 6 Benefits of YARN
  • The Future of Hadoop Projects and Roadmap
  • Speed Deliver interactive query through 100x performance increases as compared to Hive 10. Stinger: Interactive Query for Hive SQL Support the broadest array of SQL semantics for analytic applications running against Hadoop. Scale The only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes.
  • Dynamic Scaling On-demand cluster size. Increase and decrease the size with load. HOYA: HBase (NoSQL) on YARN Easier Deployment APIs to create, start, stop and delete HBase clusters. Availability Recover from Region Server loss with a new container.
  • Machine Learning Framework well suited for building machine learning jobs. Microsoft REEF Scalable / Fault Tolerant Makes it easy to implement scalable, fault- tolerant runtime environments for a range of computational models. Maintain State Users can build jobs that utilize data from where its needed and also maintain state after jobs are done. Retainable Evaluator Execution Framework
  • Heterogeneous Storages in HDFS NameNode Storage NameNode SATA SSD Fusion IO
  • Apache Hadoop 2.4 ResourceManager HA / Auto Failover HDFS Rolling Upgrades Apache Hadoop 2.5 NodeManager Restart w/o disruption Dynamic Resource Configuration Hadoop Roadmap RELEASED EARLY Q2 2014 MID Q2 2014
  • I Know You Have Questions No such thing as a stupid question. Hadoop: Past, Present and Future
  • SD Big Data Meetup One Last Thing meetup.com/sdbigdata 2nd Wednesday Of The Month Next: July 9st @ 5:45P
  • Thank You! Hadoop: Past, Present and Future Big Data Joe Rossi http://bigdatajoe.io/ @bigdatajoerossi
  • Supporting Slides Slides with information that may be asked
  • YARN: How It Works ResourceManager NodeManager ApplicationMaster NodeManager NodeManager NodeManager Scheduler Container Container Container Client
  • YARN: Example App Deployment ResourceManager NodeManager HOYA / HBase Master NodeManager NodeManager NodeManager Scheduler Region Server Region Server Region Server HOYA Client
  • Storm Vs. DataTorrent Solution Matrix DataTorrent Apache Storm Atomic Micro-batch 1 3 Events per Second Billions Thousands Automated Parallelism 3 Dynamic Runtime Changes 3 Linear Scalability 3 State Checkpointing 3
  • Apache Spark + Shark HDFS2 (redundant, reliable storage) YARN (cluster resource management) Apache Spark Shark Hive (sql)
  • Hadoop 2.x YARN + HDFS NameNode DataNode / NodeManager DataNode / NodeManager DataNode / NodeManager DataNode / NodeManager Standby NameNode / ResourceManager ContainerContainer ContainerContainer ContainerContainer ContainerContainer
  • Backwards Compatible YARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away. YARN: Key Take-Aways Resource Management YARN enables Fine Grained Resource Management for better cluster utilization. One Source of Data YARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service. Enabling Smart People YARN is a flexible framework that is giving smart people and companies to do amazing things with data. YARN will be the de-facto distributed operating system for Big Data
  • Storm Vs. DataTorrent - Detailed Solution Matrix DataTorrent Apache Storm Proprietary / Open Source O O Support for Hadoop 1.x 1 1 Support for Hadoop 2.x 1 1 Native YARN 1 3 Dashboard 1 3 Extensible via Modules 1 1 Technical Support 1 1 Atomic Micro-batch 1 3 Events per Second Billions Thousands Automated Parallelism 1 3 Dynamic Runtime Changes 1 3 High Availability 1 2 Prog. Languages Supported Java, Python, etc. Java, Python, etc. Log Analysis 1 3 Site Operations 1 3 MapReduce Diagnostics 1 3 Open Source Operators Library 1 2 Open Source Application Templates 1 3 Complex Computations (DAG) 1 3 Linear Scalability 1 3 Security 1 3 CLI and Macros 1 3 Configuration Based Specification 1 3 State Checkpointing 1 3
  • Users forced to create data system silos for managing mixed workloads Developers forced to abuse very specific MapReduce to fit their use cases The 1st Generation Of Hadoop Hadoop HBase
  • Apache Spark HDFS2 (redundant, reliable storage) YARN (cluster resource management) Apache Spark Shark Hive (sql) Spark Streaming MLib (machine learning)
  • Project Mgt Committee Members 0 2 4 6 8 10 12 14 16 Hortonworks Others Cloudera Yahoo! Facebook 7 6 3 15 11
  • Project Committers 0 5 10 15 20 25 30 Hortonworks Others Cloudera Yahoo! Facebook 24 24 11 11 5
  • YARN: Why The De-Facto Distributed OS Technology Adoption 100,000 nodes+ - 400,000 jobs - 10m compute hours daily Enables Innovation Smart people and companies to do amazing things to data Financial Backing 568m+ invested in Hadoop contributing companies, nearly 400m in the 2013 alone
  • Apache Storm Topology Bolt (Filter)Spout Stream (Data Source) Spout Stream (Data Source) Bolt (RDBMS Writes) Bolt (Calculation) Bolt (HDFS Writes) RDBMS HDFS
  • HDFS Write Data Flow NameNode Client DataNode DataNode DataNode 1 2 4 5 67 3 Block Bytes Block Bytes Block Bytes Block Write Complete AckAck Ack A B C

Recommended

View more >