Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

  • Published on

  • View

  • Download


Presto @ FacebookMartin Traverso and Dain SundstromPresto @ Facebook Ad-hoc/interactive queries for warehouse Batch processing for warehouse Analytics for user-facing products Analytics over various specialized storesAnalytics for WarehouseArchitectureUI CLI Dashboards Other toolsGatewayPresto PrestoWarehouseClusterWarehouseClusterDeploymentPrestoHDFS DatanodeMRHDFS DatanodeMRHDFS DatanodePrestoHDFS DatanodeMRHDFS DatanodeStats 1000s of internal daily active users Millions of queries each month Scan PBs of data every day Process trillions of rows every day 10s of concurrent queriesFeatures Pipelined partition/split enumeration Streaming Admission control Resource management System reliabilityBatch workloadsBatch Requirements INSERT OVERWRITE More data types UDFs Physical properties (partitioning, etc)Analytics for User-facing ProductsRequirements Hundreds of ms to seconds latency, low variability Availability Update semantics 10 - 15 way joinsArchitectureLoaderPrestoWorkerPrestoWorkerPrestoWorkerMySQLLoaderMySQLMySQLClientStats > 99.99% query success rate 100% system availability 25 - 200 concurrent queries 1 - 20 queries per second Presto RaptorRequirements Large data sets Seconds to minutes latency Predictable performance 5-15 minute load latency Reliable data loads (no duplicates, no missing data) 10s of concurrent queriesBasic ArchitectureCoordinatorMySQL Worker FlashWorker FlashWorker FlashClientBut isnt that exactly what Hive does?Additional Features Full featured and atomic DDL Table statistics Tiered storage Atomic data loads Physical organizationTable Statistics Table is divided into shards Each shard is stored in a separate replication unit (i.e., file) Typically 1 to 10 million rows Node assignment and stats stored in MySQLTable Schema in MySQL Tablesid name1 orders2 line_items3 partstable1 shardsuuid nodes c1_min c1_max c2_min c2_max c3_min c3_max43a5 A 30 90 cat dog 2014 20146701 C 34 45 apple banana 2005 20159c0f A,D 25 26 cheese cracker 1982 1994df31 B 23 71 tiger zebra 1999 2006Tiered StorageCoordinatorMySQL Worker FlashWorker FlashWorker FlashClient BackupTiered Storage One copy in local, expensive, flash Backup copy in cheap durable backup tier Currently Gluster internally, but can be anything durable Only assumes GET and PUT with client assigned ID methodsAtomic Data Loads Import data periodically from streaming event system Internally a Scribe based system similar to Kafka or Kinesis Provides continuation tokens Loads performed using SQLAtomic Data LoadsINSERT INTO target SELECT * FROM source_stream WHERE token BETWEEN ${last_token} AND ${next_token}Loader Process1. Record new job with now token in MySQL 2. Execute INSERT from last committed token to now token with external batch id 3. Wait for INSERT to commit (check external batch status) 4. Record job complete 5. RepeatFailure Recovery Loader crash Check status of jobs using external batch id INSERT hang Cancel query and rollback job (verify status to avoid race) Duplicate loader processes Process guarantees only one job can complete Monitor for lack of progress (catches no loaders also)Physical Organization Temporal organization Assure files dont cross temporal boundaries Common filter clause Eases retention policies Sorted files Can reduce file sections processed (local stats) Can reduce shards processedUnorganized DataSort ColumnsTimeOrganized DataSort ColumnsTimeBackground Organization Compaction Balance data Eager data recover (from backup) Garbage collection Junk created by compaction, delete, balance, recoveryFuture Use Cases Hot data cache for Hadoop data 0-N local copies of backup tier Query results cache Raw, not rolled-up, data store for Sharded MySql customers Materialized view store