Presto @ Treasure Data - Presto Meetup Boston 2015

  • Published on

  • View

  • Download


Designing An Evolving Database Service with PrestoTaro L. Saito Oct 6th, 2015. Presto Meetup @ Bostonmailto:leo@tresaure-data.comPresto Usage at Treasure Data2 100~ customers are actively using Presto 30,000~ Presto queries every day Importing 1,000,000~ records / sec.Import ExportStore Analyze with Presto/HiveMobile and Web SourcesMobile SDKsJavaScript SDK (web access logs)3Stream SourcesStreamingApache Logs nginx logs syslogJSON logs 4JSONExisting Data SourcesBulk ImportData files (CSV, TSV, etc.) MySQL PostgreSQLOracle 5Embedded Devices Collect data from Embedded linux, serial devices, MQTT, XBee Radio, etc.6Import data, now.7Treasure Data Architecture8LogLogLogLogLogLog1-hourpartition1-hourpartition1-hourpartitionHadoop MapReduce2015-09-29 01:00:002015-09-29 02:00:002015-09-29 03:00:00Real-Time StorageArchiveStoragetime column-based partitioningHive Presto Logmany small log files log merge jobLogLogLogLogLogDistributed SQL Query EngineS3 (AWS) Rick CS (IDCF)Columnar Format JSON data {time: 1412380700, user:1} Additional Column {time: 1412381000, user:2, status:200} Type Escalation (int -> string) {time: 1412390000, user:U01, status:200} MessagePack A fast and compact JSON-like format Auto type conversion Table schema MessagePack types Extensible Columnar Store9Use CasesE-COMMERCEBEFOREAFTERBiggest Mobile ShoppingWISH.COM Reduced costs Scalability Single data warehouse11http://WISH.COMGAMINGBEFOREAFTERDaily Upload Delay of 1-2 days2500+ serversReal-timeReal-time2500+ servers1 Billion records/day Reduced TCO Real-time collection Real-time access to KPIsTop 10 globally; 40M+ usersx 2012AD TECHPublishers Dashboard Advertisers Dashboard 800 B/month Live in 2 weeks with 1 engineer! 300% growthEuropes largest mobile ad-exchange More than 50 billion impressions/month13LOYALTYAggregationE-CommerceMarketing Campaigns; Promotions Customer Segmentation A/B Testing14Challenges Handle Huge Query Result Output SELECT */ CREATE TABLE AS /INSERT INTO Parallel Result Upload to S3 Bypass JSON result generation at the coordinator td-presto connector Accesses MessagePack based columnar store Handle S3 access retry / pipelining Future: Better query plan visualization Quickly find the performance bottleneck and memory consuming tasks Storing intermediate query results to disks Process large joins, query resource limitation 15Extensible Schema SQL via Hive, Presto Unlimited Users, QueriesEnterprise AppsEnterprise Apps Data Science ToolsREST APIIngestion: Streaming, BulkBI