Profiling Network Performance for Multi-Tier Data Center Applications Profiling Network Performance for Multi-Tier Data Center Applications Offense by – Balasaheb Bagul Rumou Duan 1 1 Polling Interval – How much is it? Errors when intervals smaller and greater than 500ms. 2 Bala (dt - ds) / dt -> dt -> packet traces and ds -> SNAP traces There is only one case when no errors are caught at 500ms Pg. 3 (i) Fine grained profiling Timescale should be small? 10 of milliseconds to second? if reduce CPU usage -> errors if fine-grained packet traces -> errors 2 SNAP Configuration - I 8K hosts and just 700 applications! In section 3.2, collect only discrete socket-call logs (“99.8% of connections has low throughput less than 1 MB/s”) 1GB of data per host per day and 1 TB per week! Continuous TCP logs are completely ignored With pooling interval is at an average of 500ms 120 bytes per connection per pull? 3 Rumou 3 SNAP Configuration - II Where are they analyzing data collected? Hosts or centralized server? Centralized (8000*1) GB per day of just socket logs How and when do you send this data to the central server? 4 Bala 4 SNAP Configuration - III Sockets to Processes mapping Done when the sockets are open Processes can create new sockets and close old ones dynamically So they have to do this mapping in that short frame of time and continuously. 5 Rumou 5 CPU Overhead – I (At each host) Polling TCP stats + Reading TCP table = 5%+5% < 10% Collecting Socket logs: 1.6 %. TCP performance classifier? 6 Rumou 6 Fine-grained profiling? TCP Incast Problem In paper: “For example, the TCP incast problem [3], caused by micro bursts of traffic at the timescale of tens of milliseconds, is not even visible in SNMP data.” However, based on Figure 8, the CPU overhead is really large. 7 Rumou 7 CPU Overhead – II (At Server) Cross-Connection Correlation is centralized How will it scale? – No mention about it! How it works? “SNAP has full knowledge of network topology, the network-stack configuration, and mappings of applications to servers.” 8 Bala 8 SNAP Validation Test beds include only 36 hosts! Extremely small data collected ACC (average correlation coefficient) = 0.4 Why? Are all the connections with ACC just above 0.4 facing problems? 9 Bala 9 Advices to DC Operator – Seriously! “Operators should schedule backup jobs more carefully to avoid triggering network congestion” 2 am to 4 am is the most idle time to do bulk transfers! -> So why change it? “Operators should disable delayed ACK or reduce it significantly” What about time critical application? 10 Bala 10 Advices to Developers – Again Seriously! Claim: “Developers can use these logs to quickly find the root cause of performance problems.” Problems that SNAP detected required several days and weeks to solve! Do developers have weeks to spare? So does this mean that SNAP’s data is not efficient for the developers “There should be better scheduling of traffic across applications…” How to do it? 11 Rumou - Pg. 11 11 Conclusion Not scalable due to centralized server Huge data collected per host per day Continuously Get it to work with more applications! 12 Linux & windows 12 Thank you! 13
Please download to view
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
...

Profiling Network Performance for Multi-Tier Data Center Applications

by felcia

on

Report

Category:

Documents

Download: 0

Comment: 0

54

views

Comments

Description

Profiling Network Performance for Multi-Tier Data Center Applications. Offense by – Balasaheb Bagul Rumou Duan. Polling Interval – How much is it?. Errors when intervals smaller and greater than 500ms. SNAP Configuration - I. 8K hosts and just 700 applications! - PowerPoint PPT Presentation
Download Profiling Network Performance for Multi-Tier Data Center Applications

Transcript

Profiling Network Performance for Multi-Tier Data Center Applications Profiling Network Performance for Multi-Tier Data Center Applications Offense by – Balasaheb Bagul Rumou Duan 1 1 Polling Interval – How much is it? Errors when intervals smaller and greater than 500ms. 2 Bala (dt - ds) / dt -> dt -> packet traces and ds -> SNAP traces There is only one case when no errors are caught at 500ms Pg. 3 (i) Fine grained profiling Timescale should be small? 10 of milliseconds to second? if reduce CPU usage -> errors if fine-grained packet traces -> errors 2 SNAP Configuration - I 8K hosts and just 700 applications! In section 3.2, collect only discrete socket-call logs (“99.8% of connections has low throughput less than 1 MB/s”) 1GB of data per host per day and 1 TB per week! Continuous TCP logs are completely ignored With pooling interval is at an average of 500ms 120 bytes per connection per pull? 3 Rumou 3 SNAP Configuration - II Where are they analyzing data collected? Hosts or centralized server? Centralized (8000*1) GB per day of just socket logs How and when do you send this data to the central server? 4 Bala 4 SNAP Configuration - III Sockets to Processes mapping Done when the sockets are open Processes can create new sockets and close old ones dynamically So they have to do this mapping in that short frame of time and continuously. 5 Rumou 5 CPU Overhead – I (At each host) Polling TCP stats + Reading TCP table = 5%+5% < 10% Collecting Socket logs: 1.6 %. TCP performance classifier? 6 Rumou 6 Fine-grained profiling? TCP Incast Problem In paper: “For example, the TCP incast problem [3], caused by micro bursts of traffic at the timescale of tens of milliseconds, is not even visible in SNMP data.” However, based on Figure 8, the CPU overhead is really large. 7 Rumou 7 CPU Overhead – II (At Server) Cross-Connection Correlation is centralized How will it scale? – No mention about it! How it works? “SNAP has full knowledge of network topology, the network-stack configuration, and mappings of applications to servers.” 8 Bala 8 SNAP Validation Test beds include only 36 hosts! Extremely small data collected ACC (average correlation coefficient) = 0.4 Why? Are all the connections with ACC just above 0.4 facing problems? 9 Bala 9 Advices to DC Operator – Seriously! “Operators should schedule backup jobs more carefully to avoid triggering network congestion” 2 am to 4 am is the most idle time to do bulk transfers! -> So why change it? “Operators should disable delayed ACK or reduce it significantly” What about time critical application? 10 Bala 10 Advices to Developers – Again Seriously! Claim: “Developers can use these logs to quickly find the root cause of performance problems.” Problems that SNAP detected required several days and weeks to solve! Do developers have weeks to spare? So does this mean that SNAP’s data is not efficient for the developers “There should be better scheduling of traffic across applications…” How to do it? 11 Rumou - Pg. 11 11 Conclusion Not scalable due to centralized server Huge data collected per host per day Continuously Get it to work with more applications! 12 Linux & windows 12 Thank you! 13
Fly UP