100GbE in Datacenter Interconnects: When, Where? in Datacenter Interconnects: When, Where? 1 Bikash Koley Network Architecture, Google Sep 2009 Datacenter Interconnects Large number of identical compute systems Interconnected by a large number of identical ...
100GbE in Datacenter Interconnects: When, Where?1Where?Bikash KoleyNetwork Architecture, GoogleSep 2009Datacenter Interconnects Large number of identical compute systems Interconnected by a large number of identical switching gears Can be within single physical boundary or can span several physical boundaries Interconnect length varies between few meters to tens of kms Current best practice: rack switches with oversubscribed uplinksDistance Between Compute ElementsBW Demand INTRA-DATACENTER CONNECTIONS INTER-DATACENTER CONNECTIONSFiber-rich, Very large BW demand3Datacenter Interconnect Fabrics High performance computing/ super-computing architectures have often used various complex multi-stage fabric architectures such as Clos Fabric, Fat Tree or Torus [1, 2, 3, 4, 5] For this theoretical study, we picked the Fat Tree architecture described in [2, 3], and analyzed the impact of choice of interconnect speed and technology on overall interconnect cost As described in [2,3], Fat-tree fabrics are built with identical N-port switching elements Such a switch fabric architecture delivers a constant bisectional bandwidth (CBB)N/2 SpinesA 2-stage Fat TreeA 3-stage Fat Tree41. C. Clos, A study of non-blocking switching networks, Bell System Technical Journal, Vol. 32, 1953, pp. 406-424.2. Charles E. Leiserson: Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing., IEEE Transactionson Computers, Vol 34, October 1985, pp 892-9013. S. R. Ohring, M. Ibel, S. K. Das, M. J. Kumar, On Generalized Fat-tree, IEEE IPPS 1995.4 RUFT: Simplifying the Fat-Tree Topology, Gomez, C.; Gilabert, F.; Gomez, M.E.; Lopez, P.; Duato, J.; Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on, 8-10 Dec. 2008 Page(s):153 1605. [Beowulf] torus versus (fat) tree topologies: http://www.beowulf.org/archive/2004-November/011114.htmlInterconnect at What Port Speed? A switching node has a fixed switching capacity (i.e. CMOS gate-count) within the same space and power envelope Per node switching capacity can be presented at different port-speed: i.e. a 400Gbps node can be 40X10Gbps or 10X40Gbps or 4X100Gbps3-stage Fat-tree Fabric Capacity1001,00010,000100,0001,000,000Maximal Fabric Capacity (Gbps)54X100Gbps Lower per-port speed allows building a much larger size maximal constant bisectional bandwidth fabric There are of course trade-offs with the number of fiber-connections needed to build the interconnect Higher port-speed may allow better utilization of the fabric capacity1100 100 200 300 400 500 600 700 800Maximal Fabric Capacity (Gbps)Per Node Switching Capacity (Gbps)Port Speed = 10 GbpsPort Speed = 40GbpsPort Speed = 100GbpsFabric Size vs Port SpeedConstant switching BW/node of 1Tbps and constant fabric cross-section BW of 10Tbps AssumedNo of Fabric Stages Needed for a 10T Fabric02468101214No of Fabric Stages NeededSwitching_BW/node=1Tbps5 spines600 100 200 300 400 500Port Speed (Gbps)No of Fabric Nodes Needed for a 10T Fabric010002000300040005000600070000 50 100 150 200 250 300 350 400Port Speed (Gbps)No of Nodes NeededSwitching_BW/node=1Tbps Higher per port bandwidth reduces the number of available ports in a node with constant switching bandwidthIn order to support same cross-sectional BWmore stages are needed in the fabric More fabric nodes are needed Power vs Port SpeedPer Port Power Consumption406080100120140Per-port Power Consumption (Watts)Constant power/Gbps4x power for 10x speed20x power for 10x speedTotal Interconnect Power Consumption100000150000200000250000300000Total Power Consumption (Watts)Constant power/Gbps4x power for 10x speed20x power for 10x speedBLEEDING EDGEPOWER PARITYBLEEDING EDGEPOWER PARITY7 Three power consumption curves for interface optical modules: Bleeding Edge: 20x power for 10x speed; e.g. if 10G is 1W/port, 100G is 20W/port Power Parity: Power parity on per Gbps basis; e.g. if 10G is 1W/port, 100G is 10W/port Mature: 4x power for 10x speed; e.g. if 10G is 1W/port, 100G is 4W/port Lower port speed provides lower power consumption For power consumption parity, power per optical module needs to follow the mature curve 0200 50 100 150 200 250 300 350 400 450Port Speed (Gbps)Per-port Power Consumption (Watts)0500000 50 100 150 200 250 300 350 400Port Speed (Gbps)MATURE MATUREPARITYCost vs Port SpeedPer Port Optics Cost$20,000$30,000$40,000$50,000$60,000$70,000Per-port Optics CostConstant cost/Gbps4x cost for 10x speed20x cost for 10x speedTotal Fabric Cost$60,000,000$80,000,000$100,000,000$120,000,000$140,000,000$160,000,000Total Fabric CostConstant cost/Gbps4x cost for 10x speed20x cost for 10x speedBLEEDING EDGECOST PARITYBLEEDING EDGECOST PARITY8$0$10,000$20,0000 50 100 150 200 250 300 350 400 450Port Speed (Gbps)$0$20,000,000$40,000,0000 50 100 150 200 250 300 350 400Port Speed (Gbps)MATUREMATURECOST PARITY Three cost curves for optical interface modules: Bleeding Edge: 20x cost for 10x speed Cost Parity: Cost parity on per Gbps basis Mature: 4x cost for 10x speed; Fiber cost is assumed to be constant per port (10% of 10G port cost) For fabric cost parity, cost of optical modules need to increase by < 4x for 10x increase in interface speed INTRA-DATACENTER CONNECTIONS INTER-DATACENTER CONNECTIONSLimited Fiber Availability, 2km+ reach9Beyond 100G: What data rate? 400Gbps? 1Tbps? Something in-between? How about all of the above?Current optical PMD specs are designed for absolute worst-case penaltiesSignificant capacity is untapped within the statistical variation of various penalties10WastedLinkMargin/CapacityWhere is the Untapped Capacity?-60-50-40-30-20-100Receiver Performance Sensitivity (dBm)Optical Channel CapacityM. Nakazawa: ECOC 200811-80-701.00E-02 1.00E-01 1.00E+00 1.00E+01 1.00E+02 1.00E+03Quantum limit of SensitivityBit Rate (Gbps) Unused Link Margin Untapped SNR Untapped Capacity In ideal world, 3dB of link margin will allow link capacity to be doubled Need the ability to use additional capacity (speed up the link) when available (temporal or statistical) and scale-back to the base-line capacity (40G/100G?) when notM. Nakazawa: ECOC 2008paper Tu.1.E.1Rate Adaptive 100G+ Ethernet? There are existing standards within the IEEE802.3 family: IEEE 802.3ah 10PASS-TS: based on MCM-VDSL standard IEEE 802.3ah 2BASE-TL: based on SHDSL standard Needed when channels are close to physics-limit : We are getting there with 100Gbps+ Ethernet Shorter links Higher capacity (matches perfectly with datacenter bandwidth demand distribution, see slide # 3)6001201002003004005000 2 4 6 8 10 12 14 16 18Bit Rate (Gbps) SNR (dB)mQAMOOK How to get there? High-order modulation Multi-carrier-Modulation/OFDM Ultra-dense WDM Combination of all the aboveIs There a Business Case?00.050.10.150.20.250 10 20 30 40 50DistributionLink Distance (km)Example Link Distance Distribution01002003004005006000 10 20 30 40 50Possible Bit Rtae (Gbps)Link Distance (km)Example Adaptive Bit Rate Implementationsscheme-1scheme-2non-adaptive13Link Distance (km)Link Distance (km)0501001502002503003504000 100 200 300 400 500 600Total Capacity on 1000 links (Tbps)Max Adaptive Bit Rate with 100Gbps base rate (Gbps)Aggregate Capacity for 1000 Links An example link-length distribution between datacenters is shown Can be supported by a 40km capable PMD Various rate-adaptive 100GbE+ options are considered Base rate is 100Gbps Max adaptive bit-rate varies from 100G to 500G Aggregate capacity for 1000 such links is computedQ&A