- Home
- Documents
- 100GbE in Datacenter Interconnects: When, Where? in Datacenter Interconnects: When, Where? 1 Bikash Koley Network Architecture, Google Sep 2009 Datacenter Interconnects Large number of identical compute systems Interconnected by a large number of identical ...

Published on

21-May-2018View

212Download

0

Transcript

100GbE in Datacenter Interconnects: When, Where?

1

Where?Bikash KoleyNetwork Architecture, Google

Sep 2009

Datacenter Interconnects

Large number of identical compute systems

Interconnected by a large number of identical switching gears

Can be within single physical boundary or can span several physical boundaries

Interconnect length varies between few meters to tens of kms

Current best practice: rack switches with oversubscribed uplinks

Distance Between Compute Elements

BW

Dem

and

INTRA-DATACENTER CONNECTIONS

INTER-DATACENTER CONNECTIONS

Fiber-rich, Very large BW demand

3

Datacenter Interconnect Fabrics

High performance computing/ super-computing architectures have often used various complex multi-stage fabric architectures such as Clos Fabric, Fat Tree or Torus [1, 2, 3, 4, 5]

For this theoretical study, we picked the Fat Tree architecture described in [2, 3], and analyzed the impact of choice of interconnect speed and technology on overall interconnect cost

As described in [2,3], Fat-tree fabrics are built with identical N-port switching elements

Such a switch fabric architecture delivers a constant bisectional bandwidth (CBB)

N/2

Spines

A 2-stage Fat TreeA 3-stage Fat Tree

4

1. C. Clos, A study of non-blocking switching networks, Bell System Technical Journal, Vol. 32, 1953, pp. 406-424.2. Charles E. Leiserson: Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing., IEEE Transactionson Computers, Vol 34, October 1985, pp 892-9013. S. R. Ohring, M. Ibel, S. K. Das, M. J. Kumar, On Generalized Fat-tree, IEEE IPPS 1995.4 RUFT: Simplifying the Fat-Tree Topology, Gomez, C.; Gilabert, F.; Gomez, M.E.; Lopez, P.; Duato, J.; Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International

Conference on, 8-10 Dec. 2008 Page(s):153 1605. [Beowulf] torus versus (fat) tree topologies: http://www.beowulf.org/archive/2004-November/011114.html

Interconnect at What Port Speed?

A switching node has a fixed switching capacity (i.e. CMOS gate-count) within the same space and power envelope

Per node switching capacity can be presented at different port-speed:

i.e. a 400Gbps node can be 40X10Gbps or 10X40Gbps or 4X100Gbps

3-stage Fat-tree Fabric Capacity

100

1,000

10,000

100,000

1,000,000

Maxim

al F

ab

ric C

ap

acit

y (

Gb

ps)

5

4X100Gbps

Lower per-port speed allows building a much larger size maximal constant bisectional bandwidth fabric

There are of course trade-offs with the number of fiber-connections needed to build the interconnect

Higher port-speed may allow better utilization of the fabric capacity

1

10

0 100 200 300 400 500 600 700 800

Maxim

al F

ab

ric C

ap

acit

y (

Gb

ps)

Per Node Switching Capacity (Gbps)

Port Speed = 10 Gbps

Port Speed = 40Gbps

Port Speed = 100Gbps

Fabric Size vs Port Speed

Constant switching BW/node of 1Tbps and constant fabric cross-section BW of 10Tbps Assumed

No of Fabric Stages Needed for a 10T Fabric

0

2

4

6

8

10

12

14

No o

f F

abric S

tages N

eeded

Switching_BW/node=1Tbps

5

spines

6

0

0 100 200 300 400 500

Port Speed (Gbps)

No of Fabric Nodes Needed for a 10T Fabric

0

1000

2000

3000

4000

5000

6000

7000

0 50 100 150 200 250 300 350 400

Port Speed (Gbps)

No o

f N

odes N

eeded

Switching_BW/node=1Tbps

Higher per port bandwidth reduces the number of available ports in a node with constant switching bandwidthIn order to support same cross-sectional BW

more stages are needed in the fabric More fabric nodes are needed

Power vs Port Speed

Per Port Power Consumption

40

60

80

100

120

140

Per-

po

rt P

ow

er

Co

nsu

mp

tio

n (

Watt

s)

Constant power/Gbps

4x power for 10x speed

20x power for 10x speed

Total Interconnect Power Consumption

100000

150000

200000

250000

300000

To

tal P

ow

er

Co

nsu

mp

tio

n (

Watt

s)

Constant power/Gbps

4x power for 10x speed

20x power for 10x speed

BLEEDING EDGE

POWER PARITY

BLEEDING EDGE

POWER PARITY

7

Three power consumption curves for interface optical modules:

Bleeding Edge: 20x power for 10x speed; e.g. if 10G is 1W/port, 100G is 20W/port

Power Parity: Power parity on per Gbps basis; e.g. if 10G is 1W/port, 100G is 10W/port

Mature: 4x power for 10x speed; e.g. if 10G is 1W/port, 100G is 4W/port

Lower port speed provides lower power consumption

For power consumption parity, power per optical module needs to follow the mature curve

0

20

0 50 100 150 200 250 300 350 400 450

Port Speed (Gbps)

Per-

po

rt P

ow

er

Co

nsu

mp

tio

n (

Watt

s)

0

50000

0 50 100 150 200 250 300 350 400

Port Speed (Gbps)

MATURE MATURE

PARITY

Cost vs Port Speed

Per Port Optics Cost

$20,000

$30,000

$40,000

$50,000

$60,000

$70,000

Per-

po

rt O

pti

cs C

ost

Constant cost/Gbps

4x cost for 10x speed

20x cost for 10x speed

Total Fabric Cost

$60,000,000

$80,000,000

$100,000,000

$120,000,000

$140,000,000

$160,000,000

To

tal F

ab

ric

Co

st

Constant cost/Gbps

4x cost for 10x speed

20x cost for 10x speedBLEEDING EDGE

COST PARITY

BLEEDING EDGE

COST PARITY

8

$0

$10,000

$20,000

0 50 100 150 200 250 300 350 400 450

Port Speed (Gbps)

$0

$20,000,000

$40,000,000

0 50 100 150 200 250 300 350 400

Port Speed (Gbps)

MATUREMATURE

COST PARITY

Three cost curves for optical interface modules: Bleeding Edge: 20x cost for 10x speed

Cost Parity: Cost parity on per Gbps basis

Mature: 4x cost for 10x speed;

Fiber cost is assumed to be constant per port (10% of 10G port cost)

For fabric cost parity, cost of optical modules need to increase by < 4x for 10x increase in interface speed

INTRA-DATACENTER CONNECTIONS

INTER-DATACENTER CONNECTIONS

Limited Fiber Availability, 2km+ reach

9

Beyond 100G: What data rate?

400Gbps? 1Tbps? Something in-between? How about all of the above?

Current optical PMD specs are designed for absolute worst-case penalties

Significant capacity is untapped within the statistical variation of various penalties

10

Wasted

Link

Margin/

Capacity

Where is the Untapped Capacity?

-60

-50

-40

-30

-20

-10

0

Receiver Performance

Sen

sit

ivit

y (

dB

m)

Optical Channel Capacity

M. Nakazawa: ECOC 2008

11

-80

-70

1.00E-02 1.00E-01 1.00E+00 1.00E+01 1.00E+02 1.00E+03

Quantum limit of

Sensitivity

Bit Rate (Gbps)

Unused Link Margin Untapped SNR Untapped Capacity

In ideal world, 3dB of link margin will allow link capacity to be doubled

Need the ability to use additional capacity (speed up the link) when available (temporal or statistical) and scale-back to the base-line capacity (40G/100G?) when not

M. Nakazawa: ECOC 2008paper Tu.1.E.1

Rate Adaptive 100G+ Ethernet?

There are existing standards within the IEEE802.3 family:

IEEE 802.3ah 10PASS-TS: based on MCM-VDSL standard

IEEE 802.3ah 2BASE-TL: based on SHDSL standard

Needed when channels are close to physics-limit : We are getting there with 100Gbps+ Ethernet

Shorter links Higher capacity (matches perfectly with datacenter bandwidth demand distribution, see slide # 3)

600

12

0

100

200

300

400

500

0 2 4 6 8 10 12 14 16 18

Bit

Ra

te (

Gb

ps)

SNR (dB)

mQAM

OOK

How to get there?

High-order modulation

Multi-carrier-Modulation/OFDM

Ultra-dense WDM

Combination of all the above

Is There a Business Case?

0

0.05

0.1

0.15

0.2

0.25

0 10 20 30 40 50

Dis

trib

uti

on

Link Distance (km)

Example Link Distance Distribution

0

100

200

300

400

500

600

0 10 20 30 40 50

Po

ssib

le B

it R

tae

(G

bp

s)

Link Distance (km)

Example Adaptive Bit Rate Implementations

scheme-1

scheme-2

non-adaptive

13

Link Distance (km)Link Distance (km)

0

50

100

150

200

250

300

350

400

0 100 200 300 400 500 600

Tota

l Ca

pa

cit

y o

n 1

00

0 lin

ks

(Tb

ps)

Max Adaptive Bit Rate with 100Gbps base rate (Gbps)

Aggregate Capacity for 1000 Links An example link-length distribution between datacenters is shown

Can be supported by a 40km capable PMD

Various rate-adaptive 100GbE+ options are considered

Base rate is 100Gbps

Max adaptive bit-rate varies from 100G to 500G

Aggregate capacity for 1000 such links is computed

Q&A