Datacenter-Scale Network Research on FPGAs krste/papers/diab...Datacenter-Scale Network Research on

  • Published on
    06-Sep-2018

  • View
    212

  • Download
    0

Transcript

Datacenter-Scale Network Research on FPGAsZhangxi TanComputer Science DivisionUC Berkeley, CAxtan@eecs.berkeley.eduKrste AsanovicComputer Science DivisionUC Berkeley, CAkrste@eecs.berkeley.eduDavid PattersonComputer Science DivisionUC Berkeley, CApattrsn@eecs.berkeley.eduABSTRACTWe describe an FPGA-based datacenter network simulatorto allow researchers to rapidly experiment with O(10, 000)node datacenter network architectures. We configure theFPGA hardware to implement abstract models of key dat-acenter building blocks including servers and all levels ofswitches. We discuss design and implementation issues ofour FPGA models and show that it is practical to prototypeand scale the testbed with a few low-cost FPGA boards.1. INTRODUCTIONMassive warehouse-scale computers (WSCs) [5] are the foun-dation of widely used Internet services; e.g., search, so-cial networking, email, video sharing, and online shopping.The tremendous success of these services has led to therapid growth of datacenters to keep up with the increasingdemand. Recent advances such as modularized container-based datacenter construction and server virtualization haveallowed modern datacenters to scale up from 10,000 serversto 100,000 servers or more [11].At this extreme scale, network infrastructure has becomeone of the most critical data center components [8, 20] forseveral reasons:1. Current networks are extremely complex, and are diffi-cult to scale out to larger configurations without com-plete redesign.2. Existing networks have many different failure modes.Occasionally, correlated failures are found in replicatedmillion-dollar units.3. Networking infrastructure has a significant impact onserver utilization, which is an important factor in dat-acenter power consumption and cost-effectiveness.4. Network infrastructure is crucial for supporting dataintensive Map-Reduce jobs.5. Network infrastructure accounts for 18% of the monthlydatacenter costs, which is the third largest contribut-ing factor [8]. In addition, existing large commer-cial switches and routers command high margins andcharge a great deal for features that are rarely used indatacenter.As a result, datacenter network architecture has become anactive area of research, often focusing on new switch de-signs [9, 10, 16, 20]. However, warehouse-scale network re-search is very difficult to perform. Previous work [17] notesthat most of these new proposals are based on observationsof existing datacenter infrastructure and applications, andlack a sound methodology to evaluate new designs. More-over, most proposed designs have only been tested with avery small testbed running unrealistic microbenchmarks, of-ten built using off-the-shelf devices [7] that have limitationswhen exploring proposed new features. The behavior ob-served by running a test workload over a few hundred nodesbears little relationship to the behavior of production runscompleted over thousands or tens of thousands of nodes.The topology and switches used for small test clusters arevery different from those in a real environment. Dedicatingtens of thousands of nodes to network research is impracti-cal even for large companies like Amazon and Microsoft, letalone academic researchers.For systems research, one possible solution is to use cloudcomputing, such as Amazon EC2, when extreme scale isneeded. But EC2 nodes are already fully provisioned withnetworking, and building an overlay network and testing atscale in EC2 is only adequate for some areas of networkresearch, such as network configuration and management.Cloud computing cannot be readily applied to hardware net-work switch studies, as the network switches have to be em-ulated at high fidelity using software models.Unfortunately, simulating target devices using software isprohibitively slow [19], given the growing complexity in tar-get designs. To mitigate this software simulation gap, manytechniques have been proposed to reduce simulation time,such as statistical sampling and parallel simulation with re-laxed synchronization. These techniques assume the work-load is static and independent of target architecture, butdatacenter networks exhibit highly dynamic target-dependentbehavior, as they are tightly coupled with computation serversrunning very adaptive software networking stacks.To address the above issues, we propose using Field Pro-grammable Gate Arrays (FPGAs) to build a reconfigurablesimulation testbed at the scale of O(10,000) nodes. Eachnode in the testbed is capable of running real datacenterapplications on a full operating system. In addition, ournetwork elements are heavily instrumented. This researchtestbed will allow us to record the same behaviors adminis-Krste AsanovicAppears in The Exascale Evaluation and Research Techniques Workshop (EXERT 2011) at ASPLOS 2011trators observe when deploying equivalently scaled datacen-ter software. We build our testbed on top of a cost-effectiveFPGA-based full-system manycore simulator, RAMPGold [18].Instead of mapping the real target hardware directly, webuild several abstract models with runtime configurable pa-rameters of key datacenter components and compose themtogether in FPGAs. In this paper, we show how to constructa 10,000-node system model from several low-cost FPGAboards connected with multi-gigabit serial links. The pro-totype testbed has been successfully applied to evaluatinga novel network proposal based on circuit-switching tech-nology [21] running application kernels taken from the Mi-crosoft Dryad Terasort program.2. DATACENTER NETWORK INFRASTRUC-TURE BACKGROUNDDatacenters use a hierarchy of local-area networks (LAN)and off-the-shelf switches. Figure 1 shows a typical data-center network arranged in a Clos topology with three net-working layers. At the bottom layer, each rack typicallyholds 2040 servers, each singly connected to a commod-ity Top-of-Rack (ToR) switch with a 1Gbps link. TheseToR switches usually offer two to eight uplinks, which leavethe rack to connect up to several array switches to provideredundancy and bandwidth. At the top of the hierarchy, dat-acenter switches carry traffic between array switches usuallyusing 10Gbps links. All links use Ethernet as the physical-layer protocol, with either copper or fiber cabling dependingon the connection distance.Rack SwitchRackRack SwitchRackArray Switch Array SwitchRack SwitchRackRack SwitchRackArray Switch Array SwitchDatacenter SwitchDatacenter SwitchFigure 1: A typical datacenter network architecture.As we move up the hierarchy, one of the most challeng-ing problems is that the bandwidthover-subscription ratio(i.e. bandwidth entering from below versus bandwidth tothe level above) gets worse rapidly. This imbalance is dueto the cost of switch bandwidth, which grows quadraticallyin the number of switch ports. The resulting limited data-center bisection bandwidth significantly affects the design ofsoftware and the placement of services and data, hence thecurrent active interest in improving network switch designs.3. FPGA-BASED DATACENTER-SCALE EM-ULATIONFPGA Architecture Model Execution (FAME) [19] has be-come a promising vehicle for architectural investigation ofparallel computer systems. Groups in both academia andindustry have built various types of FAME simulators, whichcan be classified in five levels that are analogous to differ-ent RAID levels [19]. Higher FAME levels lower the sim-ulator cost and improve performance over the lower levels,while moving further away from the concrete RTL design ofthe simulation target. As pointed out in our previous work[17], datacenter-scale simulation requires simulators to han-dle hundreds of thousands of concurrent events synchronizedto within 50 ns of simulated time across O(10, 000) servernodes. We propose building a FAME-7 simulator based onmodules that fit in low-cost single-FPGA boards, using thefollowing three key techniques to improve efficiency [19]:1. Abstracted Models A full-fledged implementation of anydatacenter component requires considerable design ef-fort and hardware resources. Instead, we employ high-level abstract models that greatly reduce these require-ments and build a simplified version of each targetcomponent, capturing important features but simpli-fying or removing features that are rarely used in prac-tice. We also separate functional models from timingmodels to simplify parameterization of target timing.2. Decoupled Design A function computed within a singletarget clock cycle can be implemented with a variablenumber of FPGA host clock cycles. For example, asimple ring network on the FPGA can model an expen-sive multiport switch crossbar in datacenter switches.3. Host Multithreading Instead of replicating hardwaremodels to simulate multiple instances in the target,we use multiple model threads running in a single hosthardware model to simulate different target instances.Multithreading significantly improves FPGA resourceutilization and hides simulation latencies, such as thosefrom host DRAM access and timing-model synchro-nization across different FPGAs.Server ModelMemory Controller Rack Switch ModelServer ModelMemory ControllerRack Switch ModelPerformance CountersGigabit Frontend LinkSimulation SchedulerMulti-gigabilit TransceiversTo control PCTo Switch FPGAsMemory ControllerArray/Datacenter Switch ModelPerformance CountersGigabit Frontend LinkSimulation SchedulerMulti-gigabit TransceiversArray/Datacenter Switch ModelTo control PCTo other Switch FPGAs To Rack FPGAsRack FPGA Switch FPGAFigure 2: FPGA Simulator Architecture.Figure 2 shows the high-level simulator architecture for thetypical target datacenter configuration presented in Figure 1.We map all server models along with the ToR switch modelsinto Rack FPGAs, and array and datacenter switch modelsto separate Switch FPGAs. This partition enables a moremodularized model design that eases experimentation withnew array and datacenter switch designs. It also makes iteasy to scale up the size of the emulated datacenter. To fur-ther simplify switch model design, we keep any switch modelwithin a single FPGA. Following the physical topology of thetarget system, we connect Rack FPGAs to Switch FPGAsthrough several time-shared multi-gigabit serial transceiversusing low-cost copper cables, such as standard SATA ca-bles. Each FPGA has its own simulation scheduler thatsynchronizes with adjacent FPGAs over the serial links at avery fine granularity to satisfy simulation accuracy require-ments. For example, a 10Gbps switch with a minimum flitsize of 64 bytes requires a maximum synchronization intervalof 51.2 ns. We reduce host communication latency by us-ing our own protocol over the serial links, achieving FPGA-FPGA communication latencies of around 20 FPGA logiccycles, which is roughly the latency for a host DRAM accesson the FPGA. In addition, the host-multithreaded designfurther helps to hide the simulator communication latency,removing model synchronization latency as a simulator per-formance bottleneck.Every model in the design has numerous hardware perfor-mance counters that periodically send performance statisticsto a control workstation over a separate gigabit Ethernet linkto avoid interrupting the simulation. Rack FPGAs supportmore memory controllers than switch FPGAs to provide thelarger memory capacity and bandwidth required to simulatesoftware running on the server models. We statically par-tition the host DRAM resources on each FPGA betweendifferent server models and the ToR switch models.We select multi-gigabit serial transceivers as the only inter-FPGA connection instead of the high-speed parallel LVDSlinks often seen on multi-FPGA boards to make the designsimpler and more modular. Specifically, parallel LVDS linksincrease design complexity. To ensure reliable transmission,designs require complicated dynamic calibration and specialeye-opening monitoring circuit on groups of I/O signals. Inaddition, designs with LVDS links are less portable becauseof varying I/O layouts on different boards, making connec-tions between Rack FPGAs and Switch FPGAs less flexible.Moreover, LVDS links increase both PCB board and FPGAcost because they require more FPGA I/O pins and linkwires for a given link bandwidth. Finally, we found thatthe multi-gigabit serial transceivers provide enough band-width between FPGAs considering our overall simulationslowdown of between 250 and 1000 of real time. Forexample, 2.5Gbps transceivers are common on three-year-old Xilinx Virtex 5 FPGAs. The bandwidth of a singletransceiver translates to 500Gbps to 2.5Tbps in the target,which far exceeds the bandwidth between a few racks andseveral array switches. Moreover, recent FPGAs have signif-icantly enhanced serial transceiver performance, supportingup to 28Gbps bandwidth [4] in the 28 nm generation.3.1 Server ModelsWe build the server models on top of RAMP Gold, which isan open-source cycle-accurate full-system FAME-7 architec-ture simulator. RAMP Gold supports the full 32-bit SPARCv8 ISA in hardware, including floating-point instructionsand precise exceptions. It also models sufficient hardwareto run an operating system, including MMUs, timers, andinterrupt controllers. Currently, we can boot the Linux2.6.21 kernel and a manycore research OS [14]. We mapone server to one hardware thread in RAMP Gold. One64-thread RAMP Gold hardware pipeline simulates up totwo 32-server datacenter racks. Each simulated server usesa simplified fixed-CPI timing model. A more detailed tim-ing model could be implemented, but it would reduce sim-ulation scale as each server model would require additionalhost hardware resources.The simulation performance of a 64-thread configuration istwo orders of magnitude faster than state-of-the-art softwaresimulators for parallel processors, or a slowdown of 1000compared to real hardware. The server models are currentlythe simulation bottleneck for the whole system, but perfor-mance could be improved by reducing the number of threadsper hardware model as discussed in [18]. When packing morepipelines onto a single FPGA to simulate more racks, an-other potential concern is the host memory controller band-width. However, our earlier analysis shows this is not aproblem for RAMP Gold [18]. A single pipeline only con-sumes a maximum of 15% of the total bandwidth of a singlehost DRAM channel when running real-world applicationson a research OS3.2 Switch ModelsThere are two broad categories of datacenter switches: con-nectionless packet switching, also known as datagram switch-ing, and connection-oriented virtual circuit switching. Inthe first case, each packet includes complete routing infor-mation, and is routed by network devices individually. Thesecond case requires a pre-allocated virtual circuit path be-fore transferring any packet. To demonstrate the flexibilityof our approach, we build FAME-7 models for both types ofswitches.The real challenges for modeling the packet switches usedin existing production datacenters arise from design com-plexity and proprietary architecture specifications. To workaround these barriers, we build abstract models by simplify-ing features that are seldom used in a datacenter. Here arethe abstractions we employed and the rationale behind ourchoice:1. Ignore Ethernet QoS related features (e.g. support ofIEEE 802.1p class of service (CoS)): Although QoSfeatures are available on almost every switch today,datacenters only utilize switches for basic connectivitywithout turning on QoS features.2. Use simplified source routing: Many packet switchesuse a large ternary CAM to hold flow tables and lookup the destination address for each packet. When anunknown MAC address is seen, the forwarding en-gine sends an interrupt to a slow-path control pro-cessor that updates the table using software. Manyswitches [2, 3] already support flow tables that haveat least 32K entries. Given the total number of ma-chines in datacenters, the slow-path flow-table updateis rarely executed, making the TCAM lookup time con-stant in practice. Besides, datacenter topologies do notchange frequently, and routes can be pre-configuredstatically. We use source routing to simplify modelingof packet routing, and we note that source routing isactually a component of many datacenter-switch re-search proposals. To emulate more complicated flowtable operations, we could implement d-left hash tables[15] using host DRAM. Recent datacenter switchesthat implement large flow tables, such as the CiscoNexus 5000, use similar techniques instead of TCAMs.Queue ManagementEnqueue PipelineThread SchedulerDequeue PipelineThread SchedulerPerformance CountersCrossbar ModelTime-shared Physical Ingress PortTime-shared Physical Egress portPacket Processing Delay ModelGlobal Packet Buffer InterconnectTo Gigabit Frontend LinkOutput Queue Switch ModelTo simulation timing controlOutput Queue Switch ModelOutput Queue Switch ModelWrite Buffer Read BufferPacket Buffer Host CacheGlobal Queue Buffer Pool(Host DRAM)64 64 6464128Parallel Switch SchedulerWrite lockFigure 3: FAME model for virtual output queue switches.3. Abstract packet processors: Commercial datacenterswitches include many pipelined packet processors thathandle different tasks such as MAC address learning,VLAN membership, and so on. The processing timeof each stage is relatively constant regardless of packetsize, and the time can be as short as a few hundrednanoseconds [6] to a few microseconds [2]. We sim-ply employ FIFOs with runtime-configurable delays tomodel packet processing.Although commercial switch implementation details are gen-erally not publicly available, the fundamentals of these switcharchitectures are well known. Examples include the architec-ture of a virtual output queue switch and common schedul-ing algorithms. We build our abstracted model focusingon these central well-known architectural features, and al-low other parts that are unclear or of special interest toresearchers (e.g. packet buffer layout) to be configurableduring simulation.Figure 3 shows the architecture of our abstracted simulationmodel for output-queue switches, such as the Fulcrum Fo-calPoint FM4000 [6]. One of the biggest differences betweenexisting commercial packet switches is the packet buffer size.For instance, the Force 10 S60 switch has 1280MB of packetbuffering, the Arista Networks 7048 switch has 768MB, andthe Cisco Systems 4948-10GE switch has 16MB. Accordingto our conversations with datacenter networking researchersin industry, switch buffer management and configurationshave also become an active area for packet switching re-searchers. To provide maximum flexibility for simulating awide range of switch buffer sizes and to keep the overall de-sign simple, we place the physical storage of simulated switchbuffers in the host DRAM and all virtual queue pointers inBRAMs on the FPGA.To make efficient use of host DRAM burst accesses, we de-signed a shared host cache connected by a ring-like inter-connect to all switch models using the same host DRAMchannel. The host cache is composed of two simple buffers,one for write and one for read, partitioned equally amongall physical ports. Due to the limited size of on-chip FPGABRAM, the write and read buffers only hold 64 bytes forevery physical port, which is the minimum flit size for manypacket switches. In addition, the write and read buffers foreach port have a write-lock to ensure they are kept coherent.Inside each switch model, a key component is a queue man-agement model responsible for all virtual queue pointer op-erations. It also keeps track of queue status and performspacket drops when necessary. The length of every simu-lated virtual queue can be configured dynamically withoutrequring another FPGA CAD flow run before a simulationstarts. We select these configurable parameters according toa Broadcom switch design [13]. Along with this module, aperformance-counter module, implemented with a collectionof BRAMs and LUTRAMs, maintains all statistics for everyvirtual queue. The performance counter module reports allof its content periodically to a remote PC through the giga-bit frontend link, with which we can construct queue lengthdynamics offline for every virtual queue running any work-load. To send unicast statistics every 6.4s in target time,a 10Gbps 32-port output-queue switch model demands abandwidth of approximately 40Mbps on the frontend link.Each model has an independent 3-stage enqueue pipelineand a 4-stage dequeue pipeline that are controlled and syn-chronized by global simulation timing control logic. Ideally,the two pipelines send commands to the queue managementFAME Model Registers LUTs BRAMs Lines of Code64-server model 9,981 (14%) 6,928 (10%) 54 (18%) 35,000Circuit-swiched model 859 (1%) 1,498 (2%) 28 (9%) 2,550Packet-switched model 814 (1%) 1,260 (2%) 24 (8%) 2,925Table 1: FPGA resource usage and lines of code for different FAME models.model through simple FIFOs in order to simplify and decou-ple the control logic design. However, this appears to be asignificant area and performance overhead on FPGAs, con-suming a large amount of distributed LUTRAM and mak-ing routing very hard. Instead, we implement two indepen-dent static thread schedulers for the two pipelines and replaycommands that were deferred due to dependencies.To guarantee good simulation performance, the switch sched-uler model processes scheduling decisions for multiple vir-tual queues in every host FPGA clock cycle. To furtherimprove performance, given the hundreds or thousands ofvirtual queues existing in our simulated switches, the paral-lel scheduler model only processes active events that happenbetween two scheduling quanta instead of naively scanningall virtual queues. Overall, the simulation performance ofa single switch model has a slowdown of 150 comparedto real hardware. This is four times faster than a softwaresingle-threaded network simulator used at Google [1], whichdoes not simulate packet payloads or support full softwarestack scaling to 10,000 nodes as does our system.Our circuit-switching models are based on a recent proposalfor container-based datacenters [21], which has two levelsof switch. In contrast to the complexity of the packet-switched approach, the proposed circuit-switched model issimple enough to be directly implemented on entry-level FP-GAs. By employing a host-multithreaded architecture thattime-multiplexes packets from multiple target switch ports,we can model several 16-port rack-level switches along witha 128-port 10-Gbps high-radix array/datacenter switch withfull architecture details on a single FPGA.4. IMPLEMENTATIONWe code all models in SystemVerilog and map them to athree-year old Xilinx Virtex-5 LX110T-1 target device. Allmodels run well at 100MHz on FPGA hardware. To verifythe correctness of our FPGA model, we use SystemVerilogassertions and code coverage groups extensively. The work-load we run are traffic patterns sampled from the MicrosoftDryad Terasort application. Table 1 lists FPGA resourceconsumption as well as design effort, measured in number oflines of code. To validate our FPGA design, we also devel-oped a cycle-accurate C software simulation model for eachFPGA model. In addition, we use micro benchmarks andthe TCP incast application [17] to verify the correctness ofour abstracted switch models.Overall, the model of a rack of 64 servers consumes the mostFPGA resources, and also requires the most design effort tosupport the full SPARC v8 standard. Abstract switch mod-els for both circuit switching and packet switching have avery small logic resource requirement, with < 1% resourceutilization on the FPGA we chose. For all three models,BRAMs are still the limiting factor for overall simulationdensity. On a four-FPGA board, such as BEE3, we can eas-ily simulate up to 512 servers with 16 32-port ToR switches.To scale to 10,000 servers, we could use a system with 20BEE boards, as has been built before [12]. We would alsoneed two BEE3 boards to simulate a datacenter switch to-gether with several array switches to connect rack switchestogether. Note that the BEE3 board was designed for a dif-ferent usage model three years ago, and uses older Virtex 5FPGAs and is equipped with limited high-speed serial con-nectors. If we used a board that has the latest 28 nm FPGAsand optimized for high-radix FPGAFPGA serial communi-cations, we could greatly reduce the number of FPGAs usedand lower the overall system cost of a 10,000-node systemsimulator from 88 FPGAs on 22 boards to 11 FPGAs on 11boards.5. COMPARING TO SOFTWARE MODELSBoth switch models are relatively easy to build with onlyaround 3,000 lines of SystemVerilog code. The circuit-switchingmodel is more straightforward and requires less design effortdue to a simpler target design. Although FAME models al-low us to conduct datacenter experiments at enormous scalewith greater architecture detail, they do require more designeffort. For example, a comparable cycle-accurate softwarepacket-switch model only requires 500 lines of C++ code.However, this does not mean a software C model is easier toverify. We always develop the software model along with theFAME model, and verify the correctness of the two modelsagainst each other. The debugging of the software modelhas never been accomplished well ahead of the correspond-ing FAME model. On the other hand, the more detailedFAME model helps to find many timing-related bugs in oursoftware simulator.As a comparison, we also optimized and parallelized theequivalent C++ simulation model using Pthreads. We com-piled the C++ module using 64-bit GCC4.4 with -O3 -mtune=native -march=native, and measured the softwaresimulator in a trace-replay mode with the CPU cache al-ready warmed up. We ran the software simulator on an 8-Core Intel Xeon X5550 machine with 48GBmemory runningthe latest Linux 2.6.34 kernel. Figure 4 shows slowdowns ofthe multithreaded software model simulating different size10Gbps switches under two types of workload, i.e. fullload with 64-byte packets and random load with random-size packets. When simulating a small 32-port switch, thesingle-thread software model has better performance thanour threaded 100MHz FAME-7 FPGA model. However,the simulation performance drops quickly when increasingthe number of switch ports. Due to many fine-grained syn-chronizations (approximately every 50 ns in target time),software multithreading helps little when simulating smallswitches. When simulating a large switch configuration, wesaw small sublinear speedups using two, or sometimes four,threads but the benefit of using more threads diminishesquickly. Profile results show that crossbar scheduling, whichscans multiple virtual queues, accounts for a large fractionof the total simulation time. Other large overheads includecache misses for first time accesses to virtual queue struc-tures as well as updating in-memory performance countersfor each simulation quanta. On the other hand, CPU mem-ory bandwidth is not at all a limiting factor, even whensimulating a large switch configuration. Moreover, Figure 4also illustrates that the workload significantly affects thesimulation performance for large switch configurations.89 88 280 270 812 844 2596 5218 249 268 346 356 590 604 2164 3001 638 593.5 606 580 863 856 1545 2208 1855 1736 1549 1534 1293 1303 2452 2492 010002000300040005000600032 (full) 32 (rand) 64 (full) 64 (rand) 128 (full) 128 (rand) 255 (full) 255 (rand)Simulation Slowdown Numbers of simulated switch port and types of workload 1 core2 cores4 cores8 coresFigure 4: Parallel software switch simulation perfor-mance.Note that we measured the software simulation performanceunder an unrealistic setting. In a real usage scenario, switchtraffic will be generated dynamically by other models con-nected to the switch, which requires many more synchro-nizations over the input and output ports of the simulatedswitch. When simulating a large system containing manyswitches and servers, we believe it will be difficult to seeany performance benefit by partitioning the software modelacross a high-performance cluster. Besides, future datacen-ter switches are very likely to be high-radix switches. Sim-ulating architectures in even greater detail could also easilyrender the software approach impractical.6. CONCLUSIONSOur initial implementation and experience with applying theprototype to some real-world datacenter network researchshows our FPGA-based approach is promising. Our futurework primarily involves improving system capability usingmultiple FPGAs and scaling the software infrastructure run-ning on our server model. We plan to quantitatively com-pare both circuit-switching and packet-switching datacenternetwork proposals using more real applications.7. ACKNOWLEDGEMENTThis research is supported in part by gifts from Sun Mi-crosystems, Google, Microsoft, Amazon Web Services, CiscoSystems, Cloudera, eBay, Facebook, Fujitsu, Hewlett-Packard,Intel, Network Appliance, SAP, VMWare and Yahoo! andby matching funds from the State of Californias MICROprogram (grants 06-152, 07-010, 06-148, 07-012, 06-146, 07-009, 06-147, 07-013, 06-149, 06-150, and 07-008), the Na-tional Science Foundation (grant #CNS-0509559), and theUniversity of California Industry/University Cooperative Re-search Program (UC Discovery) grant COM07-10240.8. REFERENCES[1] Glen Anderson, private communications, 2009.[2] Cisco Nexus 5000 Series Architecture: The Building Blocks ofthe Unified Fabric . .http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-462176.html, 2010.[3] Force10 S60 High-Performance 1/10 GbE Access Switch,http://www.force10networks.com/products/s60.asp, 2010.[4] Xilinx Virtex 7 Series FPGAs,http://www.xilinx.com/technology/roadmap/7-series-fpgas.htm,2010.[5] L. A. Barroso and U. Holzle. The Datacenter as a Computer:An Introduction to the Design of Warehouse-Scale Machines.Synthesis Lectures on Computer Architecture. Morgan &Claypool Publishers, 2009.[6] U. Cummings, D. Daly, R. Collins, V. Agarwal, F. Petrini,M. Perrone, and D. Pasetto. Fulcrums FocalPoint FM4000: AScalable, Low-Latency 10GigE Switch for High-PerformanceData Centers. In Proceedings of the 2009 17th IEEESymposium on High Performance Interconnects, pages 4251,Washington, DC, USA, 2009. IEEE Computer Society.[7] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz,V. Subramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios:a hybrid electrical/optical switch architecture for modular datacenters. In SIGCOMM 10, pages 339350, 2010.[8] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The costof a cloud: research problems in data center networks.SIGCOMM Comput. Commun. Rev., 39(1):6873, 2009.[9] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim,P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: ascalable and flexible data center network. In SIGCOMM 09,pages 5162, New York, NY, USA, 2009. ACM.[10] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian,Y. Zhang, and S. Lu. BCube: a high performance, server-centricnetwork architecture for modular data centers. In SIGCOMM09, pages 6374, New York, NY, USA, 2009. ACM.[11] R. Katz. Tech titans building boom: The architecture ofinternet datacenters. IEEE Spectrum, February 2009.[12] A. Krasnov, A. Schultz, J. Wawrzynek, G. Gibeling, and P.-Y.Droz. RAMP Blue: A Message-Passing Manycore System InFPGAs. In Proceedings of International Conference on FieldProgrammable Logic and Applications, pages 5461,Amsterdam, The Netherlands, 2007.[13] B. Kwan, P. Agarwal, and L. Ashvin. Flexible buffer allocationentities for traffic aggregate containment. US Patent20090207848, August 2009.[14] R. Liu et al. Tessellation: Space-Time partitioning in amanycore client OS. In HotPar09, Berkeley, CA, 03/2009 2009.[15] M. Mitzenmacher, A. Broder, A. Broder, M. Mitzenmacher,and M. Mitzenmacher. Using multiple hash functions toimprove ip lookups. In In Proceedings of IEEE INFOCOM,pages 14541463, 2000.[16] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang,P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat.PortLand: a scalable fault-tolerant layer 2 data center networkfabric. In SIGCOMM 09, pages 3950, New York, NY, USA,2009. ACM.[17] Z. Tan, K. Asanovic, and D. Patterson. An FPGA-basedSimulator for Datacenter Networks. In The ExascaleEvaluation and Research Techniques Workshop (EXERT2010), at the 15th International Conference on ArchitecturalSupport for Programming Languages and Operating Systems(ASPLOS 2010), March 2010.[18] Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook,D. Patterson, and K. Asanovic and. RAMP gold: AnFPGA-based architecture simulator for multiprocessors. InDesign Automation Conference (DAC), 2010 47thACM/IEEE, pages 463 468, 2010.[19] Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanovic, andD. Patterson. A case for FAME: FPGA architecture modelexecution. In Proceedings of the 37th annual internationalsymposium on Computer architecture, ISCA 10, pages290301, New York, NY, USA, 2010. ACM.[20] C. Thacker. Rethinking data centers. October 2007.[21] C. Thacker. A data center network using FPGAs, May 2010.