From the Archives: Future of Supercomputing at Altparty 2009

  • Published on
    15-Jul-2015

  • View
    102

  • Download
    1

Transcript

  • CSC â Tieteen tietotekniikan keskus Oy CSC â IT Center for Science Ltd. The Future of Supercomputing Olli-Pekka Lehto Systems Specialist
  • CSC â IT Center for Science ⢠Center for Scientific Computing â Offices located in Keilaniemi, Espoo â All shares owned by the ministry of education â Founded in 1970 as a technical support unit for the Univac 1108 ⢠Provides a variety of services to the Finnish research community â High Performance Computing (HPC) resources â Consulting services for scientific computing â Scientific software development (Chipster, Elmer etc.) â IT infrastructure services â ISP services (FUNET)
  • CSC in numbers ⢠~180 employees ⢠3000 researchers use the computing capacity actively â Around 500 projects at any given time ⢠~320 000 FUNET end- users in 85 organizations
  • Louhi.csc.fi Model Cray XT4 (single-socket nodes) Cray XT5 (dual-socket nodes) Processors 10864 AMD Opteron 2,3GHz cores 2716 Quad Core processors 1012 XT4 + 852 XT5 Theoretical peak performance >100 TeraFlop/s (= 2.3 * 10^9 Hz * 4 Flop/Hz * 10864) Memory ~10.3 TeraBytes Interconnect network Cray SeaStar2 3D torus: 6*5.6GByte/s links per node Power consumption 520.8 kW (high load) ~300 kW (nominal load) Local filesystem 67 TB Lustre filesystem Operating System Service nodes: SuSE Linux Compute nodes: Cray Compute Node Linux A âcapabilityâ system: Few large (64-10000 core) jobs
  • Murska.csc.fi Model HP Proliant Blade cluster Processors 2176 AMD Opteron 2,6GHz cores 1088 Dual Core processors 544 Blade servers Theoretical peak performance ~11.3 TeraFlop/s (= 2.6 * 10^9 Hz * 4 Flop/Hz * 2176) Memory ~5 TB Interconnect network Voltaire 4x DDR InfiniBand (16Gbit/s fat-tree network) Power consumption ~75 kW (high load) Local filesystem 98 TB Lustre filesystem Operating System HP XC Cluster Suite (RHEL based Linux) A âcapacityâ system: Many small (1-128 core) jobs
  • Why use supercomputers? ⢠Constraints â Results are needed in a reasonable time ⢠Impatient users ⢠Time-critical problems (e.g. weather forecasting) â Large problem sizes ⢠The problem does not fit into the memory of a single system ⢠Many problem types require all the processing power close to each other â Distributed computing (BOINC etc.) work well only on certain problem types
  • Who uses HPC? MILITARY SCIENTIFIC COMMERCIAL Weapons modelling Signals intelligence Radar image processing Nuclear physics Mathematics Quantum chemistry Fusion energy Nanotechnology Climate change Weather forecasting Electronic Design Automation (EDA) Genomics Tactical simulation Aerodynamics Crash simulations Movie SFX Feature-length movies Search engines Oil reservoir discovery Stock market prediction Banking & Insurance databases 1960s 1970s 1980s 1990s 2000s Strategic simulation âWargamesâ 2010s Materials science Drug design Organ modelling
  • State of HPC 2009 ⢠Move towards commodity components â Clusters built from off-the-shelf servers â Linux â Open source tools (compilers, debuggers, clustering mgmt, applications) â Standard x86 processors ⢠Price-performance efficient components â Low-latency, high-bandwitdh interconnects ⢠Standard PCI cards ⢠InfiniBand, 10Gig Ethernet, Myrinet â Parallel Filesystems ⢠Striped RAID (0) with fileservers ⢠Lustre, GPFS, PVFS2 etc.
  • Modern HPC systems Commodity clusters ⢠A large number of regular servers connected together â Usually a standard Linux OS â Possible to even mix and match components from different vendors ⢠May include some special components â High-performance interconnect network â Parallel filesystems ⢠Low-end and midrange systems ⢠Vendors: IBM, HP, Sun etc. Proprietary supercomputers ⢠Designed from the ground up for HPC â Custom interconnect network â Customized OS & software â Vendor-specific components ⢠High-end supercomputers and special applications ⢠Examples: Cray XT-series, IBM BlueGene
  • The Three Walls There are three âwallsâ which CPU design is hitting now: ⢠Memory wall â Processor clock rates have grown faster than memory clock rates ⢠Power wall â Processors consume an increasing amount of power â The increase is non-linear ⢠+13% performance = +73% power consumption ⢠Microarchitecture wall â Adding more complexity to the CPUs is not helping that much ⢠Pipelining, branch prediction etc.
  • A Typical HPC System ⢠Built from commodity servers â 1U or Blade form factor â 1-10 management nodes â 1-10 login nodes ⢠Program development, compilation â 10s of storage nodes ⢠Hosting parallel filesystem â 100s of compute nodes ⢠2-4 CPU sockets per node (4-24 cores), AMD Opteron or Intel Xeon ⢠Linux OS ⢠Connected with InfiniBand or Gigabit Ethernet ⢠Programs in C/C++ or Fortran and are parallelized using MPI (Message Passing Interface) API
  • The Exaflop system ⢠Target: 2015-2018 â 10^18 (million trillion) floating-point operations per second â Current system 0.00165 Exaflops ⢠Expectations with current technology evolution â Power draw 100-300 MW ⢠15-40% of a nuclear reactor (Olkiluoto I)! ⢠$1M/MW/year! ⢠Need to bring it down to 30-50 MW â 500 000 - 5 000 000 processor cores â Memory 30-100 PB â Storage 1 Exabyte
  • Programming Languages ⢠Current trend (C/C++/Fortran + MPI) â Difficult to program portable and efficient code â MPI is not fault tolerant by default (1 task dies and the whole system crashes) ⢠PGAS languages to the rescue? â Partitioned Global Address Space â Looks like global shared memory ⢠But possible to define task-local regions ⢠Compiler generates communication code â Current standards ⢠UPC - Unified Parallel C ⢠CAF - Co-Array Fortran â Languages under development ⢠Titanium, Fortress, X10, Chapel
  • What to do with an exaflop? ⢠Long term climate-change modelling ⢠High resolution weather forecasts â Prediction by city block â Extreme weather ⢠Large protein folding â Alzheimer, cancer, Parkinsonâs etc. ⢠Simulation of a human brain ⢠Very realistic virtual environments ⢠Design of nanostructures â Carbon nanotubes, nanobots ⢠Beat a human pro player in a 19x19 Go
  • Accelerators: GPGPU ⢠General Purpose Computing on Graphics Processing Units ⢠Nvidia Tesla/Fermi, ATI FireStream, IBM Cell, Intel Larrabee ⢠Advantages â High volume production rates, low price â High memory bandwidth on GPU (>100GB/s vs. 10-30GB/s of RAM) â High flop rate, for certain applications ⢠Disadvantages â Low performance in precise (64-bit) computation â Getting data to the GPU memory is a bottleneck (8GB/s PCI Express) â Vendors have different programming languages ⢠Now: Nvidia CUDA, ATI Stream, Intel Ct, Cell etc. ⢠Future: OpenCL on everything (hopefully!) â Does not work for all types of applications ⢠Branching, random memory access, huge datasets etc.
  • Case: Nvidia Fermi ⢠Announced last month, available in 2010 ⢠New HPC-oriented features â Error-correcting memory â High double precision performance ⢠512 compute cores, ~3 billion transistors â 750 GFlops (Double Precision) â 1.5 Tflops (Single Precision) ⢠2011: Fermi-based Cray supercomputer in Oak Ridge National Laboratory â â10 times faster than the current state of the artâ: ~20 Petaflops
  • Case: Intel Larrabee ⢠Intelâs new GPU architecture, available in 2010 ⢠Based on Pentium x86 processor cores â Initially tens of cores per GPU â Pentium cores with vector units â Compatible with x86 programs ⢠Cores connected with a ring bus
  • Accelerators: FPGA ⢠Field Programmable Gate Arrays ⢠Vendors: Clearspeed, Mitrionics, Convey, Nallatech ⢠Chip with programmable logic units â Units connected with a programmable network ⢠Advantages â Very low power consumption â Arbitrary precision â Very efficient in search algorithms â Several in-socket implementations ⢠FPGA sits directly in the CPU socket ⢠Disadvantages â Difficult to program â Limited number of logic blocks
  • Performance SP and DP GFlops 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Nv id ia G ef or ce G TX 28 0 Nv id ia T es la C 10 60 Nv id ia T es la S 10 70 AT I R ad eo n 48 70 AT I R ad eo n X2 4 87 0 AT I F ire St re am 9 25 0 Cl ea rS pe ed e 71 0 Cl ea rS pe ed C AT S7 00 IB M P ow er XC el l 8 i AM D O pt er on B ar ce lo na G F L o p /s SP Gflop/s DP Gflop/s
  • Power Efficiency Power efficiency 0 1 2 3 4 5 6 7 8 9 N vi di a G ef or ce G TX 28 0 N vi di a Te sl a C 10 60 N vi di a Te sl a S 10 70 A TI R ad eo n 48 70 A TI R ad eo n X 2 48 70 A TI F ire S tre am 9 25 0 C le ar S pe ed e 71 0 C le ar S pe ed C AT S 70 0 IB M P ow er XC el l 8 i A M D O pt er on B ar ce lo na G F lo p /s /W a tt SP DP
  • 3D Integrated Circuits ⢠Wafers stacked on top of each other ⢠Layers connected with through-silicon âviasâ ⢠Many benefits â High bandwidth and low latency â Saves space and power â Added freedom in circuit design â The stack may consist of different types of wafers ⢠Several challenges â Heat dissipation â Complex design and manufacturing ⢠HPC killer-app: Memory stacked on top of a CPU
  • Other Technologies To Watch ⢠SSD (Solid State Disk) â Fast transactions, low power, improving reliability â Fast checkpointing and restarting of programs ⢠Optics on silicon â Lightpaths both on a chip and on the PCB ⢠New memory technologies â Phase-change memory etc. â Low-power, low-latency, high bandwidth ⢠Green datacenter technologies ⢠DNA computing ⢠Quantum computing
  • Conclusions ⢠Differences between clusters and proprietary supercomputers is diminishing ⢠Accelerator technology is promimsing â Simple, vendor independent programming models are needed ⢠Lots of programming challenges in parallelisation â Similar challenges in mainstream computing today ⢠Going to Exaflop will be very tough â Innovation needed in both software and hardware
  • Questions