- STAR CCM+ and STAR CD Performance and Profili STAR-CCM+ and STAR-CD STAR-CCM+ An engineering process-oriented CFD tool Client-server architecture, object-oriented programming Delivers ...
STAR CCM+ and STAR CD Performance and Profili STAR-CCM+ and STAR-CD STAR-CCM+ An engineering process-oriented CFD tool Client-server architecture, object-oriented programming Delivers ...
STARCCM+andSTARCDPerformance BenchmarkandProfilingMarch 20092Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center The participating members would like to thank CD-adapco for their support and guidelines For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.comhttp://www.mellanox.com/http://www.dell.com/hpchttp://www.amd.com/3STAR-CCM+ and STAR-CD STAR-CCM+ An engineering process-oriented CFD tool Client-server architecture, object-oriented programming Delivers the entire CFD process in a single integrated software environment STAR-CD An integrated platform for multi-physics simulations A long established platform for industrial CFD simulation Bridging the gap between CFD and structural-mechanics Developed by CD-adapco4Objectives The presented research was done to provide best practices STAR-CCM+ and STAR-CD performance benchmarking Interconnect performance comparisons Ways to increase STAR-CCM+ and STAR-CD productivity Understanding STAR-CD communication patterns MPI libraries comparisons5Test Cluster Configuration Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron 2382 (Shanghai) CPUs Mellanox InfiniBand ConnectX 20Gb/s (DDR) HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 800MHz per node OS: RHEL5U2, OFED 1.4 InfiniBand SW stack MPI: Platform MPI 5.6.5, HP-MPI 2.3 Application: STAR-CCM+ Version 3.06, STAR-CD Version 4.08 Benchmark Workload STAR-CD: A-Class (Turbulent Flow around A-Class Car) STAR-CCM+: Auto Aerodynamics test 6Mellanox InfiniBand Solutions Industry Standard Hardware, software, cabling, management Design for clustering and storage interconnect Performance 40Gb/s node-to-node 120Gb/s switch-to-switch 1us application latency Most aggressive roadmap in the industry Reliable with congestion management Efficient RDMA and Transport Offload Kernel bypass CPU focuses on application processing Scalable for Petascale computing & beyond End-to-end quality of service Virtualization acceleration I/O consolidation Including storageInfiniBand Delivers the Lowest LatencyThe InfiniBand Performance Gap is IncreasingFibre ChannelEthernet60Gb/s20Gb/s120Gb/s40Gb/s240Gb/s (12X)80Gb/s (4X)7 Performance Quad-Core Enhanced CPU IPC 4x 512K L2 cache 6MB L3 Cache Direct Connect Architecture HyperTransport Technology Up to 24 GB/s peak per processor Floating Point 128-bit FPU per core 4 FLOPS/clk peak per core Integrated Memory Controller Up to 12.8 GB/s DDR2-800 MHz or DDR2-667 MHz Scalability 48-bit Physical Addressing Compatibility Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron processor7 November5, 2007PCI-E Bridge PCI-E BridgeI/O HubI/O HubUSBUSBPCIPCIPCI-E Bridge PCI-E Bridge8 GB/S8 GB/SDual ChannelReg DDR28 GB/S8 GB/S8 GB/SQuad-Core AMD Opteron Processor8Dell PowerEdge Servers helping Simplify IT System Structure and Sizing Guidelines 24-node cluster build with Dell PowerEdge SC 1435 Servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis9STAR-CCM+ Benchmark Results - Interconnect Input Dataset Auto Aerodynamics test InfiniBand DDR delivers higher performance and scalability For any cluster size Up to 136% faster run timeHP MPILower is betterSTAR-CCM+ Benchmark Results(Auto Aerodynamics Test )0246810121416181 2 4 8 16 24Number of NodesElapsed Time/Iteration10GigE InfiniBand DDR10STAR-CCM+ Productivity Results Two cases are presented Single job over the entire systems Four jobs, each on two cores per server Productivity increases by allowing multiple jobs to run simultaneously Up to 30% increase in the system productivityHigher is better HP MPI over InfiniBandSTAR-CCM+ Productivity(Auto Aerodynamics Test)0200004000060000800001000001200004 8 12 16 20 24Number of NodesIterations/Day1 Job 4 Jobs11STAR-CCM+ Productivity Results - Interconnect Test case Four jobs, each on two cores per server InfiniBand DDR provides higher productivity compared to 10GigE Up to 25% more iterations per day InfiniBand maintains consistent scalability as cluster size increasesHigher is better HP MPISTAR-CCM+ Productivity(Auto Aerodynamics Test)0200004000060000800001000001200004 8 12 16 24Number of NodesIterations/Day10GigE InfiniBand DDR12STAR-CD Benchmark Results - Interconnect Test case Single job over the entire systems Input Dataset (A-Class) InfiniBand DDR enhances performance and scalability Up to 34% and 10% more jobs/day compared to GigE and 10GigE respectively Performance advantage of InfiniBand increases as cluster size scalesPlatform MPIHigher is betterSTAR-CD Performance Advantage of InfiniBand and 10GigE over GigE(AClass)0%5%10%15%20%25%30%35%4 8 16 20 24Number of NodesPercentage10GigE InfiniBand DDR13Maximize STAR-CD Performance per CoreSTAR-CD Benchmark Results (AClass)0400080001200016000200004-cores/proc 2-cores/proc 1-core/procNumber of NodesTotal Elapsed Time (s)16 Cores 32 Cores 48 CoresLower is better Test case Single job over the entire systems Using one, two or four cores in each quad-core AMD processor Remaining cores kept idle Using partial cores per simulation improves single job run time It is recommended to have multiple simulations simultaneously For maximum productivity (slide 10)Platform MPI14STAR-CCM+ Profiling MPI Functions MPI_Testall, MPI_Bcast, and MPI_Recv are the mostly used MPI functionsSTAR-CCM+ MPI Profiling110010000100000010000000010000000000MPI_TestallMPI_BcastMPI_ReduceMPI_ScatterMPI_RecvMessage SizeNumber of Messages4 Nodes 8 Nodes 16 Nodes 22 Nodes15STAR-CCM+ Profiling Data TransferredSTAR-CCM+ MPI Profiling01020304050[0..64B][65..256B][257B..1KB][1..4KB][4..16KB][16..64KB][64..256KB][256KB..1M][1..4M][4M..infinity]Message SizeNumber of Messages (Millions)4 Nodes 8 Nodes 16 Nodes 22 Nodes Most MPI messages are within 4KB in size Number of messages increases with cluster size16STAR-CCM+ Profiling Timing MPI_Testall, MPI_Bcast, and MPI_Recv have relatively large overhead Overhead increases with cluster sizeSTAR-CCM+ MPI Profiling01000200030004000500060007000MPI_TestallMPI_BcastMPI_ReduceMPI_ScatterMPI_RecvMessage SizeTotal Overhead (s)4 Nodes 8 Nodes 16 Nodes 22 Nodes17STAR-CCM+ Profiling Summary STAR-CCM+ was profiled to determine networking dependency Most used message sizes 18STAR-CD MPI Profiling024681012MPI_AllgatherMPI_AllreduceMPI_BarrierMPI_BcastMPI_SendrecvMPI FunctionsNumber of Messages (Millions)4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 NodesSTAR-CD Profiling MPI Functions MPI_SendRecv and MPI_Allreduce are the mostly used MPI functions19STAR-CD MPI Profiling(MPI_Sendrecv)01234567[0-128B>[128B-1K>[1K-8K>[8K-256K>[256K-1M>Message SizeNumber of Messages (Millions)4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 NodesSTAR-CD Profiling Data Transferred Most point-to-point MPI messages are within 1KB to 8KB in size Number of messages increases with cluster size20STAR-CD MPI Profiling(MPI_Allreduce)00.511.522.5[0-128B>[128B-1K>[1K-8K>[8K-256K>[256K-1M>Message SizeNumber of Messages (Millions)4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 NodesSTAR-CD Profiling Data Transferred Most MPI collective messages are smaller than 128Bytes Number of messages increases with cluster size21STAR-CD Profiling TimingSTAR-CD MPI Profiling05000100001500020000250003000035000MPI_AllgatherMPI_AllreduceMPI_BarrierMPI_BcastMPI_SendrecvMPI FunctionsTotal Overhead (s)4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes MPI_Allreduce and MPI_Sendrecv have large overhead Overhead increases with cluster size22MPI Performance Comparison MPI_SendrecvHigher is better HP MPI demonstrates better performance for large messageInfiniBand DDRMPI_Sendrecv(192 Processes)0200400600800100012001400128 256 512 1024 2048 4096 8192Message Size (Bytes)Bandwidth(MB/sec)Platform MPI HP-MPIMPI_Sendrecv(32 Processes)0200400600800100012001400128 256 512 1024 2048 4096 8192Message Size (Bytes)Bandwidth(MB/sec)Platform MPI HP-MPI23MPI Performance Comparison MPI_AllreduceLower is better HP MPI demonstrates lower MPI_Allreduce runtime for small messagesInfiniBand DDRMPI_Allreduce(32 Processes)024681012141 2 3 4 5 6Message Size (Bytes)Total Time (usec)Platform MPI HP-MPIMPI_Allreduce(192 Processes)0510152025301 2 3 4 5 6Message Size (Bytes)Total Time (usec)Platform MPI HP-MPI24STAR-CD Benchmark Results - MPI Test case Single job over the entire systems Input Dataset (A-Class) HP-MPI has slightly better performance with CPU affinity enabledLower is betterSTAR-CD Benchmark Results (A-Class)0100020003000400050006000700080009000100004 8 16 20 24Number of NodesTotal Elapsed Time (s)Platform MPI HP-MPI25STAR-CD Profiling Summary STAR-CD was profiled to determine networking dependency Majority of data transferred between compute nodes Medium size messages Data transferred increases with cluster size Most used message sizes 26STAR-CD Performance with Power ManagementInfiniBand DDRLower is better Test Scenario 24 servers, 4-Cores/Proc Nearly identical performance with power management enabled or disabledSTAR-CD Benchmark Results (AClass)10001200140016001800Power ManagementEnabledPower ManagementDisabledTotal Elapsed Time (s)27STAR-CD Benchmark Power Consumption Power management reduces 2% of total system power consumptionInfiniBand DDRSTAR-CD Benchmark Results (AClass)500055006000650070007500Power ManagementEnabledPower ManagementDisabledPower Consumption (Watt)Lower is better28Power Cost Savings with Power Management Power management saves 248$/year for the 24-node cluster As cluster size increases, bigger saving are expected InfiniBand DDR24 Node Cluster $/year = Total power consumption/year (KWh) * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdfSTAR-CD Benchmark Results Power Cost Comparison10000110001200013000Power ManagementEnabledPower ManagementDisabled$/Year29Power Cost Savings with Different Interconnect InfiniBand saves ~$1200 and ~$4000 power to finish the same number of STAR-CD jobs compared to 10GigE and GigE Yearly based for 24-node cluster As cluster size increases, more power can be saved Power Cost Savings (InfiniBand vs 10GigE and GigE)01000200030004000500010GigE GigEPower Cost Savings ($)24 Node Cluster $/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdfPower Management enabled30Conclusions STAR-CD and STAR-CCM+ are widely used CFD simulation software Performance and productivity relies on Scalable HPC systems and interconnect solutions Low latency and high throughout interconnect technology Reasonably process distribution can dramatically improves performance per core Interconnect comparison shows InfiniBand delivers superior performance in every cluster size Low latency InfiniBand enables unparalleled scalability Power management provide 2% saving in power consumption Per 24-node system with InfiniBand $248 power savings per year for 24-node cluster Power saving increases with cluster size InfiniBand saves power cost Based on the number of jobs can be finished by InfiniBand per year InfiniBand enables $1200 and $4000 power savings compared to 10GigE and GigE 3131Thank YouHPC Advisory CouncilAll trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented hereinSTAR-CCM+ and STAR-CD Performance Benchmark and ProfilingNoteSTAR-CCM+ and STAR-CDObjectivesTest Cluster ConfigurationMellanox InfiniBand SolutionsQuad-Core AMD Opteron ProcessorDell PowerEdge Servers helping Simplify ITSTAR-CCM+ Benchmark Results - InterconnectSTAR-CCM+ Productivity ResultsSTAR-CCM+ Productivity Results - InterconnectSTAR-CD Benchmark Results - InterconnectMaximize STAR-CD Performance per CoreSTAR-CCM+ Profiling MPI FunctionsSTAR-CCM+ Profiling Data TransferredSTAR-CCM+ Profiling TimingSTAR-CCM+ Profiling SummarySTAR-CD Profiling MPI FunctionsSTAR-CD Profiling Data TransferredSTAR-CD Profiling Data TransferredSTAR-CD Profiling TimingMPI Performance Comparison MPI_SendrecvMPI Performance Comparison MPI_AllreduceSTAR-CD Benchmark Results - MPISTAR-CD Profiling SummarySTAR-CD Performance with Power ManagementSTAR-CD Benchmark Power ConsumptionPower Cost Savings with Power ManagementPower Cost Savings with Different InterconnectConclusionsSlide Number 31