STAR CCM+ and STAR CD Performance and Profili STAR-CCM+ and STAR-CD • STAR-CCM+ – An engineering process-oriented CFD tool – Client-server architecture, object-oriented programming – Delivers ...

  • Published on
    05-Feb-2018

  • View
    214

  • Download
    1

Transcript

  • STAR‐CCM+ and STAR‐CD Performance  Benchmark and Profiling

    March 2009

  • 2

    Note

    • The following research was performed under the HPC Advisory Council activities

    – Participating vendors: AMD, Dell, Mellanox

    – Compute resource - HPC Advisory Council Cluster Center

    • The participating members would like to thank CD-adapco for their support and guidelines

    • For more info please refer to

    – www.mellanox.com, www.dell.com/hpc, www.amd.com

    http://www.mellanox.com/ http://www.dell.com/hpc http://www.amd.com/
  • 3

    STAR-CCM+ and STAR-CD

    • STAR-CCM+

    – An engineering process-oriented CFD tool

    – Client-server architecture, object-oriented programming

    – Delivers the entire CFD process in a single integrated software

    environment

    • STAR-CD

    – An integrated platform for multi-physics simulations

    – A long established platform for industrial CFD simulation

    – Bridging the gap between CFD and structural-mechanics

    • Developed by CD-adapco

  • 4

    Objectives

    • The presented research was done to provide best practices

    – STAR-CCM+ and STAR-CD performance benchmarking

    – Interconnect performance comparisons

    – Ways to increase STAR-CCM+ and STAR-CD productivity

    – Understanding STAR-CD communication patterns

    – MPI libraries comparisons

  • 5

    Test Cluster Configuration

    • Dell™ PowerEdge™ SC 1435 24-node cluster

    • Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

    • Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs

    • Mellanox® InfiniBand DDR Switch

    • Memory: 16GB memory, DDR2 800MHz per node

    • OS: RHEL5U2, OFED 1.4 InfiniBand SW stack

    • MPI: Platform MPI 5.6.5, HP-MPI 2.3

    • Application: STAR-CCM+ Version 3.06, STAR-CD Version 4.08

    • Benchmark Workload

    – STAR-CD: A-Class (Turbulent Flow around A-Class Car)

    – STAR-CCM+: Auto Aerodynamics test

  • 6

    Mellanox InfiniBand Solutions

    • Industry Standard – Hardware, software, cabling, management

    – Design for clustering and storage interconnect

    • Performance – 40Gb/s node-to-node

    – 120Gb/s switch-to-switch

    – 1us application latency

    – Most aggressive roadmap in the industry

    • Reliable with congestion management • Efficient

    – RDMA and Transport Offload

    – Kernel bypass

    – CPU focuses on application processing

    • Scalable for Petascale computing & beyond • End-to-end quality of service • Virtualization acceleration • I/O consolidation Including storage

    InfiniBand Delivers the Lowest Latency

    The InfiniBand Performance Gap is Increasing

    Fibre Channel

    Ethernet

    60Gb/s

    20Gb/s

    120Gb/s

    40Gb/s

    240Gb/s (12X)

    80Gb/s (4X)

  • 7

    • Performance – Quad-Core

    • Enhanced CPU IPC • 4x 512K L2 cache • 6MB L3 Cache

    – Direct Connect Architecture • HyperTransport™ Technology • Up to 24 GB/s peak per processor

    – Floating Point • 128-bit FPU per core • 4 FLOPS/clk peak per core

    – Integrated Memory Controller • Up to 12.8 GB/s • DDR2-800 MHz or DDR2-667 MHz

    • Scalability – 48-bit Physical Addressing

    • Compatibility – Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

    7 November5, 2007

    PCI-E® Bridge

    PCI-E® Bridge

    I/O HubI/O Hub

    USBUSB

    PCIPCI

    PCI-E® Bridge

    PCI-E® Bridge

    8 GB/S

    8 GB/S

    Dual Channel Reg DDR2

    8 GB/S

    8 GB/S

    8 GB/S

    Quad-Core AMD Opteron™ Processor

  • 8

    Dell PowerEdge Servers helping Simplify IT

    • System Structure and Sizing Guidelines – 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

    – Servers optimized for High Performance Computing environments

    – Building Block Foundations for best price/performance and performance/watt

    • Dell HPC Solutions – Scalable Architectures for High Performance and Productivity

    – Dell's comprehensive HPC services help manage the lifecycle requirements.

    – Integrated, Tested and Validated Architectures

    • Workload Modeling – Optimized System Size, Configuration and Workloads

    – Test-bed Benchmarks

    – ISV Applications Characterization

    – Best Practices & Usage Analysis

  • 9

    STAR-CCM+ Benchmark Results - Interconnect

    • Input Dataset – Auto Aerodynamics test

    • InfiniBand DDR delivers higher performance and scalability – For any cluster size – Up to 136% faster run time

    HP MPILower is better

    STAR-CCM+ Benchmark Results (Auto Aerodynamics Test )

    0 2 4 6 8

    10 12 14 16 18

    1 2 4 8 16 24 Number of Nodes

    E la

    ps ed

    T im

    e/ Ite

    ra tio

    n

    10GigE InfiniBand DDR

  • 10

    STAR-CCM+ Productivity Results • Two cases are presented

    – Single job over the entire systems – Four jobs, each on two cores per server

    • Productivity increases by allowing multiple jobs to run simultaneously – Up to 30% increase in the system productivity

    Higher is better HP MPI over InfiniBand

    STAR-CCM+ Productivity (Auto Aerodynamics Test)

    0

    20000

    40000

    60000

    80000

    100000

    120000

    4 8 12 16 20 24

    Number of Nodes

    Ite ra

    tio ns

    /D ay

    1 Job 4 Jobs

  • 11

    STAR-CCM+ Productivity Results - Interconnect

    • Test case – Four jobs, each on two cores per server

    • InfiniBand DDR provides higher productivity compared to 10GigE – Up to 25% more iterations per day

    • InfiniBand maintains consistent scalability as cluster size increases

    Higher is better HP MPI

    STAR-CCM+ Productivity (Auto Aerodynamics Test)

    0

    20000

    40000

    60000

    80000

    100000

    120000

    4 8 12 16 24 Number of Nodes

    Ite ra

    tio ns

    /D ay

    10GigE InfiniBand DDR

  • 12

    STAR-CD Benchmark Results - Interconnect

    • Test case – Single job over the entire systems – Input Dataset (A-Class)

    • InfiniBand DDR enhances performance and scalability – Up to 34% and 10% more jobs/day compared to GigE and 10GigE respectively – Performance advantage of InfiniBand increases as cluster size scales

    Platform MPIHigher is better

    STAR-CD Performance Advantage of InfiniBand and 10GigE over GigE

    (AClass)

    0%

    5%

    10%

    15%

    20%

    25%

    30%

    35%

    4 8 16 20 24 Number of Nodes

    Pe rc

    en ta

    ge

    10GigE InfiniBand DDR

  • 13

    Maximize STAR-CD Performance per Core

    STAR-CD Benchmark Results (AClass)

    0

    4000

    8000

    12000

    16000

    20000

    4-cores/proc 2-cores/proc 1-core/proc

    Number of Nodes

    To ta

    l E la

    ps ed

    T im

    e (s

    )

    16 Cores 32 Cores 48 CoresLower is better

    • Test case – Single job over the entire systems – Using one, two or four cores in each quad-core AMD processor

    • Remaining cores kept idle • Using partial cores per simulation improves single job run time • It is recommended to have multiple simulations simultaneously

    – For maximum productivity (slide 10)

    Platform MPI

  • 14

    STAR-CCM+ Profiling – MPI Functions

    • MPI_Testall, MPI_Bcast, and MPI_Recv are the mostly used MPI functions

    STAR-CCM+ MPI Profiling

    1 100

    10000 1000000

    100000000 10000000000

    MP I_T

    est all

    MP I_B

    cas t

    MP I_R

    edu ce

    MP I_S

    cat ter

    MP I_R

    ecv

    Message Size

    N um

    be r

    of M

    es sa

    ge s

    4 Nodes 8 Nodes 16 Nodes 22 Nodes

  • 15

    STAR-CCM+ Profiling – Data Transferred

    STAR-CCM+ MPI Profiling

    0 10 20 30 40 50

    [0.. 64B

    ]

    [65 ..25

    6B]

    [25 7B.

    .1K B]

    [1.. 4KB

    ]

    [4.. 16K

    B]

    [16 ..64

    KB]

    [64 ..25

    6KB ]

    [25 6KB

    ..1M ]

    [1.. 4M]

    [4M ..in

    fini ty]

    Message Size

    N um

    be r o

    f M es

    sa ge

    s (M

    ill io

    ns )

    4 Nodes 8 Nodes 16 Nodes 22 Nodes

    • Most MPI messages are within 4KB in size • Number of messages increases with cluster size

  • 16

    STAR-CCM+ Profiling – Timing

    • MPI_Testall, MPI_Bcast, and MPI_Recv have relatively large overhead

    • Overhead increases with cluster size

    STAR-CCM+ MPI Profiling

    0 1000 2000 3000 4000 5000 6000 7000

    MP I_T

    est all

    MP I_B

    cas t

    MP I_R

    edu ce

    MP I_S

    cat ter

    MP I_R

    ecv

    Message Size

    To ta

    l O ve

    rh ea

    d (s

    )

    4 Nodes 8 Nodes 16 Nodes 22 Nodes

  • 17

    STAR-CCM+ Profiling Summary

    • STAR-CCM+ was profiled to determine networking dependency • Most used message sizes

  • 18

    STAR-CD MPI Profiling

    0 2 4 6 8

    10 12

    MP I_A

    llga the

    r

    MP I_A

    llre duc

    e

    MP I_B

    arri er

    MP I_B

    cas t

    MP I_S

    end rec

    v

    MPI Functions

    Nu m

    be r

    of M

    es sa

    ge s

    (M ill

    io ns

    )

    4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes

    STAR-CD Profiling – MPI Functions

    • MPI_SendRecv and MPI_Allreduce are the mostly used MPI functions

  • 19

    STAR-CD MPI Profiling (MPI_Sendrecv)

    0 1 2 3 4 5 6 7

    [0-1 28B

    >

    [12 8B-

    1K> [1K

    -8K >

    [8K -25

    6K>

    [25 6K-

    1M >

    Message Size

    N um

    be r

    of M

    es sa

    ge s

    (M ill

    io ns

    )

    4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes

    STAR-CD Profiling – Data Transferred

    • Most point-to-point MPI messages are within 1KB to 8KB in size • Number of messages increases with cluster size

  • 20

    STAR-CD MPI Profiling (MPI_Allreduce)

    0

    0.5

    1

    1.5

    2

    2.5

    [0-1 28B

    >

    [12 8B-

    1K> [1K

    -8K >

    [8K -25

    6K>

    [25 6K-

    1M>

    Message Size

    N um

    be r o

    f M es

    sa ge

    s (M

    ill io

    ns )

    4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes

    STAR-CD Profiling – Data Transferred

    • Most MPI collective messages are smaller than 128Bytes • Number of messages increases with cluster size

  • 21

    STAR-CD Profiling – Timing

    STAR-CD MPI Profiling

    0

    5000

    10000

    15000 20000

    25000

    30000

    35000

    MP I_A

    llga the

    r

    MP I_A

    llre duc

    e

    MP I_B

    arri er

    MP I_B

    cas t

    MP I_S

    end rec

    v

    MPI Functions

    To ta

    l O ve

    rh ea

    d (s

    )

    4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes

    • MPI_Allreduce and MPI_Sendrecv have large overhead • Overhead increases with cluster size

  • 22

    MPI Performance Comparison – MPI_Sendrecv

    Higher is better

    • HP MPI demonstrates better performance for large message

    InfiniBand DDR

    MPI_Sendrecv (192 Processes)

    0

    200

    400

    600

    800

    1000

    1200

    1400

    128 256 512 1024 2048 4096 8192 Message Size (Bytes)

    B an

    dw id

    th (M

    B /s

    ec )

    Platform MPI HP-MPI

    MPI_Sendrecv (32 Processes)

    0

    200

    400

    600

    800

    1000

    1200

    1400

    128 256 512 1024 2048 4096 8192 Message Size (Bytes)

    B an

    dw id

    th (M

    B /s

    ec )

    Platform MPI HP-MPI

  • 23

    MPI Performance Comparison – MPI_Allreduce

    Lower is better

    • HP MPI demonstrates lower MPI_Allreduce runtime for small messages

    InfiniBand DDR

    MPI_Allreduce (32 Processes)

    0

    2

    4

    6

    8

    10

    12

    14

    1 2 3 4 5 6 Message Size (Bytes)

    To ta

    l T im

    e (u

    se c)

    Platform MPI HP-MPI

    MPI_Allreduce (192 Processes)

    0

    5

    10

    15

    20

    25

    30

    1 2 3 4 5 6 Message Size (Bytes)

    To ta

    l T im

    e (u

    se c)

    Platform MPI HP-MPI

  • 24

    STAR-CD Benchmark Results - MPI

    • Test case – Single job over the entire systems – Input Dataset (A-Class)

    • HP-MPI has slightly better performance with CPU affinity enabled

    Lower is better

    STAR-CD Benchmark Results (A-Class)

    0 1000

    2000 3000

    4000 5000

    6000 7000

    8000 9000

    10000

    4 8 16 20 24 Number of Nodes

    To ta

    l E la

    ps ed

    T im

    e (s

    )

    Platform MPI HP-MPI

  • 25

    STAR-CD Profiling Summary

    • STAR-CD was profiled to determine networking dependency • Majority of data transferred between compute nodes

    – Medium size messages – Data transferred increases with cluster size

    • Most used message sizes –

  • 26

    STAR-CD Performance with Power Management

    InfiniBand DDRLower is better

    • Test Scenario – 24 servers, 4-Cores/Proc

    • Nearly identical performance with power management enabled or disabled

    STAR-CD Benchmark Results (AClass)

    1000

    1200

    1400

    1600

    1800

    Power Management Enabled

    Power Management Disabled

    To ta

    l E la

    ps ed

    T im

    e (s

    )

  • 27

    STAR-CD Benchmark – Power Consumption

    • Power management reduces 2% of total system power consumption

    InfiniBand DDR

    STAR-CD Benchmark Results (AClass)

    5000

    5500

    6000

    6500

    7000

    7500

    Power Management Enabled

    Power Management Disabled

    Po w

    er C

    on su

    m pt

    io n

    (W at

    t)

    Lower is better

  • 28

    Power Cost Savings with Power Management

    • Power management saves 248$/year for the 24-node cluster • As cluster size increases, bigger saving are expected

    InfiniBand DDR24 Node Cluster $/year = Total power consumption/year (KWh) * $0.20 For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

    STAR-CD Benchmark Results Power Cost Comparison

    10000

    11000

    12000

    13000

    Power Management Enabled

    Power Management Disabled

    $/ Ye

    ar

  • 29

    Power Cost Savings with Different Interconnect

    • InfiniBand saves ~$1200 and ~$4000 power to finish the same number of STAR-CD jobs compared to 10GigE and GigE – Yearly based for 24-node cluster

    • As cluster size increases, more power can be saved

    Power Cost Savings (InfiniBand vs 10GigE and GigE)

    0

    1000

    2000

    3000

    4000

    5000

    10GigE GigE

    P ow

    er C

    os t S

    av in

    gs ($

    )

    24 Node Cluster $/KWh = KWh * $0.20 For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

    Power Management enabled

  • 30

    Conclusions • STAR-CD and STAR-CCM+ are widely used CFD simulation software • Performance and productivity relies on

    – Scalable HPC systems and interconnect solutions – Low latency and high throughout interconnect technology – Reasonably process distribution can dramatically improves performance per core

    • Interconnect comparison shows – InfiniBand delivers superior performance in every cluster size – Low latency InfiniBand enables unparalleled scalability

    • Power management provide 2% saving in power consumption – Per 24-node system with InfiniBand – $248 power savings per year for 24-node cluster – Power saving increases with cluster size

    • InfiniBand saves power cost – Based on the number of jobs can be finished by InfiniBand per year – InfiniBand enables $1200 and $4000 power savings compared to 10GigE and GigE

  • 3131

    Thank You HPC Advisory Council

    All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

    STAR-CCM+ and STAR-CD Performance Benchmark and Profiling Note STAR-CCM+ and STAR-CD Objectives Test Cluster Configuration Mellanox InfiniBand Solutions Quad-Core AMD Opteron™ Processor Dell PowerEdge Servers helping Simplify IT STAR-CCM+ Benchmark Results - Interconnect STAR-CCM+ Productivity Results STAR-CCM+ Productivity Results - Interconnect STAR-CD Benchmark Results - Interconnect Maximize STAR-CD Performance per Core STAR-CCM+ Profiling – MPI Functions STAR-CCM+ Profiling – Data Transferred STAR-CCM+ Profiling – Timing STAR-CCM+ Profiling Summary STAR-CD Profiling – MPI Functions STAR-CD Profiling – Data Transferred STAR-CD Profiling – Data Transferred STAR-CD Profiling – Timing MPI Performance Comparison – MPI_Sendrecv MPI Performance Comparison – MPI_Allreduce STAR-CD Benchmark Results - MPI STAR-CD Profiling Summary STAR-CD Performance with Power Management STAR-CD Benchmark – Power Consumption Power Cost Savings with Power Management Power Cost Savings with Different Interconnect Conclusions Slide Number 31