본문 바로가기

CUDA

CUDPP and MPS

CUDPP (CUDA Data Parallel Primitives) is an older NVIDIA library that provides parallel prefix sums, sorts, and other data-parallel operations. In LAMMPS, enabling CUDPP_OPT=yes uses this library for GPU-based binning (neighbor lists) and related tasks. However, for most modern GPUs (Pascal and later), CUDPP is generally not necessary or recommended. If you do enable CUDPP, you cannot enable CUDA MPS support simultaneously in LAMMPS.

 

MPS (Multi-Process Service) is an NVIDIA feature allowing multiple processes (e.g., multiple MPI ranks) to share a single GPU more efficiently. If you enable CUDA_MPS_SUPPORT=yes, LAMMPS applies special tweaks so it can properly run under an active nvidia-cuda-mps daemon. Typically, you only need this if you plan on running multiple MPI processes per GPU. If you don’t specifically need multi-process GPU sharing, you can leave MPS support turned off.

 

1. One GPU, multiple MPI ranks:

If you have a single NVIDIA GPU and start multiple MPI processes (e.g., 2, 4, 8 ranks, etc.) that all try to use that same GPU, enabling MPS often helps.

MPS allows these processes to run GPU kernels concurrently without high overhead from context switching.

2. One GPU, one MPI rank:

If your typical workflow is just one MPI process per node/GPU, MPS doesn’t do anything—there’s only one process using the GPU, so no sharing overhead.

3. Extra steps to use MPS:

You need to enable and run the nvidia-cuda-mps daemon on the system.

You must build LAMMPS with -D CUDA_MPS_SUPPORT=yes (and CUDPP_OPT=no).

 

In other words, for multi-MPI-rank setups sharing one GPU, MPS can yield better utilization. Otherwise, you can safely leave MPS off.

 

Factors to Consider

1. GPU Memory and Large Problems

A single high-performance GPU usually has a larger memory pool.

Large HPC simulations (like those in LAMMPS) can benefit from this larger memory for bigger systems.

Multiple mid-range GPUs each have less memory. If your simulation requires more GPU memory than any single mid-range GPU provides, that might limit you on each mid-range card.

2. Performance Scaling

LAMMPS can run in parallel across multiple GPUs if you enable multi-GPU support (and possibly domain decomposition).

However, there is overhead from communication between GPUs (especially if they are on separate PCIe connections, or if they are in different nodes).

For some workloads, weak scaling or strong scaling across multiple GPUs might not provide linear speedup.

3. Single Large Simulation vs. Multiple Smaller Jobs

If you only ever run one large simulation at a time, a single high-performance GPU is often simpler and faster.

If you need to run several smaller simulations in parallel (e.g., parameter sweeps, multiple independent replicates), multiple mid-range GPUs can help you process them simultaneously.

4. Cost and Availability

Sometimes a single top-tier GPU (e.g., an NVIDIA A100 or RTX 4090) might be more expensive (and possibly more power-hungry) than several mid-range GPUs.

You may get more aggregate GPU memory by using multiple cards. But then you have to consider board space (PCIe slots), cooling, and power constraints.

5. MPI + GPU Offload

Just because you use mpiexec does not necessarily mean each MPI process needs its own GPU. You could also have multiple MPI ranks sharing one GPU (via MPS or just concurrent usage, although sharing typically has overhead).

For high throughput, ideally you run one MPI rank per GPU (or a small number of MPI ranks per GPU).

 

When One High-End GPU Is Better

You have one large simulation that needs a lot of GPU memory and benefits from maximum single-GPU FLOPs (floating-point operations per second).

You want less complexity in terms of multi-GPU communication or scheduling.

You have enough budget to afford the high-end GPU.

 

When Multiple Mid-Range GPUs Are Better

You run multiple smaller jobs at the same time (e.g., a parameter sweep of many smaller LAMMPS simulations).

Your code (LAMMPS + custom modifications, for instance) scales well across multiple GPUs and the overhead is not too large.

You need a large aggregate GPU memory across multiple cards (and the combined memory is bigger than a single GPU’s memory).

A single high-end GPU is not available or is cost-prohibitive, whereas multiple mid-range GPUs fit your budget better.

 

Conclusion

For single, large HPC-type simulations in LAMMPS, a single high-performance GPU generally wins on simplicity and raw performance.

If you run many smaller simulations in parallel or can scale across multiple GPUs effectively, then multiple mid-range GPUsmight provide better overall throughput.

 

Ultimately, the choice depends on your simulation size, memory needs, parallel scaling, and budget.

'CUDA' 카테고리의 다른 글