rockyrockrr的博客

Matrix Methods for Analysis of Structure in Data Sets

2014-06-04

nvopencc教程

NVIDIA’s Open64 Sources Nvopencc Tutorial

2012-05-16

C2连续的三次B样条插值

C2连续的三次B样条插值 C2连续的三次B样条插值 C2连续的三次B样条插值

2011-11-03

MPI-CUDA implementation for Flow Computations on Multi-GPU Clusters

Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi- GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. In this study, we pursue mixed MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations.

2011-10-24

Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU

Nowadays, NVIDIA’s CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node.

2011-10-24

rockyrockrr的博客

空空如也

Matrix Methods for Analysis of Structure in Data Sets

nvopencc教程

C2连续的三次B样条插值

ACM程序竞赛计算几何模板

MPI-CUDA implementation for Flow Computations on Multi-GPU Clusters

Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU

UMFPack 5.51 UserGuide

Python学习笔记

SolidWorks二次开发语法技巧及基础

空空如也