浙江有什么旅游景点

浙江有什么旅游景点

百度不光人被遣返，连手机都被没收了，为一个表情包付出的代价，未免太过惨痛。

This is the fourth post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or intermediate developers.

The CUDA programming model provides an abstraction of GPU architecture that acts as a bridge between an application and its possible implementation on GPU hardware. This post outlines the main concepts of the CUDA programming model by outlining how they are exposed in general-purpose programming languages like C/C++.

Let me introduce two keywords widely used in CUDA programming model: host and device.

The host is the CPU available in the system. The system memory associated with the CPU is called host memory. The GPU is called a device and GPU memory likewise called device memory.

To execute any CUDA program, there are three main steps:

Copy the input data from host memory to device memory, also known as host-to-device transfer.
Load the GPU program and execute, caching data on-chip for performance.
Copy the results from device memory to host memory, also called device-to-host transfer.

CUDA kernel and thread hierarchy

Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions.

*Figure 1. The kernel is a function executed on the GPU.*

Every CUDA kernel starts with a __global__ declaration specifier. Programmers provide a unique global ID to each thread by using built-in variables.

*Figure 2. CUDA kernels are subdivided into blocks.*

A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2).

Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism). One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA blocks. Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one time. Figure 3 shows the kernel execution and mapping on hardware resources available in GPU.

CUDA defines built-in 3D variables for threads and blocks. Threads are indexed using the built-in 3D variable threadIdx. Three-dimensional indexing provides a natural way to index elements in vectors, matrix, and volume and makes CUDA programming easier. Similarly, blocks are also indexed using the in-built 3D variable called blockIdx.

Here are a few noticeable points:

CUDA architecture limits the numbers of threads per block (1024 threads per block limit).
The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
All threads within a block can be synchronized using an intrinsic function __syncthreads. With __syncthreads, all threads in the block must wait before anyone can proceed.
The number of threads per block and the number of blocks per grid specified in the <<<…>>> syntax can be of type int or dim3. These triple angle brackets mark a call from host code to device code. It is also called a kernel launch.

The CUDA program for adding two matrices below shows multi-dimensional blockIdx and threadIdx and other variables like blockDim. In the example below, a 2D block is chosen for ease of indexing and each block has 256 threads with 16 each in x and y-direction. The total number of blocks are computed using the data size divided by the size of each block.

// Kernel - Adding two matrices MatA and MatB
__global__ void MatAdd(float MatA[N][N], float MatB[N][N],
float MatC[N][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        MatC[i][j] = MatA[i][j] + MatB[i][j];
}
 
int main()
{
    ...
    // Matrix addition kernel launch from host code
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks((N + threadsPerBlock.x -1) / threadsPerBlock.x, (N+threadsPerBlock.y -1) / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(MatA, MatB, MatC);
    ...
}

Memory hierarchy

CUDA-capable GPUs have a memory hierarchy as depicted in Figure 4.

The following memories are exposed by the GPU architecture:

Registers—These are private to each thread, which means that registers assigned to a thread are not visible to other threads. The compiler makes decisions about register utilization.
L1/Shared memory (SMEM)—Every SM has a fast, on-chip scratchpad memory that can be used as L1 cache and shared memory. All threads in a CUDA block can share shared memory, and all CUDA blocks running on a given SM can share the physical memory resource provided by the SM..
Read-only memory—Each SM has an instruction cache, constant memory, texture memory and RO cache, which is read-only to kernel code.
L2 cache—The L2 cache is shared across all SMs, so every thread in every CUDA block can access this memory. The NVIDIA A100 GPU has increased the L2 cache size to 40 MB as compared to 6 MB in V100 GPUs.
Global memory—This is the framebuffer size of the GPU and DRAM sitting in the GPU.

The NVIDIA CUDA compiler does a good job in optimizing memory resources but an expert CUDA developer can choose to use this memory hierarchy efficiently to optimize the CUDA programs as needed.

Compute capability

The compute capability of a GPU determines its general specifications and available features supported by the GPU hardware. This version number can be used by applications at runtime to determine which hardware features or instructions are available on the present GPU.

Every GPU comes with a version number denoted as X.Y where X comprises a major revision number and Y a minor revision number. The minor revision number corresponds to an incremental improvement to the architecture, possibly including new features.

For more information about the compute capability of any CUDA-enabled device, see the CUDA sample code deviceQuery. This sample enumerates the properties of the CUDA devices present in the system

Summary

The CUDA programming model provides a heterogeneous environment where the host code is running the C/C++ program on the CPU and the kernel runs on a physically separate GPU device. The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces, referred to as host memory and device memory, respectively. CUDA code also provides for data transfer between host and device memory, over the PCIe bus.

CUDA also exposes many built-in variables and provides the flexibility of multi-dimensional indexing to ease programming. CUDA also manages different memories including registers, shared memory and L1 cache, L2 cache, and global memory. Advanced developers can use some of these memories efficiently to optimize the CUDA program.

空调综合征有什么症状	双鱼座和什么星座最配	貂蝉属什么生肖	羊水是什么颜色的	发光免疫是检查什么的
咪咪头疼是什么原因	梦见大老鼠是什么意思	骨骼惊奇什么意思	聿读什么	7月生日是什么星座
内分泌失调吃什么食物好	什么可以驱蛇	肝郁是什么意思	pbr是什么意思	线束是什么意思
吴刚和嫦娥什么关系	空腹c肽偏高说明什么	根管治疗后要注意什么	没有痔疮大便出血是什么原因	lcu是什么意思

含蓄是什么意思hcv8jop2ns8r.cn	小姑子是什么关系hcv9jop2ns4r.cn	生理期腰疼是什么原因hcv8jop0ns4r.cn	场记是做什么的hcv8jop8ns2r.cn	睾丸疼是什么原因hcv8jop7ns2r.cn
朱砂有什么用hcv8jop1ns9r.cn	戍是什么意思hcv7jop5ns0r.cn	易胖体质是什么原因造成的hcv8jop6ns6r.cn	人活着意义是什么hcv8jop9ns0r.cn	子宫复旧是什么意思hcv8jop8ns3r.cn
生殖疱疹吃什么药不复发hcv8jop2ns6r.cn	整天放屁是什么原因hcv8jop3ns5r.cn	散光看东西是什么样的hcv7jop6ns7r.cn	宫颈潴留囊肿是什么意思hcv9jop3ns9r.cn	什么是阴虚hcv9jop3ns0r.cn
手术室为什么在三楼hcv8jop1ns0r.cn	彩超能检查什么hcv7jop5ns1r.cn	尿频尿急吃什么药最好520myf.com	搬家有什么讲究和忌讳hcv7jop9ns5r.cn	八月十六号是什么星座hcv7jop6ns1r.cn

CUDA kernel and thread hierarchy

Memory hierarchy

Compute capability

Summary

Related resources

Tags

About the Authors

浙江有什么旅游景点

CUDA kernel and thread hierarchy

Memory hierarchy

Compute capability

Summary

Related resources

Tags

About the Authors

Comments

Related posts

Exploring the New Features of CUDA 11.3

CUDA Refresher: Getting started with CUDA

How to Access Global Memory Efficiently in CUDA Fortran Kernels

An Easy Introduction to CUDA C and C++

An Easy Introduction to CUDA Fortran

Related posts

Just Released: NVIDIA HPC SDK v25.7

Just Released: NVIDIA HPC SDK v25.3

Profit and Loss Modeling on GPUs with ISO C++ Language Parallelism

How to Accelerate Quantitative Finance with ISO C++ Standard Parallelism

Just Released: NVIDIA HPC SDK v24.1