菱角是什么意思| 婴儿胎发什么时候剪最好| 薇字五行属什么| 翡翠有什么作用与功效| 龙井茶是什么茶| 5月19日什么星座| 猿是什么动物| 93年是什么命| 血稠是什么原因| 待我长发及腰时下一句是什么| 心脏不舒服吃什么药最好| 梦到谈恋爱预示着什么| 造纸术什么时候发明的| 如鱼得水是什么意思| 大米里放什么不生虫子| 嗓子沙哑是什么原因| 车震什么意思| 累了喝什么缓解疲劳| 特诊科是什么意思| 张什么结什么| 县尉相当于现在什么官| 看乙肝挂什么科| 看见黑猫代表什么预兆| 什么是hpv病毒| 纳囊是什么妇科病| 仓鼠喝什么水| 梦见生小孩是什么征兆| 什么地响| 什么水果糖分最高| 高考450分能上什么学校| 什么情况下吃救心丸| 照顾是什么意思| 等闲识得东风面下一句是什么| 117是什么电话| 钢琴10级是什么水平| 肾透析是什么意思| 侯赛因是什么意思| 扬长避短什么意思| 甲状腺结节吃什么好| 处女膜什么样子| 景深是什么意思| 什么血型不招蚊子| 什么是遴选| 牛肉用什么腌制比较嫩| 7.13是什么日子| 十二是什么意思| 濯清涟而不妖的濯是什么意思| 方脸适合什么发型| 头孢吃多了有什么副作用| 什么水果利尿效果最好| 尾盘放量拉升意味着什么| 头晕出虚汗是什么原因引起的| 上焦中焦下焦是什么| 检查阑尾炎挂什么科| beam是什么意思| 右肺结节是什么意思| 为什么不建议做肠镜| 手麻是什么病的前兆| 炜字五行属什么| 西安有什么区| 单发房早是什么意思| 1975年属兔的是什么命| 什么是特应性皮炎| 寂寞难耐是什么意思| 新疆是什么族| silk什么意思| 金銮殿是什么意思| 梦到鞋子是什么意思| 丹参片和复方丹参片有什么区别| 月经吃什么水果好| 鼻窦炎有什么症状| 农历五月二十一是什么星座| 胃镜挂什么科| 打呼噜挂什么科室| 美蛙是什么蛙| 2026属什么生肖| 什么的地方| 嫌疑人是什么意思| 顶臂长是什么意思| 前列腺炎是什么意思| 翳是什么意思| 肚子胀挂什么科| 紫颠是什么病怎样治| 乙肝dna检测是查什么| 子宫肌瘤是什么原因造成的| 脂质是什么| rh是什么意思| 副县长是什么级别| 为什么老是拉肚子| 手汗症是什么原因| 原发性高血压是什么意思| qrs是什么意思| e抗原阳性是什么意思| 感冒咳嗽挂号挂什么科| 苯醚甲环唑防治什么病| 体内湿气太重吃什么药能快速除湿| 什么是巨细胞病毒| 阴道口痒是什么原因| bata鞋属于什么档次| 梦见莲藕是什么意思| 心悸是什么原因造成的| 经期吃什么排污血最强| 喝莓茶对身体有什么好处| 食物中毒有什么症状| 经常感觉饿是什么原因| 芥末是什么植物| 停胎是什么原因造成的| 湿疹吃什么水果好| 气胸病是什么原因引起的| 生理性囊肿是什么意思| 哺乳期吃什么水果好| 用字五行属什么| 血红蛋白高是什么原因| 嗓子有黄痰是什么原因| 肾结石少吃什么食物| 出球小动脉流什么血| 取环后需要注意什么| 感冒挂号挂什么科| 向日葵是什么季节| 婴儿第一次理发有什么讲究吗| ken是什么意思| 双肾盂是什么意思| 一片哗然是什么意思| 眼睛疼吃什么药| 血虚是什么原因造成的| 农历8月是什么星座| 慢慢张开你的眼睛是什么歌的歌词| 珠地棉是什么面料| 骨髓瘤是什么原因引起的| 黄芪可以和什么一起泡水喝| 缓苗是什么意思| 贝母是什么| 耳朵老是痒是什么原因| 星期一左眼皮跳是什么预兆| 三纲指的是什么| 有机食品是什么意思| 壁虎属于什么类动物| 丹青指什么| 过敏性紫癜什么症状| 家庭出身是什么| 真如是什么意思| 胎动突然减少是什么原因| 坐月子能吃什么蔬菜| 知府相当于现在什么官| 白带是什么颜色| 口扫是什么| 心绞痛是什么病| 10月6日是什么星座| 双绉是什么面料| 2008是什么年| 减肥吃什么肉类| 同仁是什么意思| 蚂蚁喜欢吃什么食物| 养猫需要准备什么东西| 醋泡脚有什么好处| 调休是什么意思| 射手座女和什么星座最配| 吃什么能补雌激素| 体面什么意思| 地球里面是什么| 肝脏在人体的什么位置| 促销员是做什么的| playboy什么意思| 肝看什么科| 岁月如歌什么意思| 龙跟什么生肖最配| 天德月德是什么意思| 脑部有结节意味着什么| 摩羯男和什么星座最配| 吃糖醋蒜有什么好处和坏处| 开车压到猫有什么预兆| 月经血量少是什么原因| 银杏是什么| 眼干眼涩眼疲劳用什么眼药水| 女性支原体感染有什么症状| 皮肤一块块白是什么病| 去痘印用什么药膏好| 戴隐形眼镜用什么眼药水| 结膜炎用什么眼药水好| 膀胱钙化是什么意思| 血糖高能吃什么水果| 做梦梦见屎是什么意思| 颅脑平扫是检查什么| 兔属什么五行| 给朋友送什么礼物好| 法国用什么货币| 吃的少还胖什么原因| 肾结石有什么影响| 五险一金什么时候开始交| 苯丙酮尿症是什么| c13呼气试验阳性是什么意思| 为什么会一直放屁| 促什么谈什么| 脂肪肝不能吃什么| 脾胃虚弱吃什么食物| 下巴长痘痘用什么药| 老有眼屎是什么原因| 什么是气血| 上火吃什么水果降火快| 脂蛋白a高吃什么能降下来| 天蝎女跟什么星座最配| 扁平足为什么不能当兵| 什么时候不容易怀孕| 三点水的字有什么| 胰腺炎适合吃什么食物| 什么药能治痛风| c位是什么意思| 风土人情是什么意思| 茶叶里面含有什么成分| 脚指甲为什么变黑| 如何知道自己适合什么发型| 验血能查出什么| 江米和糯米有什么区别| 梦见自己骑马是什么意思| 质变是什么意思| single是什么意思| 镜片什么材质好| 医院特需门诊什么意思| ggo是什么意思| 晚上看见黄鼠狼有什么预兆| 东窗事发是什么意思| 结论是什么意思| 生离死别是什么生肖| 复杂囊肿是什么意思| 立冬和冬至什么区别| 串联质谱筛查是什么病| 双侧附睾头囊肿是什么意思| 白色加红色等于什么颜色| 什么是韧带| 654-2是什么药| 孕酮低是什么原因| 向日葵代表什么生肖| 公务员什么时候退休| 花旦是什么意思| 喜欢趴着睡觉是什么原因| 肛周湿疹用什么药膏| 产妇适合吃什么水果| 银壶一般什么价位| 什么鱼吃鱼屎| 洪字五行属什么| 梦到刷牙什么意思| 破卵针是什么| 白兰地是什么酒| 尿多尿频是什么原因造成的| 西游记什么时候写的| 头孢治什么| 间接胆红素是什么意思| 太阳穴长痘痘是什么原因| 什么的水井| 自残是什么心理| 假牛肉干是什么做的| 持续耳鸣是什么原因引起的| 罗勒叶在中国叫什么| 口苦口臭吃什么药效果最佳| 经常掏耳朵有什么危害| 辜负什么意思| 草字头占读什么| 蓝什么什么| rh是什么单位| 中位数是什么| 01属什么| 梦见自己尿裤子了是什么意思| 什么运动可以瘦脸| 白带有点黄是什么原因| 百度
Simulation / Modeling / Design

车讯:共计96094辆 北京现代召回部分途胜车型

百度 这种提法明确党内监督的全覆盖思想。

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to efficiently access device memory, in particular global memory, from within kernels.

There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global memory, which resides in device DRAM, for transfers between the host and device as well as for the data input to and output from kernels. The name global here refers to scope, as it can be accessed and modified from both the host and the device. Global memory can be declared in global (variable) scope using the __device__ declaration specifier as in the first line of the following code snippet, or dynamically?allocated using cudaMalloc() and assigned to a regular C pointer variable as in line 7. Global memory allocations?can persist for the lifetime of the application. Depending on the compute capability of the device, global memory may or may not be cached on the chip.

__device__ int globalArray[256];

void foo()
{
    ...
    int *myDeviceMemory = 0;
    cudaError_t result = cudaMalloc(&myDeviceMemory, 256 * sizeof(int));
    ...
}

Before we go into global memory access performance, we need to refine our understanding of the CUDA execution model. We have discussed how threads are grouped into thread blocks, which are assigned to multiprocessors on the device. During execution there is a finer grouping of threads into warps. Multiprocessors on the GPU execute instructions for each warp in SIMD (Single Instruction Multiple Data) fashion. The warp size (effectively the SIMD width) of all current CUDA-capable GPUs is 32 threads.

Global Memory Coalescing

Grouping of threads into warps is not only relevant to computation, but also to global memory accesses. The device coalesces?global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth (on older hardware of compute capability less than 2.0, transactions are coalesced within half warps of 16 threads rather than whole warps). To make clear the conditions under which coalescing occurs across CUDA device architectures we run some simple experiments on three Tesla cards: a Tesla C870 (compute capability 1.0), a Tesla C1060 (compute capability 1.3), and a Tesla C2050 (compute capability 2.0).

We run two experiments that use variants of an increment kernel shown in the following code (also available on GitHub), one with an array offset that can cause misaligned accesses to the input array, and the other with strided accesses to the input array.

#include 
#include 

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %sn", cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}

template 
__global__ void offset(T* a, int s)
{
  int i = blockDim.x * blockIdx.x + threadIdx.x + s;
  a[i] = a[i] + 1;
}

template 
__global__ void stride(T* a, int s)
{
  int i = (blockDim.x * blockIdx.x + threadIdx.x) * s;
  a[i] = a[i] + 1;
}

template 
void runTest(int deviceId, int nMB)
{
  int blockSize = 256;
  float ms;

  T *d_a;
  cudaEvent_t startEvent, stopEvent;

  int n = nMB*1024*1024/sizeof(T);

  // NB:  d_a(33*nMB) for stride case
  checkCuda( cudaMalloc(&d_a, n * 33 * sizeof(T)) );

  checkCuda( cudaEventCreate(&startEvent) );
  checkCuda( cudaEventCreate(&stopEvent) );

  printf("Offset, Bandwidth (GB/s):n");

  offset<<>>(d_a, 0); // warm up

  for (int i = 0; i <= 32; i++) {
    checkCuda( cudaMemset(d_a, 0.0, n * sizeof(T)) );

    checkCuda( cudaEventRecord(startEvent,0) );
    offset<<>>(d_a, i);
    checkCuda( cudaEventRecord(stopEvent,0) );
    checkCuda( cudaEventSynchronize(stopEvent) );

    checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) );
    printf("%d, %fn", i, 2*nMB/ms);
  }

  printf("n");
  printf("Stride, Bandwidth (GB/s):n");

  stride<<>>(d_a, 1); // warm up
  for (int i = 1; i <= 32; i++) {
    checkCuda( cudaMemset(d_a, 0.0, n * sizeof(T)) );

    checkCuda( cudaEventRecord(startEvent,0) );
    stride<<>>(d_a, i);
    checkCuda( cudaEventRecord(stopEvent,0) );
    checkCuda( cudaEventSynchronize(stopEvent) );

    checkCuda( cudaEventElapsedTime(&ms, startEvent, stopEvent) );
    printf("%d, %fn", i, 2*nMB/ms);
  }

  checkCuda( cudaEventDestroy(startEvent) );
  checkCuda( cudaEventDestroy(stopEvent) );
  cudaFree(d_a);
}

int main(int argc, char **argv)
{
  int nMB = 4;
  int deviceId = 0;
  bool bFp64 = false;

  for (int i = 1; i < argc; i++) {    
    if (!strncmp(argv[i], "dev=", 4))
      deviceId = atoi((char*)(&argv[i][4]));
    else if (!strcmp(argv[i], "fp64"))
      bFp64 = true;
  }

  cudaDeviceProp prop;

  checkCuda( cudaSetDevice(deviceId) )
  ;
  checkCuda( cudaGetDeviceProperties(&prop, deviceId) );
  printf("Device: %sn", prop.name);
  printf("Transfer size (MB): %dn", nMB);

  printf("%s Precisionn", bFp64 ? "Double" : "Single");

  if (bFp64) runTest(deviceId, nMB);
  else       runTest(deviceId, nMB);
}

This code can run both offset and stride kernels in either single (default) or double precision by passing the “fp64” command line option. Each kernel takes two arguments, an input array and an integer representing the offset or stride used to access the elements of the array. The kernels are called in loops over a range of offsets and strides.

Misaligned Data Accesses

The results for the offset kernel on the Tesla C870, C1060, and C2050 appear?in the following figure.

Arrays allocated in device memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. For the C870 or any other device with a compute capability of 1.0, any misaligned access by a half warp of threads (or aligned access where the threads of the half warp do not access memory in sequence) results in 16 separate 32-byte transactions. Since only 4 bytes are requested per 32-byte transaction, one would expect the effective bandwidth to be reduced by a factor of eight, which is roughly what we see in the figure above (brown line) for offsets that are not a multiple of 16 elements, corresponding to one half warp of threads.

For the Tesla C1060 or other devices with compute capability of 1.2 or 1.3, misaligned accesses are less problematic. Basically, the misaligned accesses of contiguous data by a half warp of threads are serviced in a few transactions that “cover” the requested data. There is still a performance penalty relative to the aligned case due both to unrequested data being transferred and to some overlap of data requested by different half-warps, but the penalty is far less than for the C870.

Devices of compute capability 2.0, such as the Tesla C2050, have an L1 cache in each multiprocessor with a 128-byte line size. The device coalesces accesses by threads in a warp into as few cache lines as possible, resulting in negligible effect of alignment on throughput for sequential memory accesses across threads.

Strided Memory Access

The results of the stride kernel appear in the following figure.

For strided global memory access we have a different picture. For large strides, the effective bandwidth is poor regardless of architecture version. This should not be surprising: when concurrent threads simultaneously access memory addresses that are very far apart in physical memory, then there is no chance for the hardware to combine the accesses. You can see in the figure above that on the Tesla C870 any stride other than 1 results in drastically reduced effective bandwidth. This is because compute capability 1.0 and 1.1 hardware requires linear, aligned accesses across threads for coalescing, so we see the familiar 1/8 bandwidth that we also saw in the offset kernel. Compute capability 1.2 and higher hardware can coalesce accesses that fall into aligned segments (32, 64, or 128 byte segments on CC 1.2/1.3, and 128-byte cache lines on CC 2.0 and higher), so this hardware results in a smooth bandwidth curve.

When accessing multidimensional arrays it is often necessary for threads to index the higher dimensions of the array, so strided access is simply unavoidable. We can handle these cases by using a type of CUDA memory called shared memory. Shared memory is an on-chip memory shared by all threads in a thread block. One use of shared memory is to extract a 2D tile of a multidimensional array from global memory in a coalesced fashion into shared memory, and then have contiguous threads stride through the shared memory tile. Unlike global memory, there is no penalty for strided access of shared memory. We will cover shared memory in detail in the next post.

?Summary

In this post we discussed some aspects of how to efficiently access global memory from within CUDA kernel code. Global memory access on the device shares performance characteristics with data access on the host; namely, that data locality is very important. In early CUDA hardware, memory access alignment was as important as locality across threads, but on recent hardware alignment is not much of a concern. On the other hand, strided memory access can hurt performance, which can be alleviated using on-chip shared memory. In the next post we will explore shared memory in detail, and in the post after that we will show how to use shared memory to avoid strided global memory accesses during a matrix transpose.

Discuss (7)

Tags

拉肚子拉稀是什么原因 陈晓和赵丽颖为什么分手 细胞是什么 咖啡色五行属什么 婆什么起舞
海带绿豆汤有什么功效 小狗吐白沫不吃东西没精神吃什么药 阴历六月十八是什么日子 左心室肥大是什么意思 一什么酒店
宫颈萎缩意味着什么 为什么会长荨麻疹 后代是什么意思 ppa是什么药 唐氏综合症是什么病
牙疼吃什么菜降火最快 孩子腿疼挂什么科 县长是什么级别 沙姜是什么姜 野兽是什么生肖
孕妇建档是什么意思hcv9jop5ns5r.cn 我想成为一个什么样的人hanqikai.com loveyourself什么意思hcv9jop0ns1r.cn 什么是thcv8jop5ns5r.cn 天乙贵人什么意思bjhyzcsm.com
肝内低回声区是什么意思hcv9jop1ns3r.cn 有始无终是什么生肖hcv9jop0ns9r.cn 左灯右行什么意思hcv9jop8ns3r.cn 吃什么卵泡长得快又好hcv9jop1ns9r.cn 晚上喝柠檬水有什么好处hcv7jop5ns0r.cn
利口酒是什么酒hcv8jop4ns9r.cn 发福了是什么意思hcv9jop6ns6r.cn 肾虚吃什么药最有效hcv7jop9ns6r.cn 眼白发黄是什么原因qingzhougame.com 坐月子能吃什么蔬菜hcv8jop0ns7r.cn
天象是什么意思hcv9jop5ns1r.cn 阳阴阳是什么卦520myf.com 龙虾和什么不能一起吃hcv8jop5ns2r.cn 收缩压低是什么原因hcv7jop5ns4r.cn ot是什么hcv9jop2ns5r.cn
百度