工作效率是什么意思| 艾滋病阴性是什么意思| 胆碱酯酶偏高说明什么| 八败是什么意思| 尼泊尔是什么人种| 什么是绝对值| 乳腺纤维瘤有什么症状表现| 喜金是什么意思| 车辙是什么意思| 小登科是什么意思| 身上长扁平疣是什么原因| 对牛弹琴是什么意思| yxh是什么意思| 川崎病是什么症状| 血糖高是什么原因引起的| 不齿是什么意思| 胰腺在人体什么部位| 妍字属于五行属什么| 老年人腿肿是什么原因引起的| 青海古代叫什么| 壁报交流是什么意思| 白细胞中性粒细胞高是什么原因| 减肥最好的办法是什么| 心率偏低是什么原因| 什么是无氧运动| 月亮星座是什么意思| 自言自语什么意思| 侄子是什么关系| 默念是什么意思| 如如不动什么意思| bmi是什么意思| 起付线是什么意思| 葛根长什么样子图片| 直肠ca代表什么病| 婳是什么意思| 青龙白虎是什么意思| NT是什么钱| 小人是什么意思| 我会送你红色玫瑰是什么歌| 经常打飞机有什么危害| o型血的人是什么性格| 儿童上火了吃什么降火最快| 万花筒是什么| 7大营养素是什么| 尾巴骨疼是什么原因| 肠道紊乱的症状是什么| u型枕有什么作用| 天疱疮是什么病| 便黑色大便是什么情况| 胃动力不足是什么原因造成的| 螃蟹为什么横着走| 盐糖水有什么功效作用| 百合是什么植物| 冷暖自知上一句是什么| 红细胞压积什么意思| 梦见戴帽子是什么预兆| 意念灰是什么意思| 备注是什么意思| 做梦结婚是什么征兆| 蛋蛋冰凉潮湿什么原因| 奶昔是什么| 茄子把有什么功效| lauren是什么意思| 为什么会得霉菌性阴道炎| 散光和近视有什么区别| 一号来的月经排卵期是什么时候| 牙结石有什么危害| 非经期出血是什么原因| 海鸥手表是什么档次| 66年出生属什么生肖| 夜间睡觉出汗是什么原因| 女生是党员有什么好处| 西米是什么| 阴道菌群失调用什么药| 送男孩子什么礼物比较好| 我不知道你在说什么英文| 棉纱是什么面料| 二个月不来月经是什么原因| 淋巴炎挂什么科| who是什么组织| 子宫瘢痕憩室是什么病| 眼花是什么原因| 产后大出血一般发生在什么时候| 50年属什么生肖| 屈曲是什么意思| 分割线是什么意思| 集体户口是什么意思| 蓝风铃香水什么味道| 什么爱心| 婴儿吃不饱有什么危害| 途径是什么意思| 满江红属于什么植物| 什么样的夜晚| 毛泽东什么时候死的| 袋鼠吃什么| 什么时候割包皮最好| 司马迁属什么生肖| 清关是什么| 梦见别人搬家预示什么| 精子像果冻是什么原因| 信口雌黄是什么意思| qm医学上什么意思| 玻璃体切除后对眼睛有什么影响| 蚊子不喜欢什么味道| 粉碎性骨折吃什么好| 吃完晚饭就犯困是什么原因| 什么是sop流程| 72年是什么年| 首肯是什么意思| 车厘子什么季节吃| 抽烟为什么会上瘾| 吃什么东西补血| 没落是什么意思| 丁羟甲苯是什么| 一直不来月经是什么原因| 葡萄糖氯化钠注射作用是什么| 什么药能降肌酐| 肱骨外上髁炎用什么药| pd是什么意思| 背德感是什么意思| 早饱是什么意思| 姑妈是什么关系| 什么空调好| 重阳节吃什么好| 女人下面长什么样| 乳头痛是什么征兆| 屌丝男是什么意思| ldl是什么意思| 进门见什么好| 大陆人去香港需要什么证件| 一个均一个金念什么| 人参为什么会跑| 非洲人一日三餐吃什么| 白血病是什么原因引起的| 化骨龙是什么意思| 鳕鱼不能和什么一起吃| 户籍地址填什么| 四不像是指什么动物| 女人喝什么调节内分泌| 撒西不理是什么意思| 杺字五行属什么| 逍遥丸什么时候吃| 有什么好吃的家常菜| 花红是什么意思| 什么鸟一生只有一个伴侣| 日有所思夜有所梦是什么意思| 女孩子命硬有什么表现| 明月照沟渠是什么意思| 什么首什么尾| 欢字五行属什么| 广西北海有什么好玩的地方| 为什么会尿床| 丹桂飘香是什么季节| 头发麻是什么病的前兆| 后背有痣代表什么意思| 什么姿势容易怀孕| ck是什么| 长期贫血会导致什么严重后果| 喝了藿香正气水不能吃什么| 什么叫欲擒故纵| 登徒子什么意思| 单核细胞偏低是什么意思| 血糖高能喝什么粥| 梦见老公出轨什么意思| 胎停有什么症状| 美背是什么| 宝宝囟门什么时候闭合| 讥讽的笑是什么笑| 小便短赤什么意思| 宝宝是什么意思| 垣字五行属什么| 核磁共振跟ct有什么区别| 迪桑特属于什么档次的| 肾病应该吃什么| 1月17号是什么星座| 钟鸣鼎食是什么意思| 白玫瑰适合送什么人| 芽轴发育成什么| col是什么的缩写| 吃什么东西增强免疫力| 一本万利是什么意思| 什么负什么名| 肝掌是什么症状| 11月17是什么星座| 我一言难尽忍不住伤心是什么歌| 舌头生疮是什么原因引起的| 暗忖是什么意思| 铁瓷是什么意思| 就让我爱你把你捧在手心里是什么歌| 红斑狼疮是什么原因引起的| 子宫憩室有什么症状| 生理年龄是什么意思| 飞机烧的是什么油| 黑鱼吃什么| 角质是什么意思| 善什么甘什么| her是什么意思| 春节是什么时候| od值是什么意思| ct检查什么| 栀子有什么功效| 做心电图挂什么科| 盛世的意思是什么| 海蓝之谜适合什么肤质| 男生的隐私长什么样| 润月是什么意思| 什么是大专| 头晕恶心什么原因| 头部麻木是什么征兆| 动物为什么要冬眠| 小孩手指头脱皮是什么原因| nsfw什么意思| 真命题是什么意思| 一什么枝条| 水痘通过什么途径传染| 起诉离婚需要什么材料| 宫缩是什么意思| 三叉神经痛吃什么药| 抗缪勒氏管激素是检查什么的| 净身高是什么意思| 身上长黑痣是什么原因| 针眼是什么原因引起的| 阿托伐他汀钙片治什么病| 寒湿吃什么药| 非经期少量出血是什么原因| 吃饭出虚汗是什么原因| 颈部淋巴结肿大吃什么药| 家里出现蚂蚁预示什么| 开车穿什么鞋最好| 风光秀丽的什么| 桑葚有什么功效和作用| 女性潮热是什么症状| 黄芪什么人不能吃| 阴阳八卦是什么生肖| 什么门不能开| 山合念什么| 有偿服务是什么意思| 它是什么结构| 倒数第二颗牙齿叫什么| 落寞是什么意思| 腿上无缘无故出现淤青是什么原因| 星座上升是什么意思| 睡醒咳嗽是什么原因| 吃什么排铅最快| 多汗症去医院挂什么科| 大腿根内侧发黑是什么原因| 喝红花有什么作用与功效| 小孩铅过高有什么症状| 想要孩子需要做什么检查| 肝衰竭是什么原因引起的| 伏天吃羊肉有什么好处| 阴道息肉长什么样| 肠胃炎不能吃什么| 3p什么意思| 肩膀上有痣代表什么| 血小板低吃什么食物补得快| 狼吃什么| 做梦梦见死去的亲人是什么意思| 什么叫放疗治疗| 吝啬鬼是什么生肖| 胃肠道感冒吃什么药| 什么是干眼症| 不安腿是什么症状| 梦到扫地是什么意思| 百度
百度 中长期的具体影响程度还要视后续中美贸易战的广度和深度判断,但考虑当前中国内需韧性及较为充裕的政策缓冲余地,对中国经济增长前景及资本市场表现不必过于悲观。

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also have high memory bandwidth, but sometimes they need your help to saturate that bandwidth.

In this post, we examine one specific method to accomplish that: prefetching. We explain the circumstances under which prefetching can be expected to work well, and how to find out whether these circumstances apply to your workload.

Context

NVIDIA GPUs derive their power from massive parallelism. Many warps of 32 threads can be placed on a streaming multiprocessor (SM), awaiting their turn to execute. When one warp is stalled for whatever reason, the warp scheduler switches to another with zero overhead, making sure the SM always has work to do.

On the high-performance NVIDIA Ampere Architecture A100 GPU, up to 64 active warps can share an SM, each with its own resources. On top of that, A100 has 108 SMs that can all execute warp instructions simultaneously.

Most instructions must operate on data, and that data almost always originates in the device memory (DRAM) attached to the GPU. One of the main reasons why even the abundance of warps on an SM can run out of work is because they are waiting for data to arrive from memory.

If this happens, and the bandwidth to memory is not fully utilized, it may be possible to reorganize the program to improve memory access and reduce warp stalls, which in turn makes the program complete faster. This is called latency hiding.

Prefetching

A technology commonly supported in hardware on CPUs is called prefetching. The CPU sees a stream of requests from memory arriving, figures out the pattern, and starts fetching data before it is actually needed. While that data travels to the execution units of the CPU, other instructions can be executed, effectively hiding the travel costs (memory latency).

Prefetching is a useful technique but expensive in terms of silicon area on the chip. These costs would be even higher, relatively speaking, on a GPU, which has many more execution units than the CPU. Instead, the GPU uses excess warps to hide memory latency. When that is not enough, you may employ prefetching in software. It follows the same principle as hardware-supported prefetching but requires explicit instructions to fetch the data.

To determine if this technique can help your program run faster, use a GPU profiling tool such as NVIDIA Nsight Compute to check the following:

  1. Confirm that not all memory bandwidth is being used.
  2. Confirm the main reason warps are blocked is Stall Long Scoreboard, which means that the SMs are waiting for data from DRAM.
  3. Confirm that these stalls are concentrated in sizeable loops whose iterations do not depend on each other.

Unrolling

Consider the simplest possible optimization of such a loop, called unrolling. If the loop is short enough, you can tell the compiler to unroll it completely and the iterations are expanded explicitly. Because the iterations are independent, the compiler can issue all requests for data (“loads”) upfront, provided that it assigns distinct registers to each load.

These requests can be overlapped with each other, so that the whole set of loads experiences only a single memory latency, not the sum of all individual latencies. Even better, part of the single latency is hidden by the succession of load instructions itself. This is a near-optimal situation, but it may require a lot of registers to receive the results of the loads.

If the loop is too long, it could be unrolled partially. In that case, batches of iterations are expanded, and then you follow the same general strategy as before. Work on your part is minimal (but you may not be that lucky).

If the loop contains many other instructions whose operands need to be stored in registers, even just partial unrolling may not be an option. In that case, and after you have confirmed that the earlier conditions are satisfied, you must make some decisions based on further information.

Prefetching means bringing data closer to the SMs’ execution units. Registers are closest of all. If enough are available, which you can find out using the Nsight Compute occupancy view, you can prefetch directly into registers.

Consider the following loop, where array arr is stored in global memory (DRAM). It implicitly assumes that just a single, one-dimensional thread block is being used, which is not the case for the motivating application from which it was derived. However, it reduces code clutter and does not change the argument.

In all code examples in this post, uppercase variables are compile-time constants. BLOCKDIMX assumes the value of the predefined variable blockDim.x. For some purposes, it must be a constant known at compile time whereas for other purposes, it is useful for avoiding computations at run time.

for (i=threadIdx.x; i<imax; i+= BLOCKDIMX) {
  double locvar = arr[i];
  <lots of instructions using locvar, for example, transcendentals>
}

Imagine that you have eight registers to spare for prefetching. This is a tuning parameter. The following code fetches four double-precision values occupying eight 4-byte registers at the start of each fourth iteration and uses them one by one, until the batch is depleted, at which time you fetch a new batch.

To keep track of the batches, introduce a counter (ctr) that increments with each successive iteration executed by a thread. For convenience, assume that the number of iterations per thread is divisible by 4.

double v0, v1, v2, v3;
for (i=threadIdx.x, ctr=0; i<imax; i+= BLOCKDIMX, ctr++) {
  ctr_mod = ctr%4;
  if (ctr_mod==0) { // only fill the buffer each 4th iteration
    v0=arr[i+0* BLOCKDIMX]; 
    v1=arr[i+1* BLOCKDIMX]; 
    v2=arr[i+2* BLOCKDIMX]; 
    v3=arr[i+3* BLOCKDIMX];
  }
  switch (ctr_mod) { // pull one value out of the prefetched batch
    case 0: locvar = v0; break;
    case 1: locvar = v1; break;
    case 2: locvar = v2; break;
    case 3: locvar = v3; break;
  }
  <lots of instructions using locvar, for example, transcendentals>
}

Typically, the more values can be prefetched, the more effective the method is. While the preceding example is not complex, it is a little cumbersome. If the number of prefetched values (PDIST, or prefetch distance) changes, you have to add or delete lines of code.

It is easier to store the prefetched values in shared memory, because you can use array notation and vary the prefetch distance without any effort. However, shared memory is not as close to the execution units as registers. It requires an extra instruction to move the data from there into a register when it is ready for use. For convenience, we introduce macro vsmem to simplify indexing the array in shared memory:

#define vsmem(index)  v[index+PDIST*threadIdx.x]
__shared__ double v[PDIST* BLOCKDIMX];
for (i=threadIdx.x, ctr=0; i<imax; i+= BLOCKDIMX, ctr++) {
  ctr_mod = ctr%PDIST;
  if (ctr_mod==0) {
    for (k=0; k<PDIST; ++k) vsmem(k) = arr[i+k* BLOCKDIMX];
  }
  locvar = vsmem(ctr_mod);
  <more instructions using locvar, for example, transcendentals>
}

Instead of prefetching in batches, you can also do a “rolling” prefetch. In that case, you fill the prefetch buffer before entering the main loop and subsequently prefetch exactly one value from memory during each loop iteration, to be used PDIST iterations later. The next example implements rolling prefetching, using array notation and shared memory.

__shared__ double v[PDIST* BLOCKDIMX];
for (k=0; k<PDIST; ++k) vsmem(k) = arr[threadIdx.x+k* BLOCKDIMX];
for (i=threadIdx.x, ctr=0; i<imax; i+= BLOCKDIMX, ctr++) {
  ctr_mod= ctr%PDIST;
  locvar = vsmem(ctr_mod);
  if ( i<imax-PDIST* BLOCKDIMX) vsmem(ctr_mod) = arr[i+PDIST* BLOCKDIMX]; 
  <more instructions using locvar, for example, transcendentals>
}

Contrary to the batched method, the rolling prefetch does not suffer anymore memory latencies during the execution of the main loop for a sufficiently large prefetch distance. It also uses the same amount of shared memory or register resources, so it would appear to be preferred. However, a subtle issue may limit its effectiveness.

A synchronization within the loop—for example, syncthreads—constitutes a memory fence and forces the loading of arr to complete at that point within the same iteration, not PDIST iterations later. The fix is to use asynchronous loads into shared memory, the simplest version of which is explained in the Pipeline interface section of the CUDA programmer guide. These asynchronous loads do not need to complete at a synchronization point, but only when they are explicitly waited on.

Here’s the corresponding code:

#include <cuda_pipeline_primitives.h>
__shared__ double v[PDIST* BLOCKDIMX];
for (k=0; k<PDIST; ++k) { // fill the prefetch buffer asynchronously
  __pipeline_memcpy_async(&vsmem(k), &arr[threadIdx.x+k* BLOCKDIMX], 8);
  __pipeline_commit();
}
for (i=threadIdx.x, ctr=0; i<imax; i+= BLOCKDIMX, ctr++) {
  __pipeline_wait_prior(PDIST-1); //wait on needed prefetch value
  ctr_mod= ctr%PDIST;
  locvar = vsmem(ctr_mod);
  if ( i<imax-PDIST* BLOCKDIMX) { // prefetch one new value
    __pipeline_memcpy_async(&vsmem(ctr_mod), &arr[i+PDIST* BLOCKDIMX], 8);
    __pipeline_commit();
  }
  <more instructions using locvar, for example, transcendentals>
}

As each __pipeline_wait_prior instruction must be matched by a __pipeline_commit instruction, we put the latter inside the loop that prefills the prefetch buffer, before entering the main computational loop, to keep bookkeeping of matching instruction pairs simple.

Performance results

Figure 1 shows, for various prefetch distances, the performance improvement of a kernel taken from a financial application under the five algorithmic variations described earlier.

  • Batched prefetch into registers (scalar batched)
  • Batched prefetch into shared memory (smem batched)
  • Rolling prefetch into registers (scalar rolling)
  • Rolling prefetch into shared memory (smem rolling)
  • Rolling prefetch into shared memory using asynchronous memory copies (smem rolling async)
Graph shows that smem rolling async speeds up by -60% at a distance of 6.
Figure 1. Kernel speedups for different prefetch strategies

Clearly, the rolling prefetching into shared memory with asynchronous memory copies gives good benefit, but it is uneven as the prefetch buffer size grows.

A closer inspection of the results, using Nsight Compute, shows that bank conflicts occur in shared memory, which cause a warp worth of asynchronous loads to be split into more successive memory requests than strictly necessary. The classical optimization approach of padding the array size in shared memory to avoid bad strides works in this case. The value of PADDING is chosen such that the sum of PDIST and PADDING equals a power of two plus 1. Apply it to all variations that use shared memory:

#define vsmem(index) v[index+(PDIST+PADDING)*threadIdx.x]

This leads to the improved shared memory results shown in Figure 2. A prefetch distance of just 6, combined with asynchronous memory copies in a rolling fashion, is sufficient to obtain optimal performance at almost 60% speedup over the original version of the code. We could actually have arrived at this performance improvement without resorting to padding by changing the indexing scheme of the array in shared memory, which is left as an exercise for the reader.

Graph shows speedup percentages where scalar rolling alone slows performance by ~60% and other rolling/batched strategies shows speedups of 20-30%.
Figure 2. Kernel speedups for different prefetch strategies with shared memory padding

A variation of prefetching not yet discussed moves data from global memory to the L2 cache, which may be useful if space in shared memory is too small to hold all data eligible for prefetching. This type of prefetching is not directly accessible in CUDA and requires programming at the lower PTX level.

Summary

In this post, we showed you examples of localized changes to source code that may speed up memory accesses. These do not change the amount of data being moved from memory to the SMs, only their timing. You may be able to optimize more by rearranging memory accesses such that data is reused many times after it arrives on the SM.

Discuss (7)

Tags

昕字取名什么寓意 吃什么可以提升白细胞 吃什么食物补钾最快 麾下是什么意思 什么是心脏早搏
突然勃不起来是什么原因 不以为然是什么意思 ao是什么意思 磨人的小妖精是什么意思 女子胞指的是什么
同仁是什么意思 属鸡的幸运色是什么颜色 口腔溃疡是什么症状 胆汁有什么作用 刻薄是什么意思
金命是什么意思 郭富城属什么生肖 肝囊肿是什么原因造成的 心包积液是什么意思 吃什么东西补血最快
钥字五行属什么hcv9jop5ns0r.cn 养鱼为什么养单不养双hcv9jop8ns0r.cn 啊囊死给什么意思hcv9jop4ns5r.cn 什么方法可以治打嗝hcv8jop8ns9r.cn 胆固醇高有什么危害wuhaiwuya.com
leg是什么意思xinjiangjialails.com 怀孕不可以吃什么东西hcv7jop6ns3r.cn 为什么会长闭口liaochangning.com 东盟为什么没有中国hcv8jop7ns6r.cn 汉语拼音是什么时候发明的hcv9jop8ns0r.cn
挑拨离间是什么意思luyiluode.com 一人吃饱全家不饿是什么生肖hcv8jop6ns1r.cn 肛裂擦什么药膏hcv8jop9ns4r.cn 鼻子流血什么原因hcv9jop6ns3r.cn 工作性质是什么意思hcv8jop1ns1r.cn
玄色是什么颜色hcv9jop6ns5r.cn 开字加一笔是什么字hcv7jop4ns8r.cn 孕妇什么时候开始补钙hcv8jop6ns0r.cn 胃痛去药店买什么药1949doufunao.com 夏天能种什么菜hcv9jop5ns2r.cn
百度