明亮的什么| 女人脾肾两虚吃什么好| 吃什么盐最好| 微信中抱拳是什么意思| 世界第一大河是什么河| 空五行属什么| 失态是什么意思| 格色是什么意思| 为什么空调外机会滴水| 桥本甲状腺炎有什么症状| hmo是什么意思| 毛笔是用什么毛做的| 血管瘤是什么原因引起的| 眼睛的睛可以组什么词| 为什么会突然吐血| 尿血是什么病| esim卡是什么| 什么是梅尼埃综合症| 呕心沥血是什么意思| 吃什么补肺养肺比较好| 宝宝积食发烧吃什么药| 失眠是什么原因引起的| 女人梦见自己掉牙齿是什么征兆| 龙凤胎是什么意思| 什么可当| 什么的大娘| 盲从什么意思| 妇科支原体感染吃什么药| 齿痕舌是什么原因| 蝗虫吃什么| 白酒都有什么香型| 打喷嚏是什么原因| 胃疼吃什么药效果好| 输卵管造影什么时候检查最好| 面粉可以做什么好吃的| 可拉明又叫什么| 牛肉和什么炒最好吃| 胸口疼挂什么科室| 头晕眼睛模糊是什么原因| 二月二是什么节| 左眼皮跳是什么预兆呢| 卷柏属于什么植物| 江浙沪是什么意思| 小猫的耳朵像什么| 老公生日送什么礼物好| 斐乐手表属于什么档次| 奥斯卡小金人什么意思| 信阳毛尖属于什么茶| 唐僧是什么生肖| 黄皮肤适合什么颜色的衣服| 吃什么水果补肝养肝最有效| 月经失调是什么原因引起的| 水滴鱼长什么样子| 沙龙会是什么意思| 吃什么奶水多| 择期手术是什么意思| 舌裂是什么原因造成的| db是什么单位| 梅毒是什么症状| hpv什么时候检查最好| 什么食物降胆固醇最好| 黄皮是什么| 化疗期间不能吃什么| 手指疼挂什么科| proof是什么意思| 宗人府是什么地方| 什么不及什么| 人的血压一天中什么时候最高| 喝山楂水有什么功效与作用| 幼儿园报名需要什么资料| 什么是hp感染| 什么水果补充维生素c| 一马平川是什么意思| 8.1是什么星座| 什么是uv| 做梦梦见老公出轨是什么意思| 肝实质弥漫性回声改变什么意思| 装孙子是什么意思| hcho是什么意思| 舌头裂开是什么原因| 什么情况下需要做心脏支架| 红色加蓝色是什么颜色| 腔梗是什么意思| 诸葛亮发明了什么| 曹操的小名叫什么| 复视是什么意思| 包拯属什么生肖| 胆囊切除后需要注意什么| 狮子座和什么座最配对| 劳作是什么意思| 肺热咳嗽吃什么药| 炒什么菜适合拌面| warrior是什么牌子| 夏季吃什么水果| 无家可归是什么生肖| 嗳气是什么原因| 红斑狼疮是一种什么病| 1202是什么星座| 婠是什么意思| puma是什么意思| 什么负什么名| 小乌龟死了有什么预兆| 梦见大领导有什么预兆| 梦见钓了好多鱼是什么意思| 土豆发芽到什么程度不能吃| 潍坊有什么好玩的| 吃鸡什么意思| 小棉袄是什么意思| upi是什么意思| 小资情调是什么意思| 道地药材是什么意思| 女人长期做俯卧撑有什么效果| 范仲淹是什么朝代的| 蝴蝶有什么寓意| 王八是什么字| 山楂有什么功效和作用| 交会是什么意思| 子字五行属什么| 双子座男和什么座最配对| 二十四节气分别是什么| 后厨打荷是干什么的| 梦到捡到钱是什么预兆| 今年温度为什么这么高| 肯德基为什么叫kfc| 关节响是什么原因| 黑色的蛇是什么蛇| 早搏有什么危害| 地豆是什么| 干细胞是什么| 阴道出血是什么原因引起的| 安慰的意思是什么| 525什么星座| via什么意思| 失联是什么意思| 血糖高的人适合吃什么水果| 鲁迅是著名的什么家| ram是什么动物| 为什么打嗝不停| 养膘是什么意思| 睡觉打呼噜是什么病| 凉薄是什么意思| 本科属于什么学位| 减胎对另一个胎儿有什么影响| 内膜薄吃什么补得最快| 崇敬是什么意思| 喷的右边念什么| 神经性皮炎是什么| 八月十五是什么节日| 为什么一喝阿胶睡眠就好了| 自然色是什么颜色| 什么是性上瘾| 笃定什么意思| 单抗主要治疗什么| 孕妇为什么不能吃西瓜| 项链突然断了预示什么| 痛风病人不能吃什么| 肚子疼拉稀是什么原因| 什么时候泡脚最好| 为什么老是掉头发特别厉害| 妾是什么意思| 鸭蛋不能和什么一起吃| 胃上火有什么症状| 什么护肤品| gcp是什么意思| 血糖低吃什么| 药剂科是干什么的| 一只什么| 糜烂性胃炎吃什么药效果好| 肠梗阻什么症状| 1.18是什么星座| 老年人腿浮肿是什么原因引起的| 月经期间肚子疼是什么原因| 吃中药不能吃什么东西| bella是什么意思| 摇篮是什么意思| 四维是检查什么| 珙桐属于什么植物| 羊水偏少对胎儿有什么影响| 恋爱是什么| 什么入什么出| 妇科衣原体是什么病| 2007年属猪五行属什么| 青蛙长什么样| 煜这个字读什么| 高中生物学什么| 阿尔兹海默症吃什么药| 甘草是什么| 低压高吃什么药最有效| 喝最烈的酒下一句是什么| 君无戏言什么意思| 打九价是什么意思| s和m是什么意思| 葫芦五行属什么| 女性尿频挂什么科| 耳膜穿孔吃什么长得快| daily什么意思| 提篮子是什么意思| 意大利用的什么货币| 喘息性支气管炎吃什么药| 考妣是什么意思| 为什么手机打不出去电话| 命里缺水取什么名字好| 飞机杯长什么样子| 女人吃芡实有什么好处| 藜芦是什么| 腰困是什么原因| 眩晕呕吐是什么病| 社会很单纯复杂的是人是什么歌| 压片糖果是什么意思| 淀粉酶高有什么危害| 卅什么意思| 穷兵黩武是什么意思| 相思病是什么意思| 佟丽娅什么民族| 多西他赛是什么药| 脂膜炎是什么病严重吗| 浓茶喝多了有什么危害| 鱼喜欢吃什么食物| 血糖低吃什么药| 番薯什么时候传入中国| 晚上八点多是什么时辰| 水云间什么意思| 嘴苦是什么病的征兆| 高危儿是什么意思| fog是什么牌子| 乌纱帽是什么意思| 芥菜长什么样子图片| 吃什么能补钙| 沙棘有什么作用| 寿终正寝是什么意思| 独在异乡为异客异是什么意思| 插肩袖是什么意思| 不忘初心方得始终是什么意思| 羊肉和什么菜搭配最好| 胯骨在什么位置| 吃维e有什么好处和副作用| 怀孕3天有什么症状| 为什么不能送手表| 龙头龟身是什么神兽| 珊瑚虫属于什么动物| 牙疼吃什么水果| 香砂是什么| 这个字念什么| 久之的之是什么意思| 什么是气血不足| 喉咙有异物感看什么科| 拿什么拯救你我的爱人演员表| hcg低是什么原因| 跨界歌手是什么意思| 甲沟炎涂什么药膏| 2027年属什么生肖| 生理需求是什么意思| 补液盐是什么| prf是什么意思| 手脚不协调是什么原因| 肾盂分离是什么意思| 晚上吃什么水果减肥效果最好| 翠花是什么意思| 血脂高吃什么油| 内分泌失调有什么症状| 咯血是什么意思| 头发掉的严重是什么原因| 火车无座是什么意思| 间歇性跛行是什么意思| 百度
Simulation / Modeling / Design

42名画家莲都写生 创作画板上的“古堰画乡”

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.
百度 港媒称,于近日确认进行了一系列海上电磁炮试验,美国海军作战部长则在本月国会一个小组委员会讲话时呼吁对这种武器给予更多关注,并表示尚未进行过电磁炮海上试射的美国正“充分投入”于完善这一武器装置,尽管有报道说这个计划因成本和技术原因被取消。

In the previous two posts we looked at how to move data efficiently between the host and device.? In this sixth post of our CUDA Fortran series we discuss how to efficiently access device memory, in particular global memory, from within kernels.

There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global memory, which resides in device DRAM, for transfers between the host and device as well as for the data input to and output from kernels. The name global here refers to scope, as it can be accessed and modified from both the host and the device. Global memory is declared in host code via the device variable attribute and can persist for the lifetime of the application. Depending on the compute capability of the device, global memory may or may not be cached on the chip.

Before we go into how global memory is accessed, we need to refine our understanding of the CUDA execution model. We have discussed how threads are grouped into thread blocks, which are assigned to multiprocessors on the device. During execution there is a finer grouping of threads into groups of threads called warps. Multiprocessors on the GPU execute instructions for each warp in SIMD (Single Instruction Multiple Data) fashion. The warp size (effectively the SIMD width) of all current CUDA-capable GPUs is 32 threads.

Global Memory Coalescing

Grouping of threads into warps is not only relevant to computation, but also to global memory accesses. The device coalesces?global memory loads and stores issued by threads of a warp into as few transactions as possible in order to minimize DRAM bandwidth (on older hardware of compute capability less than 2.0, transactions are coalesced within half warps of 16 threads rather than whole warps). To elucidate the conditions under which coalescing occurs across CUDA device architectures we run some simple experiments on three Tesla cards: a Tesla C870 (compute capability 1.0), a Tesla C1060 (compute capability 1.3), and a Tesla C2050 (compute capability 2.0).

We run two experiments that use variants of an increment kernel shown in the following code, one with an array offset that can cause misaligned accesses to the input array, and the other with strided accesses to the input array.

module kernels_m
  integer, parameter :: singlePrecision = kind(0.0)
  integer, parameter :: doublePrecision = kind(0.0d0)

  integer, parameter :: fp_kind = singlePrecision
contains
  attributes(global) subroutine offset(a, s)
    real (fp_kind) :: a(*)
    integer, value :: s
??? integer :: i
??? i = blockDim%x*(blockIdx%x-1)+threadIdx%x + s
??? a(i) = a(i)+1
? end subroutine offset

  attributes(global) subroutine stride(a, s)
??? real (fp_kind) :: a(*)
??? integer, value :: s
??? integer :: i
??? i = 1 + (blockDim%x*(blockIdx%x-1)+threadIdx%x-1) * s
??? a(i) = a(i)+1
? end subroutine stride
end module kernels_m

program offsetAndStride
  use cudafor
  use kernels_m

  implicit none

  integer, parameter :: nMB = 4? ! NB:  a_d(33*nMB) for stride case
  integer, parameter :: blockSize = 256
  integer :: n
  real (fp_kind), device, allocatable :: a_d(:)
  type(cudaEvent) :: startEvent, stopEvent
  type(cudaDeviceProp) :: prop
  integer :: i, istat
  real(4) :: time

  istat = cudaGetDeviceProperties(prop, 0)
  write(*,'(/,"Device: ",a)') trim(prop%name)
  write(*,'("Transfer size (MB): ",i0)') nMB

  if (kind(a_d) == singlePrecision) then
    write(*,'(a,/)') 'Single Precision'
  else
    write(*,'(a,/)') 'Double Precision'
  endif
  n = nMB*1024*1024/fp_kind
  allocate(a_d(n*33))

  istat = cudaEventCreate(startEvent)
  istat = cudaEventCreate(stopEvent)

  write(*,*) 'Offset, Bandwidth (GB/s):'
  call offset<<>>(a_d, 0) ?
  do i = 0, 32
    a_d = 0.0
    istat = cudaEventRecord(startEvent,0)
    call offset<<>>(a_d, i)
    istat = cudaEventRecord(stopEvent,0)
    istat = cudaEventSynchronize(stopEvent)

    istat = cudaEventElapsedTime(time, startEvent, stopEvent)
    write(*,*) i, 2*nMB/time*(1.e+3/1024)
  enddo

  write(*,*)
  write(*,*) 'Stride, Bandwidth (GB/s):'
  call stride<<>>(a_d, 1)
  do i = 1, 32
    a_d = 0.0
    istat = cudaEventRecord(startEvent,0)
    call stride<<>>(a_d, i)
    istat = cudaEventRecord(stopEvent,0)
    istat = cudaEventSynchronize(stopEvent)
    istat = cudaEventElapsedTime(time, startEvent, stopEvent)
    write(*,*) i, 2*nMB/time*(1.e+3/1024)
  enddo

  istat = cudaEventDestroy(startEvent)
  istat = cudaEventDestroy(stopEvent)
  deallocate(a_d)

end program offsetNStride

This code can run both offset and stride kernels in either single or double precision by changing the fp_kind parameter at the top of the code. Each kernel takes two arguments, an input array and an integer representing the offset or stride used to access the elements of the array. The kernels are called in loops over a range of offset and strides.

Misaligned Data Accesses

The results for the offset kernel on the Tesla C870, C1060, and C2050 are shown in the following figure.

Arrays allocated (either explicitly or implicitly) in device memory, are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. For the C870 or any other device with a compute capability of 1.0, any misaligned access by a half warp of threads (or aligned access where the threads of the half warp do not access memory in sequence) results in 16 separate 32-byte transactions. Since only 4 bytes are requested per 32-byte transaction, one would expect the effective bandwidth to be reduced by a factor of eight, which is roughly what we see in the figure above (brown line) for offsets that are not a multiple of 16 elements, corresponding to one half warp of threads.

For the Tesla C1060 or other devices with compute capability of 1.2 or 1.3, misaligned accesses are less problematic. Basically, the misaligned accesses of contiguous data by a half warp of threads are serviced in a few transactions that “cover” the requested data. There is still a performance penalty relative to the aligned case due to both unrequested data being transferred and some overlap of data requested by different half-warps, but the penalty is far less than for the C870.

Devices of compute capability 2.0, such as the Tesla C2050, have an L1 cache in each multiprocessor with a 128-byte line size. ?Accesses by threads in a warp are coalesced into as few cache lines as possible, resulting in negligible effect of alignment on throughput for sequential memory accesses across threads.

Strided Memory Access

The results of the stride kernel are shown below:

For strided global memory access we have a different picture. For large strides, the effective bandwidth is poor regardless of the version of the architecture. This should not be surprising: when concurrent threads simultaneously access memory addresses that are very far apart in physical memory, then there is no chance for the hardware to combine the accesses. You can see in the figure above that on the 870 that any stride other than 1 results in drastically reduced effective bandwidth. ?This is because compute capability 1.0 and 1.1 hardware requires linear, aligned accesses across threads for coalescing, so we see the familiar 1/8 bandwidth that we also saw in the offset kernel. Compute capability 1.2 and higher hardware can coalesce accesses that fall into aligned segments (32, 64, or 128 byte segments on CC 1.2/1.3, and 128-byte cache lines on CC 2.0 and higher), so this hardware results in a smooth bandwidth curve.

When accessing multidimensional arrays it is often necessary for threads to index the higher dimensions of the array, so strided access is simply unavoidable. We can handle these cases by using a type of CUDA memory called shared memory. Shared memory is an on-chip memory which is shared by all threads in a thread block. One use of shared memory is to extract a 2D tile of a multidimensional array from global memory in a coalesced fashion into shared memory, and then have contiguous threads stride through the shared memory tile.? Unlike global memory, there is no penalty for strided access of shared memory.? We will cover shared memory in detail in the next post.

Summary

In this post we discussed some aspects of how to efficiently access global memory from within CUDA kernel code. Global memory access on the device shares performance characteristics with data access on the host; namely, that data locality is very important. In early CUDA hardware, memory access alignment was as important as locality across threads, but on recent hardware alignment is not much of a concern. On the other hand, strided memory access can hurt performance, which can be alleviated using on-chip shared memory. In the?next post we will explore shared memory in detail, and in the post after that we will show how shared memory can be used to avoid strided global memory accesses during a matrix transpose.

Discuss (0)

Tags

琀是什么意思 经期头疼吃什么药效果最好 ur是什么缩写 hbsag是什么意思 腿疼膝盖疼是什么原因
狐狸和乌鸦告诉我们什么道理 一人吃饱全家不饿是什么生肖 心率偏高是什么原因 漫字五行属什么 摸头是什么意思
双顶径是指什么 长痱子是什么原因 尿路感染什么症状 跑得什么 什么是命题
知趣是什么意思 人心叵测什么意思 手皮脱皮是什么原因 乔其纱是什么面料 大梁是什么朝代
高反是什么意思hcv8jop0ns0r.cn 活泼的反义词是什么wuhaiwuya.com 起薪是什么意思hcv7jop6ns9r.cn 粉碎性骨折是什么意思hcv8jop0ns5r.cn 陌路人是什么意思hcv8jop7ns8r.cn
孔雀翎是什么东西hcv8jop2ns5r.cn 锚什么意思hcv8jop5ns3r.cn 暗网里面有什么hcv8jop2ns3r.cn 拉肚子想吐是什么原因hcv9jop0ns0r.cn 痢疾是什么症状hcv7jop4ns5r.cn
一唱一和是什么生肖hcv8jop5ns4r.cn 维生素b6吃多了有什么副作用1949doufunao.com 硬发质适合什么发型hcv8jop2ns0r.cn 什么是纸片人hcv7jop5ns0r.cn 带银子发黑是什么原因hcv9jop6ns9r.cn
糖尿病能喝什么饮料hcv7jop7ns1r.cn 曼珠沙华是什么意思hcv9jop6ns5r.cn 眼睛淤青用什么方法能快点消除hcv8jop8ns1r.cn 杏鲍菇不能和什么一起吃hcv9jop7ns3r.cn 当归不能和什么一起吃hcv7jop4ns6r.cn
百度