后入是什么意思| 逆光是什么意思| 谜底是什么意思| 什么什么不安| 百香果什么时候开花结果| 指甲有白点是缺什么| 眼睛看东西模糊是什么原因| 脑腐什么意思| 男人射的快是什么原因| 四风是什么| xl是什么码| 佝偻病是什么症状| 正佳广场有什么好玩的| zorro是什么牌子的打火机| 做功是什么意思| 小孩子发烧是什么原因引起的| 覆水难收是什么意思| 吃牛油果有什么好处和坏处| 什么叫膳食纤维| 乳腺发炎有什么症状| 生蚝不能和什么一起吃| 89年什么命| 查电话号码打什么电话| 里急后重是什么意思| 马齿苋不能和什么一起吃| 俄罗斯被称为什么| 是什么拼音| 斑鱼是什么鱼| 什么牌子洗面奶好用| 胃胀胃不舒服吃什么药| 什么是沉香木| 姻缘是什么意思| 一冷就咳嗽是什么原因| 安全监察是一种带有什么的监督| 前列腺液是什么东西| 吃什么能缓解孕吐| 三文鱼长什么样| 为什么眨眼睛| 山东属于什么气候| 5.5号是什么星座| 国家主席是什么级别| 归宁是什么意思| lof是什么意思| 血小板低吃什么食物补得快| 阴到炎用什么药好得快| 世界第一长河是什么河| 新疆都有什么民族| 石光荣是什么军衔| 冉字五行属什么| 色弱是什么| 柳树代表什么生肖| 夏季吃什么好| 脚气用什么洗脚| 亚麻是什么面料| tct是检查什么| 水乳是什么| 吃饭快了有什么坏处| 老鹰的天敌是什么| 经常拉肚子吃什么药好| 重组人干扰素a2b主要是治疗什么病| 肠子疼是什么原因| 办幼儿园需要什么证| 什么的足球| 主观臆断是什么意思| 手发痒是什么原因| 啤酒为什么是苦的| 硅胶是什么材料| 胆囊炎吃什么药好得快| 眼睛黑色部分叫什么| 盆腔钙化灶是什么意思| 梦见骑自行车是什么意思| 3月8日是什么星座| 争辩的近义词是什么| 北北是什么意思| 不是经期有少量出血是什么原因| 缺血灶是什么病| 实拍是什么意思| crt是什么意思| 菩提子是什么树的种子| 这个季节吃什么水果| 什么时候可以查高考成绩| 杜甫是什么朝代的| 睡觉为什么会磨牙| 平板有什么用处| 重度肠上皮化生是什么意思| 梦特娇属于什么档次| 生理性是什么意思| 小气道病变是什么意思| 滇红是什么茶| 脑瘤是什么原因引起的| cta是什么检查| 殆什么意思| 艺高胆大是什么生肖| 低血压是什么原因引起的| 3000年前是什么朝代| 什么是纳囊| 莫代尔是什么| 任达华属什么生肖| 为什么会血压低| 九月十号什么星座| 咽喉炎吃什么药好得快| dfi是什么意思| 外冷内热是什么症状| 为什么生理期不能拔牙| 抵抗力差吃什么| 鸡蛋和什么不能一起吃吗| 心口下面疼是什么原因| 长期吃泡面有什么危害| 户名是什么意思| 苹果枸杞红枣煮水喝有什么功效| 你算什么男人歌词| 零七年属什么生肖| 吃什么药不能喝酒| 心智不成熟是什么意思| 首套房有什么优惠政策| 什么什么多彩| 生性凉薄什么意思| 耐力是什么意思| 泥鳅不能和什么一起吃| 梗米是什么| 战国时期是什么时候| 早早孕是什么意思| 心情沉重是什么意思| 凉爽的什么| 裂变是什么意思| b像什么| 右手有点麻是什么原因| 一什么鱼| 食少便溏是什么意思| 大专跟本科有什么区别| 星月菩提是什么材质| 古代的面首是什么意思| 重视是什么意思| 宫颈口在什么位置| 教唆是什么意思| 小儿病毒性感冒吃什么药效果好| 莓茶什么人不适合喝| 孩子肠胃炎吃什么药| 不服气是什么意思| 老是腹泻是什么原因导致的| 疣是一种什么病| 女人喝咖啡有什么好处和坏处| 夜明砂是什么| 一天老是放屁是什么原因| coach是什么牌子的包| 生目念什么| 王王是什么字| 头部容易出汗是什么原因| 下午三点是什么时辰| 宗人府是干什么的| 一个兹一个子念什么| 黄鳝不能和什么一起吃| 存款准备金率下调意味着什么| pB什么意思| 什么是屈光不正| 脑梗什么东西不能吃| 7月份什么星座| 诸葛亮号什么| 胆红素尿呈什么颜色| 欢是什么动物| 路演是什么意思| 循环系统包括什么| 土字旁的字与什么有关| 痘坑用什么药膏最有效| 范是什么意思| 夜猫子是什么意思| 什么样的人长寿| 梦见床是什么意思| 男生说gg是什么意思| gs什么意思| 生殖器疱疹用什么药最好| 便秘不能吃什么食物| 小肠换气什么症状| 小孩磨牙是什么原因| 火山为什么会喷发| 6月30是什么星座| 什么不绝| 疣是什么意思| 处女座属于什么星象| 葫芦藓是什么植物| 69岁属什么| 手抓饼里面夹什么好吃| 便秘吃什么药最好最快| 吃什么死的比较舒服| 什么叫自负| beast什么意思| 黄体破裂是什么原因造成的| 芋圆是什么做的| 胃有灼热感是什么原因| 什么叫阳性率| 湿气重喝什么| 两情相悦什么意思| 全血粘度低切偏高是什么意思| 肺阴虚吃什么药| 脱脂乳是什么意思| 乳环是什么| 名存实亡是什么意思| 大脑精神紊乱什么病| 心咒是什么意思| mw是什么单位| 产褥热是什么病| 霸王别姬讲的是什么故事| 早孕挂什么科检查| 胃火重口臭吃什么药好| 喝红酒有什么好处| 一月七号是什么星座| 陪护是什么意思| 胡萝卜什么时间种| 4月16日是什么星座| ricu病房是什么意思| 海胆什么味道| 朝代表什么生肖| 作奸犯科是什么意思| 复方板蓝根和板蓝根有什么区别| 县长属于什么级别| n字鞋子是什么牌子| 眼皮跳吃什么药| 什么时候入伏| 心肌梗塞是什么原因造成的| 什么叫生僻字| 什么样的梦才算是胎梦| 梦到吃屎是什么意思| 女生过生日送什么礼物好| 先天性心脏病有什么症状| 三个目念什么| 淋巴结有血流信号预示着什么| 月亮是什么| 脱盐乳清粉是什么| 拉青色大便是什么原因| 醋泡黑豆有什么功效| 锄禾是什么意思| 鸡眼是什么原因引起的| 汗斑是什么原因引起的| 犬字旁的字和什么有关| 什么是肾阳虚| ra医学上是什么意思| 额头出汗多是什么原因| 周边是什么| 孕妇喝可乐对胎儿有什么影响| 卒中中心是干什么的| 鼻子老流鼻涕是什么原因引起| 生物制剂对人体有什么副作用| 富贵病是什么病| 限高什么意思| 高血压吃什么盐| 射手座什么性格| 属狗的本命佛是什么佛| 肝在什么位置图片| 奇可以加什么偏旁| 高血压用什么药最好| 胰腺炎吃什么药| 湿阻病是什么病| 儿童舌系带短挂什么科| 小三阳吃什么药能转阴| 胃酸反酸水吃什么药| 白蛋白偏高是什么原因| 大连属于什么省| 新白娘子传奇许仙为什么用女的演| 镜子是什么生肖| 子宫前置是什么意思| 流产后吃什么水果好| 甲状腺应该挂什么科| 脉搏低是什么原因| 尖货是什么意思| 百度
Content Creation / Rendering

欧盟计划制定更严格消费者法规 对Facebook谷歌监管

百度 经济历史站在他这一边。

One of the great pastimes of graphics developers and enthusiasts is comparing specifications of GPUs and marveling at the ever-increasing counts of shader cores, RT cores, teraflops, and overall computational power with each new generation. Achieving the maximum theoretical performance represented by those numbers is a major focus in the world of graphics programming. Massive amounts of rendering data, such as triangles, pixels, and rays, flow through the immensely parallel GPU computation pipeline as if on a group of assembly lines in a manufacturing plant. Maximum throughput requires the factory to be humming, with no interrupted work or idle equipment. 

This post covers several of the new features in Nsight Graphics 2024.3 to help you understand and manage these virtual assembly lines, and create optimally parallelized workloads for games and graphics applications.

Active Threads per Warp histogram

A warp is a group of 32 threads that forms the fundamental unit of execution for programmable shaders. Ray tracing, compute, vertex, pixel, and other types of shaders written in HLSL or GLSL are compiled down to machine instructions and eventually run on the hardware in warp-sized groups. The threads in the warp run in parallel, and hundreds of warps themselves run in parallel. When programmable shading is the limiting factor of a workload, having warps run efficiently is vital to reaching peak performance.

Warps run on a hardware unit called the Streaming Multiprocessor (SM), where they execute using a computational model called Single Instruction, Multiple Threads (SIMT). Each warp issues one instruction at a time across all of its threads, with each thread having its own operands. For example, if a line of shader code adds two numbers, then within a warp the addition instruction begins running simultaneously in all 32 threads, producing 32 unique sums from 64 inputs.

One way that warps can run suboptimally is through thread divergence due to branches in shader code from if-statements or control flow. The compiler might avoid branching entirely by running all blocks of code under an if-statement and then ignoring the unused results, or by using unrolled loops. But when a true dynamic branch is compiled down and encountered at execution time, and the conditional expression is not uniform across all threads in a warp, the warp must execute both the if-body and the else-body one at a time. 

Again, only one instruction at a time can be issued in the SIMT model, so the SM must mask off the threads not applicable to the active side of the branch. For a more complete description of warp instruction scheduling in modern NVIDIA GPUs, see NVIDIA Tesla V100 GPU?Architecture (page 26).?

Figure 1 shows a simplified visualization of thread divergence in a warp with a hypothetical size of eight threads. Capital letters represent statements in the program pseudocode. Throughput is reduced due to the idle lanes of thread execution at the time each instruction is issued.

Some brief pseudocode for an in-else statement in a shader (left) and a graphic showing eight lines of execution representing threads split into two chunks of four threads during the if-else blocks (right).
Figure 1. A simplified visualization of thread scheduling under the SIMT warp execution model?

Note that this is different from running both blocks in an if-statement in a branchless manner as previously mentioned. True branching avoids any side effects of unexecuted threads. More importantly, the dynamic branch can be a win when there is a favorable warp-level distribution of the conditional expression that drives the branch. The more warps that have a homogeneous result for the conditional expression, the less divergence there will be, thus having less impact on throughput. In vertex and pixel shaders, warps will be grouped based on locality of vertices and pixels, respectively. In ray tracing and compute shaders, the user has more explicit control over how the work gets grouped.

Other factors in the performance impact of thread divergence include the overall percentage of shader instructions under any branch, and the specific machine instructions and their operand types. The only way to know the true impact is to be able to track warp thread efficiency when measuring overall performance and correlate changes in one to the other.

This is where the new Active Threads per Warp histogram comes in. This compact graphic is now available throughout the Shader Profiler views within the GPU Trace tool in Nsight Graphics, including Shader Pipelines, Top-Down, Bottom-Up, Hot Spots, and Source/Disassembly. It illustrates the aggregate impact of thread divergence for any given shader, function, or individual line of source code. 

As shown in Figure 2, values on the right of the histogram (closer to 32) indicate more efficient instruction execution. The values shown are approximated from the sampling of performance counters at the time of the execution of each code block. A popup tooltip shows the histogram in greater detail. When launching GPU Trace, the Timeline Metrics setting must be set to either Top-Level Triage or Ray Tracing Triage (if available), and the Real-Time Shader Profiler enabled.

The Active Threads per Warp column is shown in the Hot Spots view of GPU Trace. Each row shows information about the lines of shader code using up the most samples in the recorded interval. Active Threads per Warp is a small colored histogram with a blue marker indicating the average value.
Figure 2. The Active Threads per Warp histogram within the GPU Trace tool in Nsight Graphics

If a function is a performance bottleneck and has poor Active Threads per Warp, you should consider strategies to improve warp coherence, or to reduce branching. For ray tracing workloads, look at Shader Execution Reordering (SER), which was designed specifically to address thread and data divergence issues in ray tracing shaders. Other algorithmic changes may improve thread execution coherence; for example, using a different ray sampling pattern. For any type of shader, it may also be possible to improve efficiency by converting branches into warp-aware shader code in D3D12 or Vulkan

The spread of the histogram reveals whether the behavior of the code block was consistent, and seeing Active Threads per Warp at or near 100% may also validate that thread divergence is not a limiting factor when it was originally suspected to be.

Figure 3 illustrates how advanced lighting techniques such as path tracing cause shader divergence as secondary rays bounce off objects in the scene.?SER improves execution coherence because rays using the same hit shader make better use of SIMT at the warp level. When SER is working, you should see Active Threads per Warp improve.

A graphic showing a side view of a room and ray tracing rays entering the room from a camera direction. Initially the rays are bundled together whereby neighboring rays shade neighboring objects. A second panel shows the rays after one bounce, after which they are mixed up such that neighboring rays don’t have any coherence. Shader Execution Reordering then reorders the rays to be bundled coherently.
Figure 3. Advanced lighting techniques such as path tracing cause shader divergence as secondary?
rays bounce off objects in the scene, denoted by various colors

At a higher level, improving overall shading time involves understanding and reducing warp stalls. Warps stall when they reach long latency operations, especially things like memory accesses and texture fetches. When one warp stalls, another warp can be scheduled for the next instruction. However, this can only buy so much parallelism. If too many warps are stalled, then the SM sits underutilized or even idle. The length of a stall can depend on many factors such as which level of cache is hit, if any, for a memory lookup. 

GPU Trace has always provided tools for analyzing these stalls, but a new arrow in the quiver is the Warp Latency histogram, which was previously presented as a single Average Warp Latency cycle count. Seeing the distribution of warp latency provides greater insight into the variability of shader timings, providing hints as to whether early exits were taken, and whether the arguments to multiple shader invocations resulted in different behaviors. Note that the histogram currently only contains separate latency data points for disjoint regions in the timeline calling into the same shader.

The Average Warp Latency column is shown in the Shader Pipelines view of GPU Trace. Each row shows information about the shaders using up the most samples in the recorded interval. Active Threads per Warp is a small colored histogram with a blue marker indicating the average value.
Figure 4. The Average Warp Latency histogram within the GPU Trace tool in Nsight Graphics

For more detailed information about optimizing GPU workloads, check out these resources:

D3D12 Work Graphs

Intercommunication between CPU and GPU is another common bottleneck in graphics pipelines. Even when bulk data is left resident on the GPU, the act of issuing rendering instructions from the CPU can create bubbles where the GPU is sitting idle. Work Graphs are a new feature in D3D12 that aim to decrease the dependency on the CPU to schedule GPU work. GPU-driven scheduling has been around for a while, but Work Graphs introduce more advanced capabilities than existing methods such as ExecuteIndirect. For an overview of Work Graphs, see Advancing GPU-Driven Rendering with Work Graphs in Direct3D 12.

Initial support for profiling Work Graph nodes as a whole in the Shader Profiler was introduced in Nsight Graphics 2024.2. In 2024.3, the Shader Profiler now supports source correlation for Work Graphs, enabling the full functionality of line-by-line analysis in the Shader Source view and Hot Spots list. As Work Graphs are a new feature in D3D12, this capability should help developers to explore and better understand Work Graph performance characteristics. Note that source correlation requires the newest R565 series driver.

A list showing names of Work Graph shaders and corresponding source code line numbers are shown in the Hot Spots view in GPU Trace, ordered by the most expensive lines of code.
Figure 5. Specific lines from Work Graph shaders shown in the Hot Spots view in GPU Trace. Clicking on any line opens the full source view of that shader showing per-line performance statistics

Likewise, the Nsight Aftermath SDK 2024.3 adds support for tracking shaders used for Work Graphs and providing contextual information to aid in narrowing down related GPU faults originating in Work Graph workloads.

Vulkan updates

The recently released Vulkan 1.4 standard promotes over a dozen previously optional extensions into the required extension set and introduces increased minimum hardware limits. For more information, see Khronos Streamlines Development and Deployment of GPU-Accelerated Applications with Vulkan 1.4. Nsight Graphics 2024.3 is shipping with Vulkan 1.4 support in the Frame Debugger. For beta drivers supporting 1.4, visit Vulkan Driver Support.

Even if you’re not using Vulkan 1.4 directly, all of the newly promoted extensions are now supported in Nsight Graphics. Support for many other extensions has been added as well, including VK_NV_inherited_viewport_scissor and VK_NV_device_generated_commands_compute. For the complete list, see the NVIDIA Nsight Graphics User Guide

This release also adds support for Frame Debugging and GPU Trace of applications that use Vulkan SC on the Windows and Linux desktop platforms. For more information about Vulkan SC and driver support, visit Vulkan Driver Support.

Conclusion 

You’ll want to have a foundational understanding of the GPU computing model before you start optimizing for parallelism. Yet, how the theory translates to practice can be hard to predict due to varying data patterns, compiler and hardware optimizations, and many other second-order influences. Have a strategy based on hypothesis testing and records of measurements in the tools. Understand what the metrics mean and then track how they change as you make adjustments. While it’s not practical to achieve 100% utilization of every hardware unit on the GPU simultaneously, incremental improvements can help you reach the performance requirements of your application.

Nsight Graphics 2024.3 is now available. Tell us about your experience with these new features using the Feedback button located at the top right of the Nsight Graphics window. 

Learn more about Nsight Developer Tools and explore tutorials for Nsight Tools. Ask questions, provide feedback, and engage with the graphics developer community on the Nsight Graphics Developer forums

Acknowledgments

For their contributions to this post, I’d like to thank Avinash Baliga, Jeff Kiel, Axel Mamode, Aurelio Reis, and Louis Bavoil.

Discuss (0)

Tags

牙龈出血吃什么药 孩子注意力不集中是什么原因 吃饭肚子疼是什么原因 肠痈是什么病 什么是码率
孔雀鱼吃什么 杀跌是什么意思 颈动脉斑块看什么科 什么是宫颈息肉 胃不好可以吃什么水果
mchc是什么意思 在什么什么后面 01什么意思 八六年属什么生肖 肝火旺吃什么食物好
哺乳期可以喝什么饮料 什么方法可以快速入睡 秒后面是什么单位 小孩眼屎多是什么原因引起的 早茶是什么意思
肿瘤切开了里面是什么hcv9jop5ns2r.cn cde是什么意思hcv8jop8ns1r.cn 清明节有什么习俗hcv8jop6ns5r.cn 为什么记忆力很差hcv7jop7ns3r.cn 人的牙齿为什么不能再生hcv8jop8ns5r.cn
做梦梦见下大雨是什么意思hcv8jop6ns6r.cn 50岁属什么hcv8jop2ns9r.cn 女性后背疼挂什么科室hcv9jop7ns0r.cn 下巴两边长痘痘是什么原因hcv9jop5ns1r.cn 婴儿大便有泡沫是什么原因hcv7jop5ns0r.cn
三叉神经是什么病tiangongnft.com 黑天鹅是什么意思ff14chat.com 宇宙之外是什么wzqsfys.com 什么的嫩芽hcv8jop0ns2r.cn 肾阴虚吃什么药hebeidezhi.com
香蕉有什么功效和作用hcv8jop3ns9r.cn 电离辐射是指什么hcv8jop7ns5r.cn 来姨妈为什么是黑色的血hcv8jop8ns8r.cn 什么时候闰三月hcv8jop4ns2r.cn 吃什么解暑hcv7jop5ns5r.cn
百度