怀孕肚子上长毛是什么原因| 大臣是什么意思| 基诺浦鞋属于什么档次| 暑湿感冒吃什么药| 属猴本命佛是什么佛| 低血压吃什么食物| 覆水难收是什么意思| 打呼噜是什么原因造成的| 可乐饼为什么叫可乐饼| 月经前长痘痘是什么原因| 唯有读书高的前一句是什么| 名侦探柯南什么时候完结| 梦见入室抢劫意味什么| 装腔作势是什么意思| no是什么气体| 白带黄绿色是什么炎症| 温州有什么区| 童心未泯是什么意思| 螳螂捕蝉什么意思| 打喷嚏很臭是什么原因| 指甲是白色的是什么原因| 神龙见首不见尾是什么意思| mmhg是什么单位| 气血不足是什么引起的| 外科看什么病| 2月29日是什么星座| 束带是什么| 青海有什么湖| 汗疱疹是什么原因引起的| marmot什么牌子| 妊娠什么意思| 子宫肌瘤是什么引起的| 脑白质病变是什么病| 什么家常菜好吃| 吃什么精力旺盛有精神| 女人左眼角有痣代表什么| 低密度脂蛋白偏高是什么原因| 射手座是什么象星座| 八月初三是什么星座| 明信片是什么| marni是什么牌子| 小叶紫檀五行属什么| 上腹疼是什么原因| 血常规检查能查出什么| 拍花子是什么意思| 雷震子是什么神| 南音是什么意思| 处女座女和什么星座最配| 必修课是什么意思| 送终是什么意思| 肠道为什么会长息肉| 12月1日是什么日子| 耳顺是什么意思| 什么牌的笔记本电脑好| 血镁偏高是什么原因| 什么出什么外| 胆囊小是什么原因| 什么耳什么聋| 女人手指粗短是什么命| 入围是什么意思| 蜈蚣咬了擦什么药最好| 外阴灼热用什么药| 右手手背有痣代表什么| 卵巢多囊是什么原因造成的| 925银什么意思| 舔逼什么感觉| 什么烟贵| 苏联什么时候解体| 吃猪脑有什么好处和坏处| 口加至念什么| 什么和什么| 纳囊是什么病| 12月24号是什么星座| 龙凤呈祥的意思是什么| 头皮上长疣是什么原因造成的| 女人左下腹部疼痛什么原因| 什么牌子的笔记本电脑好| 冰岛茶属于什么茶| hp检查是什么| 尿酸偏低是什么原因| 吕洞宾属什么生肖| 多吃蒜有什么好处和坏处| 中风吃什么药好| 吃了头孢不能吃什么| 林深时见鹿什么意思| 梦见自己穿孝衣有什么征兆| 流产有什么症状| 艾滋病是什么症状| 脚心发凉是什么原因| 痰湿体质吃什么食物好| 怀孕需要注意什么| 误食干燥剂有什么危害| 看望病人送什么东西| 三百年前是什么朝代| 1月份是什么星座的人| 霍山黄芽属于什么茶| 鹅蛋脸适合什么样的发型| 卡介苗预防什么病| 住院带什么必备物品| 香菇不能和什么一起吃| 中午是什么时辰| 踩水是什么意思| 为什么做爱那么舒服| 龙虾的血是什么颜色的| 缺铁吃什么药| 和解少阳是什么意思| 玄青色是什么颜色| 印堂发黑是什么征兆| 梦见自己把头发剪短了是什么意思| 生肖狗和什么生肖相冲| hpv16阳性有什么症状| 吃什么食物帮助睡眠| 2003年属羊的是什么命| 鸽子和什么炖气血双补| 足外翻挂什么科| 吃什么对头发好| 见到黑猫代表什么预兆| 榴莲为什么那么臭| 吃什么补血贫血| 本是什么生肖| 一路走好是什么意思| 指滑是什么意思| 主管药师是什么职称| 阑尾炎挂号挂什么科| 冶游史是什么意思| 为什么蚊子喜欢咬我| 羊肉不放什么调料| 一个人在家无聊可以做什么| 虫合读什么| 证件照是什么| 519是什么星座| 一声叹息是什么意思| 美人尖是什么| 什么汤补气血效果最好| 敖虫读什么| 高冷什么意思| 一是什么动物| 恒心是什么意思| 人生赢家什么意思| 1988是什么年| 关二爷是什么神| 手足口病用什么药| 晚上11点到12点是什么时辰| 款式是什么意思| 打耳洞医院挂什么科| 12月8日是什么星座| 用什么泡脚可以去湿气| 股癣是什么原因引起的| 一个白一个本是什么字| 苛捐杂税是什么生肖| no2是什么气体| 鲫鱼吃什么| 呜呼哀哉什么意思| 月经期间适合吃什么| 天秤座和什么座最配| 什么时候测量血压最准确| 瑶柱是什么| 什么情况下安装心脏起搏器| 八七年属兔的是什么命| spank是什么意思| 心跳过速是什么原因| 频发室性早搏是什么意思| 什么叫贵妃镯| 羊肉和什么不能一起吃| 北京大学校长是什么级别| 马鲛鱼是什么鱼| 凝血功能差有什么危害| 军长相当于地方什么官| 脑萎缩是什么意思| 孕酮低是什么原因| 生意盎然什么意思| 适当是什么意思| 国士无双什么意思| 打破伤风挂什么科| 怀孕7天有什么症状| 催供香是什么意思| 老鼠疣长什么样子图片| 9.3是什么日子| 身份证什么时候可以办| 累的什么| 拉肚子呕吐吃什么药| 眼睛发炎吃什么药| 腻歪是什么意思| alexanderwang是什么牌子| 521是什么星座的| 二月初二是什么星座| 云南白药治什么| 骨折吃什么药好得快| 还记得年少时的梦吗是什么歌| 劳动法什么时候实施的| 雪莲果什么季节成熟| 梦见孕妇大肚子是什么意思| 猪和什么生肖最配| 长脸适合什么眼镜框| 外阴裂口用什么药| 紫癜是什么病| 5.13是什么星座| 欧阳修字什么号什么| 骆驼是什么牌子| 干巴得是什么意思| 拉肚子是什么原因造成的| 阅人无数什么意思| 什么病误诊为帕金森| 内项和外项是什么| 胃动力不足吃什么中成药| 八月一日是什么节日| 做肠镜要做什么准备| 赵本山是什么学历| 做梦梦见考试是什么意思| 一心一什么| 泻立停又叫什么名字| 刷脂是什么意思| 什么什么有力| 海里有什么动物| 导语是什么| 肾囊肿挂什么科| 骆驼吃什么食物| 窦性心律不齐是什么原因引起的| 黑色裤子配什么颜色t恤| 白细胞低吃什么补得快| 钾低是什么原因| 住院报销需要什么材料| 心脏由什么组织构成| 多吃蔬菜有什么好处| 79年属什么生肖| 减肥吃什么主食比较好| 什么的仪式| 荷花和睡莲有什么区别| 脸上长藓用什么药| 圆房是什么意思| 7月23日什么星座| apl医学上是什么意思| 100001是什么电话| 赫五行属性是什么| 尤物是什么意思| 大运是什么意思| 十一朵玫瑰花代表什么意思| 做肠镜前喝的是什么药| 唐僧成了什么佛| 伊人是什么意思| 瓜尔胶是什么东西| 脸浮肿是什么病的前兆| 什么叫高脂血症| 睡衣什么面料最好| 作曲是什么意思| michaelkors是什么牌子| 经常感冒吃什么提高免疫力| 查艾滋挂什么科| 细胞质是什么| 金蟾是什么| 红加绿是什么颜色| 平均分是什么意思| 考药师证需要什么条件| 什么情况吃通宣理肺丸| 孩子呼吸道感染吃什么药效果最好| 东窗事发是什么意思| 心胆气虚吃什么中成药| 罗马布是什么面料| 床单什么颜色有助于睡眠| 肾衰竭五期是什么意思| 尿血是什么病| 黑鱼吃什么食物| 百香果什么时候吃最好| amo是什么意思| 百度
Generative AI

发改委:低收入群体增收将成收入分配改革重点

Code showing how to use epilogs with matrix multiplication in nvmath-python.
百度 我今天发言的题目是“杭州三唱”。

nvmath-python (Beta) is an open-source Python library, providing Python programmers with access to high-performance mathematical operations from NVIDIA CUDA-X math libraries. nvmath-python provides both low-level bindings to the underlying libraries and higher-level Pythonic abstractions. It is interoperable with existing Python packages, such as PyTorch and CuPy.

In this post, I show how to use epilogs with matrix multiplication in nvmath-python. Epilogs are operations that can be fused with the mathematical operation being performed, like FFT or matrix multiplication. Available epilogs cover the most common deep-learning computations. I demonstrate their usage by implementing the common forward and backward pass operations of a simple neural network.

To install nvmath-python, follow the installation instructions.

Optimizing the forward pass with the RELU_BIAS epilog

In this section, I demonstrate how to use epilogs to implement a forward pass of a simple linear layer. This layer first multiplies the input vectors by a weights matrix, then adds a bias to each element of the resulting matrix, and finally applies the ReLU activation function.

ReLU, short for Rectified Linear Unit, is a commonly used activation function that replaces negative values with zeros while leaving positive values unchanged.

In terms of matrix operations, the layer can be expressed as follows:

relu(Wx + B)

In the equation, the following definitions are true:

  • x is a batch of input vectors of shape n \times b:
    • n is the number of layer’s inputs.
    • b is the batch size.
  • W is the weight matrix of shape m \times n:
    • m is the number of layer’s outputs.
    • n is the number of its inputs.
  • B is the bias vector of length m, which is added to each column of the resulting matrix.

Assume that you have your inputs, weights, and bias as CuPy arrays:

num_inputs, num_outputs = 784, 100
batch_size = 256

weights = cupy.random.rand(num_outputs, num_inputs)
bias = cupy.random.rand(num_outputs)
x = cupy.zeros((num_inputs, batch_size))

In the most basic version, you can implement this linear layer by using nvmath-python for calculating Wx, and then handling bias and ReLU manually, as in the following code example.

In this example, I use a stateful API, in which you can separate initialization and planning from the actual execution of the multiplication. I recommend this approach when you must perform multiple similar multiplications, as it enables you to amortize the initial cost of planning. For more information about Matmul, see nvmath.linalg.advanced.Matmul.

mm = Matmul(weights, x)
mm.plan()

def forward():
    y = mm.execute()
    y += bias[:,cupy.newaxis]
    y[y < 0] = 0
    return y

To improve the performance of the code, take advantage of the RELU_BIAS epilog to perform all three operations in a single, fused cuBLAS operation. This epilog first adds the bias to the result of the multiplication and then applies the ReLU function.

You can specify the epilog using the epilog argument of the Matmul.plan method. Some epilogs, including RELU_BIAS, take extra inputs, which can be specified in the epilog_inputs dictionary. For more information about epilogs, see nvmath.linalg.advanced.Matmul.

from nvmath.linalg.advanced import MatmulEpilog

mm = Matmul(weights, x)
mm.plan(epilog=MatmulEpilog.RELU_BIAS, epilog_inputs={"bias": bias})

def forward():
    y = mm.execute()
    return y

As I explain later, to backpropagate through the ReLU function, you must know which inputs to the ReLU were positive and which ones were negative. This auxiliary information, called the ReLU mask, can be obtained with the RELU_AUX_BIAS epilog.

When an epilog with auxiliary outputs is used, a tuple containing the actual result and the dictionary of auxiliary outputs is returned from Matmul.execute. In the case of RELU_AUX_BIAS, the auxiliary output dictionary has one key relu_mask, which contains the ReLu mask. This mask is bit-encoded and might be hard to read, but there are dedicated epilogs that do this for you during the backward pass.

from nvmath.linalg.advanced import MatmulEpilog

mm = Matmul(weights, x)
mm.plan(epilog=MatmulEpilog.RELU_AUX_BIAS, epilog_inputs={"bias": bias})

relu_mask = None

def forward():
	global relu_mask
    y, aux_outputs = mm.execute()
	 relu_aux = aux_outputs["relu_aux"]
    return y
A block diagram shows the operations of a forward pass: multiplication by the weights, addition of bias and application of ReLU. Matmul with RELU_AUX_BIAS epilog is handling all three operations, and producing the ReLU mask as an auxiliary output.
Figure 1. Operations of forward pass covered by Matmul with the RELU_AUX_BIAS epilog

The implementation using the RELU_AUX_BIAS epilog is faster than its naive counterpart, providing a significant performance gain.

A bar plot showing the performance of the naive implementation and RELU_AUX_BIAS. Naive implementation reaches 62.8% of peak TFLOP/s, and RELU_AUX_BIAS reaches 79.7%.
Figure 2. Performance comparison of forward pass implementations

Figure 2 shows performing matrix multiplication of float16 matrices of sizes (65536,16384)(16384, 8192), followed by bias addition and ReLU. The performance is measured on an NVIDIA H200 GPU.

Optimizing the backward pass with the DRELU_BGRAD epilog

During the backward pass of a neural network, the gradient of the loss function with respect to the output is propagated back through the network layers to compute the gradients for each parameter.

Intuitively, for each operation, when the effect of its output on the loss is known, it becomes possible to determine how its inputs and parameters (such as the values in a weight matrix) influence the loss. For more information, see Backpropagation.

In this part, I assume that there are several linear layers stacked together. I implement a backpropagation over the sequence of operations that are normally considered to belong to different layers: adding bias, applying ReLU, and multiplying by the weights.

A block diagram shows the operations of a forward pass with multiple linear layers: multiplication by weights, adding bias, applying ReLU, multiplying by weights, adding bias, and so on. The backward pass box covers adding bias, applying ReLu, and multiplying by weights.
Figure 3. Operations implemented in forward and the part to be covered in backward

Let t_0 be the input to the part of the network shown earlier, and show the intermediate results by t_1, t_2, and t_3, respectively:

  • t_1 = x + B
  • t_2 = relu(t_1)
  • t_3 = Wt_2

In backpropagation, when you know how the loss function L is affected by t_3, which is \frac{\partial L}{\partial t_3}, it is possible to calculate the gradients with respect to other parameters. For more information about the derivations of the formulas used to compute the gradients, see Automatic Differentiation and Neural Networks.

  • \frac{\partial L}{\partial W} = t_2^T \frac{\partial L}{\partial t_3}
  • \frac{\partial L}{\partial t_2} = W^T \frac{\partial L}{\partial t_3}
  • \frac{\partial L}{\partial t_1} = 0 where t_1 was negative and \frac{\partial L}{\partial t_1} = \frac{\partial L}{\partial t_2} where t_2 was non-negative (ReLU mask contains this information)
  • \frac{\partial L}{\partial B} is \frac{\partial L}{\partial t_1}, summed over the batch dimension
A block diagram shows the operations of a forward pass and backward pass, with the formulas for gradients. Matmul with DRELU_BGRAD epilog covers computing the gradients for t2 (multiplying by weights), t1 (applying ReLU mask) and B (batch sum). Computing the gradients for W is not covered by the DRELU_BGRAD epilog.
Figure 4. Operations of the backward pass, with operations covered by DRELU_BGRAD epilog marked

The operations required to compute \frac{\partial L}{\partial B} and \frac{\partial L}{\partial t_1} can be naively implemented by using Matmul just for matrix multiplication, and then handling masking and batch sum manually:

mm = Matmul(weights.T, grad)
mm.plan()

def backward():
    grad_t1 = mm.execute()
    grad_t1[mask] = 0  # assuming that `mask = (t1 < 0)`
    grad_bias = cupy.sum(grad_t1, axis=1)
    return grad_t1, grad_bias

To optimize your backward pass, use the DRELU_BGRAD epilog. Assume that the gradient \frac{\partial L}{\partial t_3} is available in a CuPy array grad. The DRELU_BGRAD epilog expects one input, relu_aux, containing the mask returned from RELU_AUX_BIAS epilog. It applies this mask to the result of the multiplication. It also returns an auxiliary output with the column-wise sum of the result, which happens to be \frac{\partial L}{\partial B}.

mm = Matmul(weights.T, grad)
mm.plan(epilog=MatmulEpilog.DRELU_BGRAD, epilog_inputs={"relu_aux":relu_mask})

def backward():
    grad_t1, aux_outputs = mm.execute()
    grad_bias = aux_outputs["drelu_bgrad"]
    return grad_t1, grad_bias
A bar plot shows the performance of the naive implementation and DRELU_BGRAD. Naive implementation reaches 56.9% of peak TFLOP/s, and DRELU_BGRAD reaches 66.4%.
Figure 5. Performance comparison of backward pass implementations

Figure 5 shows performing matrix multiplication of float16 matrices of sizes (65536,16384)(16384, 8192), followed by the application of ReLU mask and bias gradient computation. The performance was measured on an NVIDIA H200 GPU.

Conclusion

With the epilogs of nvmath-python, you can fuse common deep learning computations together in your Python code, which enables you to greatly improve the performance. For more information, see the nvmath-python: Unleashing the Full Capabilities of NVIDIA Math Libraries within Python documentation. For an example of end-to-end implementation of a simple neural network with nv-math python, see the Backpropagation Jupyter notebook on GitHub.

We are an open-source library, so feel free to visit the /NVIDIA/nvmath-python GitHub repo and reach out to us there.

Discuss (1)

Tags

刘五行属性是什么 手腕疼痛是什么原因 男人为什么喜欢舔女人下面 人次什么意思 禅茶一味什么意思
588是什么意思 减肥头晕是什么原因 什么叫继发性高血压 阴囊潮湿瘙痒是什么原因 卵巢早衰是什么原因引起的
宫颈ecc是什么意思 囊肿里面是什么东西 牙齿发酸是什么原因 脸上长痘痘什么原因 噩耗是什么意思
蚊子最怕什么植物 舌头发麻是什么原因引起的 考研都考什么 条形码的数字代表什么 泌尿道感染吃什么药
话费为什么扣那么快hcv8jop4ns6r.cn 左腹下方隐痛什么原因hcv8jop6ns1r.cn 什么笔不能写字hcv9jop6ns9r.cn 包可以加什么偏旁hcv8jop5ns3r.cn 清欢渡是什么意思hcv9jop6ns4r.cn
四面楚歌什么意思hcv9jop3ns5r.cn 敌敌畏中毒用什么洗胃beikeqingting.com 节育是什么意思chuanglingweilai.com 比五行属什么zhongyiyatai.com 一什么花瓣hcv8jop8ns2r.cn
血压高吃什么水果好hcv8jop2ns8r.cn 为什么会有痛经hcv7jop9ns0r.cn 腿部抽筋是什么原因引起的hcv8jop1ns2r.cn 天天做梦是什么原因hcv8jop4ns8r.cn 4.24是什么星座hcv9jop4ns6r.cn
二月二十五号是什么星座hcv7jop6ns7r.cn 亲和力是什么意思travellingsim.com 2003是什么年imcecn.com 吃什么能增强记忆力hcv9jop1ns7r.cn 宋江代表什么生肖hcv8jop6ns1r.cn
百度