前馈网络层

1. 前馈网络层

llama3模型的前馈网络层代码如下

class FeedForward(nn.Module):
    def __init__(
        self,
        dim: int,
        hidden_dim: int,
        multiple_of: int,
        ffn_dim_multiplier: Optional[float],
    ):
        """
        Initialize the FeedForward module.

        Args:
            dim (int): Input dimension.
            hidden_dim (int): Hidden dimension of the feedforward layer.
            multiple_of (int): Value to ensure hidden dimension is a multiple of this value.
            ffn_dim_multiplier (float, optional): Custom multiplier for hidden dimension. Defaults to None.

        Attributes:
            w1 (ColumnParallelLinear): Linear transformation for the first layer.
            w2 (RowParallelLinear): Linear transformation for the second layer.
            w3 (ColumnParallelLinear): Linear transformation for the third layer.

        """
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        # custom dim factor multiplier
        if ffn_dim_multiplier is not None:
            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

        self.w1 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )
        self.w2 = RowParallelLinear(
            hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x
        )
        self.w3 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

llama3 feedforward

GELU 激活函数

深度学习网络中常用的激活函数

Sigmoid

tanh

ReLU

Leaky ReLU

ELU 指数线形单元

GELU

其中服从标准正态分布

激活函数

门控机制

w1和w3分别生成两个独立的特征表示，通过逐元素相乘（*）实现门控效果，w2将门控后的特征表示映射回原始维度。

GLU,即门控线性单元，在2016年提出, 源自论文《Language Modeling with Gated Convolutional Networks》,随后google在20年提出多种变体，《GLU Variants Improve Transformer》

在llama系列模型中，都采用了这种门控机制。

MOE

MOE - papers
MOE - history
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer 提出了稀疏门控混合专家层

大模型

#LLM #Feedforward #MOE

前馈网络层

https://wenzhaoabc.github.io/llm/feedforward/

作者

wenzhaoabc

发布于

2025年3月26日

许可协议

腾讯Hunyuan 3D 模型上一篇

注意力机制下一篇