Story Details

  • Transformers Without Normalization

    Posted: 2025-03-15 03:12:39

    This blog post introduces Dynamically Trained Transformers (DyT), a novel transformer architecture that removes Layer Normalization entirely. Instead, DyT employs a two-stage training process. First, it initializes scaling parameters through a closed-form solution derived from analyzing the mean and variance of activations across layers. Second, it fine-tunes these parameters alongside the model's standard weights. Experiments across various tasks like machine translation and language modeling demonstrate that DyT achieves comparable or even superior performance to transformers with layer normalization while being significantly faster and more memory efficient due to the reduced computational overhead. This approach offers a promising alternative to traditional normalization layers in transformers, potentially improving efficiency for large-scale models.

    Summary of Comments ( 24 )
    https://news.ycombinator.com/item?id=43369633

    Hacker News users discussed the implications of removing layer normalization in Transformers, as proposed in the linked paper. Several commenters expressed skepticism, questioning the generalizability of the results beyond the specific tasks and datasets tested. Some pointed out potential issues with the proposed dynamic weight initialization and its computational cost. Others were more optimistic, finding the idea intriguing and wondering about its potential application in other architectures like RNNs. The robustness of the approach to different batch sizes was also a topic of discussion, with concerns about its performance with small batches. Finally, a few commenters questioned the necessity of removing layer normalization altogether, suggesting that simpler adjustments or alternative normalization methods might suffice.