Story Details

  • DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL

    Posted: 2025-02-11 19:59:00

    Researchers have trained a 1.5 billion parameter language model, DeepScaleR, using reinforcement learning from human feedback (RLHF). They demonstrate that scaling RLHF is crucial for performance improvements and that their model surpasses the performance of OpenAI's GPT-3 "O1-Preview" model on several benchmarks, including coding tasks. DeepScaleR achieves this through a novel scaling approach focusing on improved RLHF data quality and training stability, enabling efficient training of larger models with better alignment to human preferences. This work suggests that continued scaling of RLHF holds significant promise for further advancements in language model capabilities.

    Summary of Comments ( 99 )
    https://news.ycombinator.com/item?id=43017599

    HN commenters discuss DeepScaleR's impressive performance but question the practicality of its massive scale and computational cost. Several point out the diminishing returns of scaling, suggesting that smaller, more efficient models might achieve similar results with further optimization. The lack of open-sourcing and limited details about the training process also draw criticism, hindering reproducibility and wider community evaluation. Some express skepticism about the real-world applicability of such a large model and call for more focus on robustness and safety in reinforcement learning research. Finally, there's a discussion around the environmental impact of training these large models and the need for more sustainable approaches.