DeepSeek's 3FS is a distributed file system designed for large language models (LLMs) and AI training, prioritizing throughput over latency. It achieves this by utilizing a custom kernel bypass network stack and RDMA to minimize overhead. 3FS employs a metadata service for file discovery and a scale-out object storage approach with configurable redundancy. Preliminary benchmarks demonstrate significantly higher throughput compared to NFS and Ceph, particularly for large files and sequential reads, making it suitable for the demanding I/O requirements of large-scale AI workloads.
The paper "File Systems Unfit as Distributed Storage Back Ends" argues that relying on traditional file systems for distributed storage systems leads to significant performance and scalability bottlenecks. It identifies fundamental limitations in file systems' metadata management, consistency models, and single points of failure, particularly in large-scale deployments. The authors propose that purpose-built storage systems designed with distributed principles from the ground up, rather than layered on top of existing file systems, are necessary for achieving optimal performance and reliability in modern cloud environments. They highlight how issues like metadata scalability, consistency guarantees, and failure handling are better addressed by specialized distributed storage architectures.
HN commenters generally agree with the paper's premise that traditional file systems are poorly suited for distributed storage backends. Several highlighted the impedance mismatch between POSIX semantics and distributed systems, citing issues with consistency, metadata management, and performance bottlenecks. Some questioned the novelty of the paper's findings, arguing these limitations are well-known. Others discussed alternative approaches like object storage and databases, emphasizing the importance of choosing the right tool for the job. A few commenters offered anecdotal experiences supporting the paper's claims, while others debated the practicality of replacing existing file system-based infrastructure. One compelling comment suggested that the paper's true contribution lies in quantifying the performance overhead, rather than merely identifying the issues. Another interesting discussion revolved around whether "cloud-native" storage solutions truly address these problems or merely abstract them away.
Summary of Comments ( 35 )
https://news.ycombinator.com/item?id=43716058
Hacker News users discuss DeepSeek's new distributed file system, focusing on its performance and design choices. Several commenters question the need for a new distributed file system given existing solutions like Ceph and GlusterFS, prompting discussion around DeepSeek's specific niche targeting AI workloads. Performance claims are met with skepticism, with users requesting more detailed benchmarks and comparisons to established systems. The decision to use Rust is praised by some for its performance and safety features, while others express concerns about the relatively small community and potential debugging challenges. Some commenters also delve into the technical details of the system, particularly its metadata management and consistency guarantees. Overall, the discussion highlights a cautious interest in DeepSeek's offering, with a desire for more data and comparisons to validate its purported advantages.
The Hacker News post titled "An Intro to DeepSeek's Distributed File System" (linking to https://maknee.github.io/blog/2025/3FS-Performance-Journal-1/) has generated several comments discussing various aspects of the presented file system.
One commenter questions the choice of Go for implementing the file system, expressing concerns about Go's garbage collection potentially impacting tail latency for critical operations. They suggest Rust or C++ as alternatives that might offer more predictable performance. This sparked a small discussion, with another commenter suggesting that while Go's GC might be a concern in some high-performance scenarios, optimizations and careful tuning could mitigate its impact, especially given the focus on throughput over latency in this particular file system.
Another thread of discussion focuses on the architectural decisions of 3FS, particularly the claimed efficiency advantages of shared-nothing and avoiding POSIX compliance. A commenter praises the approach of eschewing POSIX for a cleaner, more performant design, contrasting it with the complexities and overhead often associated with POSIX compliance. Another user chimes in, expressing skepticism about the ability to completely avoid POSIX compatibility in practice, especially if broader adoption is a goal, suggesting that the eventual need to interact with POSIX-compliant tools and workflows might necessitate some level of integration down the line.
The author of the blog post (and presumably the file system) engages in the comments, responding to several inquiries. They clarify specific design choices, providing context around the target workloads and performance goals. They also address the POSIX compatibility concerns, acknowledging the potential need for a translation layer in the future while emphasizing the current focus on optimizing for their specific use case.
Furthermore, a commenter raises questions about the availability and resilience of the system, particularly in the face of hardware failures. They inquire about the mechanisms in place for data replication and recovery, emphasizing the importance of robust failure handling in a distributed file system.
Overall, the comments section demonstrates a mix of curiosity, skepticism, and praise for the presented file system. The commenters delve into technical details, offering informed opinions on the design choices and potential tradeoffs. The author's active participation adds valuable context and clarifies several aspects of the system.