The author details their multi-layered approach to combating bot traffic on their small, independent website. Instead of relying on a single, potentially bypassable solution like CAPTCHA, they employ a combination of smaller, less intrusive techniques. These include rate limiting, hidden honeypot fields, analyzing user agent strings, and JavaScript checks. This strategy aims to make automated form submission more difficult and resource-intensive for bots while minimizing friction for legitimate users. The author acknowledges this isn't foolproof but believes the cumulative effect of these small hurdles effectively deters most unwanted bot activity.
The blog post discusses the increasing trend of websites using JavaScript-based "proof of work" systems to deter web scraping. These systems force clients to perform computationally expensive JavaScript calculations before accessing content, making automated scraping slower and more resource-intensive. The author argues this approach is ultimately flawed. While it might slow down unsophisticated scrapers, determined adversaries can easily reverse-engineer the JavaScript, bypass the proof of work, or simply use headless browsers to render the page fully. The author concludes that these systems primarily harm legitimate users, particularly those with low-powered devices or slow internet connections, while providing only a superficial barrier to dedicated scrapers.
HN commenters discuss the effectiveness and ethics of JavaScript "proof of work" anti-scraper systems. Some argue that these systems are easily bypassed by sophisticated scrapers, while inconveniencing legitimate users, particularly those with older hardware or disabilities. Others point out the resource cost these systems impose on both clients and servers. The ethical implications of blocking access to public information are also raised, with some arguing that if the data is publicly accessible, scraping it shouldn't be artificially hindered. The conversation also touches on alternative anti-scraping methods like rate limiting and fingerprinting, and the general cat-and-mouse game between website owners and scrapers. Several users suggest that a better approach is to offer an official API for data access, thus providing a legitimate avenue for obtaining the desired information.
Summary of Comments ( 56 )
https://news.ycombinator.com/item?id=44142761
HN users generally agreed with the author's approach of using multiple small tools to combat bots. Several commenters shared their own similar strategies, emphasizing the effectiveness and lower maintenance overhead of combining smaller, specialized tools over relying on large, complex solutions. Some highlighted specific tools like Fail2ban and CrowdSec. Others discussed the philosophical appeal of this approach, likening it to the Unix philosophy. A few questioned the long-term viability, anticipating bots adapting to these measures. The overall sentiment, however, favored the practicality and efficiency of this "death by a thousand cuts" bot mitigation strategy.
The Hacker News post "Using lots of little tools to aggressively reject the bots" sparked a discussion with a moderate number of comments, focusing primarily on the effectiveness and practicality of the author's approach to bot mitigation.
Several commenters expressed skepticism about the long-term viability of the author's strategy. They argued that relying on numerous small, easily bypassed hurdles merely slows down sophisticated bots temporarily. These commenters suggested focusing on robust authentication and stricter validation methods as more effective long-term solutions. One commenter specifically pointed out that CAPTCHAs, while annoying to users, present a more significant challenge to bots than minor inconveniences like hidden form fields.
Another line of discussion revolved around the trade-off between bot mitigation and user experience. Some commenters felt the author's approach, while effective against some bots, could negatively impact the experience of legitimate users. They argued that the cumulative effect of multiple small hurdles could create friction and frustration for real people.
A few commenters offered alternative or complementary approaches to bot mitigation. Suggestions included rate limiting, analyzing user behavior patterns, and using honeypots to trap bots. One commenter suggested that a combination of different techniques, including the author's small hurdles approach, would likely be the most effective strategy.
Some commenters also questioned the motivation and sophistication of the bots targeting the author's website. They speculated that the bots might be relatively simple and easily deterred, making the author's approach sufficient in that specific context. However, they cautioned that this approach might not be enough to protect against more sophisticated, determined bots.
Finally, a few commenters shared their own experiences with bot mitigation, offering anecdotal evidence both supporting and contradicting the author's claims. These personal experiences highlighted the varied nature of bot activity and the need for tailored solutions depending on the specific context and target audience. Overall, the comments presented a balanced perspective on the author's approach, acknowledging its potential benefits while also highlighting its limitations and potential drawbacks.