Wikipedia offers free downloads of its database in various formats. These include compressed XML dumps of all content (articles, media, metadata, etc.), current and historical versions, and smaller, more specialized extracts like article text only or specific language editions. Users can also access the data through alternative interfaces like the Wikipedia API or third-party tools. The download page provides detailed instructions and links to resources for working with the large datasets, along with warnings about server load and responsible usage.
The U.S. attorney for the District of Columbia, Matthew Graves, has questioned Wikimedia Foundation's nonprofit status. In a letter to the foundation, Graves raised concerns about potential misuse of donations, citing large reserves, high executive compensation, and expenditures on projects seemingly unrelated to its core mission of freely accessible knowledge. He suggested these activities could indicate private inurement or private benefit, violations that could jeopardize the foundation's tax-exempt status. The letter requests information regarding the foundation's finances and governance, giving a deadline for response. While Wikimedia maintains confidence in its compliance, the inquiry represents a significant challenge to its operational model.
Several Hacker News commenters express skepticism about the US Attorney's investigation into Wikimedia's non-profit status, viewing it as politically motivated and based on a misunderstanding of how Wikipedia operates. Some highlight the absurdity of the claims, pointing out the vast difference in resources between Wikimedia and for-profit platforms like Google and Facebook. Others question the letter's focus on advertising, arguing that the fundraising banners are non-intrusive and essential for maintaining a free and open encyclopedia. A few commenters suggest that the investigation could be a pretext for more government control over online information. There's also discussion about the potential impact on Wikimedia's fundraising efforts and the broader implications for online non-profits. Some users point out the irony of the US government potentially hindering a valuable resource it frequently utilizes.
Professional photographers are contributing high-quality portraits to Wikipedia to replace the often unflattering or poorly lit images currently used for many celebrity entries. Driven by a desire to improve the visual quality of the encyclopedia and provide a more accurate representation of these public figures, these photographers are donating their work or releasing it under free licenses. They aim to create a more respectful and professional image for Wikipedia while offering a readily available resource for media outlets and the public.
HN commenters generally agree that Wikipedia's celebrity photos are often unflattering or outdated. Several suggest that the issue isn't solely the photographers' fault, pointing to Wikipedia's stringent image licensing requirements and complex upload process as significant deterrents for professional photographers contributing high-quality work. Some commenters discuss the inherent challenges of representing public figures, balancing the desire for flattering images with the need for neutral and accurate representation. Others debate the definition of "bad" photos, arguing that some unflattering images simply reflect reality. A few commenters highlight the role of automated tools and bots in perpetuating the problem by automatically selecting images based on arbitrary criteria. Finally, some users share personal anecdotes about attempting to upload better photos to Wikipedia, only to be met with bureaucratic hurdles.
Summary of Comments ( 39 )
https://news.ycombinator.com/item?id=43811732
Hacker News users discussed various aspects of downloading and using Wikipedia's database. Several commenters highlighted the resource intensity of processing the full database, with mentions of multi-terabyte storage requirements and the need for significant processing power. Some suggested alternative approaches for specific use cases, such as using Wikipedia's API or pre-processed datasets like the one offered by the Wikimedia Foundation. Others discussed the challenges of keeping a local copy updated and the potential legal implications of redistributing the data. The value of having a local copy for offline access and research was also acknowledged. There was some discussion around specific tools and formats for working with the downloaded data, including tips for parsing and querying the XML dumps.
The Hacker News post titled "Wikipedia: Database Download" (https://news.ycombinator.com/item?id=43811732) has a moderate number of comments discussing various aspects of downloading and using Wikipedia's database dumps.
Several comments focus on the practical challenges and considerations related to downloading and processing the large datasets. One user points out the significant disk space requirements, even for compressed versions of the dumps, advising potential downloaders to carefully assess their storage capacity. Another comment highlights the computational resources needed to process the data, mentioning the RAM and processing power required for tasks like parsing and indexing. A separate thread discusses the various download options, including using BitTorrent for faster downloads and the availability of smaller, more specific dumps for those not needing the entire dataset.
Some users discuss the utility of having a local copy of Wikipedia. One comment mentions using the Kiwix offline reader, which allows access to a local copy of Wikipedia without the need for complex processing. Others discuss the potential for using the data for research, natural language processing tasks, and personal projects like building a local search engine. A particular comment thread delves into the technical details of setting up a local search index using tools like Xapian and Lucene.
The licensing of the Wikipedia data is also a topic of discussion. A user clarifies that the data is available under the Creative Commons license, emphasizing the importance of proper attribution when using the content.
A few comments touch on the history of Wikipedia dumps and how the process has evolved over time. One user reminisces about downloading Wikipedia dumps on DVDs in the past.
While there isn't a single overwhelmingly compelling comment, the discussion as a whole provides valuable insights into the practicalities and potential uses of the Wikipedia database dumps, covering aspects like hardware requirements, software tools, licensing, and the historical context of data availability. The collective knowledge shared by the commenters offers a comprehensive guide for anyone considering working with Wikipedia's data offline.