Mullvad Leta is a new, free, open-source, privacy-focused search engine currently in alpha. It prioritizes protecting user privacy by not logging searches or personalizing results. Leta uses its own independent web crawler and index, providing unbiased results drawn directly from the web. While currently limited in features and scope compared to established search engines, it aims to offer a viable alternative focused on privacy and transparency.
This blog post details building a basic search engine using Python. It focuses on core concepts, walking through creating an inverted index from a collection of web pages fetched with requests
. The index maps words to the pages they appear on, enabling keyword search. The implementation prioritizes simplicity and educational value over performance or scalability, employing straightforward data structures like dictionaries and lists. It covers tokenization, stemming with NLTK, and basic scoring based on term frequency. Ultimately, the project demonstrates the fundamental logic behind search engine functionality in a clear and accessible manner.
Hacker News users generally praised the simplicity and educational value of the described search engine. Several commenters appreciated the author's clear explanation of the underlying concepts and the accessible code example. Some suggested improvements, such as using a stemmer for better search relevance, or exploring alternative ranking algorithms like BM25. A few pointed out the limitations of such a basic approach for real-world applications, emphasizing the complexities of handling scale and spam. One commenter shared their experience building a similar project and recommended resources for further learning. Overall, the discussion focused on the project's pedagogical merits rather than its practical utility.
Y Combinator's amicus brief argues that Google's dominance in search and its preferential treatment of its own vertical search services harm competition and innovation, ultimately hurting consumers and startups. They contend that Google leverages its search monopoly to stifle competition in adjacent markets, preventing startups from reaching consumers and diminishing the incentive for innovation. This behavior creates a closed ecosystem that favors Google's own products, even when superior alternatives exist. YC highlights the difficulty startups face in competing against Google's self-preferencing and emphasizes the importance of a competitive search landscape for the continued dynamism of the internet and the broader economy.
HN commenters discuss YC's amicus brief, largely agreeing with its arguments against Google's anti-competitive practices in search. Several highlight the brief's focus on how Google's dominance stifles innovation by controlling distribution and manipulating search results to favor its own vertical search products. Some express skepticism about the government's chances of success, citing the difficulty of proving consumer harm and the power of Google's lobbying efforts. Others see the brief as a strong defense of startup ecosystems and a necessary challenge to Google's monopolistic behavior. The potential impact on AI competition is also mentioned, with concerns about Google leveraging its search dominance to control access to AI models. A few commenters critique specific aspects of the brief or suggest alternative approaches to regulation.
JudyRecords offers a free, full-text search engine for US federal and state court records. It indexes PACER documents, making them accessible without the usual PACER fees. The site aims to promote transparency and accessibility to legal information, allowing users to search across jurisdictions and case types using keywords, judge names, or party names. While the database is constantly growing, it acknowledges it may not contain every record. Users can download documents in their original format and the platform provides features like saved searches and email alerts.
Hacker News users discussed the legality and ethics of Judy Records' full-text search of US court records, with concerns raised about the potential for misuse and abuse of sensitive information. Some questioned the legality of scraping PACER data, particularly given its paywalled nature. Others highlighted the privacy implications of making court records easily searchable, especially for individuals involved in sensitive cases like divorce or domestic violence. While acknowledging the potential benefits of increased access to legal information, commenters emphasized the need for careful consideration of the ethical implications and potential harms of such a service. Several suggested alternative approaches like focusing on specific legal areas or partnering with existing legal databases to mitigate these risks. The lack of clarity regarding Judy Records' data sources and business model also drew criticism, with some suspecting the involvement of exploitative practices like data harvesting for marketing purposes.
Kagi's AI assistant, previously in beta, is now available to all users. It aims to provide a more private and personalized search experience by focusing on factual answers, incorporating user feedback, and avoiding generic chatbot responses. Key features include personalized summarization of search results, the ability to ask clarifying questions, and ad-free, unbiased information retrieval powered by Kagi's independent search index. Users can access the assistant directly from the search bar or a dedicated sidebar.
Hacker News users discussed Kagi Assistant's public release with cautious optimism. Several praised its speed and accuracy compared to alternatives like ChatGPT and Perplexity, particularly for coding tasks and factual queries. Some expressed concerns about the long-term viability of a subscription model for search, wondering if Kagi could maintain quality and compete with free, ad-supported giants. The integration with Kagi's existing search engine was generally seen as a positive, though some questioned its usefulness for simpler searches. A few commenters noted the potential for bias and the importance of transparency regarding the underlying model and training data. Others brought up the small company size and the challenge of scaling the service while maintaining performance and privacy. Overall, the sentiment was positive but tempered by pragmatic considerations about the future of paid search assistants.
A federal judge ruled that Google holds a monopoly in the online advertising technology market, echoing the Justice Department's claims in its antitrust lawsuit. The judge found Google's dominance in various aspects of the ad tech ecosystem, including ad buying tools for publishers and advertisers, as well as the ad exchange that connects them, gives the company an unfair advantage and harms competition. This ruling is a significant victory for the government in its effort to rein in Google's power and could potentially lead to structural changes in the company's ad tech business.
Hacker News commenters largely agree with the judge's ruling that Google holds a monopoly in online ad tech. Several highlight the conflict of interest inherent in Google simultaneously owning the dominant ad exchange and representing both buyers and sellers. Some express skepticism that structural separation, as suggested by the Department of Justice, is the right solution, arguing it could stifle innovation and benefit competitors more than consumers. A few point out the irony of the government using antitrust laws to regulate a company built on "free" products, questioning if Google's dominance truly harms consumers. Others discuss the potential impact on ad revenue for publishers and the broader implications for the digital advertising landscape. Several commenters express cynicism about the effectiveness of antitrust actions in the long run, expecting Google to adapt and maintain its substantial market power. A recurring theme is the complexity of the ad tech ecosystem, making it difficult to predict the actual consequences of any intervention.
Meilisearch is an open-source, easy-to-use search engine API. It features a typo-tolerant, fast search experience and offers AI-powered hybrid search capabilities combining keyword and semantic search for more relevant results. Developers can easily integrate Meilisearch into their applications using various SDKs and customize ranking rules, synonyms, and other settings for optimal performance and tailored search experiences.
Hacker News users discussed Meilisearch's pivot towards an AI-powered hybrid search, expressing skepticism and concern. Several commenters questioned the value proposition, noting that the core competency of a search engine is accurate retrieval, not AI-powered features. Some worried that adding AI features would increase complexity and resource consumption without significantly improving search relevance. Others highlighted potential issues with cost and vendor lock-in with OpenAI's API. There was a general sentiment that focusing on core search functionality and performance would be a more beneficial direction for Meilisearch. A few commenters offered alternative solutions, like using a vector database alongside Meilisearch for semantic search capabilities. The overall tone was cautiously pessimistic, with many expressing disappointment in the shift away from a simple and performant search solution.
PostgreSQL's full-text search functionality is often unfairly labeled as slow. This perception stems from common misconfigurations and inefficient usage. The blog post demonstrates that with proper setup, including using appropriate data types (like tsvector
for indexed documents and tsquery
for search terms), utilizing GIN indexes on tsvector
columns, and leveraging stemming and other linguistic features, PostgreSQL's full-text search can be extremely performant, even on large datasets. Furthermore, optimizing queries by using appropriate operators and understanding how ranking works can significantly improve search speed. The post emphasizes that understanding and correctly implementing these techniques are key to unlocking PostgreSQL's full-text search potential.
Hacker News users generally agreed with the article's premise that PostgreSQL full-text search can be performant if implemented correctly. Several commenters shared their own positive experiences, highlighting the importance of proper indexing and configuration. Some pointed out that while PostgreSQL's full-text search might not outperform specialized solutions like Elasticsearch or Algolia for very large datasets or complex queries, it's more than adequate for many use cases. A few cautioned against using stemming without careful consideration, as it can lead to unexpected results. The discussion also touched upon the benefits of using pg_trgm for fuzzy matching and the trade-offs between different indexing strategies.
The author argues that Google's search quality has declined due to a prioritization of advertising revenue and its own products over relevant results. This manifests in excessive ads, low-quality content from SEO-driven websites, and a tendency to push users towards Google services like Maps and Flights, even when external options might be superior. The post criticizes the cluttered and information-poor nature of modern search results pages, lamenting the loss of a cleaner, more direct search experience that prioritized genuine user needs over Google's business interests. This degradation, the author claims, is driving users away from Google Search and towards alternatives.
HN commenters largely agree with the author's premise that Google search quality has declined. Many attribute this to increased ads, irrelevant results, and a focus on Google's own products. Several commenters shared anecdotes of needing to use specific search operators or alternative search engines like DuckDuckGo or Bing to find desired information. Some suggest the decline is due to Google's dominant market share, arguing they lack the incentive to improve. A few pushed back, attributing perceived declines to changes in user search habits or the increasing complexity of the internet. Several commenters also discussed the bloat of Google's other services, particularly Maps.
Anthropic has announced that its AI assistant, Claude, now has access to real-time web search capabilities. This allows Claude to access and process information from the web, enabling more up-to-date and comprehensive responses to user prompts. This new feature enhances Claude's abilities across various tasks, including summarization, creative writing, Q&A, and coding, by grounding its responses in current information. Users can now expect Claude to deliver more factually accurate and contextually relevant answers by leveraging the vast knowledge base available online.
HN commenters discuss Claude's new web search capability, with several expressing excitement about its potential to challenge Google's dominance. Some praise Claude's more conversational and contextual search results compared to traditional keyword-based approaches. Concerns were raised about the lack of source links in the initial version, potentially hindering fact-checking and further exploration. However, Anthropic quickly responded to this criticism, stating they were actively working on incorporating source links and planned to release the feature soon. Several users noted Claude's strengths in summarizing and synthesizing information, suggesting its potential usefulness for research and complex queries. Comparisons were made to Perplexity AI, another conversational search engine, with some users finding Claude more conversational and less prone to hallucinations. There's general optimism about the future of AI-powered search and Claude's role in it.
Ecosia's founders have legally restructured the company to prevent it from ever being sold, even by future owners. This ensures that Ecosia's profits will always be used to plant trees and pursue its environmental mission. The change involves a new legal structure called a "steward ownership model" and a purpose foundation that holds all voting rights. This effectively makes selling Ecosia for profit impossible, guaranteeing its long-term commitment to environmental sustainability.
Hacker News users generally praised Ecosia's commitment to its mission, viewing the legal restructuring as a positive move. Some expressed skepticism about the long-term viability of the business model and wondered how Ecosia would adapt to future challenges without the option of selling. Others questioned the specific legal mechanisms employed and compared them to other charitable structures. A few commenters also raised concerns about potential future leadership changes and how those could impact Ecosia's stated commitment. Several users shared their personal experiences with the search engine, generally positive, and discussed the tradeoffs between Ecosia and other search options.
Ecosia and Qwant, two European search engines prioritizing privacy and sustainability, are collaborating to build a new, independent European search index called the European Open Web Search (EOWS). This joint effort aims to reduce reliance on non-European indexes, promote digital sovereignty, and offer a more ethical and transparent alternative. The project is open-source and seeks community involvement to enrich the index and ensure its inclusivity, providing European users with a robust and relevant search experience powered by European values.
Several Hacker News commenters express skepticism about Ecosia and Qwant's ability to compete with Google, citing Google's massive data advantage and network effects. Some doubt the feasibility of building a truly independent index and question whether the joint effort will be significantly different from using Bing. Others raise concerns about potential bias and censorship, given the European focus. A few commenters, however, offer cautious optimism, hoping the project can provide a viable privacy-respecting alternative and contribute to a more decentralized internet. Some also express interest in the technical challenges involved in building such an index.
The Department of Justice is reportedly still pushing for Google to sell off parts of its Chrome business, even as it prepares its main antitrust lawsuit against the company for trial. Sources say the DOJ believes Google's dominance in online advertising is partly due to its control over Chrome and that divesting the browser, or portions of it, is a necessary remedy. This potential divestiture could include parts of Chrome's ad tech business and potentially even the browser itself, a significantly more aggressive move than previously reported. While the DOJ's primary focus remains its existing ad tech lawsuit, pressure for a Chrome divestiture continues behind the scenes.
HN commenters are largely skeptical of the DOJ's potential antitrust suit against Google regarding Chrome. Many believe it's a misguided effort, arguing that Chrome is free, open-source (Chromium), and faces robust competition from other browsers like Firefox and Safari. Some suggest the DOJ should focus on more pressing antitrust issues, like Google's dominance in search advertising and its potential abuse of Android. A few commenters discuss the potential implications of such a divestiture, including the possibility of a fork of Chrome or the browser becoming part of another large company. Some express concern about the potential negative impact on user privacy. Several commenters also point out the irony of the government potentially mandating Google divest from a free product.
The author attempted to build a free, semantic search engine for GitHub using a Sentence-BERT model and FAISS for vector similarity search. While initial results were promising, scaling proved insurmountable due to the massive size of the GitHub codebase and associated compute costs. Indexing every repository became computationally and financially prohibitive, particularly as the model struggled with context fragmentation from individual code snippets. Ultimately, the project was abandoned due to the unsustainable balance between cost, complexity, and the limited resources of a solo developer. Despite the failure, the author gained valuable experience in large-scale data processing, vector databases, and the limitations of current semantic search technology when applied to a vast and diverse codebase like GitHub.
HN commenters largely praised the author's transparency and detailed write-up of their project. Several pointed out the inherent difficulties and nuances of semantic search, particularly within the vast and diverse codebase of GitHub. Some suggested alternative approaches, like focusing on a smaller, more specific domain within GitHub or utilizing existing tools like Elasticsearch with careful tuning. The cost of running such a service and the challenges of monetization were also discussed, with some commenters skeptical of the free model. A few users shared their own experiences with similar projects, echoing the author's sentiments about the complexity and resource intensity of semantic search. Overall, the comments reflected an appreciation for the author's journey and the lessons learned, contributing further insights into the challenges of building and scaling a semantic search engine.
A new Safari extension allows users to set ChatGPT as their default search engine. The extension intercepts search queries entered in the Safari address bar and redirects them to ChatGPT, providing a conversational AI-powered search experience directly within the browser. This offers an alternative to traditional search engines, leveraging ChatGPT's ability to synthesize information and respond in natural language.
Hacker News users discussed the practicality and privacy implications of using a ChatGPT extension as a default search engine. Several questioned the value proposition, arguing that search engines are better suited for information retrieval while ChatGPT excels at generating text. Privacy concerns were raised regarding sending every search query to OpenAI. Some commenters expressed interest in using ChatGPT for specific use cases, like code generation or creative writing prompts, but not as a general search replacement. Others highlighted potential benefits, like more conversational search results and the possibility of bypassing paywalled content using ChatGPT's summarization abilities. The potential for bias and manipulation in ChatGPT's responses was also mentioned.
Phind 2, a new AI search engine, significantly upgrades its predecessor with enhanced multi-step reasoning capabilities and the ability to generate visual answers, including diagrams and code flowcharts. It utilizes a novel method called "grounded reasoning" which allows it to access and process information from multiple sources to answer complex questions, offering more comprehensive and accurate responses. Phind 2 also features an improved conversational mode and an interactive code interpreter, making it a more powerful tool for both technical and general searches. This new version aims to provide clearer, more insightful answers than traditional search engines, moving beyond simply listing links.
Hacker News users discussed Phind 2's potential, expressing both excitement and skepticism. Some praised its ability to synthesize information and provide visual aids, especially for coding-related queries. Others questioned the reliability of its multi-step reasoning and cited instances where it hallucinated or provided incorrect code. Concerns were also raised about the lack of source citations and the potential for over-reliance on AI tools, hindering deeper learning. Several users compared it favorably to other AI search engines like Perplexity AI, noting its cleaner interface and improved code generation capabilities. The closed-source nature of Phind 2 also drew criticism, with some advocating for open-source alternatives. The pricing model and potential for future monetization were also points of discussion.
Google altered its Super Bowl ad for its Bard AI chatbot after it provided inaccurate information in a demo. The ad showcased Bard's ability to simplify complex topics, but it incorrectly stated the James Webb Space Telescope took the very first pictures of a planet outside our solar system. Google corrected the error before airing the ad, highlighting the ongoing challenges of ensuring accuracy in AI chatbots, even in highly publicized marketing campaigns.
Hacker News commenters generally expressed skepticism about Google's Bard AI and the implications of the ad's factual errors. Several pointed out the irony of needing to edit an ad showcasing AI's capabilities because the AI itself got the facts wrong. Some questioned the ethics of heavily promoting a technology that's clearly still flawed, especially given Google's vast influence. Others debated the significance of the errors, with some suggesting they were minor while others argued they highlighted deeper issues with the technology's reliability. A few commenters also discussed the pressure Google is under from competitors like Bing and the potential for AI chatbots to confidently hallucinate incorrect information. A recurring theme was the difficulty of balancing the hype around AI with the reality of its current limitations.
DeepSeek, a platform offering encoder APIs for developers, chose to open-source its core technology due to the inherent difficulty in building trust with users regarding data privacy and security when handling sensitive information like codebases and internal documentation. By open-sourcing, DeepSeek aims to foster transparency and allow users to self-host, ensuring complete control over their data. This approach mitigates concerns around vendor lock-in and allows the community to contribute to the project's development and security, ultimately building greater trust and fostering wider adoption.
Hacker News users discussed the open-sourcing of DeepSeek, primarily focusing on the challenges of monetizing open-source AI infrastructure. Many commenters were skeptical of Lago's business model, questioning how they could successfully build a proprietary offering on top of an open-source core, especially given the intense competition in the vector database space. Some suggested that open-sourcing DeepSeek was a necessary move due to the difficulty of attracting paying customers for a closed-source product. Others pointed out potential advantages, such as faster iteration and community contributions, but remained unconvinced of long-term viability. Several users expressed a desire for more technical details about DeepSeek's implementation and performance compared to existing solutions. The most compelling comments revolved around the inherent tension between open-sourcing and profitability in the current AI landscape.
Marginalia is a search engine designed to surface non-commercial content, prioritizing personal websites, blogs, and other independently published works often overshadowed by commercial results in mainstream search. It aims to rediscover the original spirit of the web by focusing on unique, human-generated content and fostering a richer, more diverse online experience. The search engine utilizes a custom index built by crawling sites linked from curated sources, filtering out commercial and spammy domains. Marginalia emphasizes quality over quantity, presenting a smaller, more carefully selected set of results to help users find hidden gems and explore lesser-known corners of the internet.
Hacker News users generally praised Marginalia's concept of prioritizing non-commercial content, viewing it as a refreshing alternative to mainstream search engines saturated with ads and SEO-driven results. Several commenters expressed enthusiasm for the focus on personal websites, blogs, and academic resources. Some questioned the long-term viability of relying solely on donations, while others suggested potential improvements like user accounts, saved searches, and more granular control over source filtering. There was also discussion around the definition of "non-commercial," with some users highlighting the inherent difficulty in objectively classifying content. A few commenters shared their initial search experiences, noting both successes in finding unique content and instances where the results were too niche or limited. Overall, the sentiment leaned towards cautious optimism, with many expressing hope that Marginalia could carve out a valuable space in the search landscape.
IRCDriven is a new search engine specifically designed for indexing and searching IRC (Internet Relay Chat) logs. It aims to make exploring and researching public IRC conversations easier by offering full-text search capabilities, advanced filtering options (like by channel, nick, or date), and a user-friendly interface. The project is actively seeking feedback and contributions from the IRC community to improve its features and coverage.
Commenters on Hacker News largely praised IRC Driven for its clean interface and fast search, finding it a useful tool for rediscovering old conversations and information. Some expressed a nostalgic appreciation for IRC and the value of archiving its content. A few suggested potential improvements, such as adding support for more networks, allowing filtering by nick, and offering date range restrictions in search. One commenter noted the difficulty in indexing IRC due to its decentralized and ephemeral nature, commending the creator for tackling the challenge. Others discussed the historical significance of IRC and the potential for such archives to serve as valuable research resources.
Birls.org is a new search engine specifically designed for accessing US veteran records. It offers a streamlined interface to search across multiple government databases and also provides a free, web-based system for submitting Freedom of Information Act (FOIA) requests to the National Archives via fax, simplifying the often cumbersome process of obtaining these records.
HN users generally expressed skepticism and concern about the project's viability and potential security issues. Several commenters questioned the need for faxing FOIA requests, highlighting existing online portals and email options. Others worried about the security implications of handling sensitive veteran data, particularly with a fax-based system. The project's reliance on OCR was also criticized, with users pointing out its inherent inaccuracy. Some questioned the search engine's value proposition, given the existence of established genealogy resources. Finally, the lack of clarity surrounding the project's funding and the developer's qualifications raised concerns about its long-term sustainability and trustworthiness.
Summary of Comments ( 81 )
https://news.ycombinator.com/item?id=44116503
Hacker News users generally praised Mullvad Leta for its privacy-focused approach to search, particularly its commitment to not storing user data. Several commenters appreciated the technical explanation of how Leta works, including its use of a PostgreSQL database and its indexing methods. Some expressed skepticism about its ability to compete with established search engines like Google in terms of search quality and comprehensiveness. Others discussed the challenges of balancing privacy with functionality, acknowledging that some trade-offs are inevitable. A few commenters mentioned alternative privacy-focused search engines like Brave Search and SearX, comparing their features and functionalities to Leta. Some users pointed out limitations with current language support. There was some discussion about the cost model and whether Leta would eventually incorporate ads or other monetization strategies, with some hoping it would remain a free service.
The Hacker News post titled "Mullvad Leta" discussing the article at leta.mullvad.net generated several comments exploring various aspects of the proposed search engine. Many commenters expressed cautious optimism and interest in the project.
A recurring theme was Mullvad's reputation for privacy and trustworthiness. Several commenters highlighted this as a key differentiator, suggesting that even if the search engine wasn't perfect initially, Mullvad's commitment to privacy would make it a viable alternative to existing options. One user explicitly stated their trust in Mullvad, emphasizing the company's track record with their VPN service. Another comment echoed this sentiment, pointing out that Mullvad's existing reputation makes them more likely to prioritize user privacy in their search engine.
Several comments delved into the technical details and challenges of building a private search engine. Discussions around indexing, the use of third-party APIs (particularly for image search), and the balance between privacy and functionality were prominent. One commenter questioned the feasibility of offering a fully private image search, given the reliance on external sources. Another comment acknowledged the difficulty of competing with established search giants, emphasizing the massive resources required for indexing and maintaining a comprehensive search index.
The open-source nature of the project also drew attention, with some commenters expressing enthusiasm for the potential for community contributions and audits. The ability to inspect the code was seen as a significant advantage in terms of transparency and trust.
Some skepticism was expressed regarding the potential effectiveness and reach of the search engine. One commenter wondered about the long-term viability of such a project, considering the dominance of existing players. Another comment questioned the actual improvement in privacy compared to using existing search engines with privacy-focused browsers or extensions.
Finally, several users discussed alternative privacy-focused search engines and compared their features and limitations with Mullvad Leta. SearXNG and Brave Search were mentioned as examples, with commenters analyzing their strengths and weaknesses in relation to Mullvad's offering.
Overall, the comments reflected a mixture of excitement, cautious optimism, and pragmatic concerns about the challenges of building a truly private and effective search engine. The discussion revolved around Mullvad's reputation, technical feasibility, open-source nature, and comparisons with existing alternatives.