The post "Everyone knows all the apps on your phone" argues that the extensive data collection practices of mobile advertising networks effectively reveal which apps individuals use, even without explicit permission. Through deterministic and probabilistic methods linking device IDs, IP addresses, and other signals, these networks can create detailed profiles of app usage across devices. This information is then packaged and sold to advertisers, data brokers, and even governments, allowing them to infer sensitive information about users, from their political affiliations and health concerns to their financial status and personal relationships. The post emphasizes the illusion of privacy in the mobile ecosystem, suggesting that the current opt-out model is inadequate and calls for a more robust approach to data protection.
Theophile Cantelo has created Foudinge, a knowledge graph connecting restaurants and chefs. Leveraging Large Language Models (LLMs), Foudinge extracts information from various online sources like blogs, guides, and social media to establish relationships between culinary professionals and the establishments they've worked at or own. This allows for complex queries, such as finding all restaurants where a specific chef has worked, discovering connections between different chefs through shared work experiences, and exploring the culinary lineage within the restaurant industry. Currently focused on French gastronomy, the project aims to expand its scope geographically and improve data accuracy through community contributions and additional data sources.
Hacker News users generally expressed skepticism about the value proposition of the presented knowledge graph of restaurants and chefs. Several commenters questioned the accuracy and completeness of the data, especially given its reliance on LLMs. Some doubted the usefulness of connecting chefs to restaurants without further context, like the time period they worked there. Others pointed out the existing prevalence of this information on platforms like Wikipedia and guide sites, questioning the need for a new platform. The lack of a clear use case beyond basic information retrieval was a recurring theme, with some suggesting potential applications like tracking career progression or identifying emerging culinary trends, but ultimately finding the current implementation insufficient. A few commenters appreciated the technical effort, but overall the reception was lukewarm, focused on the need for demonstrable practical application and improved data quality.
Researchers introduced SWE-Lancer, a new benchmark designed to evaluate large language models (LLMs) on realistic software engineering tasks. Sourced from Upwork job postings, the benchmark comprises 417 diverse tasks covering areas like web development, mobile development, data science, and DevOps. SWE-Lancer focuses on practical skills by requiring LLMs to generate executable code, write clear documentation, and address client requests. It moves beyond simple code generation by incorporating problem descriptions, client communications, and desired outcomes to assess an LLM's ability to understand context, extract requirements, and deliver complete solutions. This benchmark provides a more comprehensive and real-world evaluation of LLM capabilities in software engineering than existing benchmarks.
HN commenters discuss the limitations of the SWE-Lancer benchmark, particularly its focus on smaller, self-contained tasks representative of Upwork gigs rather than larger, more complex projects typical of in-house software engineering roles. Several point out the prevalence of "specification gaming" within the dataset, where successful solutions exploit loopholes or ambiguities in the prompt rather than demonstrating true problem-solving skills. The reliance on GPT-4 for evaluation is also questioned, with concerns raised about its ability to accurately assess code quality and potential biases inherited from its training data. Some commenters also suggest the benchmark's usefulness is limited by its narrow scope, and call for more comprehensive benchmarks reflecting the broader range of skills required in professional software development. A few highlight the difficulty in evaluating "soft" skills like communication and collaboration, essential aspects of real-world software engineering often absent in freelance tasks.
A US judge ruled in favor of Thomson Reuters, establishing a significant precedent in AI copyright law. The ruling affirmed that Westlaw, Reuters' legal research platform, doesn't infringe copyright by using data from rival legal databases like Casetext to train its generative AI models. The judge found the copied material constituted fair use because the AI uses the data differently than the original databases, transforming the information into new formats and features. This decision indicates that using copyrighted data for AI training might be permissible if the resulting AI product offers a distinct and transformative function compared to the original source material.
HN commenters generally agree that Westlaw's terms of service likely prohibit scraping, regardless of copyright implications. Several point out that training data is generally considered fair use, and question whether the judge's decision will hold up on appeal. Some suggest the ruling might create a chilling effect on open-source LLMs, while others argue that large companies will simply absorb the licensing costs. A few commenters see this as a positive outcome, forcing AI companies to pay for the data they use. The discussion also touches upon the potential for increased competition and innovation if smaller players can access data more affordably than licensing Westlaw's content.
Cosine similarity, while popular for comparing vectors, can be misleading when vector magnitudes carry significant meaning. The blog post demonstrates how cosine similarity focuses solely on the angle between vectors, ignoring their lengths. This can lead to counterintuitive results, particularly in scenarios like recommendation systems where a small, highly relevant vector might be ranked lower than a large, less relevant one simply due to magnitude differences. The author advocates for considering alternatives like dot product or Euclidean distance, especially when vector magnitude represents important information like purchase count or user engagement. Ultimately, the choice of similarity metric should depend on the specific application and the meaning encoded within the vector data.
Hacker News users generally agreed with the article's premise, cautioning against blindly applying cosine similarity. Several commenters pointed out that the effectiveness of cosine similarity depends heavily on the specific use case and data distribution. Some highlighted the importance of normalization and feature scaling, noting that cosine similarity is sensitive to these factors. Others offered alternative methods, such as Euclidean distance or Manhattan distance, suggesting they might be more appropriate in certain situations. One compelling comment underscored the importance of understanding the underlying data and problem before choosing a similarity metric, emphasizing that no single metric is universally superior. Another emphasized how important preprocessing is, highlighting TF-IDF and BM25 as helpful techniques for text analysis before using cosine similarity. A few users provided concrete examples where cosine similarity produced misleading results, further reinforcing the author's warning.
IRCDriven is a new search engine specifically designed for indexing and searching IRC (Internet Relay Chat) logs. It aims to make exploring and researching public IRC conversations easier by offering full-text search capabilities, advanced filtering options (like by channel, nick, or date), and a user-friendly interface. The project is actively seeking feedback and contributions from the IRC community to improve its features and coverage.
Commenters on Hacker News largely praised IRC Driven for its clean interface and fast search, finding it a useful tool for rediscovering old conversations and information. Some expressed a nostalgic appreciation for IRC and the value of archiving its content. A few suggested potential improvements, such as adding support for more networks, allowing filtering by nick, and offering date range restrictions in search. One commenter noted the difficulty in indexing IRC due to its decentralized and ephemeral nature, commending the creator for tackling the challenge. Others discussed the historical significance of IRC and the potential for such archives to serve as valuable research resources.
Summary of Comments ( 392 )
https://news.ycombinator.com/item?id=43518866
Hacker News users discussed the privacy implications of app usage data being readily available to mobile carriers and how this data can be used for targeted advertising and even more nefarious purposes. Some commenters highlighted the ease with which this data can be accessed, not just by corporations but also by individuals with basic technical skills. The discussion also touched upon the ineffectiveness of current privacy regulations and the lack of real control users have over their data. A few users pointed out the potential for this data to reveal sensitive information like health conditions or financial status based on app usage patterns. Several commenters expressed a sense of resignation and apathy, suggesting the fight for data privacy is already lost, while others advocated for stronger regulations and user control over data sharing.
The Hacker News post "Everyone knows all the apps on your phone" (linking to a Substack article about app usage data collection) generated a lively discussion with several compelling comments.
Many commenters discussed the technical mechanisms behind this data collection, pointing out that it goes beyond simply tracking app store downloads. Several highlighted the role of "device graphs," which link together various devices and online identities belonging to the same individual through sophisticated cross-referencing of information like IP addresses, advertising identifiers, and shared accounts. This allows companies to build a comprehensive picture of a user's app usage even across different devices. Some elaborated on how this data is packaged and sold, emphasizing the scale and pervasiveness of this practice.
A recurring theme was the lack of genuine informed consent. Commenters argued that the current opt-out mechanisms are often buried in complex privacy policies or presented in a way that discourages users from exercising their choices. Some expressed skepticism about the effectiveness of privacy-focused operating systems or VPNs in fully mitigating this tracking, given the sophisticated techniques employed by data brokers.
Several commenters discussed the implications of this data collection, ranging from targeted advertising to potential misuse by governments or malicious actors. Some raised concerns about the chilling effect this surveillance could have on freedom of expression and association. The potential for discrimination based on inferred characteristics from app usage was also mentioned.
A few commenters offered practical advice on mitigating this tracking, such as regularly clearing advertising identifiers and being selective about the permissions granted to apps. However, there was a general consensus that individual efforts are insufficient and that stronger regulatory measures are needed to address the systemic nature of this data collection.
Some of the more compelling comments included specific examples of how this data is used, anecdotes about unexpected data linkages, and technical deep dives into the methods employed by data brokers. The discussion also touched upon the ethical implications of this practice and the broader societal consequences of widespread digital surveillance. While some comments offered a resigned acceptance of this reality, others expressed a desire for greater transparency and control over personal data.