hackslash dot org

I got fooled by AI-for-science hype–here's what it taught me

Posted: 2025-05-20 04:57:00

The author, initially enthusiastic about AI's potential to revolutionize scientific discovery, realized that current AI/ML tools are primarily useful for accelerating specific, well-defined tasks within existing scientific workflows, rather than driving paradigm shifts or independently generating novel hypotheses. While AI excels at tasks like optimizing experiments or analyzing large datasets, its dependence on existing data and human-defined parameters limits its capacity for true scientific creativity. The author concludes that focusing on augmenting scientists with these powerful tools, rather than replacing them, is a more realistic and beneficial approach, acknowledging that genuine scientific breakthroughs still rely heavily on human intuition and expertise.

The author, reflecting on their initial exuberant embrace of the "AI for science" paradigm, recounts a personal journey marked by both excitement and subsequent disillusionment. They initially perceived artificial intelligence as a potential revolutionary force in scientific discovery, envisioning a future where machine learning models would autonomously generate novel hypotheses, design experiments, and analyze data, thereby accelerating scientific progress at an unprecedented pace. This optimistic outlook was fueled by the prevalent narrative surrounding AI's transformative potential and the impressive demonstrations of its capabilities in other domains.

However, the author's practical experience applying these techniques to real-world scientific problems revealed a more nuanced and complex reality. They discovered that the successful application of AI in science requires far more than simply applying existing algorithms to scientific datasets. A deep understanding of the underlying scientific principles and the specific challenges of the domain proved crucial, as did careful consideration of the limitations and potential biases inherent in the data and the models themselves. The author emphasizes that, contrary to the hype, AI is not a magical solution that can replace human scientific expertise. Instead, it is a powerful tool that can augment and enhance human capabilities, but only when wielded judiciously and with a clear understanding of its strengths and weaknesses.

The author's disillusionment stemmed from the realization that many of the publicized successes in AI for science were often overstated or selectively presented, failing to acknowledge the significant human effort and domain expertise required to achieve those results. They observed a tendency to focus on showcasing the potential of AI while downplaying the practical challenges and limitations, creating an inflated sense of its current capabilities. Furthermore, the author highlights the importance of distinguishing between truly novel scientific discoveries driven by AI and the application of AI to automate existing scientific workflows, arguing that the former remains elusive while the latter, although valuable, is less revolutionary.

The author concludes by advocating for a more realistic and balanced perspective on the role of AI in science. They encourage a shift away from the hype-driven narrative towards a more pragmatic approach that emphasizes collaboration between AI experts and domain scientists, rigorous validation of AI-driven insights, and a focus on addressing the specific challenges and limitations of applying AI to different scientific disciplines. While acknowledging that AI holds immense potential to transform scientific research, the author stresses the importance of tempering expectations and recognizing that its successful integration requires careful consideration, domain expertise, and a nuanced understanding of both the power and limitations of these technologies. They propose that focusing on augmenting human intelligence, rather than replacing it, is the key to unlocking the true potential of AI for scientific advancement.

Summary of Comments ( 200 )
https://news.ycombinator.com/item?id=44037941

Several commenters on Hacker News agreed with the author's sentiment about the hype surrounding AI in science, pointing out that the "low-hanging fruit" has already been plucked and that significant advancements are becoming increasingly difficult. Some highlighted the importance of domain expertise and the limitations of relying solely on AI, emphasizing that AI should be a tool used by experts rather than a replacement for them. Others discussed the issue of reproducibility and the "black box" nature of some AI models, making scientific validation challenging. A few commenters offered alternative perspectives, suggesting that AI still holds potential but requires more realistic expectations and a focus on specific, well-defined problems. The misleading nature of visualizations generated by AI was also a point of concern, with commenters noting the potential for misinterpretations and the need for careful validation.

The Hacker News post titled "I got fooled by AI-for-science hype–here's what it taught me" generated a moderate discussion with several insightful comments. Many commenters agreed with the author's core premise that AI hype in science, particularly regarding drug discovery and materials science, often oversells the current capabilities.

Several users highlighted the distinction between using AI for discovery versus optimization. One commenter pointed out that AI excels at optimizing existing solutions, making incremental improvements based on vast datasets. However, they argued it's less effective at genuine discovery, where novel concepts and breakthroughs are needed. This was echoed by another who mentioned that drug discovery often involves an element of "luck" and creative leaps that AI struggles to replicate.

Another recurring theme was the "garbage in, garbage out" problem. Commenters stressed that AI models are only as good as the data they're trained on. In scientific domains, this can be problematic due to limited, biased, or noisy datasets. One user specifically discussed materials science, explaining that the available data is often incomplete or inconsistent, hindering the effectiveness of AI models. Another mentioned that even within drug discovery, datasets are often proprietary and not shared, further limiting the potential of large-scale AI applications.

Some commenters offered a more nuanced perspective, acknowledging the hype while also recognizing the potential of AI. One suggested that AI could be a valuable tool for scientists, particularly for automating tedious tasks and analyzing complex data, but it shouldn't be seen as a replacement for human expertise and intuition. Another commenter argued that AI's role in science is still evolving, and while current applications may be overhyped, future breakthroughs are possible as the technology matures and datasets improve.

A few comments also touched on the economic incentives driving the AI hype. One user suggested that venture capital and media attention create pressure to exaggerate the potential of AI, leading to unrealistic expectations and inflated claims. Another mentioned the "publish or perish" culture in academia, which can incentivize researchers to oversell their results to secure funding and publications.

Overall, the comments section presents a generally skeptical view of the current state of AI-for-science, highlighting the limitations of existing approaches and cautioning against exaggerated claims. However, there's also a recognition that AI holds promise as a scientific tool, provided its limitations are acknowledged and expectations are tempered.

Kerosene did not save the sperm whale (2024)

permalink

Posted: 2025-04-04 15:59:28

The claim that kerosene saved sperm whales from extinction is a myth. While kerosene replaced sperm whale oil in lamps and other applications, this shift occurred after whale populations had already drastically declined due to overhunting. The demand for whale oil, not its eventual replacement, drove whalers to hunt sperm whales to near-extinction. Kerosene's rise simply made continued whaling less profitable, not less damaging up to that point. The article emphasizes that technological replacements rarely save endangered species; rather, conservation efforts are crucial.

The Substack post "Kerosene Did Not Save the Sperm Whale (2024)" by Ed Conway delves into the persistent, yet inaccurate, narrative surrounding the historical use of kerosene to refloat stranded whales. The article meticulously dissects the origins and perpetuation of this misconception, tracing it back to a singular, often-cited 19th-century anecdote concerning a beached sperm whale near Boston. Conway systematically demonstrates how this single incident, recounted in various publications with increasing embellishments over time, has morphed into a broadly accepted "fact," despite a conspicuous lack of corroborating evidence.

He argues that the story's appeal lies in its simplicity and perceived ingenuity – a readily available substance providing a seemingly miraculous solution to a dramatic predicament. This narrative resonates with a human desire for straightforward answers and triumphs over nature. Furthermore, the story's persistence is facilitated by its repeated inclusion in reputable publications and educational materials, which lends it an aura of credibility. This cyclical reinforcement, where the frequent repetition of the story serves as its own validation, has solidified its place in popular understanding.

Conway emphasizes the importance of scrutinizing such historical anecdotes and challenging their veracity, especially when they lack robust supporting evidence. He supports his argument by highlighting the practical improbability of kerosene significantly altering a whale's buoyancy given the sheer mass of the animal. He also examines the potential ecological harm caused by introducing large quantities of kerosene into the sensitive intertidal zone where strandings occur. Furthermore, he underscores the absence of any documented instances of successful whale refloating using this method beyond the single, questionable Boston incident.

The author concludes by advocating for a more critical approach to seemingly established historical narratives, urging readers to question the sources and evidence behind such claims. The kerosene-whale anecdote serves as a cautionary tale illustrating how easily misinformation can spread and become entrenched, even within scientific and educational circles. This underscores the necessity of rigorous fact-checking and the importance of relying on verifiable evidence rather than appealing narratives when discussing historical events, particularly those concerning scientific or natural phenomena. The broader implication is a call for greater skepticism and a commitment to evidence-based understanding, particularly in an age of readily disseminated information.

Summary of Comments ( 60 )
https://news.ycombinator.com/item?id=43584303

HN users generally agree with the author's debunking of the "kerosene saved the sperm whales" myth. Several commenters provide further details on whale oil uses beyond lighting, such as lubricants and industrial processes, reinforcing the idea that declining demand was more complex than a single replacement. Some discuss the impact of petroleum on other industries and the historical context of resource transitions. A few express appreciation for the well-researched article and the author's clear writing style, while others point to additional resources and related historical narratives, including the history of whaling and the environmental impacts of different industries. A small side discussion touches on the difficulty of predicting technological advancements and their impact on existing markets.

The Hacker News post titled "Kerosene did not save the sperm whale (2024)" has generated a number of comments discussing the linked article. Several commenters focus on the historical context and accuracy of the claims made in the article about the use of kerosene to refloat beached whales.

One commenter points out that while kerosene might not have been the primary factor in successful refloatings, historical accounts suggest it was used, citing examples from the 19th century. They acknowledge that the article's point about kerosene's potential harm is valid but emphasize the need to differentiate between its use as a flotation aid and its potential harm to the whale's skin. This commenter emphasizes the complexity of historical practices, suggesting that simply dismissing kerosene's use outright might be an oversimplification.

Another commenter digs into the specific example mentioned in the article about a whale refloated in New Jersey in 1902, highlighting the presence of other contributing factors, such as the rising tide. They argue that this demonstrates the difficulty in attributing success solely to kerosene. This commenter focuses on the multifaceted nature of whale rescue attempts, emphasizing that multiple factors likely play a role in any given situation.

Several commenters also discuss the challenges inherent in historical research, particularly the limitations and potential biases present in anecdotal evidence and newspaper reports. They acknowledge that while historical accounts might mention the use of kerosene, this doesn't necessarily prove its effectiveness. This emphasizes the importance of critical analysis when interpreting historical data.

Finally, some comments touch upon the broader topic of whale strandings and the various theories surrounding their causes. They mention factors like navigational errors, changes in ocean currents, and even military sonar as potential contributing factors. This discussion broadens the scope beyond the specific use of kerosene to encompass the larger issue of whale strandings and the complexities of understanding them.

Overall, the comments on Hacker News offer a nuanced perspective on the article's claims, exploring the historical context, the limitations of available evidence, and the complexities involved in interpreting historical accounts of whale rescue attempts. They generally agree with the article's main premise about the dubious effectiveness of kerosene, but also caution against oversimplification and emphasize the need for careful analysis of historical data.

SQLite-on-the-server is misunderstood: Better at hyper-scale than micro-scale

permalink

Posted: 2025-03-03 17:29:12

The blog post argues that SQLite, often perceived as a lightweight embedded database, is surprisingly well-suited for large-scale server deployments, even outperforming traditional client-server databases in certain scenarios. It posits that SQLite's simplicity, file-based nature, and lack of a separate server process translate to reduced operational overhead, easier scaling through horizontal sharding, and superior performance for read-heavy workloads, especially when combined with efficient caching mechanisms. While acknowledging limitations for complex joins and write-heavy applications, the author contends that SQLite's strengths make it a compelling, often overlooked option for modern web backends, particularly those focusing on serving static content or leveraging serverless functions.

The blog post "SQLite-on-the-server is misunderstood: Better at hyper-scale than micro-scale" argues against the common perception that SQLite, a lightweight embedded database, is only suitable for small-scale applications or client-side usage. The author contends that SQLite's unique architecture actually makes it a compelling choice for very large, high-throughput systems, even outperforming traditional client-server databases in specific scenarios. This counterintuitive claim rests on several key arguments.

Firstly, the post emphasizes the inherent scalability of SQLite when deployed in a "one database per service" model, a microservices architectural pattern. In this approach, each individual service or component within a larger application interacts with its own dedicated SQLite database file. This eliminates contention and locking issues that often become bottlenecks in centralized database systems as the application grows. Because each service handles its own isolated data, requests don't compete for the same resources, allowing for parallel processing and significant performance gains at scale.

Secondly, the author highlights the performance advantages stemming from SQLite's file-based nature. Being a library that directly manipulates a single file, SQLite avoids the overhead of inter-process communication (IPC) inherent in client-server database setups. This streamlined communication path translates to faster query execution and lower latency, especially beneficial in environments handling numerous, small, frequent requests. The post further elaborates that modern operating systems are highly optimized for file system operations, making this approach even more efficient.

The post acknowledges that managing numerous SQLite files might seem complex. However, it suggests leveraging modern containerization and orchestration technologies like Kubernetes to automate the deployment and management of these databases. This allows for easy scaling by simply spinning up more containers, each with its own dedicated SQLite database, distributing the load and maintaining high performance.

Furthermore, the author tackles the concern of data consistency and transactions across multiple SQLite databases. While admitting that distributed transactions are not natively supported, the post argues that this complexity can be managed at the application level using techniques like eventual consistency or the Saga pattern. These approaches provide ways to maintain data integrity without requiring complex distributed transaction coordination, thus preserving the performance benefits of the isolated database approach.

Finally, the blog post positions SQLite as a particularly advantageous solution for read-heavy workloads. The self-contained nature of each database file allows for easy replication and distribution across multiple servers, leading to significant improvements in read performance and availability. By simply copying the database file to multiple locations, read requests can be distributed, effectively scaling read capacity horizontally.

In essence, the author proposes a paradigm shift in thinking about SQLite. Instead of perceiving it solely as a small-scale solution, they advocate for considering its strengths in highly distributed, microservices-based architectures, where its file-based nature, lack of IPC overhead, and ease of replication can translate to significant performance and scalability advantages, particularly in read-heavy scenarios.

Summary of Comments ( 136 )
https://news.ycombinator.com/item?id=43244307

Hacker News users discussed the practicality and nuance of using SQLite as a server-side database, particularly at scale. Several commenters challenged the author's assertion that SQLite is better at hyper-scale than micro-scale, pointing out that its single-writer nature introduces bottlenecks in heavily write-intensive applications, precisely the kind often found at smaller scales. Some argued the benefits of SQLite, like simplicity and ease of deployment, are more valuable in microservices and serverless architectures, where scale is addressed through horizontal scaling and data sharding. The discussion also touched on the benefits of SQLite's reliability and its suitability for read-heavy workloads, with some users suggesting its effectiveness for data warehousing and analytics. Several commenters offered their own experiences, some highlighting successful use cases of SQLite at scale, while others pointed to limitations encountered in production environments.

The Hacker News post discussing the Rivet blog post "SQLite-on-the-server is misunderstood: Better at hyper-scale than micro-scale" generated a moderate amount of discussion, with several commenters offering insightful perspectives.

A key point of contention revolved around the interpretation of "hyperscale" and "microscale." Several commenters challenged the author's assertion that SQLite is better at hyperscale, arguing that the blog post conflated hyperscale with horizontal scalability. They pointed out that true hyperscale systems require sophisticated distributed consensus mechanisms and fault tolerance, which SQLite lacks. They clarified that SQLite's strength lies in its simplicity and ease of use for smaller, single-server deployments, making it more suitable for the microscale.

Another commenter emphasized the importance of data consistency and durability, suggesting that while SQLite might excel in read-heavy workloads, it's crucial to acknowledge the potential performance bottlenecks and data integrity risks when writing to the database at scale. This aligns with the blog post's acknowledgment of SQLite's single-writer nature, which some commenters considered a significant limitation.

The discussion also touched upon alternative approaches for achieving scalability, such as using a replicated SQLite setup or incorporating a caching layer to offload read traffic. While acknowledging the potential benefits of these strategies, commenters also highlighted the added complexity and operational overhead involved.

Several users shared their personal experiences using SQLite in various contexts, ranging from embedded systems to web applications. These anecdotes provided valuable practical insights into the strengths and weaknesses of SQLite, demonstrating its versatility as a database solution. One commenter, for instance, discussed using SQLite for a read-heavy application with a complex data schema, emphasizing the ease of schema evolution compared to other database systems.

Finally, the discussion briefly explored the trade-offs between using SQLite and other database technologies. While SQLite is praised for its simplicity and low barrier to entry, commenters noted that adopting a more robust database solution like PostgreSQL might be more appropriate for applications with complex data relationships, high write throughput, or stringent consistency requirements.

Overall, the comments on Hacker News offered a nuanced and balanced perspective on the suitability of SQLite for different scales and use cases. While the blog post's claims about hyperscale applicability were met with skepticism, the comments affirmed the value of SQLite as a powerful and versatile database for various applications, particularly in the microscale.

Vpternlog: When three is 100% more than two

permalink

Posted: 2025-01-19 05:24:25

The blog post "Vpternlog: When three is 100% more than two" explores the confusion surrounding ternary logic's perceived 50% increase in information capacity compared to binary. The author argues that while a ternary digit (trit) can hold three values versus a bit's two, this represents a 100% increase (three being twice as much as 1.5, which is the midpoint between 1 and 2) in potential values, not 50%. The post delves into the logarithmic nature of information capacity and uses the example of how many bits are needed to represent the same range of values as a given number of trits, demonstrating that the increase in capacity is closer to 63%, calculated using log base 2 of 3. The core point is that measuring increases in information capacity requires logarithmic comparison, not simple subtraction or division.

The blog post "Vpternlog: When three is 100% more than two" delves into a nuanced exploration of percentage calculations and their potential for misinterpretation, particularly when applied to ternary logic in the context of computer science. The author posits that a common misconception arises when comparing binary (two-state) systems to ternary (three-state) systems. Specifically, the erroneous assumption is frequently made that ternary logic offers a 50% increase in capacity or efficiency over binary logic. This assumption stems from the straightforward observation that three is 50% larger than two.

However, the author argues that this simplification overlooks the fundamental nature of percentage change calculations. A proper assessment requires considering the relative change in capacity. To illustrate, the author demonstrates that moving from two states to three states represents a 100% increase, not a 50% increase. This is because the increase (one additional state) is calculated relative to the original number of states (two), and one is 100% of two.

Further elaborating on this concept, the author emphasizes that percentages are inherently multiplicative factors, representing changes relative to an initial value. Therefore, an increase of 50% implies multiplying the original value by 1.5 (1 + 0.5), while an increase of 100% implies multiplying by 2 (1 + 1). In the case of transitioning from two states to three, the multiplication factor is indeed 1.5, but the percentage increase corresponding to this factor is 50%, not the other way around. The author elucidates this point with a clear mathematical breakdown of the percentage change formula: [(new value - old value) / old value] * 100%.

Finally, the post underscores the importance of precision in language and calculations, particularly when dealing with technical concepts like percentage change. The seemingly small difference between a 50% increase and a 100% increase can have significant implications in the realm of computer science and engineering, where even fractional differences in efficiency can translate to substantial real-world gains. The author's ultimate message is a cautionary one, urging readers to carefully consider the underlying mathematics when making comparisons based on percentages.

Summary of Comments ( 1 )
https://news.ycombinator.com/item?id=42753953

Hacker News users discuss the nuances of ternary logic's efficiency compared to binary. Several commenters point out that the article's claim of ternary being "100% more" than binary is misleading. They argue that the relevant metric is information density, calculated using log base 2, which shows ternary as only about 58% more efficient. Discussions also revolved around practical implementation challenges of ternary systems, citing issues with noise margins and the relative ease and maturity of binary technology. Some users mention the historical use of ternary computers, like Setun, while others debate the theoretical advantages and whether these outweigh the practical difficulties. A few also explore alternative bases beyond ternary and binary.

The Hacker News post "Vpternlog: When three is 100% more than two" (linking to an article about ternary logic) generated a moderate amount of discussion, with several commenters exploring different facets of ternary computing.

One of the most compelling threads revolved around the practical applications of ternary logic. A commenter pointed out the historical use of ternary in the Setun computer, highlighting its potential advantages in terms of efficiency for certain operations. This sparked further discussion about the reasons why ternary computing hasn't become mainstream, with theories ranging from the difficulty in manufacturing reliable ternary hardware to the entrenched dominance of binary logic in the computing industry. The challenges in designing ternary logic circuits were also mentioned, emphasizing the complexity compared to their binary counterparts.

Another interesting discussion thread emerged around the interpretation of the article's title. Some users debated the mathematically correct way to express the relationship between two and three, while others focused on the nuances of the percentage increase calculation. This led to a clarification about the difference between saying "three is 100% more than two" versus "three is 50% larger than two," highlighting the importance of precise language when discussing mathematical concepts.

Furthermore, a commenter brought up the topic of balanced ternary, a system that uses -1, 0, and 1 as its three states. They explained how this system simplifies certain mathematical operations and offered an example of representing numbers in balanced ternary. This introduced a different perspective on the potential benefits of ternary logic beyond the simple 0, 1, and 2 system.

Some users also discussed the potential benefits of ternary logic in specific applications, such as representing fractional values and optimizing certain algorithms. While acknowledging the challenges in widespread adoption, they suggested that ternary could hold promise for niche applications where its unique properties could be leveraged.

Finally, there was a brief mention of other alternative number systems beyond binary and ternary, acknowledging the broader landscape of computational possibilities and the ongoing exploration of different approaches to information processing.

Don't use cosine similarity carelessly

permalink

Posted: 2025-01-14 21:23:21

Cosine similarity, while popular for comparing vectors, can be misleading when vector magnitudes carry significant meaning. The blog post demonstrates how cosine similarity focuses solely on the angle between vectors, ignoring their lengths. This can lead to counterintuitive results, particularly in scenarios like recommendation systems where a small, highly relevant vector might be ranked lower than a large, less relevant one simply due to magnitude differences. The author advocates for considering alternatives like dot product or Euclidean distance, especially when vector magnitude represents important information like purchase count or user engagement. Ultimately, the choice of similarity metric should depend on the specific application and the meaning encoded within the vector data.

The blog post "Don't use cosine similarity carelessly" cautions against the naive application of cosine similarity, particularly in machine learning and recommendation systems, without a thorough understanding of its implications and potential pitfalls. The author meticulously illustrates how cosine similarity, while effective in certain scenarios, can produce misleading or undesirable results when the underlying data possesses specific characteristics.

The core argument revolves around the fact that cosine similarity solely focuses on the angle between vectors, effectively disregarding the magnitude or scale of those vectors. This can be problematic when comparing items with drastically different scales of interaction or activity. For instance, in a movie recommendation system, a user who consistently rates movies highly will appear similar to another user who rates movies highly, even if their taste in genres is vastly different. This is because the large magnitude of their ratings dominates the cosine similarity calculation, obscuring the nuanced differences in their preferences. The author underscores this with an example of book recommendations, where a voracious reader may appear similar to other avid readers regardless of their preferred genres simply due to the high volume of their reading activity.

The author further elaborates this point by demonstrating how cosine similarity can be sensitive to "bursts" of activity. A sudden surge in interaction with certain items, perhaps due to a promotional campaign or temporary trend, can disproportionately influence the similarity calculations, potentially leading to recommendations that are not truly reflective of long-term preferences.

The post provides a concrete example using a movie rating dataset. It showcases how users with different underlying preferences can appear deceptively similar based on cosine similarity if one user has rated many more movies overall. The author emphasizes that this issue becomes particularly pronounced in sparsely populated datasets, common in real-world recommendation systems.

The post concludes by suggesting alternative approaches that consider both the direction and magnitude of the vectors, such as Euclidean distance or Manhattan distance. These metrics, unlike cosine similarity, are sensitive to differences in scale and are therefore less susceptible to the pitfalls described earlier. The author also encourages practitioners to critically evaluate the characteristics of their data before blindly applying cosine similarity and to consider alternative metrics when magnitude plays a crucial role in determining true similarity. The overall message is that while cosine similarity is a valuable tool, its limitations must be recognized and accounted for to ensure accurate and meaningful results.

Summary of Comments ( 70 )
https://news.ycombinator.com/item?id=42704078

Hacker News users generally agreed with the article's premise, cautioning against blindly applying cosine similarity. Several commenters pointed out that the effectiveness of cosine similarity depends heavily on the specific use case and data distribution. Some highlighted the importance of normalization and feature scaling, noting that cosine similarity is sensitive to these factors. Others offered alternative methods, such as Euclidean distance or Manhattan distance, suggesting they might be more appropriate in certain situations. One compelling comment underscored the importance of understanding the underlying data and problem before choosing a similarity metric, emphasizing that no single metric is universally superior. Another emphasized how important preprocessing is, highlighting TF-IDF and BM25 as helpful techniques for text analysis before using cosine similarity. A few users provided concrete examples where cosine similarity produced misleading results, further reinforcing the author's warning.

The Hacker News post "Don't use cosine similarity carelessly" (https://news.ycombinator.com/item?id=42704078) sparked a discussion with several insightful comments regarding the article's points about the pitfalls of cosine similarity.

Several commenters agreed with the author's premise, emphasizing the importance of understanding the implications of using cosine similarity. One commenter highlighted the issue of scale invariance, pointing out that two vectors can have a high cosine similarity even if their magnitudes are vastly different, which can be problematic in certain applications. They used the example of comparing customer purchase behavior where one customer buys small quantities frequently and another buys large quantities infrequently. Cosine similarity might suggest they're similar, ignoring the significant difference in total spending.

Another commenter pointed out that the article's focus on document comparison and TF-IDF overlooks common scenarios like comparing embeddings from large language models (LLMs). They argue that in these cases, magnitude does often carry significant semantic meaning, and normalization can be detrimental. They specifically mentioned the example of sentence embeddings, where longer sentences tend to have larger magnitudes and often carry more information. Normalizing these embeddings would lose this information. This commenter suggested that the article's advice is too general and doesn't account for the nuances of various applications.

Expanding on this, another user added that even within TF-IDF, the magnitude can be a meaningful signal, suggesting that document length could be a relevant factor for certain types of comparisons. They suggested that blindly applying cosine similarity without considering such factors can be problematic.

One commenter offered a concise summary of the issue, stating that cosine similarity measures the angle between vectors, discarding information about their magnitudes. They emphasized the need to consider whether magnitude is important in the specific context.

Finally, a commenter shared a personal anecdote about a machine learning competition where using cosine similarity instead of Euclidean distance drastically improved their results. They attributed this to the inherent sparsity of the data, highlighting that the appropriateness of a similarity metric heavily depends on the nature of the data.

In essence, the comments generally support the article's caution against blindly using cosine similarity. They emphasize the importance of considering the specific context, understanding the implications of scale invariance, and recognizing that magnitude can often carry significant meaning depending on the application and data.

Stories with Tag misconceptions

I got fooled by AI-for-science hype–here's what it taught me

Summary of Comments ( 200 ) https://news.ycombinator.com/item?id=44037941

Kerosene did not save the sperm whale (2024)

Summary of Comments ( 60 ) https://news.ycombinator.com/item?id=43584303

SQLite-on-the-server is misunderstood: Better at hyper-scale than micro-scale

Summary of Comments ( 136 ) https://news.ycombinator.com/item?id=43244307

Vpternlog: When three is 100% more than two

Summary of Comments ( 1 ) https://news.ycombinator.com/item?id=42753953

Don't use cosine similarity carelessly

Summary of Comments ( 70 ) https://news.ycombinator.com/item?id=42704078

Summary of Comments ( 200 )
https://news.ycombinator.com/item?id=44037941

Summary of Comments ( 60 )
https://news.ycombinator.com/item?id=43584303

Summary of Comments ( 136 )
https://news.ycombinator.com/item?id=43244307

Summary of Comments ( 1 )
https://news.ycombinator.com/item?id=42753953

Summary of Comments ( 70 )
https://news.ycombinator.com/item?id=42704078