The paper "Auto-Differentiating Any LLM Workflow: A Farewell to Manual Prompting" introduces a method to automatically optimize LLM workflows. By representing prompts and other workflow components as differentiable functions, the authors enable gradient-based optimization of arbitrary metrics like accuracy or cost. This eliminates the need for manual prompt engineering, allowing users to simply specify their desired outcome and let the system learn the best prompts and parameters automatically. The approach, called DiffPrompt, uses a continuous relaxation of discrete text and employs efficient approximate backpropagation through the LLM. Experiments demonstrate the effectiveness of DiffPrompt across diverse tasks, showcasing improved performance compared to manual prompting and other automated methods.
OpenAI alleges that DeepSeek AI, a Chinese AI company, improperly used its large language model, likely GPT-3 or a related model, to train DeepSeek's own competing large language model called "DeepSeek Coder." OpenAI claims to have found substantial code overlap and distinctive formatting patterns suggesting DeepSeek scraped outputs from OpenAI's model and used them as training data. This suspected unauthorized use violates OpenAI's terms of service, and OpenAI is reportedly considering legal action. The incident highlights growing concerns around intellectual property protection in the rapidly evolving AI field.
Several Hacker News commenters express skepticism of OpenAI's claims against DeepSeek, questioning the strength of their evidence and suggesting the move is anti-competitive. Some argue that reproducing the output of a model doesn't necessarily imply direct copying of the model weights, and point to the possibility of convergent evolution in training large language models. Others discuss the difficulty of proving copyright infringement in machine learning models and the broader implications for open-source development. A few commenters also raise concerns about the legal precedent this might set and the chilling effect it could have on future AI research. Several commenters call for OpenAI to release more details about their investigation and evidence.
Simon Willison achieved impressive code generation results using DeepSeek's new R1 model, running locally on consumer hardware via llama.cpp. He found R1, despite being smaller than other leading models, generated significantly better Python and JavaScript code, producing functional outputs on the first try more consistently. While still exhibiting some hallucination tendencies, particularly with external dependencies, R1 showed a promising ability to reason about code context and follow complex instructions. This performance, combined with its efficient local execution, positions R1 as a potentially game-changing tool for developer workflows.
Hacker News users discuss the potential of the DeepSeek R1 chip, particularly its performance running Llama.cpp. Several commenters express excitement about the accessibility and affordability it offers for local LLM experimentation. Some raise questions about the chip's power consumption and whether its advertised performance holds up in real-world scenarios. Others note the rapid pace of hardware development in this space and anticipate even more powerful and efficient options soon. A few commenters share their experiences with similar hardware setups, highlighting the practical challenges and limitations, such as memory bandwidth constraints. There's also discussion about the broader implications of affordable, powerful local LLMs, including potential privacy and security benefits.
DeepSeek-R1 is a specialized AI model designed for complex search tasks within massive, unstructured datasets like codebases, technical documentation, and scientific literature. It employs a retrieval-augmented generation (RAG) architecture, combining a powerful retriever model to pinpoint relevant document chunks with a large language model (LLM) that synthesizes information from those chunks into a coherent response. DeepSeek-R1 boasts superior performance compared to traditional keyword search and smaller LLMs, delivering more accurate and comprehensive answers to complex queries. It achieves this through a novel "sparse memory attention" mechanism, allowing it to process and contextualize information from an extensive collection of documents efficiently. The model's advanced capabilities promise significant improvements in navigating and extracting insights from vast knowledge repositories.
Hacker News users discussed DeepSeek-R1's impressive multimodal capabilities, particularly its ability to connect text and images in complex ways. Some questioned the practicality and cost of training such a large model, while others wondered about its specific applications and potential impact on fields like robotics and medical imaging. Several commenters expressed skepticism about the claimed zero-shot performance, highlighting the potential for cherry-picked examples and the need for more rigorous evaluation. There was also interest in the model's architecture and training data, with some requesting more technical details. A few users compared DeepSeek-R1 to other multimodal models like Gemini and pointed out the rapid advancements happening in this area.
Google's TokenVerse introduces a novel approach to personalized image generation called multi-concept personalization. By modulating tokens within a diffusion model's latent space, users can inject multiple personalized concepts, like specific objects, styles, and even custom trained concepts, into generated images. This allows for fine-grained control over the generative process, enabling the creation of diverse and highly personalized visuals from text prompts. TokenVerse offers various personalization methods, including direct token manipulation and training personalized "DreamBooth" concepts, facilitating both explicit control and more nuanced stylistic influences. The approach boasts strong compositionality, allowing multiple personalized concepts to be seamlessly integrated into a single image.
HN users generally expressed skepticism about the practical applications of TokenVerse, Google's multi-concept personalization method for image editing. Several commenters questioned the real-world usefulness and pointed out the limited scope of demonstrated edits, suggesting the examples felt more like parlor tricks than a significant advancement. The computational cost and complexity of the technique were also raised as concerns, with some doubting its scalability or viability for consumer use. Others questioned the necessity of this approach compared to existing, simpler methods. There was some interest in the underlying technology and potential future applications, but overall the response was cautious and critical.
The author details their evolving experience using AI coding tools, specifically Cline and large language models (LLMs), for professional software development. Initially skeptical, they've found LLMs invaluable for tasks like generating boilerplate, translating between languages, explaining code, and even creating simple functions from descriptions. While acknowledging limitations such as hallucinations and the need for careful review, they highlight the significant productivity boost and learning acceleration achieved through AI assistance. The author emphasizes treating LLMs as advanced coding partners, requiring human oversight and understanding, rather than complete replacements for developers. They also anticipate future advancements will further blur the lines between human and AI coding contributions.
HN commenters generally agree with the author's positive experience using LLMs for coding, particularly for boilerplate and repetitive tasks. Several highlight the importance of understanding the code generated, emphasizing that LLMs are tools to augment, not replace, developers. Some caution against over-reliance and the potential for hallucinations, especially with complex logic. A few discuss specific LLM tools and their strengths, and some mention the need for improved prompting skills to achieve better results. One commenter points out the value of LLMs for translating code between languages, which the author hadn't explicitly mentioned. Overall, the comments reflect a pragmatic optimism about LLMs in coding, acknowledging their current limitations while recognizing their potential to significantly boost productivity.
DeepSeek-R1 introduces a novel reinforcement learning (RL) framework to enhance reasoning capabilities in Large Language Models (LLMs). It addresses the limitations of standard supervised fine-tuning by employing a reward model trained to evaluate the reasoning quality of generated text. This reward model combines human-provided demonstrations with self-consistency checks, leveraging chain-of-thought prompting to generate multiple reasoning paths and rewarding agreement among them. Experiments on challenging logical reasoning datasets demonstrate that DeepSeek-R1 significantly outperforms supervised learning baselines and other RL approaches, producing more logical and coherent explanations. The proposed framework offers a promising direction for developing LLMs capable of complex reasoning.
Hacker News users discussed the difficulty of evaluating reasoning ability separate from memorization in LLMs, with some questioning the benchmark used in the paper. Several commenters highlighted the novelty of directly incentivizing reasoning steps as a valuable contribution. Concerns were raised about the limited scope of the demonstrated reasoning, focusing on simple arithmetic and symbolic manipulation. One commenter suggested the approach might be computationally expensive and doubted its scalability to more complex reasoning tasks. Others noted the paper's focus on chain-of-thought prompting, viewing it as a promising, though nascent, area of research. The overall sentiment seemed cautiously optimistic, acknowledging the work as a step forward while also acknowledging its limitations.
Scale AI's "Humanity's Last Exam" benchmark evaluates large language models (LLMs) on complex, multi-step reasoning tasks across various domains like math, coding, and critical thinking, going beyond typical benchmark datasets. The results revealed that while top LLMs like GPT-4 demonstrate impressive abilities, even the best models still struggle with intricate reasoning, logical deduction, and robust coding, highlighting the significant gap between current LLMs and human-level intelligence. The benchmark aims to drive further research and development in more sophisticated and robust AI systems.
HN commenters largely criticized the "Humanity's Last Exam" framing as hyperbolic and marketing-driven. Several pointed out that the exam's focus on reasoning and logic, while important, doesn't represent the full spectrum of human intelligence and capabilities crucial for navigating complex real-world scenarios. Others questioned the methodology and representativeness of the "exam," expressing skepticism about the chosen tasks and the limited pool of participants. Some commenters also discussed the implications of AI surpassing human performance on such benchmarks, with varying degrees of concern about potential societal impact. A few offered alternative perspectives, suggesting that the exam could be a useful tool for understanding and improving AI systems, even if its framing is overblown.
The blog post explores using traditional machine learning (specifically, decision trees) to interpret and refine the output of less capable or "dumb" Large Language Models (LLMs). The author describes a scenario where an LLM is tasked with classifying customer service tickets, but its performance is unreliable. Instead of relying solely on the LLM's classification, a decision tree model is trained on the LLM's output (probabilities for each classification) along with other readily available features of the ticket, like length and sentiment. This hybrid approach leverages the LLM's initial analysis while allowing the decision tree to correct inaccuracies and improve overall classification performance, ultimately demonstrating how simpler models can bolster the effectiveness of flawed LLMs in practical applications.
Hacker News users discuss the practicality and limitations of the proposed decision-tree approach to mitigate LLM "hallucinations." Some express skepticism about its scalability and maintainability, particularly with the rapid advancement of LLMs, suggesting that improving prompt engineering or incorporating retrieval mechanisms might be more effective. Others highlight the potential value of the decision tree for specific, well-defined tasks where accuracy is paramount and the domain is limited. The discussion also touches on the trade-off between complexity and performance, and the importance of understanding the underlying limitations of LLMs rather than relying on patches. A few commenters note the similarity to older expert systems and question if this represents a step back in AI development. Finally, some appreciate the author's honest exploration of alternative solutions, acknowledging that relying solely on improving LLM accuracy might not be the optimal path forward.
Flame is a new programming language designed specifically for spreadsheet formulas. It aims to improve upon existing spreadsheet formula systems by offering stronger typing, better modularity, and improved error handling. Flame programs are compiled to a low-level bytecode, which allows for efficient execution. The authors demonstrate that Flame can express complex spreadsheet tasks more concisely and clearly than traditional formulas, while also offering performance comparable to or exceeding existing spreadsheet software. This makes Flame a potential candidate for replacing or augmenting current formula systems in spreadsheets, leading to more robust and maintainable spreadsheet applications.
Hacker News users discussed Flame, a language model designed for spreadsheet formulas. Several commenters expressed skepticism about the practicality and necessity of such a tool, questioning whether natural language is truly superior to traditional formula syntax for spreadsheet tasks. Some argued that existing formula syntax, while perhaps not intuitive initially, offers precision and control that natural language descriptions might lack. Others pointed out potential issues with ambiguity in natural language instructions. There was some interest in the model's ability to explain existing formulas, but overall, the reception was cautious, with many doubting the real-world usefulness of this approach. A few commenters expressed interest in seeing how Flame handles complex, real-world spreadsheet scenarios, rather than the simplified examples provided.
PolyChat is a web app that lets you compare responses from multiple large language models (LLMs) simultaneously. You can enter a single prompt and receive outputs from a variety of models, including open-source and commercial options like GPT-4, Claude, and several others, making it easy to evaluate their different strengths and weaknesses in real-time for various tasks. The platform aims to provide a convenient way to experiment with and understand the nuances of different LLMs.
HN users generally expressed interest in the multi-LLM chat platform, Polychat, praising its clean interface and ease of use. Several commenters focused on potential use cases, such as comparing different models' outputs for specific tasks like translation or code generation. Some questioned the long-term viability of offering so many models, particularly given the associated costs, and suggested focusing on a curated selection. There was also a discussion about the ethical implications of using jailbroken models and whether such access should be readily available. Finally, a few users requested features like chat history saving and the ability to adjust model parameters.
Luke Plant explores the potential uses and pitfalls of Large Language Models (LLMs) in Christian apologetics. While acknowledging LLMs' ability to quickly generate content, summarize arguments, and potentially reach wider audiences, he cautions against over-reliance. He argues that LLMs lack genuine understanding and the ability to engage with nuanced theological concepts, risking misrepresentation or superficial arguments. Furthermore, the persuasive nature of LLMs could prioritize rhetorical flourish over truth, potentially deceiving rather than convincing. Plant suggests LLMs can be valuable tools for research, brainstorming, and refining arguments, but emphasizes the irreplaceable role of human reason, spiritual discernment, and authentic faith in effective apologetics.
HN users generally express skepticism towards using LLMs for Christian apologetics. Several commenters point out the inherent contradiction in using a probabilistic model based on statistical relationships to argue for absolute truth and divine revelation. Others highlight the potential for LLMs to generate superficially convincing but ultimately flawed arguments, potentially misleading those seeking genuine understanding. The risk of misrepresenting scripture or theological nuances is also raised, along with concerns about the LLM potentially becoming the focus of faith rather than the divine itself. Some acknowledge potential uses in generating outlines or brainstorming ideas, but ultimately believe relying on LLMs undermines the core principles of faith and reasoned apologetics. A few commenters suggest exploring the philosophical implications of using LLMs for religious discourse, but the overall sentiment is one of caution and doubt.
Kimi K1.5 is a reinforcement learning (RL) system designed for scalability and efficiency by leveraging Large Language Models (LLMs). It utilizes a novel approach called "LLM-augmented world modeling" where the LLM predicts future world states based on actions, improving sample efficiency and allowing the RL agent to learn with significantly fewer interactions with the actual environment. This prediction happens within a "latent space," a compressed representation of the environment learned by a variational autoencoder (VAE), which further enhances efficiency. The system's architecture integrates a policy LLM, a world model LLM, and the VAE, working together to generate and evaluate action sequences, enabling the agent to learn complex tasks in visually rich environments with fewer real-world samples than traditional RL methods.
Hacker News users discussed Kimi K1.5's approach to scaling reinforcement learning with LLMs, expressing both excitement and skepticism. Several commenters questioned the novelty, pointing out similarities to existing techniques like hindsight experience replay and prompting language models with desired outcomes. Others debated the practical applicability and scalability of the approach, particularly concerning the cost and complexity of training large language models. Some highlighted the potential benefits of using LLMs for reward modeling and generating diverse experiences, while others raised concerns about the limitations of relying on offline data and the potential for biases inherited from the language model. Overall, the discussion reflected a cautious optimism tempered by a pragmatic awareness of the challenges involved in integrating LLMs with reinforcement learning.
The post argues that individual use of ChatGPT and similar AI models has a negligible environmental impact compared to other everyday activities like driving or streaming video. While large language models require significant resources to train, the energy consumed during individual inference (i.e., asking it questions) is minimal. The author uses analogies to illustrate this point, comparing the training process to building a road and individual use to driving on it. Therefore, focusing on individual usage as a source of environmental concern is misplaced and distracts from larger, more impactful areas like the initial model training or even more general sources of energy consumption. The author encourages engagement with AI and emphasizes the potential benefits of its widespread adoption.
Hacker News commenters largely agree with the article's premise that individual AI use isn't a significant environmental concern compared to other factors like training or Bitcoin mining. Several highlight the hypocrisy of focusing on individual use while ignoring the larger impacts of data centers or military operations. Some point out the potential benefits of AI for optimization and problem-solving that could lead to environmental improvements. Others express skepticism, questioning the efficiency of current models and suggesting that future, more complex models could change the environmental cost equation. A few also discuss the potential for AI to exacerbate existing societal inequalities, regardless of its environmental footprint.
The blog post argues that while Large Language Models (LLMs) have significantly impacted Natural Language Processing (NLP), reports of traditional NLP's death are greatly exaggerated. LLMs excel in tasks requiring vast amounts of data, like text generation and summarization, but struggle with specific, nuanced tasks demanding precise control and explainability. Traditional NLP techniques, like rule-based systems and smaller, fine-tuned models, remain crucial for these scenarios, particularly in industry applications where reliability and interpretability are paramount. The author concludes that LLMs and traditional NLP are complementary, offering a combined approach that leverages the strengths of both for comprehensive and robust solutions.
HN commenters largely agree that LLMs haven't killed traditional NLP, but significantly shifted its focus. Several argue that traditional NLP techniques are still crucial for tasks where explainability, fine-grained control, or limited data are factors. Some point out that LLMs themselves are built upon traditional NLP concepts. Others suggest a new division of labor, with LLMs handling general tasks and traditional NLP methods used for specific, nuanced problems, or refining LLM outputs. A few more skeptical commenters believe LLMs will eventually subsume most NLP tasks, but even they acknowledge the current limitations regarding cost, bias, and explainability. There's also discussion of the need for adapting NLP education and the potential for hybrid approaches combining the strengths of both paradigms.
Transformer² introduces a novel approach to Large Language Models (LLMs) called "self-adaptive prompting." Instead of relying on fixed, hand-crafted prompts, Transformer² uses a smaller, trainable "prompt generator" model to dynamically create optimal prompts for a larger, frozen LLM. This allows the system to adapt to different tasks and input variations without retraining the main LLM, improving performance on complex reasoning tasks like program synthesis and mathematical problem-solving while reducing computational costs associated with traditional fine-tuning. The prompt generator learns to construct prompts that elicit the desired behavior from the frozen LLM, effectively personalizing the interaction for each specific input. This modular design offers a more efficient and adaptable alternative to current LLM paradigms.
HN users discussed the potential of Transformer^2, particularly its adaptability to different tasks and modalities without retraining. Some expressed skepticism about the claimed improvements, especially regarding reasoning capabilities, emphasizing the need for more rigorous evaluation beyond cherry-picked examples. Several commenters questioned the novelty, comparing it to existing techniques like prompt engineering and hypernetworks, while others pointed out the potential for increased computational cost. The discussion also touched upon the broader implications of adaptable models, including their potential for misuse and the challenges of ensuring safety and alignment. Several users expressed excitement about the potential of truly general-purpose AI models that can seamlessly switch between tasks, while others remained cautious, awaiting more concrete evidence of the claimed advancements.
The blog post explores using entropy as a measure of the predictability and "surprise" of Large Language Model (LLM) outputs. It explains how to calculate entropy character-by-character and demonstrates that higher entropy generally corresponds to more creative or unexpected text. The author argues that while tools like perplexity exist, entropy offers a more granular and interpretable way to analyze LLM behavior, potentially revealing insights into the model's internal workings and helping identify areas for improvement, such as reducing repetitive or predictable outputs. They provide Python code examples for calculating entropy and showcase its application in evaluating different LLM prompts and outputs.
Hacker News users discussed the relationship between LLM output entropy and interestingness/creativity, generally agreeing with the article's premise. Some debated the best metrics for measuring "interestingness," suggesting alternatives like perplexity or considering audience-specific novelty. Others pointed out the limitations of entropy alone, highlighting the importance of semantic coherence and relevance. Several commenters offered practical applications, like using entropy for prompt engineering and filtering outputs, or combining it with other metrics for better evaluation. There was also discussion on the potential for LLMs to maximize entropy for "clickbait" generation and the ethical implications of manipulating these metrics.
Anthropic's post details their research into building more effective "agents," AI systems capable of performing a wide range of tasks by interacting with software tools and information sources. They focus on improving agent performance through a combination of techniques: natural language instruction, few-shot learning from demonstrations, and chain-of-thought prompting. Their experiments, using tools like web search and code execution, demonstrate significant performance gains from these methods, particularly chain-of-thought reasoning which enables complex problem-solving. Anthropic emphasizes the potential of these increasingly sophisticated agents to automate workflows and tackle complex real-world problems. They also highlight the ongoing challenges in ensuring agent reliability and safety, and the need for continued research in these areas.
Hacker News users discuss Anthropic's approach to building effective "agents" by chaining language models. Several commenters express skepticism towards the novelty of this approach, pointing out that it's essentially a sophisticated prompt chain, similar to existing techniques like Auto-GPT. Others question the practical utility given the high cost of inference and the inherent limitations of LLMs in reliably performing complex tasks. Some find the concept intriguing, particularly the idea of using a "natural language API," while others note the lack of clarity around what constitutes an "agent" and the absence of a clear problem being solved. The overall sentiment leans towards cautious interest, tempered by concerns about overhyping incremental advancements in LLM applications. Some users highlight the impressive engineering and research efforts behind the work, even if the core concept isn't groundbreaking. The potential implications for automating more complex workflows are acknowledged, but the consensus seems to be that significant hurdles remain before these agents become truly practical and widely applicable.
The article argues that integrating Large Language Models (LLMs) directly into software development workflows, aiming for autonomous code generation, faces significant hurdles. While LLMs excel at generating superficially correct code, they struggle with complex logic, debugging, and maintaining consistency. Fundamentally, LLMs lack the deep understanding of software architecture and system design that human developers possess, making them unsuitable for building and maintaining robust, production-ready applications. The author suggests that focusing on augmenting developer capabilities, rather than replacing them, is a more promising direction for LLM application in software development. This includes tasks like code completion, documentation generation, and test case creation, where LLMs can boost productivity without needing a complete grasp of the underlying system.
Hacker News commenters largely disagreed with the article's premise. Several argued that LLMs are already proving useful for tasks like code generation, refactoring, and documentation. Some pointed out that the article focuses too narrowly on LLMs fully automating software development, ignoring their potential as powerful tools to augment developers. Others highlighted the rapid pace of LLM advancement, suggesting it's too early to dismiss their future potential. A few commenters agreed with the article's skepticism, citing issues like hallucination, debugging difficulties, and the importance of understanding underlying principles, but they represented a minority view. A common thread was the belief that LLMs will change software development, but the specifics of that change are still unfolding.
The paper "A Taxonomy of AgentOps" proposes a structured classification system for the emerging field of Agent Operations (AgentOps). It defines AgentOps as the discipline of deploying, managing, and governing autonomous agents at scale. The taxonomy categorizes AgentOps challenges across four key dimensions: Agent Lifecycle (creation, deployment, operation, and retirement), Agent Capabilities (perception, planning, action, and communication), Operational Scope (individual, collaborative, and systemic), and Management Aspects (monitoring, control, security, and ethics). This framework aims to provide a common language and understanding for researchers and practitioners, enabling them to better navigate the complex landscape of AgentOps and develop effective solutions for building and managing robust, reliable, and responsible agent systems.
Hacker News users discuss the practicality and scope of the proposed "AgentOps" taxonomy. Some express skepticism about its novelty, arguing that many of the described challenges are already addressed within existing DevOps and MLOps practices. Others question the need for another specialized "Ops" category, suggesting it might contribute to unnecessary fragmentation. However, some find the taxonomy valuable for clarifying the emerging field of agent development and deployment, particularly highlighting the focus on autonomy, continuous learning, and complex interactions between agents. The discussion also touches upon the importance of observability and debugging in agent systems, and the need for robust testing frameworks. Several commenters raise concerns about security and safety, particularly in the context of increasingly autonomous agents.
Summary of Comments ( 15 )
https://news.ycombinator.com/item?id=42861815
Hacker News users discuss the potential of automatic differentiation for LLM workflows, expressing excitement but also raising concerns. Several commenters highlight the potential for overfitting and the need for careful consideration of the objective function being optimized. Some question the practical applicability given the computational cost and complexity of differentiating through large LLMs. Others express skepticism about abandoning manual prompting entirely, suggesting it remains valuable for high-level control and creativity. The idea of applying gradient descent to prompt engineering is generally seen as innovative and potentially powerful, but the long-term implications and practical limitations require further exploration. Some users also point out potential misuse cases, such as generating more effective spam or propaganda. Overall, the sentiment is cautiously optimistic, acknowledging the theoretical appeal while recognizing the significant challenges ahead.
The Hacker News post titled "Auto-Differentiating Any LLM Workflow: A Farewell to Manual Prompting" (linking to the arXiv paper at https://arxiv.org/abs/2501.16673) generated a moderate discussion with a mix of excitement and skepticism.
Several commenters expressed interest in the potential of automatically optimizing LLM workflows through differentiation. They saw it as a significant step towards making prompt engineering more systematic and less reliant on trial and error. The idea of treating prompts as parameters that can be learned resonated with many, as manual prompt engineering is often perceived as a tedious and time-consuming process. Some envisioned applications beyond simple prompt optimization, such as fine-tuning entire workflows involving multiple LLMs or other components.
However, skepticism was also present. Some questioned the practicality of the approach, particularly regarding the computational cost of differentiating through complex LLM pipelines. The concern was raised that the resources required for such optimization might outweigh the benefits, especially for smaller projects or individuals with limited access to computational power. The reliance on differentiable functions within the workflow was also pointed out as a potential limitation, restricting the types of operations that could be included in the optimized pipeline.
Another point of discussion revolved around the black-box nature of LLMs. Even with automated optimization, understanding why a particular prompt or workflow performs well remains challenging. Some commenters argued that this lack of interpretability could hinder debugging and further development. The potential for overfitting to specific datasets or benchmarks was also mentioned as a concern, emphasizing the need for careful evaluation and generalization testing.
Finally, some commenters drew parallels to existing techniques in machine learning, such as hyperparameter optimization and neural architecture search. They questioned whether the proposed approach offered significant advantages over these established methods, suggesting that it might simply be a rebranding of familiar concepts within the context of LLMs. Despite the potential benefits, some believed that manual prompt engineering would still play a crucial role, especially in defining the initial structure and objectives of the LLM workflow.