hackslash dot org

LLMs can see and hear without any training

Posted: 2025-04-26 13:38:25

Facebook researchers have introduced Modality-Independent Large-Scale models (MILS), demonstrating that large language models can process and understand information from diverse modalities like audio and images without requiring explicit training on those specific data types. By leveraging the rich semantic representations learned from text, MILS can directly interpret image pixel values and audio waveform amplitudes as if they were sequences of tokens, similar to text. This suggests a potential pathway towards truly generalist AI models capable of seamlessly integrating and understanding information across different modalities.

The Facebook AI Research (FAIR) team has introduced a groundbreaking advancement in Large Language Models (LLMs) with their Multimodal In-context Learning and Synthesizing (MILS) framework. This innovative approach empowers LLMs to process and understand diverse modalities, including images and audio, without requiring any explicit training on these specific data types. This represents a significant departure from traditional multimodal models, which typically necessitate extensive pre-training on massive datasets of paired multimodal data. MILS achieves this feat by leveraging the inherent in-context learning capabilities already present within pre-trained LLMs. Instead of directly training the model on visual or auditory data, MILS transforms these inputs into a textual format that the LLM can readily interpret. This textual representation effectively describes the multimodal input, allowing the LLM to process it as if it were processing any other text-based information.

The core of MILS lies in its utilization of pre-trained "perceptual experts." These experts are specialized models, distinct from the core LLM, trained to generate descriptive text captions for images or audio. For instance, an image perceptual expert might analyze a photograph and generate a detailed caption describing the objects, actions, and relationships present within the scene. Similarly, an audio perceptual expert could transcribe spoken words or describe the sounds present in an audio clip. These text descriptions, generated by the perceptual experts, are then provided to the LLM as input. Essentially, the LLM "sees" and "hears" through the lens of these textual descriptions, effectively bypassing the need for direct sensory processing.

This innovative approach allows LLMs to perform a variety of multimodal tasks without any specific training on those modalities. For example, MILS can enable an LLM to answer questions about an image, generate descriptive captions for audio clips, or even translate speech into another language. The flexibility and adaptability of MILS stem from the fact that the LLM remains unchanged. The only modification lies in the introduction of the perceptual experts, which act as intermediaries, translating non-textual information into a language the LLM can understand. This approach significantly simplifies the process of incorporating new modalities, as it only requires training a new perceptual expert for the desired data type, leaving the core LLM untouched. This opens up a vast landscape of possibilities for integrating LLMs into diverse multimodal applications without the computational expense and complexity associated with traditional multimodal training.

Summary of Comments ( 37 )
https://news.ycombinator.com/item?id=43803518

Hacker News users discussed the implications of Meta's ImageBind, which allows LLMs to connect various modalities (text, image/video, audio, depth, thermal, and IMU data) without explicit training on those connections. Several commenters expressed excitement about the potential applications, including robotics, accessibility features, and richer creative tools. Some questioned the practical utility given the computational cost and raised concerns about the potential for misuse, such as creating more sophisticated deepfakes. Others debated the significance of the research, with some arguing it's a substantial step towards more general AI while others viewed it as an incremental improvement over existing techniques. A few commenters highlighted the lack of clear explanations of the emergent behavior and called for more rigorous evaluation.

The Hacker News post titled "LLMs can see and hear without any training" (linking to the GitHub repository for Facebook Research's MILS project) sparked a discussion with several interesting comments.

Several commenters expressed skepticism about the claim of "zero-shot" capability. One commenter pointed out that while the models haven't been explicitly trained on image, video, or audio data, they have been trained on a massive text corpus, which likely contains descriptions and textual representations of such multimedia content. This implicit exposure could explain their apparent ability to process these modalities. This commenter argued that calling it "zero-shot" is misleading and obscures the indirect training the models have received.

Another commenter echoed this sentiment, emphasizing the vastness of the training data for LLMs and suggesting that it likely contains enough text describing images and sounds to give the models a rudimentary understanding of these modalities. They drew an analogy to a human learning about a concept solely through textual descriptions, arguing that while direct experience is different, a significant amount of knowledge can still be gleaned from text alone.

A different line of discussion focused on the potential applications of this research. One commenter speculated about the possibilities of using LLMs for tasks like generating image descriptions for visually impaired individuals or transcribing audio in real-time. They saw the potential for significant accessibility improvements.

Some comments delved into the technical aspects of the research. One commenter questioned the specifics of the model's architecture and how it handles different modalities. They were particularly interested in understanding how the model integrates information from different sources, such as text and images. Another technical comment questioned the scalability of the approach, wondering how well it would perform with larger and more complex datasets.

Finally, a few comments offered a more cautious perspective. One commenter noted that while the research is interesting, it’s important to remember that it's still early days. They cautioned against overhyping the capabilities of LLMs and emphasized the need for further research and evaluation. Another commenter pointed out the potential ethical implications of this technology, particularly regarding privacy and potential misuse.

In summary, the comments on the Hacker News post reflect a mixture of excitement, skepticism, and cautious optimism about the research. Many commenters questioned the "zero-shot" framing, highlighting the implicit learning from the massive text corpora used to train LLMs. Others explored potential applications and technical details, while some emphasized the need for further research and consideration of ethical implications.

The Llama 4 herd

permalink

Posted: 2025-04-05 18:33:56

Meta has announced Llama 4, a collection of foundational models that boast improved performance and expanded capabilities compared to their predecessors. Llama 4 is available in various sizes and has been trained on a significantly larger dataset of text and code. Notably, Llama 4 introduces multimodal capabilities, allowing it to process both text and images. This empowers the models to perform tasks like image captioning, visual question answering, and generating more detailed image descriptions. Meta emphasizes their commitment to open innovation and responsible development by releasing Llama 4 under a non-commercial license for research and non-commercial use, aiming to foster broader community involvement in AI development and safety research.

Meta's Artificial Intelligence research division has unveiled the latest iteration of their Large Language Model (LLM), Llama 4, marking a significant advancement in multimodal intelligence. This new model represents a substantial leap beyond purely text-based interactions, demonstrating a sophisticated capability to process and generate content across various modalities, including images, audio, and video, in addition to text. This multimodal proficiency allows Llama 4 to understand and respond to complex queries and tasks involving diverse data formats, opening up a wide range of potential applications previously inaccessible to single-modality models.

One of the key innovations within Llama 4 is its enhanced visual understanding. The model can not only identify objects and scenes within images but also interpret complex visual relationships and context, enabling it to answer intricate questions about visual content. This sophisticated visual processing capability is further amplified by the model's ability to generate detailed captions and descriptions for images, effectively bridging the gap between visual and textual information. Furthermore, Llama 4 exhibits the impressive capacity to answer questions pertaining to images, demonstrating a deep understanding of the depicted content.

Beyond image comprehension, Llama 4 showcases nascent capabilities in other modalities. While still under development, the model's ability to process audio and video signals suggests a future where seamless interaction with multimedia content is commonplace. This expansion beyond text unlocks the potential for richer, more nuanced human-computer interactions and lays the groundwork for groundbreaking applications in fields such as content creation, accessibility, and personalized learning experiences.

Meta emphasizes the rigorous safety evaluations conducted on Llama 4, highlighting their commitment to responsible AI development. The model has undergone extensive testing and fine-tuning to mitigate potential risks associated with large language models, such as generating harmful or biased content. This meticulous approach to safety is paramount given the model's advanced capabilities and the potential impact of its widespread deployment.

While specific technical details regarding the model's architecture and training data remain limited in the announcement, Meta underscores the significant improvements in performance and efficiency compared to previous iterations. This suggests advancements in model design and training methodologies that contribute to Llama 4's enhanced capabilities and multimodal proficiency. The release of Llama 4 signifies a notable step towards more intelligent and versatile AI systems, promising transformative advancements in how we interact with and leverage the power of information across multiple modalities.

Summary of Comments ( 561 )
https://news.ycombinator.com/item?id=43595585

Hacker News users discussed the implications of Llama 2's multimodal capabilities, particularly its image understanding. Some expressed excitement about potential applications like image-based Q&A and generating alt-text for accessibility. Skepticism arose around Meta's closed-source approach with Llama 2, contrasting it with the fully open Llama 1. Several commenters debated the competitive landscape, comparing Llama 2 to Google's Gemini and open-source models, questioning whether Llama 2 offered significant advantages. The closed nature also raised concerns about reproducibility of research and community contributions. Others noted the rapid pace of AI advancement and speculated on future developments. A few users highlighted the potential for misuse, such as generating misinformation.

The Hacker News post "The Llama 4 herd" discussing Meta's Llama 4 multimodal model has generated a fair number of comments, exploring various aspects and implications of the announcement.

Several commenters express skepticism about the "open source" nature of Llama 4, pointing out that the model's commercial use is restricted for companies with over 700 million monthly active users. This restriction effectively prevents significant commercial competitors from using the model, raising questions about Meta's motivations and the true openness of the release. Some speculate that this might be a strategic move to gain market share and potentially monetize the model later.

A recurring theme is the comparison between Llama 4 and Google's Gemini. Some users suggest that Meta's release is a direct response to Gemini and a bid to remain competitive in the generative AI landscape. Comparisons are drawn between the capabilities of both models, with some commenters arguing for Gemini's superiority in certain aspects. Others express anticipation for benchmark comparisons to provide a clearer picture of the relative strengths and weaknesses of each model.

The multimodal capabilities of Llama 4, specifically its ability to process both text and images, draw significant interest. Commenters discuss the potential applications of this technology, including content creation, accessibility improvements, and enhanced user interfaces. However, some also raise concerns about potential misuse, such as generating deepfakes or facilitating the spread of misinformation.

The closed-source nature of specific model weights, particularly those for the larger Llama 4 models, is a point of discussion. Some users express disappointment that these weights are not publicly available, limiting the research and development opportunities for the broader community. The lack of transparency is criticized, with speculation about the reasons behind Meta's decision.

Several commenters dive into technical details, discussing aspects such as the model's architecture, training data, and performance characteristics. There's interest in understanding the specifics of the multimodal integration and how it contributes to the model's overall capabilities. Some users also inquire about the computational resources required to run the model and its potential accessibility for researchers and developers with limited resources.

Finally, there's discussion about the broader implications of the increasing accessibility of powerful AI models like Llama 4. Concerns are raised about the potential societal impact, including job displacement, ethical considerations, and the need for responsible development and deployment of such technologies. The conversation reflects a mix of excitement about the potential advancements and apprehension about the potential risks associated with widespread adoption of generative AI.

16-Bit to 1-Bit: Visual KV Cache Quantization for Efficient Multimodal LLMs

permalink

Posted: 2025-03-05 16:09:26

This paper introduces Visual Key-Value (KV) Cache Quantization, a technique for compressing the visual features stored in the key-value cache of multimodal large language models (MLLMs). By aggressively quantizing these 16-bit features down to 1-bit representations, the memory footprint of the visual cache is significantly reduced, enabling efficient storage and faster retrieval of visual information. This quantization method employs a learned codebook specifically designed for visual features and incorporates techniques to mitigate the information loss associated with extreme compression. Experiments demonstrate that this approach maintains competitive performance on various multimodal tasks while drastically reducing memory requirements, paving the way for more efficient and scalable deployment of MLLMs.

The paper "16-Bit to 1-Bit: Visual KV Cache Quantization for Efficient Multimodal LLMs" addresses the growing computational demands of multimodal Large Language Models (LLMs), particularly those incorporating visual information. These models, while powerful, face challenges regarding memory and computational costs, especially when handling long sequences of visual data in tasks like video understanding or visual dialogue. Storing and accessing the Key-Value (KV) cache, a crucial component for maintaining context in LLMs, becomes a bottleneck due to the high dimensionality of visual features.

The authors propose a novel quantization technique focused on compressing the visual features stored within the KV cache, reducing memory footprint and accelerating retrieval. Instead of the standard 16-bit floating-point representation, they explore aggressive quantization down to 1-bit, representing each value with a single binary digit. This dramatic reduction in precision, while potentially introducing information loss, offers significant efficiency gains.

The core of their approach revolves around a learned, data-dependent quantization scheme. Rather than relying on standard uniform quantization methods, they introduce a trainable binary quantizer specifically tailored for visual features within the KV cache. This learned quantizer maps the high-dimensional floating-point vectors to binary codes, optimizing the preservation of crucial information for model performance.

The paper explores two specific variants of this learned binary quantization: vector-wise and dimension-wise quantization. Vector-wise quantization treats each vector as a whole, learning a single threshold for binarization, while dimension-wise quantization learns individual thresholds for each dimension of the feature vector, allowing for finer-grained control. The authors hypothesize that dimension-wise quantization, although requiring more learned parameters, might better capture the varying importance of different feature dimensions.

The effectiveness of their proposed method is evaluated on several multimodal benchmarks, including video question answering and visual dialogue. They demonstrate that even with extreme quantization down to 1-bit, the performance degradation remains surprisingly small, especially when employing the dimension-wise quantization strategy. This suggests that the crucial contextual information within the KV cache can be effectively represented with significantly reduced precision, leading to substantial savings in both memory and computational resources. The paper concludes that this aggressive quantization technique provides a promising pathway for deploying efficient and scalable multimodal LLMs, paving the way for broader adoption and application of these powerful models.

Summary of Comments ( 1 )
https://news.ycombinator.com/item?id=43268477

HN users discuss the tradeoffs of quantizing key/value caches in multimodal LLMs. Several express skepticism about the claimed performance gains, questioning the methodology and the applicability to real-world scenarios. Some point out the inherent limitations of 1-bit quantization, particularly regarding accuracy and retrieval quality. Others find the approach interesting, but highlight the need for further investigation into the impact on different model architectures and tasks. The discussion also touches upon alternative quantization techniques and the importance of considering memory bandwidth alongside storage capacity. A few users share relevant resources and personal experiences with quantization in similar contexts.

The Hacker News post titled "16-Bit to 1-Bit: Visual KV Cache Quantization for Efficient Multimodal LLMs" (https://news.ycombinator.com/item?id=43268477) has a modest number of comments, sparking a discussion around the trade-offs between performance and efficiency in multimodal large language models (LLMs).

Several commenters focus on the practicality and implications of the proposed quantization technique. One user questions the actual memory savings achieved, pointing out that while the key-value cache might be reduced, other components like the model weights remain large. This raises the issue of whether the reduction in KV cache size significantly impacts the overall memory footprint, especially in the context of inference on resource-constrained devices.

Another commenter highlights the potential impact on inference speed. While acknowledging the memory savings, they wonder if the quantization introduces computational overhead during retrieval, potentially negating the benefits of reduced memory usage. This leads to a discussion about the balance between memory efficiency and inference latency, a crucial consideration for real-world applications.

The discussion also touches upon the broader trend of optimizing LLMs for deployment. One commenter observes that these optimization efforts are becoming increasingly important as models grow larger and more complex. The need to run these models efficiently on edge devices and in other resource-limited environments drives the exploration of techniques like quantization.

Finally, there's a brief exchange about the applicability of the technique to different hardware platforms. One user speculates about its potential benefits on specialized hardware designed for low-bit operations. This raises the question of whether such hardware could unlock even greater efficiency gains from quantization methods.

While the discussion isn't extensive, it provides valuable insights into the challenges and opportunities surrounding LLM optimization. The comments reflect the practical considerations developers face when deploying these models, emphasizing the ongoing search for effective strategies to balance performance, efficiency, and hardware constraints. They also highlight the growing interest in specialized hardware that could further accelerate these advancements.

Step-Video-T2V: The Practice, Challenges, and Future of Video Foundation Model

permalink

Posted: 2025-02-17 09:54:46

Step-Video-T2V explores the emerging field of video foundation models, specifically focusing on text-to-video generation. The paper introduces a novel "step-by-step" paradigm where video generation is decomposed into discrete, controllable steps. This approach allows for finer-grained control over the generation process, addressing challenges like temporal consistency and complex motion representation. The authors discuss the practical implementation of this paradigm, including model architectures, training strategies, and evaluation metrics. Furthermore, they highlight existing limitations and outline future research directions for video foundation models, emphasizing the potential for advancements in areas such as long-form video generation, interactive video editing, and personalized video creation.

The arXiv preprint "Step-Video-T2V: The Practice, Challenges, and Future of Video Foundation Model" explores the emerging field of video foundation models, specifically focusing on text-to-video (T2V) generation. The authors meticulously analyze the current state of the art, highlighting both the significant advancements and the persistent challenges that hinder the creation of truly robust and versatile video generation models.

The paper begins by establishing the context of foundation models within the broader AI landscape, emphasizing their transformative potential across various modalities, including text, image, and now, video. It then delves into the specific complexities inherent in video generation, distinguishing it from image generation. These complexities include the temporal dimension, necessitating the modeling of motion, transitions, and dynamic changes over time; the increased computational burden associated with processing and generating sequences of frames; and the intricacies of maintaining consistency and coherence across the generated video.

The core contribution of the paper lies in its detailed examination of the "Step-Video-T2V" framework. This framework encapsulates a progressive approach to video generation, breaking down the complex task into manageable steps. The authors meticulously dissect each step, explaining the rationale behind it and the techniques employed. They discuss various methodologies for motion modeling, including diffusion models, autoregressive models, and transformer-based architectures, highlighting the strengths and weaknesses of each approach.

A significant portion of the paper is dedicated to the challenges that currently plague video foundation models. These challenges encompass issues like generating high-fidelity videos with fine-grained details, ensuring temporal consistency and avoiding flickering or unrealistic movements, controlling the length and content of the generated video according to user prompts, and mitigating the computational demands of training and inference. The authors provide in-depth analyses of these obstacles, offering potential solutions and directions for future research.

Furthermore, the paper emphasizes the importance of evaluating video generation models, proposing a comprehensive set of evaluation metrics that go beyond simple visual quality assessment. These metrics address aspects like semantic fidelity, temporal coherence, and alignment with user intent. The authors advocate for the adoption of standardized evaluation protocols to facilitate meaningful comparisons between different models and track progress within the field.

Finally, the paper concludes with a forward-looking perspective on the future of video foundation models. It anticipates further advancements in model architectures, training methodologies, and evaluation techniques, paving the way for more sophisticated and versatile video generation capabilities. The authors envision a future where video foundation models can be readily applied to a wide range of applications, including content creation, virtual reality, and scientific visualization, unlocking unprecedented creative and analytical possibilities. They also acknowledge the ethical considerations associated with the development and deployment of such powerful technologies, emphasizing the importance of responsible innovation.

Summary of Comments ( 2 )
https://news.ycombinator.com/item?id=43077074

Several Hacker News commenters express skepticism about the claimed novelty of the "Step-Video-T2V" model. They point out that the core idea of using diffusion models for video generation is not new, and question whether the proposed "step-wise" approach offers significant advantages over existing techniques. Some also criticize the paper's evaluation metrics, arguing that they don't adequately demonstrate the model's real-world performance. A few users discuss the potential applications of such models, including video editing and content creation, but also raise concerns about the computational resources required for training and inference. Overall, the comments reflect a cautious optimism tempered by a desire for more rigorous evaluation and comparison to existing work.

The Hacker News post titled "Step-Video-T2V: The Practice, Challenges, and Future of Video Foundation Model" (linking to the arXiv paper at https://arxiv.org/abs/2502.10248) has a moderate number of comments discussing various aspects of the proposed video generation model and its broader implications.

Several commenters express excitement about the potential of video generation models and the rapid advancements in the field. They highlight the impressive capabilities showcased in the paper and anticipate future developments leading to even more realistic and controllable video generation.

Some comments delve into the technical details of the model, discussing the use of diffusion models and the challenges associated with training such large models. They touch upon the computational resources required and the difficulties in ensuring consistency and coherence in generated videos. One commenter specifically mentions the importance of addressing the temporal consistency challenge, which is crucial for generating realistic and believable videos.

The ethical implications of readily accessible video generation technology are also raised. Commenters express concerns about the potential for misuse, particularly in creating deepfakes and spreading misinformation. The need for responsible development and deployment of such powerful tools is emphasized.

A few commenters draw parallels to the development and adoption of image generation models, suggesting that video generation might follow a similar trajectory. They anticipate similar challenges and opportunities, including the potential for creative applications and the need to address ethical concerns.

One commenter notes the potential for such models to revolutionize various fields, such as entertainment, education, and advertising. They envision a future where creating personalized video content becomes as easy as creating text or images.

Finally, some comments point to the ongoing research and development in the field, indicating that the current state-of-the-art is constantly evolving. They encourage readers to explore related work and stay updated on the latest advancements in video generation.

All-in-one embedding model for interleaved text, images, and screenshots

permalink

Posted: 2024-11-17 07:42:08

Voyage has released Voyage Multimodal 3 (VMM3), a new embedding model capable of processing text, images, and screenshots within a single model. This allows for seamless cross-modal search and comparison, meaning users can query with any modality (text, image, or screenshot) and retrieve results of any other modality. VMM3 boasts improved performance over previous models and specialized embedding spaces tailored for different data types, like website screenshots, leading to more relevant and accurate results. The model aims to enhance various applications, including code search, information retrieval, and multimodal chatbots. Voyage is offering free access to VMM3 via their API and open-sourcing a smaller, less performant version called MiniVMM3 for research and experimentation.

Voyage, an AI company specializing in conversational agents for games, has announced the release of Voyage Multimodal 3 (VMM3), a groundbreaking all-in-one embedding model designed to handle a diverse range of input modalities, including text, images, and screenshots, simultaneously. This represents a significant advancement in multimodal understanding, moving beyond previous models that often required separate embeddings for each modality and complex downstream processing to integrate them. VMM3, in contrast, generates a single, unified embedding that captures the combined semantic meaning of all input types concurrently. This streamlined approach simplifies the development of applications that require understanding across multiple modalities, eliminating the need for elaborate integration pipelines.

The model is particularly adept at understanding the nuances of video game screenshots, a challenging domain due to the complex visual information present, such as user interfaces, character states, and in-game environments. VMM3 excels in this area, allowing developers to create more sophisticated and responsive in-game agents capable of reacting intelligently to the visual context of the game. Beyond screenshots, VMM3 demonstrates proficiency in handling general images and text, providing a versatile solution for various applications beyond gaming. This broad applicability extends to scenarios like multimodal search, where users can query with a combination of text and images, or content moderation, where the model can analyze both textual and visual content for inappropriate material.

Voyage emphasizes that VMM3 is not just a research prototype but a production-ready model optimized for real-world applications. They have focused on minimizing latency and maximizing throughput, crucial factors for interactive experiences like in-game agents. The model is available via API, facilitating seamless integration into existing systems and workflows. Furthermore, Voyage highlights the scalability of VMM3, making it suitable for handling large volumes of multimodal data.

The development of VMM3 stemmed from Voyage's experience building conversational AI for games, where the need for a model capable of understanding the complex interplay of text and visuals became evident. They highlight the limitations of prior approaches, which often struggled with the unique characteristics of game screenshots. VMM3 represents a significant step towards more immersive and interactive gaming experiences, powered by AI agents capable of comprehending and responding to the rich multimodal context of the game world. Beyond gaming, the potential applications of this versatile embedding model extend to numerous other fields requiring sophisticated multimodal understanding.

Summary of Comments ( 31 )
https://news.ycombinator.com/item?id=42162622

The Hacker News post titled "All-in-one embedding model for interleaved text, images, and screenshots" discussing the Voyage Multimodal 3 model announcement has generated a moderate amount of discussion. Several commenters express interest and cautious optimism about the capabilities of the model, particularly its ability to handle interleaved multimodal data, which is a common scenario in real-world applications.

One commenter highlights the potential usefulness of such a model for documentation and educational materials where text, images, and code snippets are frequently interwoven. They see value in being able to search and analyze these mixed-media documents more effectively. Another echoes this sentiment, pointing out the common problem of having separate search indices for text and images, making comprehensive retrieval difficult. They express hope that a unified embedding model like Voyage Multimodal 3 could address this issue.

Some skepticism is also present. One user questions the practicality of training a single model to handle such diverse data types, suggesting that specialized models might still perform better for individual modalities like text or images. They also raise concerns about the computational cost of running such a large multimodal model.

Another commenter expresses a desire for more specific details about the model's architecture and training data, as the blog post focuses mainly on high-level capabilities and potential applications. They also wonder about the licensing and availability of the model for commercial use.

The discussion also touches upon the broader implications of multimodal models. One commenter speculates on the potential for these models to improve accessibility for visually impaired users by providing more nuanced descriptions of visual content. Another anticipates the emergence of new user interfaces and applications that can leverage the power of multimodal embeddings to create more intuitive and interactive experiences.

Finally, some users share their own experiences working with multimodal data and express interest in experimenting with Voyage Multimodal 3 to see how it compares to existing solutions. They suggest potential use cases like analyzing product reviews with images or understanding the context of screenshots within technical documentation. Overall, the comments reflect a mixture of excitement about the potential of multimodal models and a pragmatic awareness of the challenges that remain in developing and deploying them effectively.

Stories with Tag multimodal learning

LLMs can see and hear without any training

Summary of Comments ( 37 ) https://news.ycombinator.com/item?id=43803518

The Llama 4 herd

Summary of Comments ( 561 ) https://news.ycombinator.com/item?id=43595585

16-Bit to 1-Bit: Visual KV Cache Quantization for Efficient Multimodal LLMs

Summary of Comments ( 1 ) https://news.ycombinator.com/item?id=43268477

Step-Video-T2V: The Practice, Challenges, and Future of Video Foundation Model

Summary of Comments ( 2 ) https://news.ycombinator.com/item?id=43077074

All-in-one embedding model for interleaved text, images, and screenshots

Summary of Comments ( 31 ) https://news.ycombinator.com/item?id=42162622

Summary of Comments ( 37 )
https://news.ycombinator.com/item?id=43803518

Summary of Comments ( 561 )
https://news.ycombinator.com/item?id=43595585

Summary of Comments ( 1 )
https://news.ycombinator.com/item?id=43268477

Summary of Comments ( 2 )
https://news.ycombinator.com/item?id=43077074

Summary of Comments ( 31 )
https://news.ycombinator.com/item?id=42162622