hackslash dot org

Show HN: Kreuzberg – Modern async Python library for document text extraction

Posted: 2025-02-15 10:07:23

Kreuzberg is a new Python library designed for efficient and modern asynchronous document text extraction. It leverages asyncio and supports various file formats including PDF, DOCX, and various image types through integration with OCR engines like Tesseract. The library aims for a clean and straightforward API, enabling developers to easily extract text from multiple documents concurrently, thereby significantly improving processing speed. It also offers features like automatic OCR language detection and integrates seamlessly with existing async Python codebases.

Summary of Comments ( 38 )
https://news.ycombinator.com/item?id=43057375

Hacker News users discussed Kreuzberg's potential, praising its modern, async approach and clean API. Several questioned its advantages over existing libraries like unstructured and langchain, prompting the author to clarify Kreuzberg's focus on smaller documents and ease of use for specific tasks like title and metadata extraction. Some expressed interest in benchmarks and broader language support, while others appreciated its minimalist design and MIT license. The small size of the library and its reliance on readily available packages like beautifulsoup4 and selectolax were also highlighted as positive aspects. A few commenters pointed to the lack of support for complex layouts and OCR, suggesting areas for future development.

The Hacker News post about Kreuzberg, a modern async Python library for document text extraction, has several comments discussing its merits and potential drawbacks.

One commenter expresses enthusiasm for the project, praising its modern approach utilizing async and the promising performance improvements it suggests. They also appreciate the clear and well-structured documentation, making it easy to understand and use.

Another commenter questions the necessity of another text extraction library, given the existing options like textract and Apache Tika. They wonder if Kreuzberg offers any significant advantages over these established tools, asking for specific examples where it outperforms them. This prompts a discussion about the limitations of existing libraries, particularly regarding handling large files or specific document formats. The author of Kreuzberg responds, explaining that existing tools struggled with large PDF files containing scanned images in their tests. Kreuzberg was developed to address these shortcomings, offering better performance and memory efficiency in these scenarios by using OCR and processing documents asynchronously. They acknowledge that textract might be sufficient for simpler use cases, but emphasize Kreuzberg's focus on handling complex and large documents more efficiently.

Further discussion revolves around the benchmark comparisons provided. One commenter suggests incorporating Tesseract's page segmentation modes into the benchmarks to provide a more comprehensive performance evaluation. Another user points out the lack of benchmarks for common file types like DOCX and emphasizes the importance of including these in future comparisons.

The conversation also touches upon the practical implications of asynchronous processing for text extraction, with commenters discussing the scenarios where it offers the most significant benefits. Some suggest that async is particularly useful for processing multiple documents concurrently, leading to substantial time savings.

Finally, a few commenters express interest in the underlying technologies used in Kreuzberg, specifically the OCR engine and PDF parsing libraries. The author clarifies their choice of Tesseract OCR and explains the rationale behind using a specific Python library for PDF handling.

Stories with Tag document

Show HN: Kreuzberg – Modern async Python library for document text extraction

Summary of Comments ( 38 ) https://news.ycombinator.com/item?id=43057375

Summary of Comments ( 38 )
https://news.ycombinator.com/item?id=43057375