Story Details

  • Show HN: Kreuzberg – Modern async Python library for document text extraction

    Posted: 2025-02-15 10:07:23

    Kreuzberg is a new Python library designed for efficient and modern asynchronous document text extraction. It leverages asyncio and supports various file formats including PDF, DOCX, and various image types through integration with OCR engines like Tesseract. The library aims for a clean and straightforward API, enabling developers to easily extract text from multiple documents concurrently, thereby significantly improving processing speed. It also offers features like automatic OCR language detection and integrates seamlessly with existing async Python codebases.

    Summary of Comments ( 38 )
    https://news.ycombinator.com/item?id=43057375

    Hacker News users discussed Kreuzberg's potential, praising its modern, async approach and clean API. Several questioned its advantages over existing libraries like unstructured and langchain, prompting the author to clarify Kreuzberg's focus on smaller documents and ease of use for specific tasks like title and metadata extraction. Some expressed interest in benchmarks and broader language support, while others appreciated its minimalist design and MIT license. The small size of the library and its reliance on readily available packages like beautifulsoup4 and selectolax were also highlighted as positive aspects. A few commenters pointed to the lack of support for complex layouts and OCR, suggesting areas for future development.