Story Details

  • Understanding Surrogate Pairs: Why Some Windows Filenames Can't Be Read

    Posted: 2025-02-24 12:19:40

    Some Windows filenames appear unreadable due to the way Windows handles characters outside the Basic Multilingual Plane (BMP). While newer versions support Unicode, older NTFS implementations only understand UTF-16, which uses surrogate pairs to represent these extended characters. A surrogate pair is two special 16-bit code units that together represent a single character outside the BMP. If a filename contains such a character and is accessed by a system or application that doesn't properly interpret surrogate pairs, it can't reconstruct the intended character, resulting in a garbled or unreadable filename. This issue primarily arises with older software or when transferring files between systems with different Unicode handling capabilities.

    Summary of Comments ( 44 )
    https://news.ycombinator.com/item?id=43158696

    HN users discuss various aspects of surrogate pairs and Unicode. Several commenters highlight the complexity and nuances of Unicode handling, particularly in different programming languages and operating systems. Some mention the challenges of correctly processing and displaying these characters, with specific examples of issues encountered in Windows and other environments. The discussion also touches upon the historical context of surrogate pairs, the difference between UTF-16 and UTF-8, and the importance of proper encoding and decoding. A few commenters offer practical advice and resources for dealing with surrogate pairs, including libraries and tools. There's a general agreement that handling Unicode correctly requires careful attention and a deep understanding of its intricacies.