Some Windows filenames appear unreadable due to the way Windows handles characters outside the Basic Multilingual Plane (BMP). While newer versions support Unicode, older NTFS implementations only understand UTF-16, which uses surrogate pairs to represent these extended characters. A surrogate pair is two special 16-bit code units that together represent a single character outside the BMP. If a filename contains such a character and is accessed by a system or application that doesn't properly interpret surrogate pairs, it can't reconstruct the intended character, resulting in a garbled or unreadable filename. This issue primarily arises with older software or when transferring files between systems with different Unicode handling capabilities.
This blog post delves into the intricacies of character encoding, specifically within the Windows operating system, and explains why certain filenames might appear unreadable or cause issues. It centers around the concept of "surrogate pairs," a mechanism used to represent characters outside the Basic Multilingual Plane (BMP) of Unicode. The BMP encompasses the most commonly used characters, each representable by a single 16-bit code unit. However, Unicode extends beyond the BMP to include less common characters, such as emojis, musical symbols, and characters from ancient scripts. These supplementary characters require more than 16 bits for representation.
To handle these supplementary characters within systems primarily designed for 16-bit code units, Unicode employs surrogate pairs. A surrogate pair consists of two 16-bit code units, a high surrogate and a low surrogate, which together represent a single supplementary character. These surrogate code units are specifically reserved within the Unicode standard and, when encountered sequentially, are interpreted as a single character. The post emphasizes that these individual surrogate code units have no meaning on their own and should only be considered as components of a complete pair.
The core problem addressed in the post is the incompatibility of certain Windows API functions with surrogate pairs. While newer APIs correctly handle supplementary characters represented by surrogates, older APIs often treat the two code units of a surrogate pair as two separate characters. This can lead to several issues, including incorrect filename display, inability to access files with supplementary characters in their names, and potential security vulnerabilities. The post provides a concrete example of this issue using the command-line tool dir
, demonstrating how it might misinterpret a filename containing a surrogate pair.
The author further explains the technical details of how surrogate pairs are encoded, providing the specific code point ranges for high and low surrogates. This helps in understanding how to identify and handle them programmatically. The post also touches on the importance of using appropriate API functions that correctly support supplementary characters to avoid these encoding-related problems. It highlights the distinction between UTF-16, which uses surrogate pairs, and UTF-32, which represents all characters with a fixed 32-bit code unit, thereby eliminating the need for surrogates. Finally, the post suggests using newer, Unicode-aware API functions in Windows for robust and correct handling of all Unicode characters, including those represented by surrogate pairs, in filenames and other text strings. This ensures compatibility and avoids the potential pitfalls associated with older, 16-bit character-centric API functions.
Summary of Comments ( 44 )
https://news.ycombinator.com/item?id=43158696
HN users discuss various aspects of surrogate pairs and Unicode. Several commenters highlight the complexity and nuances of Unicode handling, particularly in different programming languages and operating systems. Some mention the challenges of correctly processing and displaying these characters, with specific examples of issues encountered in Windows and other environments. The discussion also touches upon the historical context of surrogate pairs, the difference between UTF-16 and UTF-8, and the importance of proper encoding and decoding. A few commenters offer practical advice and resources for dealing with surrogate pairs, including libraries and tools. There's a general agreement that handling Unicode correctly requires careful attention and a deep understanding of its intricacies.
The Hacker News post titled "Understanding Surrogate Pairs: Why Some Windows Filenames Can't Be Read" linking to an article about surrogate pairs in Windows filenames generated a moderate discussion with several interesting points.
Several commenters discussed the challenges and inconsistencies surrounding surrogate pairs in different programming languages and operating systems. One commenter highlighted the complexity arising from UTF-16's variable-width encoding, where supplementary characters require two code units (a surrogate pair), causing issues if systems aren't correctly handling them as a single entity. They pointed out how this contrasts with UTF-8, which uses a variable-length encoding where characters can occupy 1 to 4 bytes. This difference often leads to confusion and bugs, especially when transferring data between systems or using libraries that don't fully support UTF-16.
Another user discussed the specific problem of filenames on Windows, noting how NTFS technically does support these supplementary characters. However, the Win32 API layer often fails to handle them correctly, leading to the inability to access or manipulate files with such names. This commenter offered a workaround involving using the "\?\" prefix, effectively bypassing the problematic Win32 API and directly accessing the lower-level NTFS functionality. They further explained that using
std::filesystem::path::native()
might be more portable than manually adding the prefix.A separate commenter highlighted the overall complexity of character encoding and the difficulties many programmers face in fully grasping it. They pointed to the numerous related challenges that arise, such as combining characters, grapheme clusters, and the nuances of different Unicode normalization forms. They emphasized that even seasoned developers can struggle with these concepts.
One commenter recounted their personal experience dealing with similar filename encoding issues on Windows with Chinese characters. They described the frustration of files being inaccessible due to encoding mismatches and the lack of clear error messages.
Some comments delved into the technical details of UTF-16 and how surrogate pairs function. One user clarified that supplementary characters are encoded as a "high surrogate" followed by a "low surrogate," and how these pairs form a single code point representing characters beyond the Basic Multilingual Plane (BMP).
Finally, a commenter touched upon the historical context, suggesting that the limitations in the Win32 API's handling of surrogate pairs are likely due to its age, predating the widespread adoption and understanding of supplementary characters. They speculated that updating the API would be a significant undertaking with potential compatibility issues.
In summary, the comments on the Hacker News post explored the technical intricacies of surrogate pairs, their implications for Windows filenames, the inconsistencies across different systems and programming languages, and the overall challenges developers face when dealing with Unicode characters. Several comments offered practical advice and workarounds for handling these issues, while others provided valuable context and personal anecdotes.