Quantcast
Viewing all articles
Browse latest Browse all 10

#1,007 – Getting Length of String that Contains Surrogate Pairs

You can use the string.Length property to get the length (number of characters) of a string.  This only works, however, for Unicode code points that are no larger than U+FFFF.  This set of code points is known as the Basic Multilingual Plane (BMP).

Unicode code points outside of the BMP are represented in UTF-16 using 4 byte surrogate pairs, rather than using 2 bytes.

To correctly count the number of characters in a string that may contain code points higher than U+FFFF, you can use the StringInfo class (from System.Globalization).

            // 3 Latin (ASCII) characters
            string simple = "abc";

            // 3 character string where one character
            //  is a surrogate pair
            string containsSurrogatePair = "A𠈓C";

            // Length=3 (correct)
            Console.WriteLine(string.Format("Length 1 = {0}", simple.Length));

            // Length=4 (not quite correct)
            Console.WriteLine(string.Format("Length 2 = {0}", containsSurrogatePair.Length));

            // Better, reports Length=3
            StringInfo si = new StringInfo(containsSurrogatePair);
            Console.WriteLine(string.Format("Length 3 = {0}", si.LengthInTextElements));

Image may be NSFW.
Clik here to view.
1007-001


Filed under: Strings Tagged: C#, Length, Strings, Surrogate Pairs, Unicode, UTF-16 Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.

Viewing all articles
Browse latest Browse all 10

Trending Articles