Unicode maps characters into their corresponding code points, i.e. a numeric value that represents that character. A character encoding scheme then dictates how each code point is represented as a series of bits so that it can be stored in memory or on disk.
Recall that UTF-16 encoding uses either 2 or 4 bytes to represent each code point.
Code points larger than FFFF are represented using 4 bytes, known as a surrogate pair.
- The first two bytes (lead surrogate) are in the range D800 – DBFF
- The third and fourth bytes (trail surrogate) are in the range DC00 – DFFF
These surrogate pairs in turn represent code points in the range U+10000 – U+10FFFF. This represents 1,048,576 additional values that can be encoded as surrogate pairs. (Though the Unicode standard currently defines only a small subset of these values).
Filed under: Basics Tagged: Basics, C#, Unicode, UTF-16 Image may be NSFW.
Clik here to view.

Clik here to view.
