Swift: Bytes for Beginners (Part III)


Beyond UTF8

UTF8 is fine for regular text, but when special characters are inserted into a string then a character can span more than one byte. And if we are mixing up arrays then we might find that those special characters, for example emoji, are no longer held together and we end up with placeholders or nil being returned.

UTF16 Encoding

We could at this point jump forward to the most flexible encoding (Unicode Scalar Representation), but for completeness I'm going to include the transformation of a UTF16 string, because while it doesn't solve the special character problem, it might be selected as a format for other reasons.
let str = "Hello, playground"
var buff = [UInt16](str.utf16)
var revbuff = reverse(buff)
// You now have a buffer loaded with bytes!!

if let aStr = String.stringWithBytesNoCopy(&revbuff, length:revbuff.count*sizeof(UInt16), encoding: NSUTF16LittleEndianStringEncoding, freeWhenDone:false) {
    aStr // returns "dnuorgyalp ,olleH"
}
There are a few things to note here: first we're using a UInt16 array, because we're now working with the UTF16 code units, which are 16-bit; second we're no longer able to use the simple stringWithBytes() method and so we have to point at the buffer using the ampersand; third, it is necessary for us to specify the length of the bytes; fourth we specify a NSUTF16 encoding; and, fifth we specify that we don't want the buffer to be freed when done.

The inability to use stringWithBytes() on a UInt16 array I haven't looked at too closely, so I can't provide a full explanation of why this is the case. The concept of a pointer, I'm going to presume is familiar to most people reading this post. The encoding choice, I will leave you to experiment with. (You will quickly see why the Little Endian one was chosen.)

Moving on to the things that are more important here: the length is calculated by supplying the size of the code unit in bytes and multiplying this by the array count. As we go from UInt8 to UInt16 and up, the code units increase in size, so although the arrays might have the same number of items the length in bytes is determined by taking into account the size of each code unit. This is something that is often important to be able to calculate and while I'm not providing a full explanation here, I will look more closely at the significance of using 8-bit, 16-bit and 32-bit integers in the next post.

For an explanation of how the bytes are stored, see Apple's own documentation.

Unicode Scalar Representation

Now we've briefly looked at UTF16, let's look at the more flexible unicode scalar representation and the code for repeating the earlier manipulation of a string here:
let str = "Hello, "
let str2 = "playground."
// create unicode scalar array
var arr = [UnicodeScalar](str.unicodeScalars)
let arr2 = [UnicodeScalar](str2.unicodeScalars)
arr += arr2
arr = reverse(arr)

var str3 = String()
// extract characters from array
for a in arr {
    str3.append(a)
}

str3 // returns "dnuorgyalp ,olleH"
As you'll see we can add scalars to a string directly meaning that this final piece of code is "pure swift", whereas up to now we've been relying on the bridge between String and NSString.

Note: there is a post on StackOverflow demonstrating how a UInt16 array can be transformed into a string without NSString, and I recommend it for any Swift purists working with a UTF16 string encoding.

Next step

Next time I'm going to explain the relationship between UInt numbers and binary numbers, because I think you're ready. We'll then continue on to even bigger and better things.

In the meantime, I recommend cutting and pasting the code into a playground and then throwing in as many special characters as possible. (Use Ctrl + Cmd + Space to bring up the selection on a Mac. Source: Natasha the Robot.)
Endorse on Coderwall

Comments