Ruby: Limiting a UTF-8 string by byte-length
This RabbitMQ page states:
Queue names may be up to 255 bytes of UTF-8 characters.
In ruby (1.9.3), how would I truncate a UTF-8 string by byte-count without breaking in the middle of a character? The resulting string should be the longest possible valid UTF-8 string that fits in the byte limit.
bytesize will give you the length of the string in bytes while (as long as the string's encoding is set properly) operations such as slice won't mangle the string.
A simple process would be to just iterate through the string
s.each_char.each_with_object('') do|char, result| if result.bytesize + char.bytesize > 255 break result else result << char end end
If you were being crafty you'd copy the first 63 characters directly since any unicode character is at most 4 bytes in utf-8.
Note that this is still not perfect. For example, imagine that the last 4 bytes of your string are the characters 'e' and combining acute accent. Slicing the last 2 bytes produces a string that is still utf8 but in terms of what the user sees would change the output from 'é' to 'e', which could change the meaning of the text. This is probably not a huge deal when you're just naming RabbitMQ queues but could be important in other circumstances. For example, in French a newsletter headline reading 'Un policier tué' means 'A policeman was killed' whereas 'Un policier tue' means 'A policeman kills'.
For Rails >= 3.0 you have ActiveSupport::Multibyte::Chars limit method.
From API docs:
- (Object) limit(limit)
Limit the byte size of the string to a number of bytes without breaking characters. Usable when the storage for a string is limited for some reason.
'こんにちは'.mb_chars.limit(7).to_s # => "こん"
How about this:
s = "δogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδog" count = 0 while true more_truncate = "a" + (255-count).to_s s2 = s.unpack(more_truncate) s2.force_encoding 'utf-8' if s2[-1].valid_encoding? break else count += 1 end end s2.force_encoding 'utf-8' puts s2
I think I found something that works.
def limit_bytesize(str, size) str.encoding.name == 'UTF-8' or raise ArgumentError, "str must have UTF-8 encoding" # Change to canonical unicode form (compose any decomposed characters). # Works only if you're using active_support str = str.mb_chars.compose.to_s if str.respond_to?(:mb_chars) # Start with a string of the correct byte size, but # with a possibly incomplete char at the end. new_str = str.byteslice(0, size) # We need to force_encoding from utf-8 to utf-8 so ruby will re-validate # (idea from halfelf). until new_str[-1].force_encoding('utf-8').valid_encoding? # remove the invalid char new_str = new_str.slice(0..-2) end new_str end
>> limit_bytesize("abc\u2014d", 4) => "abc" >> limit_bytesize("abc\u2014d", 5) => "abc" >> limit_bytesize("abc\u2014d", 6) => "abc—" >> limit_bytesize("abc\u2014d", 7) => "abc—d"
Decomposed behavior without active_support:
>> limit_bytesize("abc\u0065\u0301d", 4) => "abce" >> limit_bytesize("abc\u0065\u0301d", 5) => "abce" >> limit_bytesize("abc\u0065\u0301d", 6) => "abcé" >> limit_bytesize("abc\u0065\u0301d", 7) => "abcéd"
Decomposed behavior with active_support:
>> limit_bytesize("abc\u0065\u0301d", 4) => "abc" >> limit_bytesize("abc\u0065\u0301d", 5) => "abcé" >> limit_bytesize("abc\u0065\u0301d", 6) => "abcéd"