Ruby: Limiting a UTF-8 string by byte-length

This RabbitMQ page states:

Queue names may be up to 255 bytes of UTF-8 characters.

In ruby (1.9.3), how would I truncate a UTF-8 string by byte-count without breaking in the middle of a character? The resulting string should be the longest possible valid UTF-8 string that fits in the byte limit.

Answers


bytesize will give you the length of the string in bytes while (as long as the string's encoding is set properly) operations such as slice won't mangle the string.

A simple process would be to just iterate through the string

s.each_char.each_with_object('') do|char, result| 
  if result.bytesize + char.bytesize > 255
    break result
  else
    result << char
  end
end

If you were being crafty you'd copy the first 63 characters directly since any unicode character is at most 4 bytes in utf-8.

Note that this is still not perfect. For example, imagine that the last 4 bytes of your string are the characters 'e' and combining acute accent. Slicing the last 2 bytes produces a string that is still utf8 but in terms of what the user sees would change the output from 'é' to 'e', which could change the meaning of the text. This is probably not a huge deal when you're just naming RabbitMQ queues but could be important in other circumstances. For example, in French a newsletter headline reading 'Un policier tué' means 'A policeman was killed' whereas 'Un policier tue' means 'A policeman kills'.


For Rails >= 3.0 you have ActiveSupport::Multibyte::Chars limit method.

From API docs:

- (Object) limit(limit) 

Limit the byte size of the string to a number of bytes without breaking characters. Usable when the storage for a string is limited for some reason.

Example:

'こんにちは'.mb_chars.limit(7).to_s # => "こん"

How about this:

s = "δogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδog"
count = 0
while true
  more_truncate = "a" + (255-count).to_s
  s2 = s.unpack(more_truncate)[0]
  s2.force_encoding 'utf-8'

  if s2[-1].valid_encoding?
    break
  else
    count += 1
  end
end

s2.force_encoding 'utf-8'
puts s2

I think I found something that works.

def limit_bytesize(str, size)
  str.encoding.name == 'UTF-8' or raise ArgumentError, "str must have UTF-8 encoding"

  # Change to canonical unicode form (compose any decomposed characters).
  # Works only if you're using active_support
  str = str.mb_chars.compose.to_s if str.respond_to?(:mb_chars)

  # Start with a string of the correct byte size, but
  # with a possibly incomplete char at the end.
  new_str = str.byteslice(0, size)

  # We need to force_encoding from utf-8 to utf-8 so ruby will re-validate
  # (idea from halfelf).
  until new_str[-1].force_encoding('utf-8').valid_encoding?
    # remove the invalid char
    new_str = new_str.slice(0..-2)
  end
  new_str
end

Usage:

>> limit_bytesize("abc\u2014d", 4)
=> "abc"
>> limit_bytesize("abc\u2014d", 5)
=> "abc"
>> limit_bytesize("abc\u2014d", 6)
=> "abc—"
>> limit_bytesize("abc\u2014d", 7)
=> "abc—d"

Update...

Decomposed behavior without active_support:

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 7)
=> "abcéd"

Decomposed behavior with active_support:

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abc"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcéd"

Need Your Help

scrollbar making issue when used heirarchical datagrid with item renderer

actionscript-3 flex4

I had done hierarchical datagrid with item renderer. everything is working fine. i had lot of data to display in the grid, so the scrollbar is displaying every time when the data loads. when drag ...

Draw rectangle in custom table cell

iphone objective-c cocoa-touch tablecell

How would I draw a rectangle in a custom table cell class? The cell currently has a background image with a few text labels. I would like to draw a rectangle behind each of the labels so they are e...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.