دسمبر 19, 2018 —
Posted by Laurence Moroney and Edward Loper
TensorFlow now supports Unicode, a standard encoding system used to represent characters from almost all languages. When processing natural language, it can be important to understand how characters are encoded in byte strings. In languages with small character sets like English, each character can be encoded in a single byte using the ASCII encoding. B…
0
and 0x10FFFF
. When code points are put in sequence, a Unicode string is formed."语言处理"
(which means “language processing” in Chinese) using code points, UTF-8 and UTF-16 respectively.
# Unicode string, represented as a vector of code points.
text_chars = tf.constant([35821, 35328, 22788, 29702])
# Unicode string, represented as a UTF-8-encoded string scalar.
text_utf8 = tf.constant('\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86')
# Unicode string, represented as a UTF-16-BE-encoded string scalar.
text_utf16be = tf.constant('\x8b\xed\x8a\x00Y\x04t\x06')
Naturally you may need to convert between representations — and TensorFlow 1.13 has added functions to do this:
tf.strings.unicode_decode
: Converts an encoded string scalar to a vector of code points.tf.strings.unicode_encode
: Converts a vector of code points to an encoded string scalar.tf.strings.unicode_transcode
: Converts an encoded string scalar to a different encoding.>>> tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8')
tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)
When decoding a Tensor
containing multiple strings, the strings may have differing lengths. unicode_decode
returns the result as a RaggedTensor
, where the length of the inner dimension varies depending on the number of characters in each string.
>>> hello = [b"Hello", b"你好", b"こんにちは", b"Bonjour"]
>>> tf.strings.unicode_decode(hello, input_encoding='UTF-8')
To learn more about Unicode support in TensorFlow, check out the Unicode tutorial colab and browse the tf.strings
documentation.
دسمبر 19, 2018
—
Posted by Laurence Moroney and Edward Loper
TensorFlow now supports Unicode, a standard encoding system used to represent characters from almost all languages. When processing natural language, it can be important to understand how characters are encoded in byte strings. In languages with small character sets like English, each character can be encoded in a single byte using the ASCII encoding. B…