June 10, 2019 —
Posted by Robby Neale, Software Engineer
TensorFlow provides a wide breadth of ops that greatly aid in building models from images and video. However, there are many models that begin with text, and the language models built from these require some preprocessing before the text can be fed into the model. For example, the Text Classification tutorial that uses the IMDB set begins with text data th…
tokenizer = tensorflow_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost.'], ['Sad\xe2\x98\xb9']]
The initial release also includes a unicode script tokenizer, which splits UTF-8 strings based on Unicode script boundaries. Unicode scripts are collections of characters and symbols that have historically related language derivations. View the International Components for Unicode (ICU) UScriptCode values for the complete set of enumerations. It’s worth noting that this is similar to the whitespace tokenizer with the most apparent difference being that it will split punctuation USCRIPT_COMMON
from language texts (eg. USCRIPT_LATIN
, USCRIPT_CYRILLIC
, etc).tokenizer = tensorflow_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'], ['Sad', '\xe2\x98\xb9']]
The final tokenizer provided in the TF.Text launch is a wordpiece tokenizer. It is an unsupervised text tokenizer which requires a predetermined vocabulary for further splitting tokens down into subwords (prefixes & suffixes). Wordpiece is commonly used in BERT models.def _CreateTable(vocab, num_oov=1):
init = tf.lookup.KeyValueTensorInitializer(
vocab,
tf.range(tf.size(vocab, out_type=tf.int64), dtype=tf.int64),
key_dtype=tf.string,
value_dtype=tf.int64)
return tf.lookup.StaticVocabularyTable(
init, num_oov, lookup_key_dtype=tf.string)
vocab_table = _CreateTable(["great", "they", "the", "##'", "##re", "##est"])
tokens = [["they're", "the", "greatest"]]
tokenizer = tensorflow_text.WordpieceTokenizer(
vocab_table, token_out_type=tf.string)
result = tokenizer.tokenize(tokens)
print(result.to_list())
[[['they', "##'", '##re'], ['the'], ['great', '##est']]]
Each of these tokenizes on UTF-8 encoded strings and includes an option for getting byte offsets into the original string. This allows the caller to know the byte alignment into the original string for each token that was created.tokenizer = tensorflow_text.UnicodeScriptTokenizer()
(tokens, offset_starts, offset_limits) = tokenizer.tokenize_with_offsets(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
print(offset_starts.to_list())
print(offset_limits.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'], ['Sad', '\xe2\x98\xb9']]
[[0, 11, 15, 21, 26, 29, 33], [0, 3]]
[[10, 14, 20, 25, 28, 33, 34], [3, 6]]
pip install tensorflow-text
June 10, 2019
—
Posted by Robby Neale, Software Engineer
TensorFlow provides a wide breadth of ops that greatly aid in building models from images and video. However, there are many models that begin with text, and the language models built from these require some preprocessing before the text can be fed into the model. For example, the Text Classification tutorial that uses the IMDB set begins with text data th…