December 12, 2018 —
Posted by Laurence Moroney
In many scenarios, data doesn’t come evenly divided into uniformly-shaped arrays that can be loaded into tensors. A classic case is in training and processing text. For example, if you look at the Text Classification tutorial that uses the IMDB dataset, you’ll see a major part of your data preparation is in shaping your data to a normalized size. In that case, every rev…
speech = tf.ragged.constant(
[['All', 'the', 'world', 'is', 'a', 'stage'],
['And', 'all', 'the', 'men', 'and', 'women', 'merely', 'players'],
['They', 'have', 'their', 'exits', 'and', 'their', 'entrances']])
When printing this out, we can see that it is created from a list of lists, with each being of variable length:
<tf.RaggedTensor [['All', 'the', 'world', 'is', 'a', 'stage'], ['And', 'all', 'the', 'men', 'and', 'women', 'merely', 'players'], ['They', 'have', 'their', 'exits', 'and', 'their', 'entrances']]>
Most operations that you’d expect to be supported with normal tensors are available to ragged tensors too, so, for example, Python style indexing to access slices of the tensor work as expected:
>>print(speech[0])
tf.Tensor(['All', 'the', 'world', 'is', 'a', 'stage'], shape=(6,), dtype=string)
The tf.ragged
package also defines a number of operations that are specific to ragged tensors. For example, the tf.ragged.map_flat_values
operation can be used to efficiently transform the individual values in a ragged tensor, while keeping its shape the same:
> print tf.ragged.map_flat_values(tf.strings.regex_replace,speech, pattern="([aeiouAEIOU])", rewrite=r"{\1}")
You can learn more about which ops are supported here.
import math
import tensorflow as tf
tf.enable_eager_execution()
# Set up the embeddingss
num_buckets = 1024
embedding_size = 16
embedding_table =
tf.Variable(
tf.truncated_normal([num_buckets, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)),
name="embedding_table")
# Input tensor.
queries = tf.ragged.constant([
['Who', 'is', 'Dan', 'Smith']
['Pause'],
['Will', 'it', 'rain', 'later', 'today']])
# Look up embedding for each word. map_flat_values applies an operation to each value in a RaggedTensor.
word_buckets = tf.strings.to_hash_bucket_fast(queries, num_buckets)
word_embeddings = tf.ragged.map_flat_values(
tf.nn.embedding_lookup, embedding_table, word_buckets) # ①
# Add markers to the beginning and end of each sentence.
marker = tf.fill([queries.nrows()), 1], '#')
padded = tf.concat([marker, queries, marker], axis=1) # ②
# Build word bigrams & look up embeddings.
bigrams = tf.string_join(
[padded[:, :-1], padded[:, 1:]], separator='+') # ③
bigram_buckets =
tf.strings.to_hash_bucket_fast(bigrams, num_buckets)
bigram_embeddings = tf.ragged.map_flat_values(
tf.nn.embedding_lookup, embedding_table, bigram_buckets) # ④
# Find the average embedding for each sentence
all_embeddings =
tf.concat([word_embeddings, bigram_embeddings], axis=1) # ⑤
avg_embedding = tf.reduce_mean(all_embeddings, axis=1) # ⑥
print(word_embeddings)
print(bigram_embeddings)
print(all_embeddings)
print(avg_embedding)
This is illustrated in the following diagram. Note that the numbers are for illustrative purposes only. For the real values in the embedding, check out the values output at the end of the code block.
tf.Tensor
, since no time or space is wasted on the padding values; and are more flexible and convenient than using a tf.SparseTensor
, since they support a wide variety of operations, with the correct semantics for variable-length lists.
December 12, 2018
—
Posted by Laurence Moroney
In many scenarios, data doesn’t come evenly divided into uniformly-shaped arrays that can be loaded into tensors. A classic case is in training and processing text. For example, if you look at the Text Classification tutorial that uses the IMDB dataset, you’ll see a major part of your data preparation is in shaping your data to a normalized size. In that case, every rev…