February 27, 2020 —
Guest post by Hamza Tahir of maiot, along with Robert Crowe and Tris Warkentin on behalf of the TFX team
IntroductionPrincipal Component Analysis (PCA) is a dimensionality reduction technique, useful in many different machine learning scenarios. In essence, PCA reduces the dimension of input vectors in a way that retains the maximal variance in your dataset. Reducing the dimensionality of the mode…
preprocessing_fn
function that gets fed into the Transform component in the TFX pipeline. This function accepts a dictionary of feature tensors and outputs a dictionary of features with applied relevant transformations. While you can use normal TensorFlow code here, many fundamental transformations are already built-in out of the box with TensorFlow Transform (e.g., normalize, bucketize, compute vocabularies, etc.). Find the full list of out-of-the-box transforms here.tft.pca
transform, which we will use to compute the PCA of our dataset. Here is how you can utilize this transform in a preprocessing_fn
function.def preprocessing_fn(inputs):
features = []
outputs = {}
for feature_tensor in inputs.values():
# standard scaler pre-req for PCA
features.append(tft.scale_to_z_score(feature_tensor))
# concat to make feature matrix for PCA to run over
feature_matrix = tf.concat(features, axis=1)
# get orthonormal vector matrix
orthonormal_vectors = tft.pca(feature_matrix, output_dim=2, dtype=tf.float32)
# multiply matrix by feature matrix to get projected transformation
pca_examples = tf.linalg.matmul(feature_matrix, orthonormal_vectors)
# unstack and add to output dict
pca_examples = tf.unstack(pca_examples, axis=1)
outputs['Principal Component 1'] = pca_examples[0]
outputs['Principal Component 2'] = pca_examples[1]
return outputs
Note: In this example, we have assumed that all input features are numerical, and are all fed into the PCA transform. If needed, only a subset of the input features may be used.tft.pca
function. This calculates the orthonormal vector matrix of our data. As explained in the tft.pca documentation, the matrix can be used to calculate the final projection of our data. We do this by multiplying this matrix with the feature matrix. The final step is to ‘unstack’ the projection matrix, separating the individual principal components. We then return these in the output dictionary.preprocessing_fn
, a lot goes on under the hood that is abstracted away. To perform distributed processing on a compute cluster TFX creates a distributed Apache Beam pipeline which computes the relevant co-variances and orthonormal vector matrix. It also creates a normal TensorFlow graph with this transformation embedded, which will become part of your trained model, so that you can use the PCA transformation at serving time. The result of PCA is a new vector space with fewer dimensions. At serving time, new data will be projected into that lower dimensional space from the original higher dimensional space.
February 27, 2020
—
Guest post by Hamza Tahir of maiot, along with Robert Crowe and Tris Warkentin on behalf of the TFX team
IntroductionPrincipal Component Analysis (PCA) is a dimensionality reduction technique, useful in many different machine learning scenarios. In essence, PCA reduces the dimension of input vectors in a way that retains the maximal variance in your dataset. Reducing the dimensionality of the mode…