Introducing TensorFlow Data Validation: Data Understanding, Validation, and Monitoring At Scale

9月 10, 2018

Posted by Clemens Mewald (Product Manager) and Neoklis Polyzotis (Research Scientist)

Today we are launching TensorFlow Data Validation (TFDV), an open-source library that helps developers understand, validate, and monitor their ML data at scale. Both academia and industry have paid much attention to ML algorithms and their performance, but all this optimization effort goes to waste if the input data are wrong. Understanding and validating data may seem like a trivial task for tiny amounts of data that can be inspected manually. However, in practice the data are too large for manual inspection, and often times arrive continuously and in big chunks, and so it becomes necessary to automate and scale the tasks of data analysis, validation, and monitoring.

TFDV is part of the TFX Platform and this technology is used to analyze and validate petabytes of data at Google every day. It has a proven track record in catching data errors early and thus in helping TFX users maintain the health of their ML pipelines.

Figure 1: TensorFlow Data Validation is used for Data Analysis and Validation in TFX

TensorFlow Data Validation in a Notebook

Early in designing TFDV we made the decision to enable its use from a notebook environment. We found it important to allow data scientists and engineers to use the TFDV libraries as early as possible within their workflows, to ensure that they could inspect and validate their data, even if they were doing exploration with only a small subset of their data. This would make the subsequent transition to a large-scale production deployment easier.

Computing and Visualizing Descriptive Statistics

At the basis of TFDV lies a powerful library to compute descriptive statistics over your ML data. These statistics are used to both help developers investigate and understand their data, as well as infer a schema (more on this later).

The TFDV APIs are designed to enable connectors to different data formats, and to provide flexibility and scale.

Connectors: TFDV uses Apache Beam to define and process its data pipelines. As a result, existing Beam IO connectors, as well as user-defined PTransforms, can be used to process different formats and representations of data. We provide two helper functions for CSV and TF Record of serialized tf.Examples.

# compute statistics for a CSV file
train_stats = tfdv.generate_statistics_from_csv(TRAIN_DATA)
# compute statistics for TF Record files
train_stats = tfdv.generate_statistics_from_tfrecord(TRAIN_DATA)

Flexibility: The API also allows for the computation of custom statistics (in addition to the standard statistics computed by TFDV), so long as this computation can be expressed as Apache Beam transformations. These custom statistics are serialized in the same statistics.proto and can be consumed by downstream libraries.
Scale: TFDV creates an Apache Beam pipeline that is executed with a DirectRunner in the notebook environment. The same pipeline can be distributed with other runners, e.g. with the DataflowRunner on Google Cloud Platform. The Apache Flink and Apache Beam communities are also nearing completion of a Flink Runner. Follow the corresponding JIRA ticket, Apache Beam blog, or mailing lists to get notifications about availability of the Flink Runner.

The statistics are stored in a statistics.proto and can be visualized from within the notebook.

Figure 2: The statistics.proto can be visualized and inspected inline (using Facets Overview)

Inferring a Schema

Based on the statistics, TFDV infers a schema (described by the schema.proto) that is meant to reflect the stable characteristics of the data. We will explain below how the schema drives data validation within TFDV. In addition, this schema format is also used as the interface to other components in the TFX ecosystem, e.g. it can automate the parsing of data within TensorFlow Transform.

# Infer schema based on statistics
schema = tfdv.infer_schema(train_stats)
# Display schema inline in table format
tfdv.display_schema(schema)

Figure 3: The schema.proto can be visualized and inspected inline

Similarly to computing descriptive statistics over small amounts of data, writing a schema that describes the expectations of training data may seem trivial for a small number of features. In practice however, training data may contain thousands of features. infer_schema helps developers first author a schema that they can then refine and update manually.

Validating New Data

Given a schema, TFDV can validate a new set of data against the expectations expressed in the schema.

# Compute statistics over a new set of data
new_stats = tfdv.generate_statistics_from_csv(NEW_DATA)
# Compare how new data conforms to the schema
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies inline
tfdv.display_anomalies(anomalies)

Figure 4: The anomalies report gives an overview of discrepancies between new data and the schema

The output of validate_statistics is stored in an anomalies.proto and describes how the data deviate from the constraints encoded in the schema. Developers can examine this output and take action to either fix errors in their data or update the schema if the detected anomaly is a natural evolution of the data (e.g. a new valid string value in a categorical feature).

TensorFlow Data Validation in Production Pipelines

Outside of a notebook environment the same TFDV libraries can be used to analyze and validate data at scale. Two common use-cases of TFDV within TFX pipelines are validation of continuously arriving data and training/serving skew detection. In addition, for users of TensorFlow Transform, the inferred schema can be used to parse data into the preprocessing function.

Validation of Continuously Arriving Data

In the case of continuously arriving data, new data need to be validated against the expectations codified within the schema. In a typical setup the schema is maintained across time, statistics are computed over new data, and those statistics are used to validate those data against the original schema. As mentioned above, the schema is inferred initially as a convenience, but it can evolve over time.

Figure 5: Schematic view of using validate_statistics to validate newly arriving data

The statistics from different sets of data (or different days of data) can also be visually compared using the visualize_statistics command.

Figure 6: Using the Facets overview visualization to compare two sets of data. Here we compare the feature ‘fare’ between two days. Because DAY2 only has half of the number of examples of DAY1, we switch to percentages. Then we click ‘expand’ to zoom into the visualization. The ‘Quantiles’ view shows that the distribution is similar with the exception of an outlier in the DAY1 dataset.

TFDV can also detect distribution drifts between successive versions of the training data. As with the rest of validation, constraints on drift can be expressed in the schema. TFDV uses these constraints to compare the statistics between successive data versions. If drift is detected, an appropriate message is included in the generated anomalies.

Training/Serving Skew Detection

Training/Serving skew refers to a difference in the feature values or distributions between the data used to train a model and the data observed by the serving system. Similar to the continuous validation of training data, TFDV can compute statistics of servings logs and perform validation with the schema, taking into account any expected differences between training and serving data (e.g. the fact that a label is present in the training data but absent in serving logs, or that the distribution may shift by a small amount). In addition, Facets can visualize side-by-side the statistics of training and serving data and thus highlight potential errors or drifts.

Figure 7: Schematic view of comparing training and serving data with validate_statistics

TFDV and TensorFlow Transform

TensorFlow Transform (TFT) is an open source library for TensorFlow that allows users to define preprocessing pipelines and run these using large scale data processing frameworks, while also exporting the pipeline in a way that can be run as part of a TensorFlow graph. Users define a pipeline by composing modular Python functions, which tf.Transform then executes with Apache Beam, a framework for large-scale, efficient, distributed data processing.

TFT requires the specification of a schema to parse the data into tensors. Instead of manually specifying a schema (by specifying the type of each feature), using the TFDV inferred schema radically simplifies the use of TFT.

feature_spec = schema_utils.schema_as_feature_spec(schema).feature_spec
schema = dataset_schema.from_feature_spec(feature_spec)

How to get started with TensorFlow Data Validation

We’ve open sourced TFDV and published it on GitHub at github.com/tensorflow/data-validation under the Apache 2.0 License. This release includes an example notebook of how to use the TFDV libraries within a notebook.

We’ve also updated our end-to-end example, showcasing how TFDV can be used alongside TensorFlow Transform, TensorFlow Estimators, TensorFlow Model Analysis, and TensorFlow Serving. We suggest reading and trying out this example to get started with TFX.

We would like to acknowledge core contributions by Sudip Roy, Paul Suganthan, Ming Zhong, and Martin Zinkevich. We would also like to thank the following colleagues: Ahmet Altay, Deepak Bhaduria, Robert Bradshaw, Mike Case, Charles Chen, Yifei Feng, Chuan Yu Foo, Robbie Haertel, Abhijit Karmarkar‎, Gus Katsiapis, Lak Lakshmanan, Billy Lamberta, Raz Mathias, Kester Tong, Zohar Yahav, Xing Yan, Paul Yang, Xin Zhang, Lana Webb, Jarek Wilkiewicz‎.

Introducing TensorFlow Data Validation: Data Understanding, Validation, and Monitoring At Scale

TensorFlow Core ·

Introducing TensorFlow Data Validation: Data Understanding, Validation, and Monitoring At Scale

9月 10, 2018 — Posted by Clemens Mewald (Product Manager) and Neoklis Polyzotis (Research Scientist)

Today we are launching TensorFlow Data Validation (TFDV), an open-source library that helps developers understand, validate, and monitor their ML data at scale. Both academia and industry have paid much attention to ML algorithms and their performance, but all this optimization effort goes to waste if the input …