september 15, 2021 — Posted by Gal Moran, Iris Shmuel, and Daniel Marcous (Data Scientists at Waze)WazeWaze is the world's largest community-based traffic and navigation app. It uses real-time data to help users circumvent literal and figurative bumps in the road. On top of mobile navigation, Waze offers a web platform, a carpool app, partnership services, an advertisement platform and more. Such a broad portfoli…
Posted by Gal Moran, Iris Shmuel, and Daniel Marcous (Data Scientists at Waze)
Waze is the world's largest community-based traffic and navigation app. It uses real-time data to help users circumvent literal and figurative bumps in the road. On top of mobile navigation, Waze offers a web platform, a carpool app, partnership services, an advertisement platform and more. Such a broad portfolio brings along diverse technological challenges and many different use cases.
Waze relies on many ML solutions, including:
But it’s not that easy to get something like these right and “production grade”. It is very common for these kinds of projects to have requirements for complex surrounding infrastructure for getting them to production and hence require multiple engineers (data scientist, software engineer and software reliability engineers) and a lot of time. Even more so when you mix in the Waze-y requirements like large scale data, low (real-time, actually) latency inference, diverse use cases, and a whole lot of geospatial data.
The above is a good reason why opportunistically starting to do ML created a chaotic state at Waze. For us it manifested as:
Overall, data scientists ended up spending a lot of their time on ops and monitoring instead of focusing on the actual modelling and data processing. At a certain level of growth we’ve decided to organize the chaos and invest in automation and processes so we can scale faster. We’ve decided to heavily invest in a way to dramatically increase velocity and quality by adopting a full cycle data science philosophy. This means that in this new world we wanted to build, a single data scientist is able to close the product cycle from research to a production grade service.
Data scientists now directly contribute to production to maximize impact. They focus on modelling and data processing and get many infrastructures and ops work out-of-the-box. While we are not yet at the end of this journey fully realizing the above vision, we feel like the effort layed out here was crucial in putting us on the right track.
Translating the above philosophy to a tech spec, we were set on creating an easy, stable, automated and uniform way of building ML pipelines at Waze.
Deep diving into tech requirements we came up with the below criteria:
For the above reasons we’ve landed on TFX and the power of its built-in components to deliver these capabilities mostly out of the box.
It’s worth saying - Waze runs its tech stack on Google Cloud Platform (GCP).
It happens to be that GCP offers a suite of tools called Vertex AI. It is the ML infrastructure platform Waze is building on top of. While we use many components of Vertex AI’s managed services, we will focus here on - Vertex Pipelines - a framework for ML pipelines that helps us encapsulate TFX (or any pipeline) complexity and setup.
Together with our data tech stack, the overall ML architecture at Waze (all managed, scaled, pythonic etc.) is as follows:
Careful readers will notice the alleged caveat here - we go all in on TensorFlow.
TFX means TensorFlow (even though that's not exactly true anymore, let's assume it is).
It might be a little scary at first when you have many different use cases.
Fortunately, the TF ecosystem is rich and Waze has the merit of having large enough data that neural nets converge.
Since starting this we’ve yet to find a use case that TF magic does not solve better or adequately as other frameworks (and not talking about micro % points, not trying to do a Kaggle competition here but get something to production).
You might think that landing on TFX and Vertex Pipelines solved all our problems, but that’s not exactly true.
In order to make things truly simple we’ve had to write some “glue code” (integrating the various products in the above architecture diagram) and abstracting enough details so the common data scientist could use this stuff effectively and fast.
That resulted in:
We’ve put it all in an easy to use Python package called “waze-data-tfx”
On top, we provided a super detailed walkthrough, usage guides and code templates, to our data scientists, so the common DS workflow is: fork, change config, tweak the code a little, deploy.
For reference this is how a simple waze-data-tfx pipeline looks like:
_DATASET_NAME = 'tfx_examples'
_TABLE_NAME = 'simple_template_data'
_LABEL_KEY = 'label'
_CATEGORICAL_INT_FEATURES = {
"categorical_calculated": 2,
}
_DENSE_FLOAT_FEATURE_KEYS = ["numeric_feature1", "numeric_feature2"]
_BUCKET_FEATURES = {
"numeric_feature1": 5,
}
_VOCAB_FEATURES = {
"categorical_feature": {
'top_k': 5,
'num_oov_buckets': 3
}
}
_TRAIN_BATCH_SIZE = 128
_EVAL_BATCH_SIZE = 128
_NUM_EPOCHS = 250
_TRAINING_ARGS = {
'dnn_hidden_units': [6, 3],
'optimizer': tf.keras.optimizers.Adam,
'optimizer_kwargs': {
'learning_rate': 0.01
},
'layer_activation': None,
'metrics': ["Accuracy"]
}
_EVAL_METRIC_SPEC = create_metric_spec([
mse_metric(upper_bound=25, absolute_change=1),
accuracy_metric()
])
def preprocessing_fn(inputs):
"""tf.transform's callback function for preprocessing inputs.
Args:
inputs: map from feature keys to raw not-yet-transformedfeatures.
Returns:
Map from string feature key to transformed feature operations.
"""
outputs = features_transform(
inputs=inputs,
label_key=_LABEL_KEY,
dense_features=_DENSE_FLOAT_FEATURE_KEYS,
vocab_features=_VOCAB_FEATURES,
bucket_features=_BUCKET_FEATURES,
)
return outputs
def _build_keras_model(**training_args):
"""Build a keras model.
Args:
hidden_units: [int], the layer sizes of the DNN (input layer first).
learning_rate: [float], learning rate of the Adam optimizer.
Returns:
A keras model
"""
feature_columns = \
prepare_feature_columns(
dense_features=_DENSE_FLOAT_FEATURE_KEYS,
vocab_features=_VOCAB_FEATURES,
bucket_features=_BUCKET_FEATURES,
)
return _dnn_regressor(deep_columns=list(feature_columns.values()),
dnn_hidden_units=training_args.get(
"dnn_hidden_units"),
dense_features=_DENSE_FLOAT_FEATURE_KEYS,
vocab_features=_VOCAB_FEATURES,
bucket_features=_BUCKET_FEATURES,
)
pipeline_run = WazeTFXPipelineOperator(
dag=dag,
task_id='pipeline_run',
model_name='basic_pipeline_template',
package=tfx_pipeline_basic,
pipeline_project_id=EnvConfig.get_value('gcp-project-infra'),
table_project_id=EnvConfig.get_value('gcp-project-infra'),
project_utils_filename='utils.py',
gcp_conn_id=gcp_conn_id,
enable_pusher=True,
)
Simple, right?
When you commit a configuration file to the code base it gets deployed and sets up continuous training, and a full blown pipeline including all TFX and Vertex AI magics like data validation, transforms deployed to Dataflow, monitoring etc.
We knew we were up to something good when one of our data scientists came back from a long leave and had to use this new framework for a use case. She said that she was able to spin up a full production-ready pipeline in hours, something that before her leave would have taken her weeks to do.
Going forward we have much planned that we want to bake into `waze-data-tfx`. A key advantage that we see in having this common infrastructure is that once a feature is added, then everyone can enjoy it “for free”. For example, we plan on adding additional components to the pipeline, such as Infra Validator and Fairness Indicators. Once these are supported, every new or existing ML pipeline will add these components out-of-the-box, no extra code needed.
Additional improvements we are planning are around deployment. We wish to provide deployment quality assurance while automating as much as possible.
One way we are currently exploring doing so is using canary deployments. A data scientist will simply need to configure an evaluation metric and the framework (using Vertex Prediction traffic splitting capabilities and other continuous evaluation magic) would test the new model in production and gradually deploy or rollback according to the evaluated metrics.
september 15, 2021 — Posted by Gal Moran, Iris Shmuel, and Daniel Marcous (Data Scientists at Waze)WazeWaze is the world's largest community-based traffic and navigation app. It uses real-time data to help users circumvent literal and figurative bumps in the road. On top of mobile navigation, Waze offers a web platform, a carpool app, partnership services, an advertisement platform and more. Such a broad portfoli…