Why Karrot Uses TFX, and How to Improve Productivity on ML Pipeline Development
mai 06, 2022
Posted by Ukjae Jeong, Gyoung-yoon Yoo, and Myeonghyeon Song from Karrot

Karrot (the global service of Danggeun Market in Korea) is a local community service app that connects neighbors based on a secondhand marketplace. Danggeun Market was launched in 2015, and over 23 million people in Korea are using Danggeun Market in their local communities. Currently, Karrot is operated in 440 local communities in four countries: the U.K., the U.S., Canada, and Japan. In our service, scrolling through feeds to find inexpensive and useful items has become a daily pleasure for users. For better user experiences, we’ve been applying several machine learning models such as recommendation models.

We are also working on ways to effectively and efficiently apply ML models. In particular, we’re putting lots of effort into building machine learning pipelines for periodic deployment, rapid experiments, and continuous model improvement.

For the ML pipelines, we’ve been using TFX (TensorFlow Extended) for production. So in this article, we will briefly introduce why we use TFX, and how we utilize TFX to improve productivity.

Machine Learning in Karrot

There are many ML projects inside Karrot. ML models are running inside the services. For example, we use automation models to detect fraud, and there are recommendation models to improve the user experience on our app feed. If you are interested in detailed descriptions of the models, please refer to our team blogs, which are written in Korean.

As we’ve been using Kubeflow for our ML models, we were able to periodically train, experiment, and deploy models but still, we had some pain points. As we started to use TFX with Kubeflow last year, TFX pushed this line further and let the team easily use our production pipelines.

How TFX helps us with production ML

TFX helps build and deploy production ML pipelines easily with open and extendable design.

TFX, completely open-sourced in 2019, is an end-to-end platform for production ML pipelines. It supports writing ML workflows in component units, which then can be run in multiple environments - Apache Beam, Dataflow, Kubeflow, and Airflow. It also comes with well-written standard components for data ingestion/transformation, training, and deployment.

Standard Components

TFX provides several standard components. If you are looking for components for data ingestion, there are CsvExampleGen based on local CSV files, PrestoExampleGen, and BigQueryExampleGen which can ingest data directly from Presto, BigQuery, and many other sources with some customization. So you can easily process data from multiple sources just by connecting pre-built components to your TFX pipelines.

It can also handle large-scale data processing smoothly. Since the Transform component that performs feature engineering is implemented on Apache Beam, you can execute it on GCP Dataflow or another compute cluster in a distributed manner.

Of course, many other convenient components exist and are added constantly.

Component Reusability

In order to adapt TFX to our product, there is a need for custom components. TFX has a well-structured component design that enables us to create custom components naturally and easily connect them to existing TFX pipelines. A simple Python function or container can be transformed into a TFX component, or you can write the whole component in the exact same way as standard components are written. For more details, check out the custom component guide.

To enhance our productivity by delivering these advantages, we share custom components that have similar use cases among our ML pipelines as an internal library of Karrot Market.

Various Runners are Supported

TFX is compatible with a variety of environments. It can be run locally on your laptop or run on DataFlow, GCP’s batch data processing service compatible with Apache Beam. You can visualize the output by manually running each component in a Jupyter Notebook. TFX also supports KubeFlow and Vertex AI, which have recently been released with new features as well. Therefore, the pipeline code is written once, and can then be run almost anywhere. We can simply create development, experiment, and deployment environments at once. For that reason, the burden of deploying models to production was significantly reduced by using TFX for our services.

Technical lessons

As we set up our ML pipelines with TFX, code qualities and our experiences in model development have increased.

However, there were difficulties. We didn't have a uniform project structure or best practices among our team. Maybe this is because TFX itself is relatively new and we've been using it before version 1. It became harder to understand codes and start to contribute. As the pipelines are becoming larger and more complex, it’s getting harder to understand the meaning of custom components, corresponding config values, and dependencies. In addition, it was difficult to introduce some of the latest features to the team.

Improving the Development Experience

We decided to create and use a template for TFX pipelines to make it easier to understand each other's code, implement pipelines with the same pattern, and share know-how with each other. We merged components frequently used in Karrot and put them in a shared library so that ML pipelines can be developed very quickly.

It was expected that the template would accelerate the development of new projects. In addition, as mentioned above, we expected that each project would have a similar structure, making it easier to understand each other's projects.

So far, we have briefly introduced the template project. Here are some of our considerations to make better use of TFX in this project.

Configuration first

We prioritize our configuration first. It should be enough to understand how pipelines work by reading their configuration. If we can understand specific settings very easily, we can set up various experiments and proceed with them to AB testing.

example_gen_config.proto written in Protocol Buffer(Protobuf), denotes the specification of config. config.pbtxt holds the values, and pipeline.py builds up the pipeline.

// config.pbtxt
example_gen_config {
    big_query_example_gen_config {
        query: "# query for example gen"
    }


    ...
}

...
// example_gen_config.proto
message ExampleGenConfig {
    oneof config {
        BigQueryExampleGenConfig big_query_example_gen_config = 1;
        CsvExampleGenConfig csv_example_gen_config = 2;
    }

    ...
}

// When BigQueryExampleGen is used
message BigQueryExampleGenConfig {
    optional string query = 1;
}

// When CsvExampleGenConfig is used
message CsvExampleGenConfig {
    optional string input_base = 1;
}
# pipeline.py
def create_pipeline(config):
   ...
   example_gen = _create_example_gen(config.example_gen_config)
   ...




def _create_example_gen(config: example_gen_config_pb2.ExampleGenConfig):
    ...

    if config.HasField("big_query_example_gen_config"):
        ...
        return ...


    if config.HasField("csv_example_gen_config"):
        ...
        return ...


    raise ...

All configurations of ExampleGen are determined by a single ExampleGenConfig message. Similarly, all pipeline components only depend on their configs and are created from them. This way, you can understand the structure of the pipeline just by looking at the configuration file. There is also the intention to make customization and code understanding easier by separating the part that defines each component.

For example, let's assume the following situation: In order to test the data transformation later, the Transform component needs to support various data processing methods. You might want to add a data augmentation process in the transform component. Then it should be done by adding a config related to the data augmentation function. Similarly, you can extend the predefined Protobuf specification to easily support multiple processing methods and make it easy to see which processing method to use.

Managing Configs with Protobuf

About the example code above, some people may wonder why they use Protobuf as a configuration tool. There are several reasons for this, and we will compare advantages with YAML, which is one of the common practices for configuration.

First, Protobuf has a robust interface, and validation such as type checking is convenient. There is no need to check whether any field is defined, as Protobuf defines the object structure in advance. In addition, it is useful to support backward/forward compatibility in a project under active development.

Also, you can easily check the pipeline structure. YAML has a hierarchical structure, but in the case of hydra, which is often used in the machine learning ecosystem, the stage (e.g. production, dev, alpha) settings are divided into several files, so we thought that Protobuf has better stability and visibility.

If you use Protobuf as your project setup tool, many of the Protobuf definitions defined in TFX can be reused.

TensorFlow Ecosystem with Bazel

Bazel is a language-independent build system that is easy to extend and supports a variety of languages and tools. From simple projects to large projects using multiple languages and tools, it can be used quickly and concisely in most situations. For more information, please refer to Bazel Vision on the Bazel documentation page.

Using Bazel in a Python project is an uncommon setting, but we used Bazel as the project build system of the TFX template project. A brief introduction to the reason is as follows.

First of all, it works really well with Protobuf. Because Bazel is a language-independent build system, you can easily tie your Protobuf build artifacts as dependencies with other builds without worry. In addition, the Protocol Buffer repository itself uses Bazel, so it is easy to integrate it into the Bazel-based project.

The second reason is the special environment of the TensorFlow ecosystem. Many projects in the TensorFlow ecosystem use Bazel, and TFX also uses Bazel, so you can easily link builds with other projects (TensorFlow, TFX) using Bazel.

Internal Custom TFX Modules

As mentioned before, we’ve been building an internal library for the custom TFX modules (especially the custom components) that are frequently used across multiple projects. Anyone in Karrot can add their components and share them with the team.

For example, we are using ArgoCD to manage applications(e.g. TF Serving) in Kubernetes clusters, so if someone develops a component for deploying with ArgoCD, we can easily share it via an internal library. The library now contains several custom modules for our team for productivity.

The reason why we can share our custom features as an internal shared library is probably thanks to the modular structure of TFX. Through this, we were able to improve the overall productivity of the team easily. We can reuse most of the components that were developed from several projects, and develop new projects very easily.

Conclusion

TFX provides lots of great features to develop production ML pipelines. We’re using TFX on Kubeflow for ML pipelines to develop, experiment, and deploy in a better way, and it brings us many benefits. So we decided to introduce how we are using TFX in this blog post.

To learn more about Karrot, check out our website (Korea, US, and Canada). For the TFX, check out the TFX documentation page.

Next post
Why Karrot Uses TFX, and How to Improve Productivity on ML Pipeline Development

Posted by Ukjae Jeong, Gyoung-yoon Yoo, and Myeonghyeon Song from Karrot Karrot (the global service of Danggeun Market in Korea) is a local community service app that connects neighbors based on a secondhand marketplace. Danggeun Market was launched in 2015, and over 23 million people in Korea are using Danggeun Market in their local communities. Currently, Karrot is operated in 440 local communit…