Researching Uber's Cadence

May 27 2021 3:10 PM Software Engineering, Data Engineering uber cadence go aws airflow grpc thrift 4159 views

I spent some time doing a precursory review of Uber's Cadence product. After a minor issue, I was able to get it running locally on my MacBook Pro. Here are my thoughts:

+ Created by the authors of Amazon's Simple Workflow Service (SWF) and Written in Go (compiled)
+ Fault-obvious design
+ Scalable
+ Supports multiple languages for the worker logic: Go, Java, Python, Ruby, .NET
+ Workflows / workers can be written in Go (compiled)
+ Utilizes Apache Thrift / gRPC
+ Handles external-to-internal communication via Signals (better than Airflow)
- Go client samples can be verbose / boilerplate-y (but that's just Go in general)
- Poor Web UI. Though you are encouraged to write your own REST API to provide programmatic control over your workflows / workers
- Learning curve may be steep for Python developers unfamiliar with Go
- Cadence community not as established or as large as Airflow community

I found this blog post to be a good source: Building your first Cadence workflow.

Here are some Cadence vs. Airflow points:

Airflow requires you to express workflows in python, as oppose to Cadence which is language agnostic.

Airflow seems to compute the graph statically and persist it in SqlAlchemy based store (MySQL or Postgres). Cadence just maintains workflow state as mutable state and append-only list of events stored in Cassandra. Our Persistence layer is pluggable but today we only provide Cassandra based persistence implementation.

Cadence provides atmost once task dispatch guarantee, meaning either the task is delivered to worker or it will timeout. Giving an opportunity to workflow logic to either retry the task or run some compensation logic. I didn't find any details on Airflow around those. My guess it provides atleast once guarantee meaning all your tasks need to be idempotent.

I didn't see any perf and scale numbers for Airflow, but considering it is backed by MySQL I have a feeling it will have limitations around number of active workflows and workflow executions per second. Cadence is backed by Cassandra and we are targeting much larger scale both in the terms of large number of active workflows and workflow executions per second.

Airflow seems to have pretty good experience around managing visualizing pipelines. Cadence only provides bare bones visibility API

– From https://github.com/uber/cadence/issues/331#issuecomment-324688649

Airflow executes a static DAG. The DAG is generated by code by cannot be mutated during execution. Cadence executes the workflow code directly which gives unlimited possibilities. For example Airflow has a hard time with the following scenario: read a number of partitions, execute an activity for each partition, wait for 80% of activities to complete then cancel the rest and execute upload results activity. This would be a pretty trivial Cadence workflow to implement.

Airflow has pretty limited scalability. Cadence was built as a highly scalable cloud service from the beginning. It was tested up to 200 million parallel workflows and 10+k events per second. But it can go much higher given more hardware.

Airflow is Python only. Cadence already supports writing workflows and activities in Go and Java. Python and C# clients are under active development. Any other language clients can be added.

I'm biased, but I think programming languages are much cleaner way to specify complex business logic over DAGs or any other JSON/XML/YAML based languages.

Cadence supports unlimited number of internal queues to routing activities to workers. This supports a lot of cool scenarios like routing tasks to specific boxes or pools of machines.

Cadence supports cross-dc (cross region in Amazon terminology) asynchronous replication. So even a complete region outage wouldn't bring down a system.

– From https://www.reddit.com/r/golang/comments/d2vv1p/ubercadence_cadence_is_a_distributed_scalable/f00ewxd

I hope this helps!