Why and How We Chose Kafka

8 min readNov 1, 2017

The requirement:

We’re a small startup company.

We had a booking systems for bus tickets, working in production. It was a monolith, and had an invoicing component within it.

A whole new booking system was about to be deployed.

We were requested to build a new component for invoicing, which could handle both systems, deprecating the old invoicing component . There was a deadline, due to legal invoicing period and the already announced “go live” of the new booking system.

The invoicing component would need information about bookings, cancellations and providers of different kinds.

Option 1 — Same Database

Build the new component as a new application, but use the existing databases of the old and new systems to have the information about the bookings and cancellations. This is easy, as all information is already there accessible.

Why is this so bad?

The booking systems and the new invoicing component would be coupled through the databases, making deployments and migrations difficult, and inflicting non required changes, just in order not to break.

Option 2 — API Calls

The new component is a new application with its own database, communicating with the booking systems via API calls.

This would mean, that the booking systems need to notify the new component of bookings, cancellations and changes to the providers.

Why is this so bad?

The booking systems would have to explicitly know the new component. If there are more and more components, they would need to know all of them, and it would become slow due to making several HTTP requests. (We could use a different protocol, but calls would still accumulate).

Moreover, when the interface of the new component changes, we need to update the booking systems, though not so cumbersome if done correctly with versioned api endpoints (i.e. the monolith uses v1endpoints of invoicing until invoicingv2 is deployed).

Option 3 — Distributed System

The booking systems will notify a single “someone” about new bookings, cancellations, and changes to the providers, and whoever is interested, in this case, the new invoicing component, will “listen” and update its own data. At the heart of this architecture is a messaging solution.

The benefit is decoupling of the architectural components, also easing deployments.

My team has agreed internally that messaging was the right approach. We presented our opinion to the CTO, who agreed with our “revolutionary” idea, and listed the criteria for the messaging solution:

Criteria

Be able to handle a few thousands of messages per day, with option to scale (handle more) in the future as the business grows
Produce messages fast, so that the front-end server is relieved quickly or done asynchronously, because if messaging is down we do not want to fail the booking request
Development effort — learning curve, implementation, DevOps is not huge
Delivery time of a message is up to a few minutes
Simplicity of the overall architecture after it has been implemented — i.e. should be easy to add or change components
Budget constraints — purchasing or paying for cloud machines, developing, maintaining and licenses
DevOps need to feel comfortable with the solution
Integration with Ruby, up-to-date and maintained gems
100% reliability — can be “at least once”, we don’t care about order of messages and make sure they are idempotent (“idempotent” means that the result of a successful performed request is independent of the number of times it is executed)

How to Choose a Solution

1. List all possible solutions

2. Write pros and cons for each

3. Eliminate the obvious no-gos

4. Evaluate remaining solutions and choose the winner

Possible Solutions

“Invented Here” using current architecture

As a business event occurs in the booking systems, a message is saved in the current database, and a background worker sends it to the invoicing component every some time interval, then deletes from the current storage upon success.

Pros

Simple and quick to build
Does not require extra machines for messaging

Cons

Couples the current apps with the invoicing component
We have to build and test our own “poor man’s” messaging solution in both booking systems
Does not scale very good — request per single booking/cancellation
Does not help setting a common ground for “big picture” architecture

“Invented here” — using S3

After every business event in the booking systems, a message is saved in their database, and a worker sends the data to S3 buckets. The invoicing component, every some time interval, reads it and deletes from the current storage upon success.

For example, upload all “last hour” messages to S3, and the invoicing component downloads every hour and processes.

Pros

Does not require extra machines for messaging, just create an S3 bucket

Cons

We have to build and test our own “poor man’s” messaging solution
Does not help setting a common ground for “big picture” architecture

Beanstalk

Create ”tubes”, place jobs in them, and build workers that pick them up, and place in “failing” tubes for retries by other workers.

Pros

Has retry tubes (but we need to build it)

Cons

Not maintained anymore (since 2014)
Need to know the invoicing interface
Doesn’t scale (no clustering)

Sidekiq

Background job mechanism — upon business event in the booking systems, they place a job in the work queue. This job makes API calls in the background to the new invoicing component.

Pros

Easy to implement
We already had Sidekiq implemented for other features

Cons

Need to know the invoicing interface
Doesn’t have a solid retry mechanism
Not 100% reliable - in memory

Amazon SQS

Place messages in a cloud queue upon business events, to be picked up by the new invoicing component.

Pros

Decouples booking systems from the invoicing component

Cons

On premise components need to communicate with cloud components that need maintenance (unlike S3)
No AWS or SQS knowledge in the company

Amazon Kinesis

Place messages in a cloud stream upon business events, to be picked up by the new invoicing component.

Pros

Decouples booking systems from the invoicing component

Cons

On premise components need to communicate with cloud components that need maintenance (unlike S3)
No AWS or Kinesis knowledge in the company

RabbitMQ

A messaging solution.

Place messages in a message queue upon business events, to be picked up by the new invoicing component.

Pros

Components are decoupled
Reliable and resilient due to clustering
Throughput of 20k+/sec messages
HTTP-API, command line tool, and UI for managing and monitoring
A lot of tools/plugins https://www.rabbitmq.com/devtools.html
Messages placed onto the queue are stored until the consumer retrieves them
Could use CloudAMQP (hosted RabbitMQ solution as a service) and then don’t care about availability problems
Up to date Ruby library
ACK in client and in server for reliablility

Cons

Supports complicated routing scenarios (e.g. content based), but we don’t need this feature
No experience with Erlang eco-system
Pay more fees if we use CloudAMQP (https://www.cloudamqp.com/plans.html)

Apache Kafka

A messaging solution.

Place messages in a message log upon business events( a.k.a “produce”), to be picked up by the new invoicing component (a.k.a. “consume”).

Set up a Kafka cluster (including ZooKeeper cluster to manage the zoo).

Components will be decoupled from each other; they will communicate by producing messages and consuming them. Producers and consumers do not know each other.

Kafka supports a few serialization protocols, e.g. MsgPack and Avro. Avro for example, requires a message schema, thus, all messages that were produced are guaranteed to be valid. If a message is invalid, an exception is raised, but in any way, the message will not be produced. The serialization protocols that do not have a schema validation, may contain invalid data and tend to become a mess. The ones with schema are the de-facto API between the components.

Kafka guarantees “at least once” delivery (the default setup) — we need to make sure that our consumers are idempotent.

Kafka has “topics” to produce messages into, “partitions” for each “topic”, “producers” that produce a message, “consumers” that consume messages and “consumer group”. Kafka replicates each partition for resilience (configurable).

Kafka eco-system is Scala/Java, thus requires JVM.

Pros

Components are decoupled
Reliable and resilient due to clustering
Scalable — will support huge traffic (a few millions per minute) depending on the setup, both for producing and consuming
Can batch messages for efficiency
Will allow easily adding a BI solution in the future (e.g. Hadoop) with Kafka Connectors
Allows replaying of messages
Some vendors offer cloud Kafka, paid per use

Cons

Requires setting up more machines for resilience — ZooKeeper cluster, Schema Registry for schema validation (not mandatory) and Kafka instances. This means more money on infrastructure and more time to install, configure and maintain
May be an overkill for low traffic
No DevOps knowledge in the company

As reliability was critical, we decided to go for battle proven tools, even though it would have been much quicker to build some naïve mechanism.

We were left with Kafka and RabbitMQ.

How to Choose Between the Two?

Hold a competition! The competition, a.k.a PoC, was not to prove that Kafka or RabbitMQ work, we know they do, but to see how easy or difficult it was to set up and work with.

The PoC, was conducted by two developers — one for Kafka and one for RabbitMQ, excluding any DevOps, as both solutions would require more or less the same resources.

The PoC:

Not more than one day
Set up local Kafka/RabbitMQ on docker with all required dependencies
Implement a producer (i.e. place a few thousands of messagea)
Implement a consumer (i.e. read a few thousands of messages)
Define and register a schema
Implement a schema-aware producer
Implement a schema-aware consumer
Demonstrated on local machine, with learnings

Result:

Both Kafka and RabbitMQ had up-to-date ruby clients
Both were easy to write and the code was readable
Both are messaging solutions that don’t care what message is passed. A complementary solution is required for the schema validation. Kafka’s eco-system includes Schema Registry, a schema validation component built by Confluent.
Both solutions were running end-to-end within a few hours

The Winner

The team recommended using Kafka for the following reasons:

RabbitMQ has a more complex routing mechanism (e.g. content based) that we do not need and we actually consider an anti-pattern for our use case
Kafka eco-system is JVM based, and we’re more comfortable with it than RabbitMQ’s Erlang eco-system, and would be quicker
Schema registry, which handles all schema awareness on top, is manufactured by “Confluent”, a company that does tooling for Kafka, and it plays nicely, so there is a gem for automatically dealing with it, instead of implementing ourselves. We implemented it quickly during the PoC. It is versioned, so we can have gradual deployments when making changes to the message schema
Some engineers already had some experience and background about Kafka and tooling
As a bonus, Kafka can handle huge future throughput

Implementation

DevOps built Kafka cluster with ZooKeeper in all environments.

My team built the consumers and producers and finished the project on time.

The learning curve did not prove to be big, and Kafka has become another “tool in the toolbox” for the engineers involved (yes, they added it to their skills on LinkedIn).

Other teams have started using Kafka, and new components make use of it. It has become the default communication method where applicable, and we’re considering moving more things to it, for example metrics. Kafka has become a crucial architectural component, which makes it easy to communicate between components in a decoupled manner.

Effective leadership is learned
To learn more or reach out, visit my website or LinkedIn

Why and How We Chose Kafka

The requirement:

Option 1 — Same Database

Option 2 — API Calls

Option 3 — Distributed System

Criteria

How to Choose a Solution

Possible Solutions

Beanstalk

Sidekiq

Amazon SQS

Amazon Kinesis

RabbitMQ

Apache Kafka

How to Choose Between the Two?

The Winner

Implementation

Written by Yaniv Preiss