Why and How We Chose Kafka

Yaniv Preiss
8 min readNov 1, 2017
Franz Kafka — 1910 (Zdroj)

The requirement:

We’re a small startup company.

We had a booking systems for bus tickets, working in production. It was a monolith, and had an invoicing component within it.

A whole new booking system was about to be deployed.

We were requested to build a new component for invoicing, which could handle both systems, deprecating the old invoicing component . There was a deadline, due to legal invoicing period and the already announced “go live” of the new booking system.

The invoicing component would need information about bookings, cancellations and providers of different kinds.

Option 1 — Same Database

Build the new component as a new application, but use the existing databases of the old and new systems to have the information about the bookings and cancellations. This is easy, as all information is already there accessible.

Why is this so bad?

The booking systems and the new invoicing component would be coupled through the databases, making deployments and migrations difficult, and inflicting non required changes, just in order not to break.

Option 2 — API Calls

The new component is a new application with its own database, communicating with the booking systems via API calls.

This would mean, that the booking systems need to notify the new component of bookings, cancellations and changes to the providers.

Why is this so bad?

The booking systems would have to explicitly know the new component. If there are more and more components, they would need to know all of them, and it would become slow due to making several HTTP requests. (We could use a different protocol, but calls would still accumulate).

Moreover, when the interface of the new component changes, we need to update the booking systems, though not so cumbersome if done correctly with versioned api endpoints (i.e. the monolith uses v1endpoints of invoicing until invoicingv2 is deployed).

Option 3 — Distributed System

The booking systems will notify a single “someone” about new bookings, cancellations, and changes to the providers, and whoever is interested, in this case, the new invoicing component, will “listen” and update its own data. At the heart of this architecture is a messaging solution.

The benefit is decoupling of the architectural components, also easing deployments.

My team has agreed internally that messaging was the right approach. We presented our opinion to the CTO, who agreed with our “revolutionary” idea, and listed the criteria for the messaging solution:

Criteria

  • Be able to handle a few thousands of messages per day, with option to scale (handle more) in the future as the business grows
  • Produce messages fast, so that the front-end server is relieved quickly or done asynchronously, because if messaging is down we do not want to fail the booking request
  • Development effort — learning curve, implementation, DevOps is not huge
  • Delivery time of a message is up to a few minutes
  • Simplicity of the overall architecture after it has been implemented — i.e. should be easy to add or change components
  • Budget constraints — purchasing or paying for cloud machines, developing, maintaining and licenses
  • DevOps need to feel comfortable with the solution
  • Integration with Ruby, up-to-date and maintained gems
  • 100% reliability — can be “at least once”, we don’t care about order of messages and make sure they are idempotent (“idempotent” means that the result of a successful performed request is independent of the number of times it is executed)

How to Choose a Solution

1. List all possible solutions

2. Write pros and cons for each

3. Eliminate the obvious no-gos

4. Evaluate remaining solutions and choose the winner

Possible Solutions

“Invented Here” using current architecture

As a business event occurs in the booking systems, a message is saved in the current database, and a background worker sends it to the invoicing component every some time interval, then deletes from the current storage upon success.

Pros

  • Simple and quick to build
  • Does not require extra machines for messaging

Cons

  • Couples the current apps with the invoicing component
  • We have to build and test our own “poor man’s” messaging solution in both booking systems
  • Does not scale very good — request per single booking/cancellation
  • Does not help setting a common ground for “big picture” architecture

“Invented here” — using S3

After every business event in the booking systems, a message is saved in their database, and a worker sends the data to S3 buckets. The invoicing component, every some time interval, reads it and deletes from the current storage upon success.

For example, upload all “last hour” messages to S3, and the invoicing component downloads every hour and processes.

Pros

  • Does not require extra machines for messaging, just create an S3 bucket

Cons

  • We have to build and test our own “poor man’s” messaging solution
  • Does not help setting a common ground for “big picture” architecture

Beanstalk

Create ”tubes”, place jobs in them, and build workers that pick them up, and place in “failing” tubes for retries by other workers.

Pros

  • Has retry tubes (but we need to build it)

Cons

  • Not maintained anymore (since 2014)
  • Need to know the invoicing interface
  • Doesn’t scale (no clustering)

Sidekiq

Background job mechanism — upon business event in the booking systems, they place a job in the work queue. This job makes API calls in the background to the new invoicing component.

Pros

  • Easy to implement
  • We already had Sidekiq implemented for other features

Cons

  • Need to know the invoicing interface
  • Doesn’t have a solid retry mechanism
  • Not 100% reliable - in memory

Amazon SQS

Place messages in a cloud queue upon business events, to be picked up by the new invoicing component.

Pros

  • Decouples booking systems from the invoicing component

Cons

  • On premise components need to communicate with cloud components that need maintenance (unlike S3)
  • No AWS or SQS knowledge in the company

Amazon Kinesis

Place messages in a cloud stream upon business events, to be picked up by the new invoicing component.

Pros

  • Decouples booking systems from the invoicing component

Cons

  • On premise components need to communicate with cloud components that need maintenance (unlike S3)
  • No AWS or Kinesis knowledge in the company

RabbitMQ

A messaging solution.

Place messages in a message queue upon business events, to be picked up by the new invoicing component.

Pros

  • Components are decoupled
  • Reliable and resilient due to clustering
  • Throughput of 20k+/sec messages
  • HTTP-API, command line tool, and UI for managing and monitoring
  • A lot of tools/plugins https://www.rabbitmq.com/devtools.html
  • Messages placed onto the queue are stored until the consumer retrieves them
  • Could use CloudAMQP (hosted RabbitMQ solution as a service) and then don’t care about availability problems
  • Up to date Ruby library
  • ACK in client and in server for reliablility

Cons

  • Supports complicated routing scenarios (e.g. content based), but we don’t need this feature
  • No experience with Erlang eco-system
  • Pay more fees if we use CloudAMQP (https://www.cloudamqp.com/plans.html)

Apache Kafka

A messaging solution.

Place messages in a message log upon business events( a.k.a “produce”), to be picked up by the new invoicing component (a.k.a. “consume”).

Set up a Kafka cluster (including ZooKeeper cluster to manage the zoo).

Components will be decoupled from each other; they will communicate by producing messages and consuming them. Producers and consumers do not know each other.

Kafka supports a few serialization protocols, e.g. MsgPack and Avro. Avro for example, requires a message schema, thus, all messages that were produced are guaranteed to be valid. If a message is invalid, an exception is raised, but in any way, the message will not be produced. The serialization protocols that do not have a schema validation, may contain invalid data and tend to become a mess. The ones with schema are the de-facto API between the components.

Kafka guarantees “at least once” delivery (the default setup) — we need to make sure that our consumers are idempotent.

Kafka has “topics” to produce messages into, “partitions” for each “topic”, “producers” that produce a message, “consumers” that consume messages and “consumer group”. Kafka replicates each partition for resilience (configurable).

Kafka eco-system is Scala/Java, thus requires JVM.

Pros

  • Components are decoupled
  • Reliable and resilient due to clustering
  • Scalable — will support huge traffic (a few millions per minute) depending on the setup, both for producing and consuming
  • Can batch messages for efficiency
  • Will allow easily adding a BI solution in the future (e.g. Hadoop) with Kafka Connectors
  • Allows replaying of messages
  • Some vendors offer cloud Kafka, paid per use

Cons

  • Requires setting up more machines for resilience — ZooKeeper cluster, Schema Registry for schema validation (not mandatory) and Kafka instances. This means more money on infrastructure and more time to install, configure and maintain
  • May be an overkill for low traffic
  • No DevOps knowledge in the company

As reliability was critical, we decided to go for battle proven tools, even though it would have been much quicker to build some naïve mechanism.

We were left with Kafka and RabbitMQ.

How to Choose Between the Two?

Hold a competition! The competition, a.k.a PoC, was not to prove that Kafka or RabbitMQ work, we know they do, but to see how easy or difficult it was to set up and work with.

The PoC, was conducted by two developers — one for Kafka and one for RabbitMQ, excluding any DevOps, as both solutions would require more or less the same resources.

The PoC:

  • Not more than one day
  • Set up local Kafka/RabbitMQ on docker with all required dependencies
  • Implement a producer (i.e. place a few thousands of messagea)
  • Implement a consumer (i.e. read a few thousands of messages)
  • Define and register a schema
  • Implement a schema-aware producer
  • Implement a schema-aware consumer
  • Demonstrated on local machine, with learnings

Result:

  • Both Kafka and RabbitMQ had up-to-date ruby clients
  • Both were easy to write and the code was readable
  • Both are messaging solutions that don’t care what message is passed. A complementary solution is required for the schema validation. Kafka’s eco-system includes Schema Registry, a schema validation component built by Confluent.
  • Both solutions were running end-to-end within a few hours

The Winner

The team recommended using Kafka for the following reasons:

  • RabbitMQ has a more complex routing mechanism (e.g. content based) that we do not need and we actually consider an anti-pattern for our use case
  • Kafka eco-system is JVM based, and we’re more comfortable with it than RabbitMQ’s Erlang eco-system, and would be quicker
  • Schema registry, which handles all schema awareness on top, is manufactured by “Confluent”, a company that does tooling for Kafka, and it plays nicely, so there is a gem for automatically dealing with it, instead of implementing ourselves. We implemented it quickly during the PoC. It is versioned, so we can have gradual deployments when making changes to the message schema
  • Some engineers already had some experience and background about Kafka and tooling
  • As a bonus, Kafka can handle huge future throughput

Implementation

DevOps built Kafka cluster with ZooKeeper in all environments.

My team built the consumers and producers and finished the project on time.

The learning curve did not prove to be big, and Kafka has become another “tool in the toolbox” for the engineers involved (yes, they added it to their skills on LinkedIn).

Other teams have started using Kafka, and new components make use of it. It has become the default communication method where applicable, and we’re considering moving more things to it, for example metrics. Kafka has become a crucial architectural component, which makes it easy to communicate between components in a decoupled manner.

Effective leadership is learned
To learn more or reach out, visit my website or LinkedIn

--

--

Yaniv Preiss

Coaching managers to become effective | Head Of Engineering | I write about management, leadership and tech