0

What’s New in Cloudera’s Distribution of Apache Kafka?

Cloudera’s distribution (now on release 2.0) of Kafka is based on Apache Kafka 0.9 and includes various new features (especially for security and usability), enhancements, and bug fixes.

Kafka is rapidly gaining momentum in enterprise Apache Hadoop deployments and has become the de facto messaging bus in most Big Data technology stacks. During this period of rapid adoption (and since Cloudera began shipping Kafka in February 2015), a lot has changed with respect to what users expect of Kafka, and how Kafka should deliver.

In this post, we’ll explain how the major features now shipping with Cloudera’s new distribution of Kafka reflects some of those changes.

Security

As more enterprises adopt Kafka, security has become a major concern. With this new release, Cloudera is closing several gaps by providing capabilities to encrypt data over the wire and to authenticate users using Kerberos or TLS client certificates.

The new producer and consumer APIs in the release support these new security features. Pre-existing producer and consumer applications won’t be able to utilize the new features unless they migrate to the new API.

Over-the-wire encryption and authentication are mutually independent and you can choose from the following combinations of secure/insecure communication when configuring Kafka:

  • PLAINTEXT, neither over-the-wire encryption nor authentication
  • SSL, over-the-wire encryption without authentication
  • SASL_PLAINTEXT, authentication without over-the-wire encryption
  • SASL_SSL, both authentication and over-wire encryption

Over-the-Wire Encryption (SSL)

The new release supports encryption over the wire. This is important if your data is sensitive but is transferred over an untrusted network, as is often the case for cross data center collection and IoT applications collecting sensor data. Over-the-wire encryption, when combined with at-rest encryption capabilities provided by Cloudera Navigator Encrypt and native HDFS encryption in Hadoop, enable end-to-end encryption of data streams flowing from Kafka to Cloudera Enterprise.

Since this over-the-wire encryption has a performance overhead, it is turned off by default. The performance hit due to SSL can be up to 50%, but usually significantly less, depending on CPU, JVM, and so on. Over-the-wire encryption is supported between Kafka clients and Kafka brokers, and/or amongst Kafka brokers. Users can choose to have SSL turned on only between client and broker by utilizing security.inter.broker.protocol.

Authentication (SASL/Kerberos)

With this new release, you can now configure Kafka to enable user authentication using Kerberos or TLS client certificates. If configured, this feature authenticates each and every client request before responding, allowing sending as well as receiving Kafka messages to be restricted using a common identity and security infrastructure typically shared across other Hadoop services. Similar to encryption, you can enable authentication between clients and Kafka brokers and/or between Kafka brokers.

Cloudera’s release 2.0 also enables authenticated access to the Kafka metadata stored in Apache ZooKeeper by leveraging the SASL authentication feature of ZooKeeper. This feature enables Kafka to work with Kerberos-enabled ZooKeeper ensembles that disallow un-credentialed connections.

New Consumer API

This new release includes a redesigned consumer client. Unlike the old consumer’s distinct set of APIs for high-level ZooKeeper-based consumer and the low-level SimpleConsumer, the new consumer client does not require a separate set of APIs based on the intended usage of the consumer client. Instead, the new consumer provides an unified API for a simplified development experience.

The new consumer provides capabilities supported by old high-level and low-level consumers, and much more. Even with its broader and more powerful feature set, the new consumer API makes it possible for consumer clients to be simpler and thinner. Enhancements to the new consumer API include:

  • Better group management
  • Faster rebalancing
  • No dependency on ZooKeeper
  • Pluggable partition assignment amongst the members of a consumer group
  • Pluggable offset management support that allows users to choose between default Kafka-backed offset management and offset management through an external data store

It’s important to note that existing Kafka clients (using the old consumer API and libraries based Kafka 0.8.x) are binary forward-compatible and thus you only need to update them when the new features are desired. However, the same is not true of newer clients (built using the Kafka 0.9 API and/or libraries). These clients cannot communicate with Kafka brokers using Apache Kafka versions < 0.9 (or Cloudera’s distribution of Kafka < 2.0).  Thus, you should update Kafka infrastructure first before updating clients.

Stay tuned for an in-depth post about the new Consumer API in the near future.

User-defined Quotas

With this new release, Kafka is one step closer to supporting true multi-tenancy. Previously, Kafka was incapable of throttling/rate limiting producers or consumers. Instead, it was possible for one consumer to consume extremely quickly and thus monopolize broker resources as well as cause network saturation. It was also possible for one producer to push extremely large amounts to data—thus causing memory pressure and I/O contention issues on brokers. This approach could lead to temporarily lower throughput or higher latency throughout the system.

With the new support of user-defined quotas, you have the ability to enforce quotas on a per-client basis based on the client-id. Moreover, the per-client quota is enforced on a per-broker basis. Each client receives a default quota (10MBps read and 5MBps write, for example), which you can override on a per-client basis dynamically (without downtime). Producer-side quotas are defined in terms of bytes written per second per client id. Consumer quotas are defined in terms of bytes read per second per client id.

If a client exceeds its quota, Kafka starts throttling down fetch/produce requests from the client. The client will not receive any errors while being throttled down; however, the slowdown will be visible via metrics in Cloudera Manager. Throttling down of a client by Kafka does not affect any other client interacting with the Kafka cluster—rather, it prevents a malicious client from hogging the cluster resources and thus affecting the other clients.

Conclusion

As you can see, Cloudera’s new distribution (release 2.0) of Kafka reflects the evolving nature of Kafka deployments in production, including a pronounced focus on security and usability. We invite you to get started via the following resources:

As always, we welcome your feedback via the Data Ingestion area in the Cloudera Community.

Ashish Singh is a Software Engineer at Cloudera, and a contributor to Apache Kafka, Apache Hive, Apache Sentry (incubating), and Apache Parquet.

keven

Leave a Reply

Your email address will not be published. Required fields are marked *