Apache Kafka

Apache Kafka is the realtime backbone for data in many companies already.

In the past I learned data can be processed in parallel or with transactional guarantees. Kafka showed me that there often is a pragmatic middle way. The idea is, does it really matter if patient A or patient B’s data is processed first? In many cases no, as long as all data belonging to a single patient is processed in the correct order.

Or in Kafka terms, the data is partitioned by the patient ID and thus all data within a partition is processed in order and transactional, parallel to other patients’ data.

That got me hooked on Apache Kafka. Its other concepts in regards to Load Balancing, KStreams, etc. are clever as well.

SAP with Big Data

There are many options to combine Big Data with SAP including cheap ones.

SAP follows the concept of openness, hence there are many options to integrate Big Data with SAP.

The downside of this freedom of choice is choosing the right approach. Which products to use, involved technologies, the business needs, how to involve users.

Examples:

  • An open source minded person might do all the transformations in e.g. Apache Spark and load the results into Hana via SQL DataSets.
  • A Hana team might connect to Hadoop using the Spark Connector.
  • A SAP person might suggest using SAP Data Hub for everything.

All approaches have pros and cons. Navigating through this minefield needs lot of background knowledge and experience.