Streaming Data Lakehouse Integration: Moving from Kafka Topics to Cloud Data Lakes in AWS & Azure via Confluent Connectors

Author(s): Girish Rameshbabu

Publication #: 2601028

Date of Publication: 29.01.2026

Country: United States

Pages: 1-9

Published In: Volume 12 Issue 1 January-2026

DOI: https://doi.org/10.62970/IJIRCT.v12.i1.2601028

Abstract

The emergence of the Data Lakehouse—fusing the scale of cloud object storage (AWS S3, Azure ADLS Gen2) with the transactional governance of a data warehouse—necessitates a robust mechanism for handling high-velocity event streams from Apache Kafka. This paper addresses the architectural challenge of reliably bridging Kafka topics to Lakehouse formats (Delta Lake, Apache Hudi, Apache Iceberg) while ensuring three critical constraints are met: consistency (ACID), timeliness (low latency), and schema integrity. We propose and evaluate two primary ingestion patterns utilizing Confluent Sink Connectors: Direct Micro-Batch Ingestion (Pattern 1) and Stream Processor-Aided Ingestion (Pattern 2). The analysis demonstrates that Pattern 2, which incorporates an intermediate stream processor for inline deduplication and enrichment, is superior for high-churn/Change Data Capture (CDC) workloads. Furthermore, we establish that transactional consistency and graceful schema evolution are maintained through the layered cooperation of the Confluent Schema Registry (enforcing data contracts pre-landing) [11] and the Lakehouse Table Format (implementing atomic commits post-landing) [4]. The approach confirms the architectural portability of this streaming integration strategy across both AWS and Azure cloud ecosystems, providing a standardized blueprint for real-time analytics and machine learning applications.

Keywords:

Download/View Paper's PDF

Download/View Count: 47

Share this Article