Skip to main content

Streaming Data

Master plan for building real-time data streaming pipelines and event-driven architectures. Covers message brokers, stream processing frameworks, event sourcing patterns, and operational best practices for high-throughput data systems.

Key Topics

  • Message Broker Architectures: Kafka, Pulsar, RabbitMQ, NATS, and when to use each
  • Stream Processing Frameworks: Apache Flink, Kafka Streams, Apache Spark Streaming, Apache Storm
  • Event Sourcing Patterns: Command sourcing, event stores, projections, and replay mechanisms
  • Exactly-Once Semantics: Idempotency, transactional outbox pattern, and distributed transactions
  • Schema Management: Schema Registry (Avro, Protobuf, JSON Schema), evolution, and compatibility
  • Partitioning Strategies: Key-based partitioning, sticky partitioning, and consumer group management
  • Backpressure Handling: Flow control, buffering strategies, and rate limiting
  • Windowing and Aggregations: Tumbling, sliding, session windows, and watermarks
  • Change Data Capture (CDC): Debezium, database log tailing, and sync strategies
  • Real-Time Analytics: Streaming SQL, materialized views, and time-series aggregations

Primary Tools & Technologies

Message Brokers:

  • Apache Kafka (industry standard for event streaming)
  • Apache Pulsar (cloud-native alternative with built-in multi-tenancy)
  • RabbitMQ (traditional message queue patterns)
  • AWS Kinesis, Azure Event Hubs, Google Pub/Sub (managed services)

Stream Processing:

  • Apache Flink (stateful stream processing)
  • Kafka Streams (lightweight, embedded processing)
  • Apache Spark Streaming (micro-batch processing)
  • ksqlDB (SQL-based stream processing)

Schema Management:

  • Confluent Schema Registry
  • Pulsar Schema Registry
  • AWS Glue Schema Registry

Change Data Capture:

  • Debezium (Kafka Connect-based CDC)
  • Maxwell's Daemon (MySQL CDC)
  • Airbyte (data integration with CDC support)

Monitoring & Operations:

  • Kafka UI, Kafdrop, Conduktor
  • Prometheus + Grafana for metrics
  • Datadog, New Relic for observability

Integration Points

Upstream Dependencies:

  • Data Architecture: Event schema design and versioning strategies
  • API Design: Async API specifications for event-driven systems
  • Secret Management: Broker credentials and TLS certificate handling

Downstream Consumers:

  • Data Transformation: Real-time ETL consuming streaming data
  • Performance Engineering: Throughput optimization and latency tuning
  • SQL Optimization: Materialized views and streaming analytics

Cross-Functional:

  • Microservices: Event-driven communication and saga patterns
  • Monitoring & Alerting: Stream lag monitoring and data quality checks
  • Disaster Recovery: Replication strategies and backup/restore

Status

Master Plan Available - Comprehensive guidance for real-time data streaming architectures, covering Kafka, Pulsar, Flink, and event-driven design patterns.


Part of the Data Engineering skill collection focused on building scalable, real-time data pipelines.