Data Architecture 101: Kappa (Real-Time Data)

2024 ж. 15 Мам.
3 226 Рет қаралды

All things being equal, I think we'd all want access to source data in real time.
This is the holy-grail of data engineering and removes most delays to insights.
One architecture approach that makes this possible is known as the Kappa Architecture.
It focuses on real-time data loading & processing rather than waiting on batches.
No brainer, right?
Well, back here in the real-world, things aren't made equal.
Different architecture approaches require different levels of complexity (aka skill requirements).
Which impacts design & maintenance time.
Which all tends to mean higher overall cost of ownership.
All that to say, just because it's technically possible, doesn't always mean it's the best choice for your team.
Truthfully, I find it's usually not necessary.
BUT that doesn't mean you shouldn't be aware of it and be able to evaluate it.
So in this video we'll review at a high level what the Kappa architecture is about.
By the end, you'll understand the key points and be able to decide whether or not it makes sense for your team.
Enjoy!
►► The Starter Guide for The Modern Data Stack (Free PDF)
Simplify the “modern” data stack + better understand common tools & components → bit.ly/starter-mds
Timestamps:
0:00 - Intro
0:23 - What is a Kappa Architecture
1:28 - General Considerations
2:53 - Example Architecture
Title & Tags:
Data Architecture 101: Kappa (Real-Time Data)
#kahandatasolutions #dataengineering #dataarchitecture

Пікірлер
  • ►► The Starter Guide for Modern Data → bit.ly/starter-mds Simplify “modern” architectures + better understand common tools & components

    @KahanDataSolutions@KahanDataSolutions6 ай бұрын
  • Good summary, but having gone through this fight recently I guess I would say it can get really complex. First it should be use case driven and you may want to lean towards this type of architecture if there is really a need in your company for real time data, which tends to be more operational than analytical in nature. For those use cases, how much of your data does is represent? Like if you need 5% of your data immediately and from one source, you don't need to kludge the other 95% into kappa architecture because you have to pick one. You also need to look at the sources and how data is presented to you. Like in my case the Kappa advocates get an hourly file with millions of records. They wanted to break it down into individual messages and then run it through their event broker as if it was Kappa because at some point in the future it may change. For real time data you need to look at latency requirements and often just stream directly to a consuming application because running it through your whole lake-->transform-->model process adds lag to your processes. This was shown in the lambda model but is applicable to any case where you need to react in seconds or minutes. Finally if you are in a cloud environment you need to consider the cost of running your ingest servers 24/7 rather than just spinning up compute to run your ETL once a day. So you end up with some significant optimization problems trying to minimize the unutilized capacity to reduce costs. So in summary I guess I would say don't fall in love with any specific pattern and maybe try to break down your processing plan based on needs and challenges and use whatever pattern is appropriate (considering maintainability as well).

    @gatorpika@gatorpika6 ай бұрын
KZhead