📦 apache / fluss

📄 2025-12-02-fluss-x-iceberg-why-your-lakehouse-is-not-streamhouse-yet.md · 414 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414---
title: "Fluss × Iceberg (Part 1): Why Your Lakehouse Isn’t a Streamhouse Yet"
authors: [mehulbatra, yuxia]
date: 2025-12-11
tags: [streaming-lakehouse, apache-iceberg, real-time-analytics, apache-fluss]
---


As software and data engineers, we've witnessed Apache Iceberg revolutionize analytical data lakes with ACID transactions, time travel, and schema evolution. Yet when we try to push Iceberg into real-time workloads such as sub-second streaming queries, high-frequency CDC updates, and primary key semantics, we hit fundamental architectural walls. This blog explores how Fluss × Iceberg integration works and delivers a true real-time lakehouse.

Apache Fluss represents a new architectural approach: the **Streamhouse** for real-time lakehouses. Instead of stitching together separate streaming and batch systems, the Streamhouse unifies them under a single architecture. In this model, Apache Iceberg continues to serve exactly the role it was designed for: a highly efficient, scalable cold storage layer for analytics, while Fluss fills the missing piece: a hot streaming storage layer with sub-second latency, columnar storage, and built-in primary-key semantics.

After working on Fluss–Iceberg lakehouse integration and deploying this architecture at a massive scale, including Alibaba's 3 PB production deployment processing 40 GB/s, we're ready to share the architectural lessons learned. Specifically, why existing systems fall short, how Fluss and Iceberg naturally complement each other, and what this means for finally building true real-time lakehouses.

![Banner](assets/fluss-x-iceberg/fluss-lakehouse-streaming_comp.png)

<!-- truncate -->

## The Real-Time Lakehouse Imperative

### Why Real-Time Lakehouses Matter Now

Four converging forces are driving the need for sub-second data infrastructure:

**1. Business Demand for Speed:** Modern businesses operate in real-time. Pricing decisions, inventory management, and fraud detection all require immediate response. Batch-oriented systems with T+1 day or even T+1 hour latency can't keep up with operational tempo.

**2. Immediate Decision Making:** Operational analytics demands split-second insights. Manufacturing lines, delivery logistics, financial trading, and customer service all need to react to events as they happen, not hours or days later.

**3. AI/ML Needs Fresh Data:** Here's the critical insight: **You can't build the next TikTok recommender system on traditional lakehouses, which lack real-time streaming data for AI.** Modern AI applications—personalized recommendations, real-time content ranking, and dynamic ad placement—require continuous model inference on fresh data.

**4. Agentic AI Requires Real-Time Context:** AI agents need immediate access to the current system state to make decisions. Whether it's autonomous trading systems, intelligent routing agents, or customer service bots, agents can't operate effectively on stale data.

![Use Cases](assets/fluss-x-iceberg/lakehouse-usecases.png)

### The Evolution of Data Freshness

**Traditional Batch Era (T+1 day):** Hive-based data warehouses, daily ETL jobs run overnight, next-day readiness acceptable for reporting.

**Lakehouse Era (T+1 hour):** Modern lakehouse formats (Iceberg, Delta Lake, Hudi), hourly micro-batch processing, better for near-real-time dashboards.

**Streaming Lakehouse Era (T+1 minute):** Streaming integration with lakehouses (Paimon), minute-level freshness through continuous ingestion, suitable for operational analytics.

**The Critical Gap - Second-Level Latency:** File-system-based lakehouses inherently face minute-level latency as their practical upper limit. This isn't a limitation of specific implementations; it's fundamental. File commits, metadata operations, and object storage consistency guarantees create unavoidable overhead.

Yet critical use cases demand sub-second to second-level latency: search and recommendation systems with real-time personalization, advertisement attribution tracking, anomaly detection for fraud and security monitoring, operational intelligence for manufacturing/logistics/ride-sharing, and Gen AI model inference requiring up-to-the-second features. The industry needs a **hot real-time layer** sitting in front of the lakehouse.

![Evolution Timeline](assets/fluss-x-iceberg/evolution_comp.png)
## What is Fluss × Iceberg?

### The Core Concept: Hot/Cold Unified Storage

The Fluss architecture delivers millisecond-level end-to-end latency for real-time data writing and reading. Its **Tiering Service** continuously offloads data into standard lakehouse formats like Apache Iceberg, enabling external query engines to analyze data directly. This streaming/lakehouse unification simplifies the ecosystem, ensures data freshness for critical use cases, and combines real-time and historical data seamlessly for comprehensive analytics.

**Unified Data Locality:** Fluss aligns partitions and buckets across both streaming and lakehouse layers, ensuring consistent data layout. This alignment enables direct Arrow-to-Parquet conversion without network shuffling or repartitioning, dramatically reducing I/O overhead and improving pipeline performance.
Think of your data as having two thermal zones:

**Hot Tier (Fluss):** Last 1 hour of data, NVMe/SSD storage, sub-second latency, primary key indexed (RocksDB), streaming APIs, Apache Arrow columnar format. High-velocity writes, frequent updates, sub-second query latency requirements.

**Cold Tier (Iceberg):** Historical data (hours to years), S3/HDFS object storage, minute-level latency, Parquet columnar format, ACID transactions, analytical query engines. Infrequent updates, optimized for analytical scans, stored cost-efficiently.

Traditional architectures force you to maintain **separate systems** for these zones: Kafka/Kinesis for streaming (hot), Iceberg for analytics (cold), complex ETL pipelines to move data between them, and applications writing to both systems (dual-write problem).

![Kappa vs Lambda Architecture](assets/fluss-x-iceberg/kappa-vs-lambda_comp.png)

**Fluss × Iceberg unifies these as tiered storage with Kappa architecture:** Applications write once to Fluss. A stateless Tiering Service (Flink job) automatically moves data from hot to cold storage based on configured freshness (e.g., 30 seconds, 5 minutes). Query engines see a single table that seamlessly spans both tiers—eliminating the dual-write complexity of Lambda architecture.


### Why This Architecture Matters

**Single write path:** Your application writes to Fluss. Period. No dual-write coordination, no consistency headaches across disconnected systems.

**Automatic lifecycle management:** Data naturally flows from hot → cold based on access patterns and configured retention. Freshness is configurable in minutes via table properties.

**Auto Table creation:** Support both append & primary key table, with mapping schema and unified data locality via partitioning.

**Auto Table Maintenance:** Enable with a single flag (table.datalake.auto-maintenance=true). The tiering service automatically detects small files during writes, applies bin-packing compaction to merge them into optimal sizes.

**Query flexibility:** Run streaming queries on hot data (Fluss), analytical queries on cold data (Iceberg), or union queries that transparently span both tiers.

![Tiering Service](assets/fluss-x-iceberg/fluss-tiering-lake.png)

## What Iceberg Misses Today

Apache Iceberg was architected for batch-optimized analytics. While it supports streaming ingestion, fundamental design decisions create unavoidable limitations for real-time workloads.

### Gap 1: Metadata Overhead Limits Write Frequency

Every Iceberg commit rewrites `metadata.json` and manifest list files. For analytics with commits every 5-15 minutes, this overhead is negligible. For streaming with high-frequency commits, it becomes a bottleneck.

**The Math:** Consider a streaming table with 100 events/second, committing every second:
- Each commit adds a manifest list entry
- After 1 hour: **3,600 manifest lists**
- After 1 day: **86,400 manifest lists**
- `metadata.json` grows to megabytes
- Individual commit latency stretches to multiple seconds

**Compounding Factor:** The problem compounds with partitioning. A 128-partition table with per-partition commits can generate thousands of metadata operations per second. **Metadata becomes the bottleneck, not data throughput.**

**Real-World Evidence - Snowflake's acknowledgment:** Their Iceberg streaming documentation explicitly warns about this, setting `MAX_CLIENT_LAG` defaults to 30 seconds (versus 1 second for native tables). The metadata overhead makes sub-second latency impractical.

### Gap 2: Polling-Based Reads Create Latency Multiplication

Iceberg doesn't have a push-based notification system. Streaming readers poll for new snapshots.

**Latency Breakdown:**
```
1. Writer commits snapshot              → 0ms
2. Metadata hits S3 (eventual consistency) → 0-5,000ms
3. Reader polls (5-10s interval)        → 5,000-10,000ms
4. Reader discovers snapshot            → 5,000-15,000ms
5. Reader fetches data files from S3    → 5,100-15,500ms
   ────────────────────────────────────────────────────
   Total end-to-end latency: 5+ to 15+ seconds
```

Compare this to a push model where producers write and consumers immediately receive notifications. End-to-end latency drops to **single-digit milliseconds**.

For real-time dashboards, fraud detection, or operational analytics requiring sub-second freshness, this polling latency is a non-starter.

### Gap 3: Primary Key Support Is Declarative, Not Enforced

Iceberg V2 tables accept PRIMARY KEY declarations in DDL:

```sql
CREATE TABLE users (
  user_id BIGINT,
  email STRING,
  created_at TIMESTAMP,
  PRIMARY KEY (user_id)  -- This is a hint, not a constraint
);
```

However, Iceberg **does not enforce** primary keys:
- ❌ No uniqueness validation on write
- ❌ No built-in deduplication
- ❌ No indexed lookups (point queries scan entire table)

**The Consequence:** For CDC workloads, you must implement deduplication logic in your streaming application (typically using Flink state). For tables with billions of rows, this state becomes enormous—**50-100+ TB in production scenarios**.

### Gap 4: High-Frequency Updates Create Write Amplification

Iceberg supports updates via Merge-On-Read (MOR) with delete files:

**Equality deletes:** Store all column values for deleted rows. For CDC updates (`-U` records), this means writing full row content to delete files before writing the updated version. For wide tables (50+ columns), this **doubles the write volume**.

**Position deletes:** More efficient but require maintaining file-level position mappings. For streaming updates scattered across many files, position deletes proliferate rapidly.

**The Small File Problem:** Streaming workloads naturally create many small files. Production teams report cases where **500 MB of CDC data exploded into 2 million small files** before compaction, slowing queries by **10-100x**.

## How Fluss Fills These Gaps

### Solution 1: Log-Indexed Streaming Storage with Push-Based Reads

**Addresses:** Gap 1 (Metadata Overhead) & Gap 2 (Polling-Based Reads)

Fluss reimagines streaming storage using **Apache Arrow IPC columnar format** with **Apache Kafka's battle-tested replication protocol**.

**How it solves metadata overhead:**

Iceberg's metadata bottleneck occurs when you commit frequently. Fluss sidesteps this entirely:

1. **High-frequency writes go to Fluss**—append-only log segments with no global metadata coordination
2. **Iceberg receives batched commits**—the tiering service aggregates minutes of data into single, well-formed Parquet files
3. **Configurable freshness**—`table.datalake.freshness = '1min'` means Iceberg sees ~1 commit per minute, not thousands

**Result:** Iceberg operates exactly as designed: periodic batch commits with clean manifest evolution. The streaming complexity stays in Fluss.

**How it solves polling latency:**

- **Push-based real-time:** Consumers receive millisecond-latency push notifications when new data arrives. No polling intervals.
- **End-to-end latency:** Sub-second, typically single-digit milliseconds
- Real-time queries hit Fluss directly; they don't wait for Iceberg snapshots

### Solution 2: Primary Key Tables with Native Upsert Support

**Addresses:** Gap 3 (Primary Key Not Enforced) & Gap 4 (Update Write Amplification)

Fluss provides first-class primary key semantics using an LSM tree architecture:

```
┌─────────────────────────────────────────────────┐
│ Primary Key Table (e.g., inventory)             │
├─────────────────────────────────────────────────┤
│ KV Tablet (RocksDB - current state):            │
│ ┌─────────────────────────────────────┐         │
│ │ sku_id=101 → {quantity: 50, ...}    │         │
│ │ sku_id=102 → {quantity: 23, ...}    │         │
│ └─────────────────────────────────────┘         │
│                                                 │
│ Log Tablet (changelog - replicated 3x):         │
│ ┌─────────────────────────────────────┐         │
│ │ +I[101, 100, ...]  // Insert        │         │
│ │ -U[101, 100, ...] +U[101, 50, ...]  │ Update  │
│ │ -D[102, 23, ...]   // Delete        │         │
│ └─────────────────────────────────────┘         │
│                                                 │
│ Features:                                       │
│ - Enforced uniqueness (PK constraint)           │
│ - 500K+ QPS point queries (RocksDB index)       │
│ - Pre-deduplicated changelog (CDC)              │
│ - Read-your-writes consistency                  │
│ - Changelog Read for streaming consumers        │
│ - Lookup Join for dimensional enrichment        │
└─────────────────────────────────────────────────┘
```

**How it works:**

When you write to a primary key table, Fluss:

1. Reads current value from RocksDB (if exists)
2. Determines change type: `+I` (insert), `-U/+U` (update), `-D` (delete)
3. Appends changelog to replicated log tablet (WAL)
4. Waits for log commit (high watermark)
5. Flushes to RocksDB
6. Acknowledges client

**Guarantees:** Read-your-writes consistency—if you see a change in the log, you can query it. If you see a record in the table, its change exists in the changelog.

**Critical Capabilities:**

- **Point queries by primary key** use RocksDB index, achieving **500,000+ QPS on single tables** in production
- This replaces external KV stores (Redis, DynamoDB) for dimension table serving
- **CDC without deduplication:** The changelog is already deduplicated at write time
- Downstream Flink consumers read pre-processed CDC events with **no additional state required**

**Tiering Primary Key Tables to Iceberg:**

When tiering PK tables, Fluss:
1. First writes the snapshot (no delete files)
2. Then writes changelog records
3. For changelog entries (`-D` or `-U` records), Fluss writes equality-delete files containing all column values

This design enables Iceberg data to serve as Fluss changelog—complete records can be reconstructed from equality-delete entries for downstream consumption.

### Solution 3: Unified Ingestion with Automatic Tiering

**Addresses:** All gaps by offloading real-time complexity from Iceberg

**Architecture Flow:**

```
Application
    │
    ├─→ Fluss Table (single write)
         │
         ├─→ Real-time consumers (Flink, StarRocks, etc.)
         │   - Sub-second latency
         │   - Column-projected streaming reads
         │   - Primary key lookups
         │
         └─→ Tiering Service (Flink job, automatic)
              └─→ Apache Iceberg
                  - Parquet files
                  - Atomic commits
                  - Historical analytics
```

**Tiering Service Architecture:**

Stateless Flink jobs: Multiple jobs register with the Fluss Coordinator Server. The coordinator assigns tables to jobs via an in-memory queue. Each job processes one table at a time, commits results, and requests the next table.

**Key Capabilities:**

- **Auto-create lake table:** Automatically provisions the corresponding Iceberg table
- **Handles both table types seamlessly:** append-only Log Tables for event streams and Primary Key Tables with full upsert/delete support
- **Auto mapping schema:** Translates Fluss schema to Iceberg schema with system columns
- **Arrow → Parquet conversion:** Transforms columnar Arrow batches to Parquet format
- **Freshness in minutes:** Configurable via `table.datalake.freshness` property

**Benefits:**

- **Elastic scaling:** Deploy 3 jobs for 3x throughput, stop idle jobs to reclaim resources
- **No single point of failure:** Job failure doesn't block all tables
- **Load balancing:** Automatic distribution based on sync lag

**Integrated Compaction:**

While tiering data, the service optionally performs bin-packing compaction:
1. Scans manifests to identify small files in the current bucket
2. Schedules background rewrite (small files → large files)
3. Waits for compaction before committing
4. Produces atomic snapshot: new data files + compacted files

**Configuration:** `table.datalake.auto-maintenance=true`

**Result:** Streaming workloads avoid small file proliferation without separate maintenance jobs.

### Solution 4: Union Read for Seamless Query Across Tiers

**Enables:** Querying hot + cold data as a single logical table

The architectural breakthrough enabling a real-time lakehouse is **client-side stitching with metadata coordination**. This is what makes Fluss truly a **Streaming Lakehouse**—unlocking real-time data to the Lakehouse with union delta log (minutes) on Fluss.

### How Union Read Works

![Union Read Architecture](assets/fluss-x-iceberg/fluss-union-read.png)

Union Read seamlessly combines hot and cold data through intelligent offset coordination, as illustrated above:

**The Example:** Consider a query that needs records for users Jark, Mehul, and Yuxia:

1. **Offset Coordination:** Fluss CoordinatorServer provides Snapshot 06 as the Iceberg boundary. At this snapshot, Iceberg contains `{Jark: 30, Yuxia: 20}`.

2. **Hot Data Supplement:** Fluss's real-time layer holds the latest updates beyond the snapshot: `{Jark: 30, Mehul: 20, Yuxia: 20}` (including Mehul's new record).

3. **Union Read in Action:** The query engine performs a union read:
    - Reads `{Jark: 30, Yuxia: 20}` from Iceberg (Snapshot 06)
    - Supplements with `{Mehul: 20}` from Fluss (new data after the snapshot)

4. **Sort Merge:** Results are merged and deduplicated, producing the final unified view: `{Jark: 30, Mehul: 20}` (Yuxia's update already in Iceberg).

**Key Benefit:** The application queries a single logical table while the system intelligently routes between Iceberg (historical) and Fluss (real-time) with zero gaps or overlaps.

**Union Read Capabilities:**

- **Query both Historical & Real-time Data:** Seamlessly access cold and hot tiers
- **Exchange using Arrow-native format:** Efficient data transfer between tiers
- **Efficient process/integration for query engines:** Optimized for Flink, StarRocks, Spark

**SQL Syntax (Apache Flink):**

```sql
-- Automatic union read (default behavior)
SELECT * FROM orders WHERE event_time > NOW() - INTERVAL '1' HOUR;
-- Seamlessly reads: Iceberg (historical) + Fluss (real-time)

-- Explicit cold-only read (Iceberg only)
SELECT * FROM orders$lake WHERE order_date < CURRENT_DATE;
```

**Consistency Guarantees:**

The tiering service embeds Fluss offset metadata in Iceberg snapshot summaries:

```json
{
  "commit-user": "__fluss_lake_tiering",
  "fluss-bucket-offset": "[{0: 5000}, {1: 5123}, {2: 5087}, ...]"
}
```
Fluss coordinator persists this mapping. When clients query, they receive the exact offset boundary. Union read logic ensures:

- **No overlaps:** Iceberg handles offsets ≤ boundary
- **No gaps:** Fluss handles offsets > boundary
- **Total ordering preserved** per bucket


## Architecture Benefits

### Cost-Efficient Historical Storage
![Historical Analysis](assets/fluss-x-iceberg/fluss-lakehouse-history.png)

Automatic tiering optimizes storage and analytics: efficient backfill, projection/filter pushdown, high Parquet compression, and S3 throughput.

### Real-Time Analytics
![Real-time Analytics](assets/fluss-x-iceberg/fluss-lakehouse-realtime.png)

Union Read delivers sub-second lakehouse freshness: union delta log on Fluss, Arrow-native exchange, and seamless integration with Flink, Spark *, Trino, and StarRocks.


### Key Takeaways

**Single write, two read modes:** Write once to Fluss. Query two ways: $lake suffix for cost-efficient historical batch analysis, or default (no suffix) for unified view with second-level freshness.

**Automatic data movement:** Data automatically tiers from Fluss → Iceberg after configured time (e.g., 1 minute). No manual ETL jobs or data pipelines to maintain. Configurable freshness per table.

**Unified table abstraction:** Applications see a single logical table. Query engine transparently routes to appropriate tier(s). Offset-based coordination ensures no gaps or duplicates.

## Getting Started

This gives you a working streaming lakehouse environment in minutes. Visit: [https://fluss.apache.org/docs/quickstart/lakehouse/](https://fluss.apache.org/docs/quickstart/lakehouse/)

## Conclusion: The Path Forward

Apache Fluss and Apache Iceberg represent a fundamental rethinking of real-time lakehouse architecture. Instead of forcing Iceberg to become a streaming platform (which it was never designed to be), Fluss embraces Iceberg for its strengths—cost-efficient analytical storage with ACID guarantees—while adding the missing hot streaming layer.

The result is a Streamhouse that delivers:

- **Sub-second query latency** for real-time workloads
- **Second-level freshness** for analytical queries (versus T+1 hour)
- **80% cost reduction** by eliminating data duplication and system complexity
- **Single write path** ending dual-write consistency problems
- **Automatic lifecycle management** from hot to cold tiers

For software/data engineers building real-time analytics platforms, the question isn't whether to use Fluss or Iceberg—it's recognizing they solve complementary problems. Fluss handles what happens in the last hour (streaming, updates, real-time queries). Iceberg handles everything before that (historical analytics, ML training, compliance).

### When to Adopt

**Strong signals you need Fluss:**

- Requirement for sub-second query latency on streaming data
- High-frequency CDC workloads (100+ updates/second per key)
- Need for primary key semantics with indexed lookups
- Large Flink stateful jobs (10TB+ state) that could be externalized
- Desire to unify real-time and historical queries
- Tired of maintaining dual infrastructure one for batch, another for real-time

### Next Steps

1. **Explore the documentation:** [fluss.apache.org](https://fluss.apache.org)
2. **Review FIP-3:** Detailed Iceberg integration specification
3. **Try the quickstart:** Deploy locally with Docker Compose
4. **Join the community:** Apache Fluss mailing lists, Slack, and GitHub
5. **Evaluate Iceberg integration:** Production-ready today, same architectural patterns

---

We've covered **what** Fluss × Iceberg is and **how** it works the architecture eliminates dual-write complexity, delivers sub-second freshness, and unifies streaming and batch under a single table abstraction.

But here's the elephant in the room: **Apache Kafka dominates event streaming. Tableflow handles Kafka-to-Iceberg materialization. Why introduce another system?**

**Stay tuned for Part 2 as it tackles this question head-on** by comparing Fluss with existing technologies.