Is Aggregate Key table suitable for OLTP-like middle state lookup/update workload? #63405
Replies: 2 comments
-
|
During the incident, BE CPU usage was also very high. The machine is We later checked the BE runtime config and found that the compaction related settings had been changed to very conservative values: At that time, frequent small writes caused tablet versions to accumulate quickly, and Doris started rejecting writes with: version count: 4001, exceed limit: 4000 We then adjusted the runtime BE config without restarting Doris: curl -X POST 'http://be_host:8040/api/update_config?max_cumu_compaction_threads=8' After this, the failed writes started to recover, but the BE load was still very high. Example metrics: doris_be_compaction_used_permits 21 So the full picture is:
Questions:
补充一些现场信息: 事故期间, BE 的 CPU 也一直很高. 这台机器是 后面我们检查 BE 运行时配置, 发现 compaction 相关配置被调得比较保守: 当时频繁小写入导致 tablet version 快速堆积, Doris 开始拒绝写入: version count: 4001, exceed limit: 4000 随后我们在不重启 Doris 的情况下动态调整了 BE 配置: curl -X POST 'http://be_host:8040/api/update_config?max_cumu_compaction_threads=8' 调整后, 失败写入开始恢复, 但是 BE 的负载依然很高. 当时 metrics 大致如下: doris_be_compaction_used_permits 21 所以完整情况是:
想补充请教:
|
Beta Was this translation helpful? Give feedback.
-
|
codex 给的建议是: 想不出来有什么办法可以缓解, cpu 几天了, 一直在 90% 以上下不来, 内存也一直在 70% ~ 85% 之间 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Aggregate Key table becomes bottleneck for point lookup and frequent small writes at ~30M rows
Problem
We are using Apache Doris as a statistics/analytics store. One of our intermediate state tables is an Aggregate Key table, but it has become a severe bottleneck for point queries and frequent small writes.
The table is used as a middle state table for user statistics. The logical key is
(app_id, user_id). The workload looks like OLTP-style state lookup/update, but the table was originally created in Doris because the downstream statistics are all in Doris.Recently, the table grew to about 30 million rows. After that, simple point queries and small insert/delete operations became very slow. Some single-row operations took 2-7 seconds, which caused our Kafka consumer to exceed
max.poll.interval.msand repeatedly rebalance.Environment
2.1.101 FE, 1 BE32C 64G(app_id, user_id)Table Usage
The table stores user-level middle state, for example:
The typical query pattern is:
For a special business rule, when a user logs in again after more than 30 days, we need to treat the login as a new user. The old row is backed up with a new user_id suffix, then the original row is deleted before writing the new login event.
Example:
Symptoms
After the table reached around 30 million rows:
tablet writer write failed
failed to init rowset builder
version count: 4001, exceed limit: 4000
Please reduce the frequency of loading data or adjust the max_tablet_version_num in be.conf to a larger value.
CommitFailedException: Commit cannot be completed since the group has already rebalanced
Question
Is this expected for an Aggregate Key table in Doris?
For this type of table:
What is the recommended Doris table model and design?
Should this type of state table not be stored in Doris at all, and instead be stored in MySQL/OLTP storage, while Doris only handles analytical/statistical aggregation?
If Doris can support this pattern, what table design, bucket strategy, compaction settings, or write pattern should we use?
What we tried
We moved this middle state table to MySQL and only kept final analytical/statistical processing in Doris. After moving the point lookup/state update part to MySQL, the Kafka lag disappeared.
This suggests the bottleneck is specifically the Doris middle state table workload, not Kafka or application logic.
Aggregate Key 表在 3000 万行左右的点查和频繁小写入场景下成为瓶颈
问题描述
我们现在把 Apache Doris 用作统计分析库, 其中有一张中间状态表是 Aggregate Key 表. 这张表主要用于用户统计的中间状态维护, 逻辑主键是 (app_id, user_id).
这个访问模式比较接近 OLTP 风格的状态查询和状态更新. 当初把这张表放在 Doris 里, 是因为后续统计分析也都在 Doris 中完成.
最近这张表增长到大约 3000 万行后, 出现了比较严重的性能问题. 一些简单的点查和小写入操作耗时达到了 2-7 秒, 导致应用侧 Kafka consumer 超过 max.poll.interval.ms, 然后 consumer group 反复 rebalance.
环境信息
表使用方式
这张表存储用户级别的中间状态, 字段大致包括:
典型查询是按 (app_id, user_id) 做点查:
另外有一个特殊业务逻辑: 如果用户超过 30 天再次登录, 需要把这次登录当成新用户. 当前做法是先把原来的 user_id 追加当天日期后备份一条记录, 再删除原来的记录, 然后再写入新的登录事件.
示例:
现象
当这张表增长到大约 3000 万行后, 出现了以下问题:
tablet writer write failed
failed to init rowset builder
version count: 4001, exceed limit: 4000
Please reduce the frequency of loading data or adjust the max_tablet_version_num in be.conf to a larger value.
CommitFailedException: Commit cannot be completed since the group has already rebalanced
想请教的问题
Aggregate Key 表出现这种情况是符合预期的吗?
对于这种场景:
Doris 推荐的表模型和设计方式是什么?
这种中间状态数据是否本来就不应该放在 Doris 中, 而应该放在 MySQL 这类 OLTP 存储里, Doris 只负责最终统计分析?
如果 Doris 可以支持这种场景, 应该如何设计表结构, 分桶策略, compaction 参数, 或写入方式?
我们尝试过的处理
我们后来把这类中间状态表迁移到了 MySQL, Doris 只保留最终统计分析类数据. 迁移后 Kafka 堆积很快消失.
这说明瓶颈主要在 Doris 中间状态表的访问模式上, 不是 Kafka 或应用逻辑本身.
Beta Was this translation helpful? Give feedback.
All reactions