TORC Compact 常见故障诊断方法

2024-04-02 16:43:56 3072

摘要：本篇文章将为读者介绍了TORC表相关的常见故障如何解决，以及Compaction的黑名单机制如何使用

友情链接：

ORC Transaction Compact原理以及如何使用

背景

Orc transaction表是一种inceptor中可以支持CRUD操作的的ORC表，其基本原理是对于每个crud操作（insert，update，delete，merge into），都会生成一个对应版本，同时系统中存在compact机制对每个orc transaction进行compact，将多个版本合并成一个版本。

常见报错与解决方案

1. 表进入黑名单怎么办

通过show compact blacklist命令，发现表在黑名单里，说明之前已经compact失败多次。如果能找到之前失败的日志可做进一步分析报错，如果无法找到，则可以先将该表或分区从blacklist中移除(alter table table_name enable compact;)，然后手动触发compaction(alter table table_name compact ‘major’;)，再根据如下情况做分析。

2. 报错NPE或者 dir is null, the table or partition is invalid for compaction at present

日志中出现NPE(Java.lang.NullPointerException)或者 dir is null, the table or partition is invalid for compaction at present，基本是高并发情况下,如果系统中某些表或分区长时间处于open状态(未提交也未rollback)，使得系统中的表都无法满足compact条件，此时需等系统并发降低或者对orc事务表操作空闲的时候，触发compact。

3. 数据倾斜

在yarn页面中，也可以看到大部分任务快速完成，少量任务执行时间特别长，点开耗时很长的任务，可以看到有重试情况，并且重试都失败。这是因为mapreduce的推断执行机制，重试的任务失败，整个task会被kill掉。可以配置yarn的参数mapreduce.map.speculative=false并重启yarn和inceptor则可以解决。

故障诊断方法

正常情况下，系统里orc transaction表大部分都会存在一个base目录，如果发现orc transaction表里有很多delta目录（>10）却没有自动compact，则说明该系统已经发生问题了，此时可以按照一下步骤来进行初步诊断。

诊断过程

1. 首先在inceptor里用select语句检查该表能否被正常的select，如果select出错，则问题定位为orc表本身的问题。如果能select出来则说明compact系统出现问题；

2. compact系统出现问题时，首先查看compact所在服务的进程的jstack，检查里面是否有以下3种thread：

a. Initiator thread: 负责检查每个orc transaction表是否符合compact条件
b. Worker thread: 负责向yarn提交mapreduce任务对表进行compact
c. Cleaner thread: 负责清除冗余的版本，例如上面的例子中生成base_0246405后delta_0246404_0246404与delta_0246405_0246405就可以被删除掉了

如果没有initiator thread，系统不能对表进行自动的compact，但通过alter table还能手动触发compact，如果没有worker thread，系统将无法进行compact，这两种情况都会导致orc transaction表的相关信息在metastore里面积压，大幅影响系统性能甚至会导致系统挂掉。cleaner thread相对比较不重要，但是没有cleaner，hdfs上就会有太多冗余版本，故我们应该保证metastore进程中这三种thread都是处于正常状态，如果发现metastore中少了某个thread，应该重启metastore。

以metastore的jstack为例，这些thread在jstack的表现如下：

3. 如果compact所在进程中这三种thread都处于正常状态，对于基于mapreduce的compact，还要检查yarn上compact任务是否正常，之前碰到的情况是yarn分配给inceptor的资源太多，导致compact任务一直无法完成，碰到这种情况需要调整yarn的资源分配来确保有足够的资源进行compact。

4.对于带分区（partition）的ORC事务表，可以使用以下命令手动出发compaction：

4.1 单值分区（singl-key partition）

alter table single_key_partition_table partition col compact 'major'; //对partition spec col 进行compact;
alter table single_key_partition_table partition(col='value') compact 'major'; //对partition spec col=value 进行compact;

4.2 多值分区表（multi-key partition）

alter table multi_key_partition_table partition(col1='value1', col2='value2') compact 'major';
alter table multi_key_partition_table partition col compact 'major'; //对partition spec col 进行compact;

4.3 单值范围分区（single-key and range partition ）

alter table range_partition_table partition part_key compact 'major'; -- 其中的单引号' ' 是需要的。

5.手动合并失败，查看metastore中的日志报错如下：

这是因为表已经进入了黑名单，进入黑名单的表是不能合并的。

进入黑名单是因为之前自动合并失败很多次，需要手动从黑名单中释放。

alter table table_name_xxxxxx enable compact; -- 'table_name_xxxxxx'替换为表名

为了保证上述语句运行成功，防止delta数量过多的极端情况下合并失败，需要适当调整参数来设置一次性合并的delta版本数量上限：

SET hive.compactor.max.num.delta=50; -- 默认值为500

Compaction Blacklist黑名单机制

如果一个表或者分区在多次尝试compact并且失败，compaction 服务会认为后续再对这个表或分区compact同样会失败。为了避免浪费资源，comapction服务会将这个表或分区加入compaction blacklist。

目前失败次数由参数orc.compact.blacklist.threshold控制，默认值是3。

表或分区一旦加入黑名单，无论自动或手动触发compaction，都不会执行compact操作。

目前还没有从compaction blacklist中自动移除的机制。

查看黑名单

可以在beeline通过以下命令查看：

show compact blacklist

从黑名单中移除

如果将某个表或分区从compaction blacklist中移除，可以通过：

在beeline，从blacklist中移除某个表或分区，执行

alter table table_name enable compact; // 表
alter table table_name partition (pt='xxx') enable compact; // 单值分区
alter table table_name partition range_name enable compact; // 范围分区

也可以在mysql中，直接删除COMPACTION_BLACKLIST中记录

手动加入黑名单

可以手动将某个表或分区加入黑名单，不让其做compaction。

alter table table_name disable compact; // 表
alter table table_name partition (pt='xxx') disable compact; // 单值分区
alter table table_name partition range_name disable compact; // 范围分区

友情链接：

ORC Transaction Compact原理以及如何使用

背景

常见报错与解决方案

1. 表进入黑名单怎么办

2. 报错NPE或者 dir is null, the table or partition is invalid for compaction at present

3. 数据倾斜

故障诊断方法

诊断过程

1. 首先在inceptor里用select语句检查该表能否被正常的select，如果select出错，则问题定位为orc表本身的问题。如果能select出来则说明compact系统出现问题；

2. compact系统出现问题时，首先查看compact所在服务的进程的jstack，检查里面是否有以下3种thread：

a. Initiator thread: 负责检查每个orc transaction表是否符合compact条件
b. Worker thread: 负责向yarn提交mapreduce任务对表进行compact
c. Cleaner thread: 负责清除冗余的版本，例如上面的例子中生成base_0246405后delta_0246404_0246404与delta_0246405_0246405就可以被删除掉了

以metastore的jstack为例，这些thread在jstack的表现如下：

4.对于带分区（partition）的ORC事务表，可以使用以下命令手动出发compaction：

4.1 单值分区（singl-key partition）

alter table single_key_partition_table partition col compact 'major'; //对partition spec col 进行compact;
alter table single_key_partition_table partition(col='value') compact 'major'; //对partition spec col=value 进行compact;

4.2 多值分区表（multi-key partition）

alter table multi_key_partition_table partition(col1='value1', col2='value2') compact 'major';
alter table multi_key_partition_table partition col compact 'major'; //对partition spec col 进行compact;

4.3 单值范围分区（single-key and range partition ）

alter table range_partition_table partition part_key compact 'major'; -- 其中的单引号' ' 是需要的。

5.手动合并失败，查看metastore中的日志报错如下：

这是因为表已经进入了黑名单，进入黑名单的表是不能合并的。

进入黑名单是因为之前自动合并失败很多次，需要手动从黑名单中释放。

alter table table_name_xxxxxx enable compact; -- 'table_name_xxxxxx'替换为表名

为了保证上述语句运行成功，防止delta数量过多的极端情况下合并失败，需要适当调整参数来设置一次性合并的delta版本数量上限：

SET hive.compactor.max.num.delta=50; -- 默认值为500

Compaction Blacklist黑名单机制

目前失败次数由参数orc.compact.blacklist.threshold控制，默认值是3。

表或分区一旦加入黑名单，无论自动或手动触发compaction，都不会执行compact操作。

目前还没有从compaction blacklist中自动移除的机制。

查看黑名单

可以在beeline通过以下命令查看：

show compact blacklist

从黑名单中移除

如果将某个表或分区从compaction blacklist中移除，可以通过：

在beeline，从blacklist中移除某个表或分区，执行

alter table table_name enable compact; // 表
alter table table_name partition (pt='xxx') enable compact; // 单值分区
alter table table_name partition range_name enable compact; // 范围分区

也可以在mysql中，直接删除COMPACTION_BLACKLIST中记录

手动加入黑名单

可以手动将某个表或分区加入黑名单，不让其做compaction。

alter table table_name disable compact; // 表
alter table table_name partition (pt='xxx') disable compact; // 单值分区
alter table table_name partition range_name disable compact; // 范围分区

# Inceptor# 星环产品# 知识分享# 问题排查攻略# 性能优化

登录后可评论

发布者

星

星小环分享号

官方

文章

194

问答

262

关注者

TORC Compact 常见故障诊断方法

背景

常见报错与解决方案

1. 表进入黑名单怎么办

2. 报错NPE或者 dir is null, the table or partition is invalid for compaction at present

3. 数据倾斜

故障诊断方法

诊断过程

Compaction Blacklist黑名单机制

查看黑名单

从黑名单中移除

手动加入黑名单

背景

常见报错与解决方案

1. 表进入黑名单怎么办

2. 报错NPE或者 dir is null, the table or partition is invalid for compaction at present

3. 数据倾斜

故障诊断方法

诊断过程

Compaction Blacklist黑名单机制

查看黑名单

从黑名单中移除

手动加入黑名单

热门问答

加入TDH社区版技术交流群

获取更多技术支持 ->