Hudi CLI 安装配置总结

伦少的博客 2024-06-11

312

前言

上篇文章总结了Spark SQL Rollback, Hudi CLI 也能实现 Rollback，本文总结下 Hudi CLI 安装配置以及遇到的问题。

官方文档

https://hudi.apache.org/cn/docs/cli/

版本

Hudi 0.13.0
Spark 3.2.3

打包

mvn clean package -DskipTests -Drat.skip=true -Dscala-2.12  -Dspark3.2 -Pflink-bundle-shade-hive3 -Dflink1.15

创建hudi-cli文件夹

mkdir hudi-cli

上传

将打包好的 conf
hudi-cli-bundle_2.12-0.13.0.jar
hudi-cli-with-bundle.sh
上传到 hudi-cli

修改文件权限和格式

chmod -R 777 hudi-cli-with-bundle.sh
dos2unix ./hudi-cli-with-bundle.sh
dos2unix conf/hudi-env.sh

环境变量

conf/hudi-env.sh

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop/conf"}
export SPARK_CONF_DIR=${SPARK_CONF_DIR:-"/etc/spark2/conf"}
#export CLIENT_JAR=${CLIENT_JAR}
export CLI_BUNDLE_JAR=/opt/hudi-cli/hudi-cli-bundle_2.12-0.13.0.jar
export SPARK_BUNDLE_JAR=/usr/hdp/3.1.0.0-78/spark/jars/hudi-spark3.2-bundle_2.12-0.13.0.jar
export SPARK_HOME=/usr/hdp/3.1.0.0-78/spark

使用

到现在就可以使用了

./hudi-cli-with-bundle.sh
connect /tmp/hudi/test_flink_mor
commits show --desc true --limit 10

警告和异常解决

修改脚本

修改后的脚本如下：

#!/usr/bin/env bash

#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.

JAKARTA_EL_VERSION=3.0.3
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
echo "DIR is ${DIR}"
. "${DIR}"/conf/hudi-env.sh
if [ -z "$CLI_BUNDLE_JAR" ]; then
  echo "Inferring CLI_BUNDLE_JAR path assuming this script is under Hudi repo"
  CLI_BUNDLE_JAR=`ls $DIR/target/hudi-cli-bundle*.jar | grep -v source | grep -v javadoc`
fi

if [ -z "$SPARK_BUNDLE_JAR" ]; then
  echo "Inferring SPARK_BUNDLE_JAR path assuming this script is under Hudi repo"
  SPARK_BUNDLE_JAR=`ls $DIR/../hudi-spark-bundle/target/hudi-spark*-bundle*.jar | grep -v source | grep -v javadoc`
fi

echo "CLI_BUNDLE_JAR: $CLI_BUNDLE_JAR"
echo "SPARK_BUNDLE_JAR: $SPARK_BUNDLE_JAR"

HUDI_CONF_DIR="${DIR}"/conf
# hudi aux lib contains jakarta.el jars, which need to be put directly on class path
#HUDI_AUX_LIB="${DIR}"/auxlib

if [ ! -d $HUDI_AUX_LIB ]; then
  echo "Downloading necessary auxiliary jars for Hudi CLI"
  wget https://repo1.maven.org/maven2/org/glassfish/jakarta.el/$JAKARTA_EL_VERSION/jakarta.el-$JAKARTA_EL_VERSION.jar -P auxlib
  wget https://repo1.maven.org/maven2/jakarta/el/jakarta.el-api/$JAKARTA_EL_VERSION/jakarta.el-api-$JAKARTA_EL_VERSION.jar -P auxlib
fi


if [ -z "$CLI_BUNDLE_JAR" ] || [ -z "$SPARK_BUNDLE_JAR" ]; then
  echo "Make sure to generate both the hudi-cli-bundle.jar and hudi-spark-bundle.jar before running this script."
  exit
fi

if [ -z "$SPARK_HOME" ]; then
  echo "SPARK_HOME not set, setting to /usr/local/spark"
  export SPARK_HOME="/usr/local/spark"
fi

echo "Running : java -cp ${HUDI_CONF_DIR}:${HUDI_AUX_LIB}/*:${SPARK_HOME}/*:${SPARK_HOME}/jars/*:${HADOOP_CONF_DIR}:${SPARK_CONF_DIR}:${CLI_BUNDLE_JAR}:${SPARK_BUNDLE_JAR} -DSPARK_CONF_DIR=${SPARK_CONF_DIR} -DHADOOP_CONF_DIR=${HADOOP_CONF_DIR} org.apache.hudi.cli.Main $@"
java -cp ${HUDI_CONF_DIR}:${HUDI_AUX_LIB}/*:${SPARK_HOME}/*:${SPARK_HOME}/jars/*:${HADOOP_CONF_DIR}:${SPARK_CONF_DIR}:${CLI_BUNDLE_JAR}:${SPARK_BUNDLE_JAR} -DSPARK_CONF_DIR=${SPARK_CONF_DIR} -DHADOOP_CONF_DIR=${HADOOP_CONF_DIR} org.apache.hudi.cli.Main $@

下载 HUDI_AUX_LIB

由于公司内网连不上maven仓库，可以选择手动下载，并上传到对应目录，这样就不用每次都下载了

mkdir auxlib

日志配置文件解析错误

错误日志

ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.

解决方法

解决方法：删除 spark jars 下面的 hive-exec-2.3.9-core.jar 中的 Log4j2Plugins.dat 文件
问题原因：Log4j2Plugins.dat 文件冲突
参考文章：https://www.cnblogs.com/yeyang/p/10485790.html

查找冲突文件思路

根据 CLI 启动日志：

java -cp /opt/hudi-cli/conf:/opt/hudi-cli/auxlib/*:/usr/hdp/3.1.0.0-78/spark/*:/usr/hdp/3.1.0.0-78/spark/jars/*:/etc/hadoop/conf:/etc/spark2/conf:/opt/hudi-cli/hudi-cli-bundle_2.12-0.13.0.jar:/usr/hdp/3.1.0.0-78/spark/jars/hudi-spark3.2-bundle_2.12-0.14.1.jar -DSPARK_CONF_DIR=/etc/spark2/conf -DHADOOP_CONF_DIR=/etc/hadoop/conf org.apache.hudi.cli.Main

挨个查找其中每个路径下的文件中是否有包含 Log4j2Plugins.dat 。最终发现在 spark jars 下面的 hive-exec-2.3.9-core.jar 中包含 Log4j2Plugins.dat 。删除其中的 Log4j2Plugins.dat ，问题解决。

 grep -rl Log4j2Plugins.dat /usr/hdp/3.1.0.0-78/spark/jars/*
/usr/hdp/3.1.0.0-78/spark/jars/hive-exec-2.3.9-core.jar

最终效果

Rollback

用法

hudi->help
AVAILABLE COMMANDS

Archived Commits Command
       trigger archival: trigger archival
       show archived commits: Read commits from archived files and show details
       show archived commit stats: Read commits from archived files and show details

Bootstrap Command
       bootstrap run: Run a bootstrap action for current Hudi table
       bootstrap index showmapping: Show bootstrap index mapping
       bootstrap index showpartitions: Show bootstrap indexed partitions

Built-In Commands
       help: Display help about available commands
       stacktrace: Display the full stacktrace of the last error.
       clear: Clear the shell screen.
       quit, exit: Exit the shell.
       history: Display or save the history of previously run commands
       version: Show version info
       script: Read and execute commands from a file.

Cleans Command
       cleans show: Show the cleans
       clean showpartitions: Show partition level details of a clean
       cleans run: run clean

Clustering Command
       clustering run: Run Clustering
       clustering scheduleAndExecute: Run Clustering. Make a cluster plan first and execute that plan immediately
       clustering schedule: Schedule Clustering

Commits Command
       commits compare: Compare commits with another Hoodie table
       commits sync: Sync commits with another Hoodie table
       commit showpartitions: Show partition level details of a commit
       commits show: Show the commits
       commits showarchived: Show the archived commits
       commit showfiles: Show file level details of a commit
       commit show_write_stats: Show write stats of a commit

Compaction Command
       compaction run: Run Compaction for given instant time
       compaction scheduleAndExecute: Schedule compaction plan and execute this plan
       compaction showarchived: Shows compaction details for a specific compaction instant
       compaction repair: Renames the files to make them consistent with the timeline as dictated by Hoodie metadata. Use when compaction unschedule fails partially.
       compaction schedule: Schedule Compaction
       compaction show: Shows compaction details for a specific compaction instant
       compaction unscheduleFileId: UnSchedule Compaction for a fileId
       compaction validate: Validate Compaction
       compaction unschedule: Unschedule Compaction
       compactions show all: Shows all compactions that are in active timeline
       compactions showarchived: Shows compaction details for specified time window

Diff Command
       diff partition: Check how file differs across range of commits. It is meant to be used only for partitioned tables.
       diff file: Check how file differs across range of commits

Export Command
       export instants: Export Instants and their metadata from the Timeline

File System View Command
       show fsview all: Show entire file-system view
       show fsview latest: Show latest file-system view

HDFS Parquet Import Command
       hdfsparquetimport: Imports Parquet table to a hoodie table

Hoodie Log File Command
       show logfile records: Read records from log files
       show logfile metadata: Read commit metadata from log files

Hoodie Sync Validate Command
       sync validate: Validate the sync by counting the number of records

Kerberos Authentication Command
       kerberos kdestroy: Destroy Kerberos authentication
       kerberos kinit: Perform Kerberos authentication

Markers Command
       marker delete: Delete the marker

Metadata Command
       metadata stats: Print stats about the metadata
       metadata list-files: Print a list of all files in a partition from the metadata
       metadata list-partitions: List all partitions from metadata
       metadata validate-files: Validate all files in all partitions from the metadata
       metadata delete: Remove the Metadata Table
       metadata create: Create the Metadata Table if it does not exist
       metadata init: Update the metadata table from commits since the creation
       metadata set: Set options for Metadata Table

Repairs Command
       repair deduplicate: De-duplicate a partition path contains duplicates & produce repaired files to replace with
       rename partition: Rename partition. Usage: rename partition --oldPartition <oldPartition> --newPartition <newPartition>
       repair overwrite-hoodie-props: Overwrite hoodie.properties with provided file. Risky operation. Proceed with caution!
       repair migrate-partition-meta: Migrate all partition meta file currently stored in text format to be stored in base file format. See HoodieTableConfig#PARTITION_METAFILE_USE_DATA_FORMAT.
       repair addpartitionmeta: Add partition metadata to a table, if not present
       repair deprecated partition: Repair deprecated partition ("default"). Re-writes data from the deprecated partition into __HIVE_DEFAULT_PARTITION__
       repair show empty commit metadata: show failed commits
       repair corrupted clean files: repair corrupted clean files

Rollbacks Command
       show rollback: Show details of a rollback instant
       commit rollback: Rollback a commit
       show rollbacks: List all rollback instants

Savepoints Command
       savepoint rollback: Savepoint a commit
       savepoints show: Show the savepoints
       savepoint create: Savepoint a commit
       savepoint delete: Delete the savepoint

Spark Env Command
       set: Set spark launcher env to cli
       show env: Show spark launcher env by key
       show envs all: Show spark launcher envs

Stats Command
       stats filesizes: File Sizes. Display summary stats on sizes of files
       stats wa: Write Amplification. Ratio of how many records were upserted to how many records were actually written

Table Command
       table update-configs: Update the table configs with configs with provided file.
       table recover-configs: Recover table configs, from update/delete that failed midway.
       refresh, metadata refresh, commits refresh, cleans refresh, savepoints refresh: Refresh table metadata
       create: Create a hoodie table if not present
       table delete-configs: Delete the supplied table configs from the table.
       fetch table schema: Fetches latest table schema
       connect: Connect to a hoodie table
       desc: Describe Hoodie Table properties

Temp View Command
       temp_query, temp query: query against created temp view
       temps_show, temps show: Show all views name
       temp_delete, temp delete: Delete view name

Timeline Command
       metadata timeline show incomplete: List all incomplete instants in active timeline of metadata table
       metadata timeline show active: List all instants in active timeline of metadata table
       timeline show incomplete: List all incomplete instants in active timeline
       timeline show active: List all instants in active timeline

Upgrade Or Downgrade Command
       downgrade table: Downgrades a table
       upgrade table: Upgrades a table

Utils Command
       utils loadClass: Load a class

hudi->help commit rollback
NAME
       commit rollback - Rollback a commit

SYNOPSIS
       commit rollback [--commit String] --sparkProperties String --sparkMaster String --sparkMemory String --rollbackUsingMarkers String

OPTIONS
       --commit String
       Commit to rollback
       [Mandatory]

       --sparkProperties String
       Spark Properties File Path
       [Optional]

       --sparkMaster String
       Spark Master
       [Optional]

       --sparkMemory String
       Spark executor memory
       [Optional, default = 4G]

       --rollbackUsingMarkers String
       Enabling marker based rollback
       [Optional, default = false]

示例

connect /tmp/hudi/test_flink_mor
commit rollback --commit 20240606144849707

问题

问题一

只有第一次Rollback成功，第二次以及后续的Rollback失败:

也不能Rollback失败的 commit:

问题原因和上篇文章一样，但是升级 Hudi 版本到 0.14.1，并没有解决问题，说明Hudi CLI 相关的代码逻辑并没有修复。

问题二

没有 history 日志，history UI 无法查看，报 404 错误。
原因为源码中配置了 spark.eventLog.enabled=false , 这样虽然 spark-default.conf 中配置了 spark.eventLog.enabled=true , 但是代码优先级高，所以导致没有history 日志。

解决方法：修改源码，删除源码中spark.eventLog.enabled的默认设置，使照环境中 spark-default.conf 的配置生效，重新打包。
源码位置: SparkUtil
.getDefaultConf

安装包

链接：https://pan.baidu.com/s/1dbO8bh20R_9wFgf7_m9zhQ?pwd=4alo
提取码：4alo

🧐 分享、点赞、在看，给个3连击呗！👇

spark hudi

文章转载自伦少的博客，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。