ES检测kibana端口是否宕并自动拉起进程

这篇文章主要讲述了我最近运维的一套生产ES集群,遇到kibana进程宕导致管理界面难以访问问题。

Part1背景描述

目前在运维一套elasticsearch生产集群,5台服务器,10个节点。 之前kibana部署在其中一台服务器上。但有时会遇到因为内存等原因导致kibana访问慢,或者无法正常访问。 为了减少单点故障,在另一台服务器上又部署了一个kibana,并通过域名方式将两者联系起来,形成主备,若其中一个kibana节点宕,还能通过域名访问kibana管理界面。

Part2问题描述

随着ES接入的应用越来越多,这套ES承载的压力也越来越大,近期对ES的JDK进行了优化,将JDK从 1.8升级到了openJDK 16,升级完后ES性能有所提升,但近期其中一台kibana频繁出现节点宕,每天都会宕很多次。在检查了kibana日志,并根据一些查询的资料显示kibana应该是由于分配内存过小,之前是采用默认java内存大小,后来通过如下方法将kibana内存调整到了4G,修改kibana启动脚本bin/kibana,修改后的kibana脚本信息如下:

#!/bin/sh
SCRIPT=$0

# SCRIPT may be an arbitrarily deep series of symlinks. Loop until we have the concrete path.
while [ -h "$SCRIPT" ] ; do
ls=$(ls -ld "$SCRIPT")
# Drop everything prior to ->
link=$(expr "$ls" : '.*-> \(.*\)$')
if expr "$link" : '/.*' > /dev/null; then
SCRIPT="$link"
else
SCRIPT=$(dirname "$SCRIPT")/"$link"
fi
done

DIR="$(dirname "${SCRIPT}")/.."
CONFIG_DIR=${KBN_PATH_CONF:-"$DIR/config"}
NODE="${DIR}/node/bin/node"
test -x "$NODE"
if [ ! -x "$NODE" ]; then
echo "unable to find usable node.js executable."
exit 1
fi

if [ -f "${CONFIG_DIR}/node.options" ]; then
KBN_NODE_OPTS="$(grep -v ^# < ${CONFIG_DIR}/node.options | xargs)"
fi

NODE_OPTIONS="--no-warnings --max-http-header-size=65536 --max-old-space-size=4096 --tls-min-v1.0 $KBN_NODE_OPTS $NODE_OPTIONS" NODE_ENV=production exec "${NODE}" "${DIR}/src/cli/dist" ${@}

## 和原kibana启动脚本相比,增加了 --max-old-space-size=4096

修改了kibana的启动内存大小后,kibana备节点宕的频率降低了,但还是偶尔会进程宕,通过kibana日志看到宕时信息如下:

{"type":"response","@timestamp":"2024-05-22T11:53:22+08:00","tags":[],"pid":19881,"method":"get","statusCode":200,"req":{"url":"/ui/fonts/inter_ui/Inter-UI-SemiBold.woff2","method":"get","headers":{"connection":"upgrade","host":"fa.vemic.com","sec-ch-ua":"\"Google Chrome\";v=\"125\", \"Chromium\";v=\"125\", \"Not.A/Brand\";v=\"24\"","origin":"https://fa.vemic.com","sec-ch-ua-mobile":"?0","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36","sec-ch-ua-platform":"\"Windows\"","accept":"*/*","sec-fetch-site":"same-origin","sec-fetch-mode":"cors","sec-fetch-dest":"font","referer":"https://fa.vemic.com/flog/login?next=%2Fflog%2Fs%2Ffsp%2Fapp%2Fdiscover%23%2F%3F_g%3D%28filters%3A%21%28%29%2CrefreshInterval%3A%28pause%3A%21t%2Cvalue%3A0%29%2Ctime%3A%28from%3Anow-15m%2Cto%3Anow%29%29%26_a%3D%28columns%3A%21%28fql%2Cusername%29%2Cfilters%3A%21%28%28%27%24state%27%3A%28store%3AappState%29%2Cmeta%3A%28alias%3A%21n%2Cdisabled%3A%21f%2Cindex%3A%279c5a70b0-257b-11eb-ab0e-6bf15e41beab%27%2Ckey%3Ausername.keyword%2Cnegate%3A%21f%2Cparams%3A%28query%3Apc_qc_keyword_company_search%29%2Ctype%3Aphrase%29%2Cquery%3A%28match_phrase%3A%28username.keyword%3Apc_qc_keyword_company_search%29%29%29%29%2Cindex%3A%279c5a70b0-257b-11eb-ab0e-6bf15e41beab%27%2Cinterval%3Aauto%2Cquery%3A%28language%3Akuery%2Cquery%3A%27%27%29%2Csort%3A%21%28%21%28date%2Cdesc%29%29%29&msg=SESSION_EXPIRED","accept-encoding":"gzip, deflate, br, zstd","accept-language":"zh-CN,zh;q=0.9","x-polyfill-debug":"673069"},"remoteAddress":"192.168.16.12","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36","referer":"https://fa.vemic.com/flog/login?next=%2Fflog%2Fs%2Ffsp%2Fapp%2Fdiscover%23%2F%3F_g%3D%28filters%3A%21%28%29%2CrefreshInterval%3A%28pause%3A%21t%2Cvalue%3A0%29%2Ctime%3A%28from%3Anow-15m%2Cto%3Anow%29%29%26_a%3D%28columns%3A%21%28fql%2Cusername%29%2Cfilters%3A%21%28%28%27%24state%27%3A%28store%3AappState%29%2Cmeta%3A%28alias%3A%21n%2Cdisabled%3A%21f%2Cindex%3A%279c5a70b0-257b-11eb-ab0e-6bf15e41beab%27%2Ckey%3Ausername.keyword%2Cnegate%3A%21f%2Cparams%3A%28query%3Apc_qc_keyword_company_search%29%2Ctype%3Aphrase%29%2Cquery%3A%28match_phrase%3A%28username.keyword%3Apc_qc_keyword_company_search%29%29%29%29%2Cindex%3A%279c5a70b0-257b-11eb-ab0e-6bf15e41beab%27%2Cinterval%3Aauto%2Cquery%3A%28language%3Akuery%2Cquery%3A%27%27%29%2Csort%3A%21%28%21%28date%2Cdesc%29%29%29&msg=SESSION_EXPIRED"},"res":{"statusCode":200,"responseTime":8,"contentLength":94752},"message":"GET /ui/fonts/inter_ui/Inter-UI-SemiBold.woff2 200 8ms - 92.5KB"}
Error: read ETIMEDOUT
at TCP.onStreamRead (internal/stream_base_commons.js:209:20)

Part3处理办法

因为该kibana备服务器上也部署了很多其它java进程,继续调整kibana启动内存或许会进一步降低kibana宕的频率,所以初步方案是希望能对kibana进程进行监控,当检测到kibana 5601端口不存在时,就自动将其拉起,并将该脚本设置为定时任务,每隔2分钟检测一次,进过多次调试,自动检测kibana端口是否宕并将其拉起的脚本如下:

#!/bin/bash
# 脚本存放位置及名称
# /home/esuser/monitor_kibana.sh

# 定义变量
KIBANA_PORT=5601
KIBANA_START_SCRIPT="/home/esuser/deploy/kibana/bin/kibana"
LOG_FILE="/home/esuser/kibana_monitor.log"

# 检测端口是否存在的函数
check_port() {
netstat -an | grep ":$KIBANA_PORT" | grep -q LISTEN
}

# 启动Kibana的函数
start_kibana() {
echo "$(date): Kibana is not running. Starting Kibana..." >> $LOG_FILE
nohup $KIBANA_START_SCRIPT &>> $LOG_FILE &
if [ $? -eq 0 ]; then
echo "$(date): Kibana started successfully." >> $LOG_FILE
else
echo "$(date): Failed to start Kibana." >> $LOG_FILE
fi
}

# 主函数
main() {
if check_port; then
echo "$(date): Kibana is running on port $KIBANA_PORT." >> $LOG_FILE
else
start_kibana
fi
}

# 执行主函数
main

并设置定时任务如下:

*/2 * * * * /home/esuser/monitor_kibana.sh

该脚本的日志记录信息如下:

Tue Jun 11 14:45:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 14:50:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 14:55:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:00:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:05:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:10:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:15:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:20:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:25:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:30:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:35:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:40:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:45:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:50:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 15:55:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 16:00:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 16:05:01 CST 2024: Kibana is running on port 5601.
Tue Jun 11 16:10:02 CST 2024: Kibana is running on port 5601.
Tue Jun 11 16:15:01 CST 2024: Kibana is not running. Starting Kibana...
Tue Jun 11 16:15:01 CST 2024: Kibana started successfully.
{"type":"log","@timestamp":"2024-06-11T16:15:19+08:00","tags":["info","plugins-service"],"pid":57913,"message":"Plugin \"osquery\" is disabled."}
{"type":"log","@timestamp":"2024-06-11T16:15:19+08:00","tags":["warning","config","deprecation"],"pid":57913,"message":"Setting [elasticsearch.username] to \"elastic\" is deprecated. You should use the \"kibana_system\" user instead."}
{"type":"log","@timestamp":"2024-06-11T16:15:19+08:00","tags":["warning","config","deprecation"],"pid":57913,"message":"Config key [monitoring.cluster_alerts.email_notifications.email_address] will be required for email notifications to
work in 8.0.\""}
{"type":"log","@timestamp":"2024-06-11T16:15:19+08:00","tags":["warning","config","deprecation"],"pid":57913,"message":"Setting [monitoring.username] to \"elastic\" is deprecated. You should use the \"kibana_system\" user instead."}
{"type":"log","@timestamp":"2024-06-11T16:15:20+08:00","tags":["info","plugins-system"],"pid":57913,"message":"Setting up [100] plugins: [taskManager,licensing,globalSearch,globalSearchProviders,banners,code,usageCollection,xpackLegacy,t
elemetryCollectionManager,telemetry,telemetryCollectionXpack,kibanaUsageCollection,mapsLegacy,securityOss,share,newsfeed,kibanaLegacy,translations,legacyExport,embeddable,uiActionsEnhanced,esUiShared,expressions,charts,bfetch,data,home,o
bservability,apmOss,console,consoleExtensions,painlessLab,searchprofiler,grokdebugger,management,indexPatternManagement,advancedSettings,fileUpload,savedObjects,visualizations,visTypeTable,regionMap,visTypeVislib,visTypeTimelion,timelion
,features,licenseManagement,graph,watcher,canvas,visTypeTagcloud,visTypeVega,visTypeMetric,visTypeMarkdown,tileMap,visTypeXy,dashboard,dashboardEnhanced,visualize,visTypeTimeseries,inputControlVis,discover,discoverEnhanced,savedObjectsMa
nagement,spaces,security,reporting,savedObjectsTagging,maps,lens,cloud,upgradeAssistant,snapshotRestore,enterpriseSearch,lists,dataEnhanced,encryptedSavedObjects,fleet,indexManagement,remoteClusters,crossClusterReplication,rollup,indexLi
fecycleManagement,dashboardMode,beatsManagement,transform,ingestPipelines,eventLog,actions,alerts,triggersActionsUi,stackAlerts,ml,securitySolution,case,infra,monitoring,logstash,apm,uptime]"}
{"type":"log","@timestamp":"2024-06-11T16:15:20+08:00","tags":["info","plugins","taskManager"],"pid":57913,"message":"TaskManager is identified by the Kibana UUID: 4227be37-8c2a-4580-be72-233bafd4d332"}
{"type":"log","@timestamp":"2024-06-11T16:15:20+08:00","tags":["warning","plugins","security","config"],"pid":57913,"message":"Generating a random key for xpack.security.encryptionKey. To prevent sessions from being invalidated on restar
t, please set xpack.security.encryptionKey in the kibana.yml or use the bin/kibana-encryption-keys command."}


请使用浏览器的分享功能分享到微信等