1，简介

1.1mha简介

MHA，即MasterHigh Availability Manager and Tools for MySQL，是日本的一位MySQL专家采用Perl语言编写的一个脚本管理工具，该工具仅适用于MySQLReplication（二层）环境，目的在于维持Master主库的高可用性。

MHA(Master High Availability)是自动的master故障转移和Slave提升的软件包.它是基于标准的MySQL复制(异步/半同步).

MHA有两部分组成:MHA Manager(管理节点)和MHA Node(数据节点).

MHA Manager可以单独部署在一台独立机器上管理多个master-slave集群,也可以部署在一台slave上.MHA Manager探测集群的node节点,当发现master出现故障的时候,它可以自动将具有最新数据的slave提升为新的master,然后将所有其它的slave导向新的master上.整个故障转移过程对应用程序是透明的。

MHA node运行在每台MySQL服务器上(master/slave/manager),它通过监控具备解析和清理logs功能的脚本来加快故障转移的。

1.2，mha特点

（1）. 10-30s实现master failover（9-12s可以检测到主机故障，7-10s可以关闭主机避免SB，在用很短的时间应用差异日志）

（2）部署简单，无需对现有M-S结构做任何改动（至少3台，保证切换后仍保持M-S结构）

（3）. 支持手动在线切换（主机硬件维护），downtime几乎很短0.5-2s

（4）. 保证故障切换后多从库数据的一致性

（5）. 完全自动化的failover及快速复制架构恢复方案（一主多从）

（6）. 恢复过程包括：选择新主库、确认从库间relaylog差异、新主库应用必要语句、其他从库同步差异语句、重新建立复制连接

2，工作原理

相较于其它HA软件，MHA的目的在于维持MySQL Replication中Master库的高可用性，其最大特点是可以修复多个Slave之间的差异日志，最终使所有Slave保持数据一致，然后从中选择一个充当新的Master，并将其它Slave指向它。

-从宕机崩溃的master保存二进制日志事件(binlogevents)。

-识别含有最新更新的slave。

-应用差异的中继日志(relay log)到其它slave。

-应用从master保存的二进制日志事件(binlogevents)。

-提升一个slave为新master。

-使其它的slave连接新的master进行复制。

3，mha工具包

(1)、 Manager工具:

- masterha_check_ssh : 检查MHA的SSH配置。

- masterha_check_repl : 检查MySQL复制。

- masterha_manager : 启动MHA。

- masterha_check_status : 检测当前MHA运行状态。

- masterha_master_monitor : 监测master是否宕机。

- masterha_master_switch : 控制故障转移(自动或手动)。

- masterha_conf_host : 添加或删除配置的server信息。

(2)、 Node工具(这些工具通常由MHAManager的脚本触发,无需人手操作)。

- save_binary_logs : 保存和复制master的二进制日志。

- apply_diff_relay_logs : 识别差异的中继日志事件并应用于其它slave。

- filter_mysqlbinlog : 去除不必要的ROLLBACK事件(MHA已不再使用这个工具)。

- purge_relay_logs : 清除中继日志(不会阻塞SQL线程)。

4，主机部署

Ip地址	mha状况	Mysql状况
192.168.52.129	Node	Master
192.168.52.130	Node	Slave
192.168.52.131	Manager	Slave

5，生成ssh无密钥证书

5.1 先在192.168.52.129上面生成密钥

ssh-keygen -t dsa -P '' -f id_dsa

Id_dsa.pub为公钥，id_dsa为私钥，紧接着将公钥文件复制成authorized_keys文件，这个步骤是必须的，过程如下：

cat id_dsa.pub >> authorized_keys

5.2 在192.168.52.130上面生产密钥

ssh-keygen -t dsa -P '' -f id_dsa

cat id_dsa.pub >> authorized_keys

5.3 在192.168.52.131上面生产密钥

ssh-keygen -t dsa -P '' -f id_dsa

cat id_dsa.pub >> authorized_keys

5.4 构造3个通用的authorized_keys

在192.168.52.129上面操作：

cd /root/.ssh #copy130和131上面的密钥过来

scp 192.168.52.130:/root/.ssh/id_dsa.pub./id_dsa.pub.130

scp 192.168.52.131:/root/.ssh/id_dsa.pub./id_dsa.pub.131

cat id_dsa.pub.130 >> authorized_keys

cat id_dsa.pub.131 >> authorized_keys

查看生成的通用密钥

[root@data01 .ssh]# cat authorized_keys

ssh-dssAAAAB3NzaC1kc3MAAACBAKe9oTz+hQ3vAUE+x7s2EIT6RSrlPoD2VHuSsDo+ZmAyv6+DD6/eVhVXrCyqzjQPJa6UI0PYjlPMk2r2wqdvC/YqQaLhFuQmsBQwrVA2xNHwhB3ic+Om44GVoiZFM7+bGAtfhQ9DLK2+sjfaa/oQfuDvTJ2SI/f0oG3wDGmokgdLAAAAFQC/O4R1yX1FxW7+dCKHhV+LQHWWHQAAAIADR5yqZGZCx8VB8Q6kAYg3cuUCCo3gF/hA5bHY+2xO5CcGAD1mq/l5v55QPUlGAH7btdmfip1tiaQ+V3N+Ektf2psM0uap/LKvbV2jQYKc2UZrpfF6R0lG+x9rpkxWMce1TJ4yANGOasjNjUV6Lg0RVDxLiMT4Ja4+edQVduYt2AAAAIBPNfJlyglZ5CwsI+v753tD8WT4GaH8ssNLpIKrH9qJU6OuT9MmniKE1RqZr+e5MCvkUAAXHFPq0DhHQlPKWqhIpVlu0E8Zsn9a5tv728JpIYz1umB5VLo2J5fhpobefD3AhjEHSyaubJgQG4Gu+jdwsR0H21xLNx0VoP8YPbYkAQ==root@data01

ssh-dssAAAAB3NzaC1kc3MAAACBAPtU+mTL9an88U1wfSwxdlHBg9n8eB9l218sXGLHwJyxNJN0pq4iPCLMfRxhM6T30HxUnyhghxBF2XvkAkMEjZ+IuCoA0mwRi1CcYSMh72SXqfRdgE2GpRBZDRRhlCIE5feNswWZdC7fIDmgNHtK5CFmJLcl+9Bkb3kfGqu8JOxbAAAAFQDo2YRMd5ZsfBRvPZcCYWcOsuy2oQAAAIEA4pGH2w7luN9JhJ0V6sDUUySg4R488yloKR/k4yD33qPXTejCiDUKUbanMLs6obQqxpbVvlhRT6cyo/le7PO6H8IzRHjFy65EPL0omn7J06PmHBUMqCn4jXo27EGXlRLavnonUf3tFeaVo7GxXerj71NdBKkQX7e/bgzD4d5v0PMAAACBAIhx1X50hlYzqPEZEDXTJmntLRXsMB20DvngvUcQLlgLozwfaNdJAQuYSGqklXYTquSmsyaTNJsxj8EkKG4XbM/7hHtNbV8KuAMJGT4Dm7jEdiKClyle2tIvVtonYaL41KeZVdn6Lk4lRYIFxpDoQHKXXr+XEFhLjoniT8urPISlroot@data02

ssh-dss AAAAB3NzaC1kc3MAAACBAJtC3j4Gq+zR7adyKFco/1hELblR65Af+Cgi81mfL+sJAFuOqPcFvDANhymBQ9ltH1N2/eDq1CrD0U9tRMxSwBvgiWZW9brkMpl5ix6oJSCBHdyqL6iuREk7CZ3V/y7P2V+TDCc+am5onMWDG1Af9o6CeA7CP92CHaUoltcXd7L7AAAAFQCqpeVqjbEs/lC/J1utfWrpGDxt8QAAAIB1aeB6C3S9t0dU3bEZs06DaooO46foBFMOI7u0w7uEWvj48UQF7W5Y++vjsiARxr6clk2g2T70n0fJmZCtMMiHqD6vyy8OOy8PzdUcQVAUW2GZQ8hn7M1U2GOz2KPo6uUbPiKkXilEfh9YRsyZyxMdmC4raPjPea8sj6favK8RbgAAAIAima6hWfBFsFcE1oh02xk7ydavHGJbHAlxeH9g9glE0FPmzPgWMFkYQnLsyV2z+ouMPFmERUPYzg1k/8Dhoxjp9j4JB6bIcPNtKdlS660NcFLxRtFAhrnzvLOTzXYzeIuZOlE0WcjeQGNpx8JHAef/3WzpHnVmnhbmlkBrZ8X/OQ==root@oraclem1

[root@data01 .ssh]#

看到authorized_keys文件里面有3行记录，分别代表了访问data01（192.168.52.129），oraclem1（192.168.52.131），data02（192.168.52.130）的公用密钥。然后把这个authorized_keys公钥文件copy到oraclem1（192.168.52.131）和data02（192.168.52.130）上面同一个目录下。Scp命令如下：

scp authorized_keys192.168.52.130:/root/.ssh/

scp authorized_keys192.168.52.131:/root/.ssh/

5.5 开始check验证如下：

[root@oraclem1 ~]# ssh 192.168.52.130

Last login: Tue Apr 7 02:40:40 2015 from data01

[root@data02 ~]# ssh 192.168.52.131

Last login: Tue Apr 7 02:40:56 2015 from 192.168.52.131

[root@oraclem1 ~]# ssh 192.168.52.130

Last login: Tue Apr 7 02:41:11 2015 from 192.168.52.131

[root@data02 ~]# ssh 192.168.52.129

Last login: Tue Apr 7 02:40:01 2015 from 192.168.52.131

[root@data01 ~]# ssh 192.168.52.131

Last login: Tue Apr 7 02:41:18 2015 from 192.168.52.130

[root@oraclem1 ~]# ssh 192.168.52.129

Last login: Tue Apr 7 02:41:26 2015 from data02

[root@data01 ~]# ssh 192.168.52.131

Last login: Tue Apr 7 02:41:31 2015 from 192.168.52.129

[root@oraclem1 ~]# ssh 192.168.52.130

Last login: Tue Apr 7 02:41:21 2015 from 192.168.52.131

[root@data02 ~]# ssh 192.168.52.129

Last login: Tue Apr 7 02:41:42 2015 from 192.168.52.131

[root@data01 ~]

OK，3台服务器已经能实现两两互相ssh通了，不需要输入密码即可。

PS ：如果不能实现任何两台主机互相之间可以无密码登录，后面的环节可能会有问题。

5.6 实现主机名hostname登录

在3台服务器上，编辑/etc/hosts,追加入以下内容保存退出

192.168.52.129 data01

192.168.52.130 data02

192.168.52.131 oraclem1

验证主机名登录：

[root@data02 ~]# ssh oraclem1

The authenticity of host 'oraclem1(192.168.52.131)' can't be established.

RSA key fingerprint ise5:f0:ae:e3:14:35:2f:09:1f:88:dd:31:c3:1a:e1:73.

Are you sure you want to continueconnecting (yes/no)? yes

Warning: Permanently added 'oraclem1' (RSA)to the list of known hosts.

Last login: Tue Apr 7 02:51:59 2015 from data01

[root@oraclem1 ~]# ssh data02

Last login: Tue Apr 7 02:52:10 2015 from data01

[root@data02 ~]# ssh oraclem1

Last login: Tue Apr 7 02:52:18 2015 from data02

[root@oraclem1 ~]# vim /etc/hosts

[root@oraclem1 ~]# ssh data01

Last login: Tue Apr 7 02:52:06 2015 from data02

[root@data01 ~]# ssh data02

Last login: Tue Apr 7 02:52:21 2015 from oraclem1

[root@data02 ~]# ssh data01

Last login: Tue Apr 7 02:55:13 2015 from oraclem1

[root@data01 ~]#

OK，看到可以实现通过主机名来无密码登录了。

6，准备好mysql主从环境

架构如下，一主二从的架构

Master主库à192.168.52.129，slave从库à192.168.52.130

Master 主库à192.168.52.129，slave从库à192.168.52.131

创建用户mha管理的账号，在所有mysql服务器上都需要执行：

GRANT SUPER,RELOAD,REPLICATIONCLIENT,SELECT ON *.* TO manager@'192.168.52.%' IDENTIFIED BY 'manager_1234';

GRANT CREATE,INSERT,UPDATE,DELETE,DROP ON*.* TO manager@'192.168.52.%';

创建主从账号，在所有mysql服务器上都需要执行：

GRANT RELOAD, SUPER, REPLICATION SLAVE ON*.* TO 'repl'@'192.168.52.%' IDENTIFIED BY 'repl_1234';

7，开始安装mha

mha包括manager节点和data节点，data节点包括原有的MySQL复制结构中的主机，至少3台，即1主2从，当masterfailover后，还能保证主从结构；只需安装node包。manager server：运行监控脚本，负责monitoring 和 auto-failover；需要安装node包和manager包。

为了节省机器，可以从现有复制架构中选一台“闲置”从库作为manager server，比如：某台从库不对外提供读的服务，只是作为候选主库，或是专门用于备份。

7.1，在数据节点上安装mha

首先安装yum -y install perl-DBD-MySQL

tar -zxvpf mha4mysql-node-0.56.tar.gz

perl Makefile.PL

make && make install

7.2 在管理节点上安装mha

首先安装perl的mysql包：

yum install -y perl-DBD-MySQL

yum install -y perl-Config-Tiny

yum install -y perl-Log-Dispatch

yum install -y perl-Parallel-ForkManager

yum install -y perl-Config-IniFiles

一些安装失败，可以直接下载rpm包安装：

wgetftp://ftp.muug.mb.ca/mirror/centos/5.10/os/x86_64/CentOS/perl-5.8.8-41.el5.x86_64.rpm

wgetftp://ftp.muug.mb.ca/mirror/centos/6.5/os/x86_64/Packages/compat-db43-4.3.29-15.el6.x86_64.rpm

wgethttp://downloads.naulinux.ru/pub/NauLinux/6x/i386/sites/School/RPMS/perl-Log-Dispatch-2.27-1.el6.noarch.rpm

wget http://dl.fedoraproject.org/pub/epel/6/i386/perl-Parallel-ForkManager-0.7.9-1.el6.noarch.rpm

wget http://dl.fedoraproject.org/pub/epel/6/i386/perl-Mail-Sender-0.8.16-3.el6.noarch.rpm wget http://dl.fedoraproject.org/pub/epel/6/i386/perl-Mail-Sendmail-0.79-12.el6.noarch.rpm

wget http://mirror.centos.org/centos/6/os/x86_64/Packages/perl-Time-HiRes-1.9721-136.el6.x86_64.rpm

下载完后，一个个rpm安装好

如果最后还是安装不好，可以尝试一下perl CPAN的方式：

perl -MCPAN -e shell

cpan[1]> install Log::Dispatch

然后通过perlMakefile.PL检查mha的perl安装环境，如下所示：

[root@oraclem1 mha4mysql-manager-0.56]#perl Makefile.PL

*** Module::AutoInstall version 1.03

*** Checking for Perl dependencies...

[Core Features]

- DBI ...loaded. (1.609)

- DBD::mysql ...loaded. (4.013)

- Time::HiRes ...loaded. (1.9726)

- Config::Tiny ...loaded. (2.12)

- Log::Dispatch ...loaded. (2.44)

- Parallel::ForkManager ...loaded. (0.7.9)

- MHA::NodeConst ...loaded. (0.56)

*** Module::AutoInstall configurationfinished.

Generating a Unix-style Makefile

Writing Makefile for mha4mysql::manager

Writing MYMETA.yml and MYMETA.json

[root@oraclem1 mha4mysql-manager-0.56]#

然后解压缩安装：

tar -xvf mha4mysql-manager-0.56.tar.gz

cd mha4mysql-manager-0.56

perl Makefile.PL

make && make install

7.3 编辑管理节点配置

在管理节点192.168.52.129上面

[root@data01 mha4mysql-manager-0.56]# vim/etc/masterha/app1.cnf

[server default]

manager_workdir=/var/log/masterha/app1

manager_log=/var/log/masterha/app1/manager.log

ssh_user=root #ssh免密钥登录的帐号名

repl_user=repl #mysql复制帐号，用来在主从机之间同步二进制日志等

repl_password=repl_1234

ping_interval=1 #ping间隔，用来检测master是否正常

[server1]

hostname=192.168.52.129

candidate_master=1 #master机宕掉后,优先启用这台作为新master

master_binlog_dir=/home/data/mysql/binlog/

[server2]

hostname=192.168.52.130

#candidate_master=1

master_binlog_dir=/home/data/mysql/binlog/

[server3]

hostname=192.168.52.131

#candidate_master=1

master_binlog_dir=/home/data/mysql/binlog/

#[server4]

#hostname=host4

#no_master=1

7.4 利用mha工具检测ssh

然后check ssh成功，采用命令：masterha_check_ssh --conf=/etc/masterha/app1.cnf，如下所示：

[root@data01 ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf

Tue Apr  7 02:56:12 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Tue Apr  7 02:56:12 2015 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Tue Apr  7 02:56:12 2015 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Tue Apr  7 02:56:12 2015 - [info] Starting SSH connection tests..

Tue Apr  7 02:56:15 2015 - [debug]

Tue Apr  7 02:56:12 2015 - [debug]  Connecting via SSH from root@192.168.52.129(192.168.52.129:22) to root@192.168.52.130(192.168.52.130:22)..

Warning: Permanently added '192.168.52.129' (RSA) to the list of known hosts.

Tue Apr  7 02:56:13 2015 - [debug]   ok.

Tue Apr  7 02:56:13 2015 - [debug]  Connecting via SSH from root@192.168.52.129(192.168.52.129:22) to root@192.168.52.131(192.168.52.131:22)..

Tue Apr  7 02:56:15 2015 - [debug]   ok.

Tue Apr  7 02:56:15 2015 - [debug]

Tue Apr  7 02:56:13 2015 - [debug]  Connecting via SSH from root@192.168.52.130(192.168.52.130:22) to root@192.168.52.129(192.168.52.129:22)..

Tue Apr  7 02:56:14 2015 - [debug]   ok.

Tue Apr  7 02:56:14 2015 - [debug]  Connecting via SSH from root@192.168.52.130(192.168.52.130:22) to root@192.168.52.131(192.168.52.131:22)..

Tue Apr  7 02:56:15 2015 - [debug]   ok.

Tue Apr  7 02:56:16 2015 - [debug]

Tue Apr  7 02:56:13 2015 - [debug]  Connecting via SSH from root@192.168.52.131(192.168.52.131:22) to root@192.168.52.129(192.168.52.129:22)..

Tue Apr  7 02:56:15 2015 - [debug]   ok.

Tue Apr  7 02:56:15 2015 - [debug]  Connecting via SSH from root@192.168.52.131(192.168.52.131:22) to root@192.168.52.130(192.168.52.130:22)..

Tue Apr  7 02:56:16 2015 - [debug]   ok.

Tue Apr  7 02:56:16 2015 - [info] All SSH connection tests passed successfully.

[root@data01 ~]#

7.5 使用mha工具check检查repl环境

检测命令为：masterha_check_repl--conf=/etc/masterha/app1.cnf，检测结果如下：

 [root@oraclem1 ~]#  masterha_check_repl --conf=/etc/masterha/app1.cnf

Fri Apr 10 01:02:18 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Fri Apr 10 01:02:18 2015 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Fri Apr 10 01:02:18 2015 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Fri Apr 10 01:02:18 2015 - [info] MHA::MasterMonitor version 0.56.

Fri Apr 10 01:02:18 2015 - [info] Multi-master configuration is detected. Current primary(writable) master is 192.168.52.129(192.168.52.129:3306)

Fri Apr 10 01:02:18 2015 - [info] Master configurations are as below:

Master 192.168.52.130(192.168.52.130:3306), replicating from 192.168.52.129(192.168.52.129:3306), read-only

Master 192.168.52.129(192.168.52.129:3306), replicating from 192.168.52.130(192.168.52.130:3306)

Fri Apr 10 01:02:18 2015 - [info] GTID failover mode = 0

Fri Apr 10 01:02:18 2015 - [info] Dead Servers:

Fri Apr 10 01:02:18 2015 - [info] Alive Servers:

Fri Apr 10 01:02:18 2015 - [info]   192.168.52.129(192.168.52.129:3306)

Fri Apr 10 01:02:18 2015 - [info]   192.168.52.130(192.168.52.130:3306)

Fri Apr 10 01:02:18 2015 - [info]   192.168.52.131(192.168.52.131:3306)

Fri Apr 10 01:02:18 2015 - [info] Alive Slaves:

Fri Apr 10 01:02:18 2015 - [info]   192.168.52.130(192.168.52.130:3306)  Version=5.6.12-log (oldest major version between slaves) log-bin:enabled

Fri Apr 10 01:02:18 2015 - [info]     Replicating from 192.168.52.129(192.168.52.129:3306)

Fri Apr 10 01:02:18 2015 - [info]   192.168.52.131(192.168.52.131:3306)  Version=5.6.12-log (oldest major version between slaves) log-bin:enabled

Fri Apr 10 01:02:18 2015 - [info]     Replicating from 192.168.52.129(192.168.52.129:3306)

Fri Apr 10 01:02:18 2015 - [info] Current Alive Master: 192.168.52.129(192.168.52.129:3306)

Fri Apr 10 01:02:18 2015 - [info] Checking slave configurations..

Fri Apr 10 01:02:18 2015 - [info] Checking replication filtering settings..

Fri Apr 10 01:02:18 2015 - [info]  binlog_do_db= user_db, binlog_ignore_db= information_schema,mysql,performance_schema,test

Fri Apr 10 01:02:18 2015 - [info]  Replication filtering check ok.

Fri Apr 10 01:02:18 2015 - [info] GTID (with auto-pos) is not supported

Fri Apr 10 01:02:18 2015 - [info] Starting SSH connection tests..

Fri Apr 10 01:02:20 2015 - [info] All SSH connection tests passed successfully.

Fri Apr 10 01:02:20 2015 - [info] Checking MHA Node version..

Fri Apr 10 01:02:23 2015 - [info]  Version check ok.

Fri Apr 10 01:02:23 2015 - [info] Checking SSH publickey authentication settings on the current master..

Fri Apr 10 01:02:23 2015 - [info] HealthCheck: SSH to 192.168.52.129 is reachable.

Fri Apr 10 01:02:23 2015 - [info] Master MHA Node version is 0.56.

Fri Apr 10 01:02:23 2015 - [info] Checking recovery script configurations on 192.168.52.129(192.168.52.129:3306)..

Fri Apr 10 01:02:23 2015 - [info]   Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/home/data/mysql/binlog/ --output_file=/var/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000183

Fri Apr 10 01:02:23 2015 - [info]   Connecting to root@192.168.52.129(192.168.52.129:22)..

  Creating /var/tmp if not exists..    ok.

  Checking output directory is accessible or not..

ok.

  Binlog found at /home/data/mysql/binlog/, up to mysql-bin.000183

Fri Apr 10 01:02:23 2015 - [info] Binlog setting check done.

Fri Apr 10 01:02:23 2015 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..

Fri Apr 10 01:02:23 2015 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='manager' --slave_host=192.168.52.130 --slave_ip=192.168.52.130 --slave_port=3306 --workdir=/var/tmp --target_version=5.6.12-log --manager_version=0.56 --relay_dir=/home/data/mysql/data --current_relay_log=mysqld-relay-bin.000011  --slave_pass=xxx

Fri Apr 10 01:02:23 2015 - [info]   Connecting to root@192.168.52.130(192.168.52.130:22)..

  Checking slave recovery environment settings..

    Relay log found at /home/data/mysql/data, up to mysqld-relay-bin.000013

    Temporary relay log file is /home/data/mysql/data/mysqld-relay-bin.000013

    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.

 done.

    Testing mysqlbinlog output.. done.

    Cleaning up test file(s).. done.

Fri Apr 10 01:02:24 2015 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='manager' --slave_host=192.168.52.131 --slave_ip=192.168.52.131 --slave_port=3306 --workdir=/var/tmp --target_version=5.6.12-log --manager_version=0.56 --relay_log_info=/home/data/mysql/data/relay-log.info  --relay_dir=/home/data/mysql/data/  --slave_pass=xxx

Fri Apr 10 01:02:24 2015 - [info]   Connecting to root@192.168.52.131(192.168.52.131:22)..

  Checking slave recovery environment settings..

    Opening /home/data/mysql/data/relay-log.info ... ok.

    Relay log found at /home/data/mysql/data, up to mysql-relay-bin.000023

    Temporary relay log file is /home/data/mysql/data/mysql-relay-bin.000023

    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.

 done.

    Testing mysqlbinlog output.. done.

    Cleaning up test file(s).. done.

Fri Apr 10 01:02:26 2015 - [info] Slaves settings check done.

Fri Apr 10 01:02:26 2015 - [info]

192.168.52.129(192.168.52.129:3306) (current master)

 +--192.168.52.130(192.168.52.130:3306)

 +--192.168.52.131(192.168.52.131:3306)

Fri Apr 10 01:02:26 2015 - [info] Checking replication health on 192.168.52.130..

Fri Apr 10 01:02:26 2015 - [info]  ok.

Fri Apr 10 01:02:26 2015 - [info] Checking replication health on 192.168.52.131..

Fri Apr 10 01:02:26 2015 - [info]  ok.

Fri Apr 10 01:02:26 2015 - [warning] master_ip_failover_script is not defined.

Fri Apr 10 01:02:26 2015 - [warning] shutdown_script is not defined.

Fri Apr 10 01:02:26 2015 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

[root@oraclem1 ~]#

8，管理mha操作

8.1 启动manager

nohup masterha_manager --conf=/etc/masterha/app1.cnf < /dev/null >/logs/mha/app1/manager.log 2>&1 &

执行后台情形，如下所示：

[root@oraclem1 mha4mysql-manager-0.56]#  nohup masterha_manager --conf=/etc/masterha/app1.cnf /logs/mha/app1/manager.log 2>&1 &

[1] 8973

[root@oraclem1 mha4mysql-manager-0.56]#

[root@oraclem1 mha4mysql-manager-0.56]# tail -f /logs/mha/app1/manager.log

Fri Apr 10 02:46:43 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Fri Apr 10 02:46:43 2015 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Fri Apr 10 02:46:43 2015 - [info] Reading server configuration from /etc/masterha/app1.cnf..

8.2 使用masterha_check_status检测下

[root@oraclem1 ~]# masterha_check_status--conf=/etc/masterha/app1.cnf

app1 (pid:8973) is running(0:PING_OK),master:192.168.52.129

[root@oraclem1 ~]#

手动操作：

8.3 停止manager

命令：masterha_stop --conf=/etc/masterha/app1.cnf

[root@oraclem1 mha4mysql-manager-0.56]#masterha_stop --conf=/etc/masterha/app1.cnf

Stopped app1 successfully.

[1]+ Exit 1 nohupmasterha_manager --conf=/etc/masterha/app1.cnf < /dev/null >/logs/mha/app1/manager.log 2>&1

[root@oraclem1 mha4mysql-manager-0.56]#

8.4 master死机自动切换测试

在mysql的master库52.129上，执行如下命令：echo c> /proc/sysrq-trigger

后果是：然后会看到master库变成了52.130，而52.131从库也被迫去连接新的主库52.130了。

Manager自动完成了切换操作。

8.5 master手动切换

先停止manager：masterha_stop --conf=/etc/masterha/app1.cnf

在备选slave和master上添加crontab –e任务，

手动切换master，命令如下

masterha_master_switch--conf=/etc/masterha/app1.cnf --master_state=dead--dead_master_host=192.168.52.129

masterha_master_switch--conf=/etc/masterha/app1.cnf --master_state=alive--new_master_host=192.168.52.130

先设置原来的master为dead，如下所示：

 [root@oraclem1 ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=dead --dead_master_host=192.168.52.129

--dead_master_ip= is not set. Using 192.168.52.129.

--dead_master_port= is not set. Using 3306.

Fri Apr 10 04:19:36 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Fri Apr 10 04:19:36 2015 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Fri Apr 10 04:19:36 2015 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Fri Apr 10 04:19:36 2015 - [info] MHA::MasterFailover version 0.56.

Fri Apr 10 04:19:36 2015 - [info] Starting master failover.

Fri Apr 10 04:19:36 2015 - [info]

Fri Apr 10 04:19:36 2015 - [info] * Phase 1: Configuration Check Phase..

Fri Apr 10 04:19:36 2015 - [info]

Fri Apr 10 04:19:36 2015 - [info] Multi-master configuration is detected. Current primary(writable) master is 192.168.52.129(192.168.52.129:3306)

Fri Apr 10 04:19:36 2015 - [info] Master configurations are as below:

Master 192.168.52.130(192.168.52.130:3306), replicating from 192.168.52.129(192.168.52.129:3306), read-only

Master 192.168.52.129(192.168.52.129:3306), replicating from 192.168.52.130(192.168.52.130:3306)

Fri Apr 10 04:19:36 2015 - [info] GTID failover mode = 0

Fri Apr 10 04:19:36 2015 - [info] Dead Servers:

Fri Apr 10 04:19:36 2015 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln187] None of server is dead. Stop failover.

Fri Apr 10 04:19:36 2015 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53

[root@oraclem1 ~]#

然后设置新的master为alive，在切换过程中，界面几次自动输入YES，最后会有Switchingmaster to 192.168.52.130(192.168.52.130:3306) completed successfully.提示标志着手动切换成功，如下所示：

 [root@oraclem1 ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.52.130

Fri Apr 10 04:28:06 2015 - [info] MHA::MasterRotate version 0.56.

Fri Apr 10 04:28:06 2015 - [info] Starting online master switch..

Fri Apr 10 04:28:06 2015 - [info]

Fri Apr 10 04:28:06 2015 - [info] * Phase 1: Configuration Check Phase..

Fri Apr 10 04:28:06 2015 - [info]

Fri Apr 10 04:28:06 2015 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Fri Apr 10 04:28:06 2015 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Fri Apr 10 04:28:06 2015 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Fri Apr 10 04:28:06 2015 - [info] Multi-master configuration is detected. Current primary(writable) master is 192.168.52.129(192.168.52.129:3306)

Fri Apr 10 04:28:06 2015 - [info] Master configurations are as below:

Master 192.168.52.130(192.168.52.130:3306), replicating from 192.168.52.129(192.168.52.129:3306), read-only

Master 192.168.52.129(192.168.52.129:3306), replicating from 192.168.52.130(192.168.52.130:3306)

Fri Apr 10 04:28:06 2015 - [info] GTID failover mode = 0

Fri Apr 10 04:28:06 2015 - [info] Current Alive Master: 192.168.52.129(192.168.52.129:3306)

Fri Apr 10 04:28:06 2015 - [info] Alive Slaves:

Fri Apr 10 04:28:06 2015 - [info]   192.168.52.130(192.168.52.130:3306)  Version=5.6.12-log (oldest major version between slaves) log-bin:enabled

Fri Apr 10 04:28:06 2015 - [info]     Replicating from 192.168.52.129(192.168.52.129:3306)

Fri Apr 10 04:28:06 2015 - [info]   192.168.52.131(192.168.52.131:3306)  Version=5.6.12-log (oldest major version between slaves) log-bin:enabled

Fri Apr 10 04:28:06 2015 - [info]     Replicating from 192.168.52.129(192.168.52.129:3306)

It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.52.129(192.168.52.129:3306)? (YES/no): YES

Fri Apr 10 04:28:09 2015 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..

Fri Apr 10 04:28:09 2015 - [info]  ok.

Fri Apr 10 04:28:09 2015 - [info] Checking MHA is not monitoring or doing failover..

Fri Apr 10 04:28:09 2015 - [info] Checking replication health on 192.168.52.130..

Fri Apr 10 04:28:09 2015 - [info]  ok.

Fri Apr 10 04:28:09 2015 - [info] Checking replication health on 192.168.52.131..

Fri Apr 10 04:28:09 2015 - [info]  ok.

Fri Apr 10 04:28:09 2015 - [info] 192.168.52.130 can be new master.

Fri Apr 10 04:28:09 2015 - [info]

From:

192.168.52.129(192.168.52.129:3306) (current master)

 +--192.168.52.130(192.168.52.130:3306)

 +--192.168.52.131(192.168.52.131:3306)

To:

192.168.52.130(192.168.52.130:3306) (new master)

 +--192.168.52.131(192.168.52.131:3306)

Starting master switch from 192.168.52.129(192.168.52.129:3306) to 192.168.52.130(192.168.52.130:3306)? (yes/NO): yes

Fri Apr 10 04:28:13 2015 - [info] Checking whether 192.168.52.130(192.168.52.130:3306) is ok for the new master..

Fri Apr 10 04:28:13 2015 - [info]  ok.

Fri Apr 10 04:28:13 2015 - [info] ** Phase 1: Configuration Check Phase completed.

Fri Apr 10 04:28:13 2015 - [info]

Fri Apr 10 04:28:13 2015 - [info] * Phase 2: Rejecting updates Phase..

Fri Apr 10 04:28:13 2015 - [info]

master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes

Fri Apr 10 04:28:18 2015 - [info] Locking all tables on the orig master to reject updates from everybody (including root):

Fri Apr 10 04:28:18 2015 - [info] Executing FLUSH TABLES WITH READ LOCK..

Fri Apr 10 04:28:18 2015 - [info]  ok.

Fri Apr 10 04:28:18 2015 - [info] Orig master binlog:pos is mysql-bin.000185:120.

Fri Apr 10 04:28:18 2015 - [info]  Waiting to execute all relay logs on 192.168.52.130(192.168.52.130:3306)..

Fri Apr 10 04:28:18 2015 - [info]  master_pos_wait(mysql-bin.000185:120) completed on 192.168.52.130(192.168.52.130:3306). Executed 0 events.

Fri Apr 10 04:28:18 2015 - [info]   done.

Fri Apr 10 04:28:18 2015 - [info] Getting new master's binlog name and position..

Fri Apr 10 04:28:18 2015 - [info]  mysql-bin.000058:578

Fri Apr 10 04:28:18 2015 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.52.130', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000058', MASTER_LOG_POS=578, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Fri Apr 10 04:28:18 2015 - [info] Setting read_only=0 on 192.168.52.130(192.168.52.130:3306)..

Fri Apr 10 04:28:18 2015 - [info]  ok.

Fri Apr 10 04:28:18 2015 - [info]

Fri Apr 10 04:28:18 2015 - [info] * Switching slaves in parallel..

Fri Apr 10 04:28:18 2015 - [info]

Fri Apr 10 04:28:18 2015 - [info] -- Slave switch on host 192.168.52.131(192.168.52.131:3306) started, pid: 14563

Fri Apr 10 04:28:18 2015 - [info]

Fri Apr 10 04:28:18 2015 - [info] Log messages from 192.168.52.131 ...

Fri Apr 10 04:28:18 2015 - [info]

Fri Apr 10 04:28:18 2015 - [info]  Waiting to execute all relay logs on 192.168.52.131(192.168.52.131:3306)..

Fri Apr 10 04:28:18 2015 - [info]  master_pos_wait(mysql-bin.000185:120) completed on 192.168.52.131(192.168.52.131:3306). Executed 0 events.

Fri Apr 10 04:28:18 2015 - [info]   done.

Fri Apr 10 04:28:18 2015 - [info]  Resetting slave 192.168.52.131(192.168.52.131:3306) and starting replication from the new master 192.168.52.130(192.168.52.130:3306)..

Fri Apr 10 04:28:18 2015 - [info]  Executed CHANGE MASTER.

Fri Apr 10 04:28:18 2015 - [info]  Slave started.

Fri Apr 10 04:28:18 2015 - [info] End of log messages from 192.168.52.131 ...

Fri Apr 10 04:28:18 2015 - [info]

Fri Apr 10 04:28:18 2015 - [info] -- Slave switch on host 192.168.52.131(192.168.52.131:3306) succeeded.

Fri Apr 10 04:28:18 2015 - [info] Unlocking all tables on the orig master:

Fri Apr 10 04:28:18 2015 - [info] Executing UNLOCK TABLES..

Fri Apr 10 04:28:18 2015 - [info]  ok.

Fri Apr 10 04:28:18 2015 - [info] All new slave servers switched successfully.

Fri Apr 10 04:28:18 2015 - [info]

Fri Apr 10 04:28:18 2015 - [info] * Phase 5: New master cleanup phase..

Fri Apr 10 04:28:18 2015 - [info]

Fri Apr 10 04:28:18 2015 - [info]  192.168.52.130: Resetting slave info succeeded.

Fri Apr 10 04:28:18 2015 - [info] Switching master to 192.168.52.130(192.168.52.130:3306) completed successfully.

[root@oraclem1 ~]#

PS ：手动切换后，使用 masterha_check_repl 不能使用原来的 /etc/masterha/app1.cnf 来做 check ，要用新的 app2.cnf 来做 check ，因为 app1.cnf 里面的 master 是原来旧的 cnf ， check 会报错主从复制失败。如何生成新的 app2.cnf ，很简单，如下所示：

（1）复制原理的app1.cnf为新的app2.cnf

cp /etc/masterha/app1.cnf/etc/masterha/app2.cnf

（2）编辑app2.cnf，将里面的server1和server2的ip互换，也就是switch的两个主从的ip换掉，如下所示：

[server1]

hostname=192.168.52.130

candidate_master=1

master_binlog_dir=/home/data/mysql/binlog/

[server2]

hostname=192.168.52.129

#candidate_master=1

#master_binlog_dir=/home/data/mysql/binlog/

（3）然后在使用masterha_check_repl --conf=/etc/masterha/app2.cnf进行check，就可以看到switch后的mha的主从是ok的了。

9报错记录总结

报错记录1：

[root@data01 ~]# masterha_check_repl--conf=/etc/masterha/app1.cnf

Tue Apr 7 22:31:06 2015 - [warning] Global configuration file/etc/masterha_default.cnf not found. Skipping.

Tue Apr 7 22:31:07 2015 - [info] Reading application default configuration from/etc/masterha/app1.cnf..

Tue Apr 7 22:31:07 2015 - [info] Reading server configuration from/etc/masterha/app1.cnf..

Tue Apr 7 22:31:07 2015 - [info] MHA::MasterMonitor version 0.56.

Tue Apr 7 22:31:07 2015 - [error][/usr/local/share/perl5/MHA/Server.pm,ln303] Getting relay log directory orcurrent relay logfile from replication table failed on192.168.52.130(192.168.52.130:3306)!

Tue Apr 7 22:31:07 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln424] Error happened on checking configurations. at /usr/local/share/perl5/MHA/ServerManager.pmline 315

Tue Apr 7 22:31:07 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln523] Error happened on monitoring servers.

Tue Apr 7 22:31:07 2015 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

[root@data01 ~]#

解决办法：在 192.168.52.130 上面， vim /etc/my.cnf ，在里面添加

relay-log=/home/data/mysql/binlog/mysql-relay-bin

然后重启mysql，再去重新设置slave连接。

STOP SLAVE;

RESET SLAVE;

CHANGE MASTER TOMASTER_HOST='192.168.52.129',MASTER_USER='repl',MASTER_PASSWORD='repl_1234',MASTER_LOG_FILE='mysql-bin.000178',MASTER_LOG_POS=459;

START SLAVE;

Ok，搞定了。

报错记录2：

[root@data01 perl]# masterha_check_repl--conf=/etc/masterha/app1.cnf

Thu Apr 9 00:54:32 2015 - [warning] Global configuration file/etc/masterha_default.cnf not found. Skipping.

Thu Apr 9 00:54:32 2015 - [info] Reading application default configuration from/etc/masterha/app1.cnf..

Thu Apr 9 00:54:32 2015 - [info] Reading server configuration from/etc/masterha/app1.cnf..

Thu Apr 9 00:54:32 2015 - [info] MHA::MasterMonitor version 0.56.

Thu Apr 9 00:54:32 2015 - [error][/usr/local/share/perl5/MHA/Server.pm,ln306] Getting relay log directory orcurrent relay logfile from replication table failed on 192.168.52.130(192.168.52.130:3306)!

Thu Apr 9 00:54:32 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln424] Error happened on checking configurations. at/usr/local/share/perl5/MHA/ServerManager.pm line 315

Thu Apr 9 00:54:32 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln523] Error happened on monitoring servers.

Thu Apr 9 00:54:32 2015 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

[root@data01 perl]#

解决方法：

/etc/masterha/app1.cnf文件里面的参数配置，user和repl_user都是mysql账号，需要创建好，这里是只创建了repl_user而没有创建好user账号：

user=manager

password=manager_1234

repl_user=repl

repl_password=repl_1234

在 mysql 节点上，建立允许 manager 访问数据库的“ manager manager ”账户，主要用于 SHOW SLAVESTATUS,RESET SLAVE; 所以需要执行如下命令：

GRANT SUPER,RELOAD,REPLICATIONCLIENT,SELECT ON *.* TO manager@'192.168.52.%' IDENTIFIED BY 'manager_1234';

错误记录3：

[root@oraclem1 ~]# masterha_check_repl--conf=/etc/masterha/app1.cnf

Thu Apr 9 23:09:05 2015 - [warning] Global configuration file/etc/masterha_default.cnf not found. Skipping.

Thu Apr 9 23:09:05 2015 - [info] Reading application default configuration from/etc/masterha/app1.cnf..

Thu Apr 9 23:09:05 2015 - [info] Reading server configuration from/etc/masterha/app1.cnf..

Thu Apr 9 23:09:05 2015 - [info] MHA::MasterMonitor version 0.56.

Thu Apr 9 23:09:05 2015 - [error][/usr/local/share/perl5/MHA/ServerManager.pm,ln781] Multi-master configuration is detected, but two or more masters areeither writable (read-only is not set) or dead! Check configurations fordetails. Master configurations are as below:

Master 192.168.52.130(192.168.52.130:3306),replicating from 192.168.52.129(192.168.52.129:3306)

Master 192.168.52.129(192.168.52.129:3306),replicating from 192.168.52.130(192.168.52.130:3306)

Thu Apr 9 23:09:05 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln424] Error happened on checking configurations. at/usr/local/share/perl5/MHA/MasterMonitor.pm line 326

Thu Apr 9 23:09:05 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln523] Error happened on monitoring servers.

Thu Apr 9 23:09:05 2015 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

[root@oraclem1 ~]#

解决办法：

mysql> set global read_only=1;

Query OK, 0 rows affected (0.00 sec)

mysql>

报错记录4：

Thu Apr 9 23:54:32 2015 - [info] Checking SSH publickey authentication andchecking recovery script configurations on all alive slave servers..

Thu Apr 9 23:54:32 2015 - [info] Executing command : apply_diff_relay_logs --command=test--slave_user='manager' --slave_host=192.168.52.130 --slave_ip=192.168.52.130--slave_port=3306 --workdir=/var/tmp --target_version=5.6.12-log--manager_version=0.56 --relay_dir=/home/data/mysql/data--current_relay_log=mysqld-relay-bin.000011 --slave_pass=xxx

Thu Apr 9 23:54:32 2015 - [info] Connecting to root@192.168.52.130(192.168.52.130:22)..

Can't exec "mysqlbinlog": No suchfile or directory at /usr/local/share/perl5/MHA/BinlogManager.pm line 106.

mysqlbinlog version command failed with rc1:0, please verify PATH, LD_LIBRARY_PATH, and client options

at/usr/local/bin/apply_diff_relay_logs line 493

Thu Apr 9 23:54:32 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln205] Slaves settings check failed!

Thu Apr 9 23:54:32 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln413] Slave configuration failed.

Thu Apr 9 23:54:32 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln424] Error happened on checking configurations. at /usr/local/bin/masterha_check_repl line 48

Thu Apr 9 23:54:32 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln523] Error happened on monitoring servers.

Thu Apr 9 23:54:32 2015 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

[root@oraclem1 ~]#

解决办法：

[root@data02 ~]# type mysqlbinlog

mysqlbinlog is/usr/local/mysql/bin/mysqlbinlog

[root@data02 ~]#

[root@data02 ~]# ln -s/usr/local/mysql/bin/mysqlbinlog /usr/bin/mysqlbinlog

报错记录5：

Thu Apr 9 23:57:24 2015 - [info] Connecting to root@192.168.52.130(192.168.52.130:22)..

Checking slave recovery environment settings..

Relay log found at /home/data/mysql/data, up to mysqld-relay-bin.000013

Temporary relay log file is /home/data/mysql/data/mysqld-relay-bin.000013

Testing mysql connection and privileges..sh: mysql: command not found

mysql command failed with rc 127:0!

at/usr/local/bin/apply_diff_relay_logs line 375

main::check()called at /usr/local/bin/apply_diff_relay_logs line 497

eval{...} called at /usr/local/bin/apply_diff_relay_logs line 475

main::main()called at /usr/local/bin/apply_diff_relay_logs line 120

Thu Apr 9 23:57:24 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln205] Slaves settings check failed!

Thu Apr 9 23:57:24 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln413] Slave configuration failed.

Thu Apr 9 23:57:24 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln424] Error happened on checking configurations. at /usr/local/bin/masterha_check_repl line 48

Thu Apr 9 23:57:24 2015 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm,ln523] Error happened on monitoring servers.

Thu Apr 9 23:57:24 2015 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

解决办法：

ln -s /usr/local/mysql/bin/mysql/usr/bin/mysql

报错记录6：

Fri Apr 10 00:58:36 2015 - [info] Executing command : apply_diff_relay_logs--command=test --slave_user='manager' --slave_host=192.168.52.130--slave_ip=192.168.52.130 --slave_port=3306 --workdir=/var/tmp--target_version=5.6.12-log --manager_version=0.56--relay_dir=/home/data/mysql/data--current_relay_log=mysqld-relay-bin.000011 --slave_pass=xxx

Fri Apr 10 00:58:36 2015 - [info] Connecting to root@192.168.52.130(192.168.52.130:22)..

Checking slave recovery environment settings..

Relay log found at /home/data/mysql/data, up to mysqld-relay-bin.000013

Temporary relay log file is/home/data/mysql/data/mysqld-relay-bin.000013

Testing mysql connection and privileges..Warning: Using a password onthe command line interface can be insecure.

ERROR 1142 (42000) at line 1: CREATEcommand denied to user 'manager'@'192.168.52.130' for table'apply_diff_relay_logs_test'

mysql command failed with rc 1:0!

at/usr/local/bin/apply_diff_relay_logs line 375

main::check()called at /usr/local/bin/apply_diff_relay_logs line 497

eval{...} called at /usr/local/bin/apply_diff_relay_logs line 475

main::main()called at /usr/local/bin/apply_diff_relay_logs line 120

Fri Apr 10 00:58:37 2015 -[error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln205] Slaves settingscheck failed!

Fri Apr 10 00:58:37 2015 -[error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln413] Slave configurationfailed.

Fri Apr 10 00:58:37 2015 -[error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened onchecking configurations. at/usr/local/bin/masterha_check_repl line 48

Fri Apr 10 00:58:37 2015 -[error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened onmonitoring servers.

Fri Apr 10 00:58:37 2015 - [info] Got exitcode 1 (Not master dead).

MySQL Replication Health is NOT OK!

解决办法：

执行如下授权语句sql：

GRANT CREATE,INSERT,UPDATE,DELETE,DROP ON*.* TO manager@'192.168.52.%';

10，参考资料

参考： http://blog.csdn.net/largetalk/article/details/10006899

参考： http://www.tuicool.com/articles/NV3U3i

多实例参考： http://www.cnblogs.com/wingsless/p/4033093.html

参考： http://www.tuicool.com/articles/NV3U3i

参考人为模拟故障过程：

http://www.mysqlsystems.com/2012/03/figure-out-process-of-autofailover-on-mha.html#more-2172

MYSQL 5.7 MHA(GTID+ROW)部署及failover，online_change实战演练

文章结构如下：

1、MHA简介

Masterhigh availability manager and toolsfor mysql,是日本的一位mysql专家采用perl语言编写的一个脚本管理工具，该工具进适用于msql replication环境。目的在于维持master主库的高可用性，MHA是自动的master故障转移和slave提升的软件包，基于标准的MYSQL复制（异步/半同步）。

下载地址：

https://github.com/yoshinorim/mha4mysql-manager/releases https://github.com/yoshinorim/mha4mysql-node/releases

其他需要下载的包地址：

http://rpm.pbone.net/index.php3?stat=3&limit=5&srodzaj=1&dl=40&search=perl-Module-Install&field=

2、MHA原理

MHA的目的在于维持MYSQL Replication 中master库的高可用性，其最大特点是可以修复多个slave之间的差异日志，最终使所有slave保持一致，然后从中选择一个充当新的master,并将其他slave指向它。当master出现故障时，可以通过对比slave之间的I/O thread读取主库binlog的position号，选取最接近的slave 作为备选主库，其他的从库可以通过与备库主库对比生成差异的中继日志，在备选主库上应用从原来master保存的binlog,同时将备选主库提升为master，最后在其他slave上应用相应的差异中继日志并从新的master开始复制。

3、MHA优缺点

3.1. 优点

1、故障切换时，可以自行判断从哪个库与主库的数据最接近，然后切换到上面，可以减少数据的丢失，保证数据的一致性。

2、支持binlog server，可以提高binlog的传送效率，进一步减少数据丢失的风险。

3、结合MYSQL5.7的增强半同步功能，确保故障切换时数据不丢失。

3.2. 缺点

1、自动切换的脚本太简单了，而且比较老化，建议后期逐渐完善

2、搭建,MHA架构需要开启LINUX系统互信协议，所以对于系统安全性来说是个不小的考验。

4、MHA工具包的功能

1、manager管理工具

masterha_check_ssh：检查MHA的ssh配置。

masterha_check_repl:检查MYSQL数据库的主从复制功能。

masterha_manager：启动MHA服务。

masterha_check_status：检查当前MHA运行状态。

masterha_master_monitor：监测Master是否宕机。

masterha_master_switch：控制故障转移（手动或者自动）。

masterha_conf_host：添加或删除配置的server信息。

2、node 数据节点工具

save_binary_logs：保存和复制master的二进制日志。

apply_diff_relay_logs：识别差异的中继日志时间并应用于其他slave。

filter_mysqlbinlog:去除不必要的rollback事件（MHA已不再使用这个工具）。

purge_relay_logs：清除中继日志（不会阻塞SQL线程）。

5、MYSQL MHA安装以及演练

IP	主机名	作用	数据库版本	操作系统版本
172.16.10.22	rac2	master node	Mysql5.7.20	Redhat6.7
172.16.10.61	racdg1	slave node1	Mysql5.7.20	Redhat6.7
172.16.10.62	Racdg2	slave node2	Mysql5.7.20	Redhat6.7
172.16.10.30		VIP

5.1. 配置互信

配置三台机器的互信：

cd

mkdir ~/.ssh

cd /root/.ssh/

ssh-keygen -t dsa -P '' -f id_dsa

id_dsa.pub为公钥，id_dsa为私钥，紧接着将公钥文件复制authorized_keys文件，过程：

cat id_dsa.pub >> authorized_keys

从库1执行：

ssh-keygen -t dsa -P '' -f id_dsa

cat id_dsa.pub >> authorized_keys

从库2执行：

ssh-keygen -t dsa -P '' -f id_dsa

cat id_dsa.pub >> authorized_keys

在把秘钥的传到主库过程：

scp /root/.ssh/id_dsa.pub 172.16.10.22:/root/.ssh/id_dsa.pub.61

scp /root/.ssh/id_dsa.pub 172.16.10.22:/root/.ssh/id_dsa.pub.62

查看主库秘钥：

接收完成后执行合并秘钥：

cat id_dsa.pub.61 >> authorized_keys

cat id_dsa.pub.62 >> authorized_keys

scp /root/.ssh/authorized_keys 172.16.10.61:/root/.ssh/

scp /root/.ssh/authorized_keys 172.16.10.62:/root/.ssh/

然后三台服务器添加域名解析：

vi /etc/hosts

172.16.10.22 rac2

172.16.10.61 racdg1

172.16.10.62 racdg2

验证互信：

主执行：

ssh racdg1 date

ssh racdg2 date

Slave1执行：

ssh rac2 date

ssh racdg2 date

Slave2执行：

ssh rac2 date

ssh racdg1 date

5.2. 搭建一主两从

搭建主从环境，使用的是5.7版本，基于GTID+ROW模式进行搭建。

所以机器执行创建复制账号步骤：

create user 'rep'@'172.16.10.%' identified by 'mysql';

grant replication slave on *.* to 'rep'@'172.16.10.%';

flush privileges;

show grants for 'rep'@'172.16.10.%';

所有主机创建管理账号：

create user 'zs'@'172.16.10.%' identified by '123456';

grant all privileges on *.* to 'zs'@'172.16.10.%';

flush privileges;

配置主从命令并且开启主从同步：

初始化：

/usr/local/mysql5.7/bin/mysqldump -S /tmp/mysql3307.sock --single-transaction -uroot -pmysql --master-data=2 -A > slave.sql

注意：必须加参数 –master-data=2,让备份出来的文件中记录备份这一刻binlog文件与position号，为搭建主从环境做准备。查看备份文件中记录的当前binlog文件和position号。

scp slave.sql 172.16.10.61:/root

scp slave.sql 172.16.10.62:/root

注意，如果主从GTID不一样，数据一致可以：

set global gtid_purged='*******';

主库需要以下配置：

gtid_mode=on

enforce_gtid_consistency=on

log_bin=on

从库需要以下配置：

servier-id 主从库不能一样。

gtid_mode=on

enforce_gtid_consistency=on

log_slave_updates=1

mysql -S /tmp/mysql3307.sock -uroot -pmysql < slave.sql

从库执行：

change master to master_host='172.16.10.22',master_port=3307,master_user='rep',master_password='mysql',master_auto_position=1;

start slave;

show slave status\G;

5.3. 安装MHA节点

所有MHA服务器安装Perl环境

yum install perl-DBD-MySQL

注意最新的为0.58版本，但是对应的是centos7/redhat7,如果是低版本，建议用0.57（本人已经踩过这个坑）

a) 安装MHA-NODE节点

所有节点上执行以下：

tar -zxvf mha4mysql-node-0.57.tar.gz

yum install perl-CPAN*

cd mha4mysql-node-0.57

perl Makefile.PL

make && make install

ls -lh /usr/local/bin/

查看生成的文件：

b) 安装配置MHA-manager管理节点

在salve2 执行以下操作：

安装管理节点，先安装介质包：

yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes –y

yum -y install perl*

查看，并下载需要安装的依赖包：

先检查：

rpm -qa|grep perl-Config-Tiny

rpm -qa|grep perl-Email-Date-Format

rpm -qa|grep perl-Log-Dispatch

rpm -qa|grep perl-Mail-Sender

rpm -qa|grep perl-Mail-Sendmail

rpm -qa|grep perl-MIME-Lite

rpm -qa|grep perl-Time-HiRes

rpm -qa|grep perl-Parallel-ForkManager

后安装：

rpm -ivh perl-Config-Tiny-2.12-1.el6.rfx.noarch.rpm

rpm -ivh perl-Email-Date-Format-1.002-1.el6.rfx.noarch.rpm

rpm -ivh perl-Parallel-ForkManager-0.7.9-1.el6.noarch.rpm

rpm -ivh perl-Mail-Sendmail-0.79_16-4.2.noarch.rpm

rpm -ivh perl-Mail-Sender-0.8.16-3.el6.noarch.rpm

rpm -ivh perl-MIME-Lite-3.029-1.el6.rfx.noarch.rpm

rpm -ivh perl-Parallel-ForkManager-0.7.9-1.el6.noarch.rpm

rpm -ivh perl-Log-Dispatch-2.27-1.el6.noarch.rpm

然后安装管理节点：

tar -zxvf mha4mysql-manager-0.57.tar.gz

cd mha4mysql-manager-0.57

perl Makefile.PL

make

make install

ls -lh /usr/local/bin

进行管理节点MHA的配置过程。

slave2执行：

mkdir -p /usr/local/mha

mkdir -p /etc/mha

cd /etc/mha/

vim mha.conf

[server default]

user=zs

password=123456

manager_workdir=/usr/local/mha

manager_log=/usr/local/mha/manager.log

remote_workdir=/usr/local/mha

ssh_user=root

repl_user=rep

repl_password=mysql

ping_interval=1

master_ip_failover_script=/usr/local/scripts/master_ip_failover

master_ip_online_change_script=/usr/local/scripts/master_ip_online_change

[server1]

hostname=172.16.10.22

ssh_port=22

master_binlog_dir=/mydata/mysql/mysql3307/logs/

candidate_master=1

port=3307

[server2]

hostname=172.16.10.61

ssh_port=22

master_binlog_dir=/mydata/mysql/mysql3307/logs/

candidate_master=1

port=3307

[server3]

hostname=172.16.10.62

ssh_port=22

master_binlog_dir=/mydata/mysql/mysql3307/logs/

no_master=1

port=3307

注释：

manager_workdir=/usr/local/mha //设置manager的工作目录manager_log=/usr/local/mhamanager.log //设置manager的日志

ping_interval=1 //设置监控主库，发送ping包的时间间隔，默认是3秒，尝试三次没有回应的时候自动进行failover

master_ip_failover_script=/usr/local/scripts/master_ip_failover //设置自动failover时候的切换脚本

master_ip_online_change_script= /usr/local/scripts/master_ip_online_change //设置手动切换时候的切换脚本

candidate_master=1 //设置为候选master，如果设置该参数以后，发生主从切换以后将会将此从库提升为主库，即使这个主库不是集群中事件最新的slave

check_repl_delay=0 //默认情况下如果一个slave落后master 100M的relay logs的话，MHA将不会选择该slave作为一个新的master，因为对于这个slave的恢复需要花费很长时间，通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时，这个参数对于设置了candidate_master=1的主机非常有用，因为这个候选主在切换的过程中一定是新的master

no_master=1 //意味着这个server从来不会成为新的master,这个参数用来标记从来不用成为新主的服务器。

编辑failover切换脚本：

mkdir -p /usr/local/scripts

cd /usr/local/scripts

vim master_ip_failover

脚本内容如下：

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Getopt::Long;

my (
$command, $ssh_user, $orig_master_host, $orig_master_ip,
$orig_master_port, $new_master_host, $new_master_ip, $new_master_port
);

my $vip = '172.16.10.30/24';
my $key = '1';
#my $ssh_start_vip = "/sbin/ip addr add $vip dev eth0";
my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip";
#my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down";
my $ssh_stop_vip = "/sbin/ip addr del $vip dev eth0";

GetOptions(
'command=s' => \$command,
'ssh_user=s' => \$ssh_user,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
);

exit &main();

sub main {

print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

if ( $command eq "stop" || $command eq "stopssh" ) {

my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {

my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
exit 0;
}
else {
&usage();
exit 1;
}
}

sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
return 0 unless ($ssh_user);
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

chmod +x master_ip_failover

vi master_ip_online_change

脚本内容如下：

#!/usr/bin/env perl

#

# This program is free software; you can redistribute it and/or modify

# it under the terms of the GNU General Public License as published by

# the Free Software Foundation; either version 2 of the License, or

# (at your option) any later version.

#

# This program is distributed in the hope that it will be useful,

# but WITHOUT ANY WARRANTY; without even the implied warranty of

# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

# GNU General Public License for more details.

#

# You should have received a copy of the GNU General Public License

# along with this program; if not, write to the Free Software

# Foundation, Inc.,

# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;

use warnings FATAL => 'all';

use Getopt::Long;

use MHA::DBHelper;

use MHA::NodeUtil;

use Time::HiRes qw( sleep gettimeofday tv_interval );

use Data::Dumper;

my $_tstart;

my $_running_interval = 0.1;

my $vip = '172.16.10.30';

my $key = "2";

my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip";
my $ssh_stop_vip = "/sbin/ip addr del $vip dev eth0";
my $ssh_send_garp = "/sbin/arping -U $vip -I eth0 -c 1";

my (

$command, $orig_master_is_new_slave, $orig_master_host,

$orig_master_ip, $orig_master_port, $orig_master_user,

$orig_master_password, $orig_master_ssh_user, $new_master_host,

$new_master_ip, $new_master_port, $new_master_user,

$new_master_password, $new_master_ssh_user,

);

GetOptions(

'command=s' => \$command,

'orig_master_is_new_slave' => \$orig_master_is_new_slave,

'orig_master_host=s' => \$orig_master_host,

'orig_master_ip=s' => \$orig_master_ip,

'orig_master_port=i' => \$orig_master_port,

'orig_master_user=s' => \$orig_master_user,

'orig_master_password=s' => \$orig_master_password,

'orig_master_ssh_user=s' => \$orig_master_ssh_user,

'new_master_host=s' => \$new_master_host,

'new_master_ip=s' => \$new_master_ip,

'new_master_port=i' => \$new_master_port,

'new_master_user=s' => \$new_master_user,

'new_master_password=s' => \$new_master_password,

'new_master_ssh_user=s' => \$new_master_ssh_user,

);

exit &main();

sub start_vip(){

`ssh $new_master_ssh_user\@$new_master_host \" $ssh_start_vip \"`;

`ssh $new_master_ssh_user\@$new_master_host \" $ssh_send_garp \"`;

}

sub stop_vip(){

`ssh $orig_master_ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;

}

sub current_time_us {

my ( $sec, $microsec ) = gettimeofday();

my $curdate = localtime($sec);

return $curdate . " " . sprintf( "%06d", $microsec );

}

sub sleep_until {

my $elapsed = tv_interval($_tstart);

if ( $_running_interval > $elapsed ) {

sleep( $_running_interval - $elapsed );

}

sub get_threads_util {

my $dbh = shift;

my $my_connection_id = shift;

my $running_time_threshold = shift;

my $type = shift;

$running_time_threshold = 0 unless ($running_time_threshold);

$type = 0 unless ($type);

my @threads;

my $sth = $dbh->prepare("SHOW PROCESSLIST");

$sth->execute();

while ( my $ref = $sth->fetchrow_hashref() ) {

my $id = $ref->{Id};

my $user = $ref->{User};

my $host = $ref->{Host};

my $command = $ref->{Command};

my $state = $ref->{State};

my $query_time = $ref->{Time};

my $info = $ref->{Info};

$info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);

next if ( $my_connection_id == $id );

next if ( defined($query_time) && $query_time < $running_time_threshold );

next if ( defined($command) && $command eq "Binlog Dump" );

next if ( defined($user) && $user eq "system user" );

&& $command eq "Sleep"

&& defined($query_time)

&& $query_time >= 1 );

if ( $type >= 1 ) {

next if ( defined($command) && $command eq "Sleep" );

next if ( defined($command) && $command eq "Connect" );

}

if ( $type >= 2 ) {

next if ( defined($info) && $info =~ m/^select/i );

next if ( defined($info) && $info =~ m/^show/i );

}

push @threads, $ref;

}

return @threads;

}

sub main {

if ( $command eq "stop" ) {

## Gracefully killing connections on the current master

# 1. Set read_only= 1 on the new master

# 2. DROP USER so that no app user can establish new connections

# 3. Set read_only= 1 on the current master

# 4. Kill current queries

# * Any database access failure will result in script die.

my $exit_code = 1;

eval {

## Setting read_only=1 on the new master (to avoid accident)

my $new_master_handler = new MHA::DBHelper();

# args: hostname, port, user, password, raise_error(die_on_error)_or_not

$new_master_handler->connect( $new_master_ip, $new_master_port,

$new_master_user, $new_master_password, 1 );

print current_time_us() . " Set read_only on the new master.. ";

$new_master_handler->enable_read_only();

if ( $new_master_handler->is_read_only() ) {

print "ok.\n";

}

else {

die "Failed!\n";

}

$new_master_handler->disconnect();

# Connecting to the orig master, die if any database error happens

my $orig_master_handler = new MHA::DBHelper();

$orig_master_handler->connect( $orig_master_ip, $orig_master_port,

$orig_master_user, $orig_master_password, 1 );

## Drop application user so that nobody can connect. Disabling per-session binlog beforehand

$orig_master_handler->disable_log_bin_local();

# print current_time_us() . " Drpping app user on the orig master..\n";

#drop_app_user($orig_master_handler);

## Waiting for N * 100 milliseconds so that current connections can exit

my $time_until_read_only = 15;

$_tstart = [gettimeofday];

my @threads = get_threads_util( $orig_master_handler->{dbh},

$orig_master_handler->{connection_id} );

while ( $time_until_read_only > 0 && $#threads >= 0 ) {

if ( $time_until_read_only % 5 == 0 ) {

printf

"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n",

current_time_us(), $#threads + 1, $time_until_read_only * 100;

if ( $#threads < 5 ) {

print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"

foreach (@threads);

}

sleep_until();

$_tstart = [gettimeofday];

$time_until_read_only--;

@threads = get_threads_util( $orig_master_handler->{dbh},

$orig_master_handler->{connection_id} );

}

## Setting read_only=1 on the current master so that nobody(except SUPER) can write

print current_time_us() . " Set read_only=1 on the orig master.. ";

$orig_master_handler->enable_read_only();

if ( $orig_master_handler->is_read_only() ) {

print "ok.\n";

}

else {

die "Failed!\n";

}

## Waiting for M * 100 milliseconds so that current update queries can complete

my $time_until_kill_threads = 5;

@threads = get_threads_util( $orig_master_handler->{dbh},

$orig_master_handler->{connection_id} );

while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {

if ( $time_until_kill_threads % 5 == 0 ) {

printf

"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n",

current_time_us(), $#threads + 1, $time_until_kill_threads * 100;

if ( $#threads < 5 ) {

print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"

foreach (@threads);

}

sleep_until();

$_tstart = [gettimeofday];

$time_until_kill_threads--;

@threads = get_threads_util( $orig_master_handler->{dbh},

$orig_master_handler->{connection_id} );

}

## Terminating all threads

print current_time_us() . " Killing all application threads..\n";

$orig_master_handler->kill_threads(@threads) if ( $#threads >= 0 );

print current_time_us() . " done.\n";

$orig_master_handler->enable_log_bin_local();

$orig_master_handler->disconnect();

## Droping the VIP

print "Disabling the VIP an old master: $orig_master_host \n";

&stop_vip();

## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK

$exit_code = 0;

};

if ($@) {

warn "Got Error: $@\n";

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "start" ) {

## Activating master ip on the new master

# 1. Create app user with write privileges

# 2. Moving backup script if needed

# 3. Register new master's ip to the catalog database

# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.

# If exit code is 0 or 10, MHA does not abort

my $exit_code = 10;

eval {

my $new_master_handler = new MHA::DBHelper();

# args: hostname, port, user, password, raise_error_or_not

$new_master_handler->connect( $new_master_ip, $new_master_port,

$new_master_user, $new_master_password, 1 );

## Set read_only=0 on the new master

$new_master_handler->disable_log_bin_local();

print current_time_us() . " Set read_only=0 on the new master.\n";

$new_master_handler->disable_read_only();

## Creating an app user on the new master

#print current_time_us() . " Creating app user on the new master..\n";

# create_app_user($new_master_handler);

print "Enabling the VIP $vip on the new master: $new_master_host \n";

&start_vip();

$new_master_handler->enable_log_bin_local();

$new_master_handler->disconnect();

## Update master ip on the catalog database, etc

$exit_code = 0;

};

if ($@) {

warn "Got Error: $@\n";

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "status" ) {

# do nothing

exit 0;

}

else {

&usage();

exit 1;

}

sub usage {

print

"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";

die;

}

chmod +x master_ip_online_change

利用MHA工具检测SSH。

安装需要的环境包：

yum -y install perl-Time-HiRes

执行检测命令：

/usr/local/bin/masterha_check_ssh --conf=/etc/mha/mha.conf

结果全是OK代表着SSH检测成功。

检测整个主从结构：

/usr/local/bin/masterha_check_repl --conf=/etc/mha/mha.conf

这个容易报错：

注意：update mysql.user set Host='%' where User='root'; （需要主从root 互相登陆）

5.4. 添加VIP

在主库（172.16.10.22）上执行添加VIP的过程（第一次手动添加）

ip addr add 172.16.10.30/24 dev eth0:1

ip addr show

删除的话：ip addr del 172.16.10.30/24 dev eth0

5.5. 启动MHA服务

在管理节点（172.16.10.62）slave2上执行MHA的启动：

nohup /usr/local/bin/masterha_manager --conf=/etc/mha/mha.conf > /tmp/mha_manager.log < /dev/null 2>&1 &

注意：如果做过一次FAILOVER测试,启动MHA 建议如下启动：

nohup /usr/local/bin/masterha_manager --conf=/etc/mha/mha.conf --ignore_last_failover > /tmp/mha_manager.log < /dev/null 2>&1 &

验证启动成功命令并且查看显示状态：

/usr/local/bin/masterha_check_status --conf=/etc/mha/mha.conf

5.6. 故障转移演练

模拟主库（172.16.10.22）宕机，即停止MySQL服务。

mysqladmin -S /tmp/mysql3307.sock -uroot -pmysql shutdown

原salve1自动获得VIP 172.16.10.30，如下图：

ip addr

即salve1转换成新的主库。

Slave2指向新的主库，如下图：

且管理节点即（172.16.10.62）上MHA进程自动停止，如下图：

ps -ef |grep MHA

root 11998 5673 0 09:01 pts/1 00:00:00 grep MHA

/usr/local/bin/masterha_check_status --conf=/etc/mha/mha.conf

5.7. 恢复原master

启动原master

mysqld_safe --defaults-file=/etc/my3307.cnf &

重新配置主从：

当然，数据要一致是前提（生产上数据不一致，要备份恢复），配置新主从如下：

即原master变成新的从库：

新一主两从结构如下：

change master to master_host='172.16.10.61',master_port=3307,master_user='rep',master_password='mysql',master_auto_position=1;

手动在线切换：

我的测试环境执行下面一句：

/usr/local/bin/masterha_master_switch --conf=/etc/mha/mha.conf --master_state=alive --new_master_host=172.16.10.22 --new_master_port=3307 --orig_master_is_new_slave --running_updates_limit=10000

注意：回切可以不用启动monitor进程。

中间提示输入YES.

原salve1（现为master）需要switch

查看VIP

VIP漂过来了。回切成功。

后成功后启动MHA：

nohup /usr/local/bin/masterha_manager --conf=/etc/mha/mha.conf --ignore_last_failover > /tmp/mha_manager.log < /dev/null 2>&1 &

第1章 MHA架构介绍

MHA （Master High Availability）目前在MySQL高可用方面是一个相对成熟的解决方案，它由日本人youshimaton开发，是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中，MHA能做到0~30秒之内自动完成数据库的故障切换操作，并且在进行故障切换的过程中，MHA能最大程度上保证数据库的一致性，以达到真正意义上的高可用。

MHA 由两部分组成：MHA Manager（管理节点）和MHA Node（数据节点）。MHA Manager

可以独立部署在一台独立的机器上管理多个Master-Slave集群，也可以部署在一台Slave上。

当Master出现故障是，它可以自动将最新数据的Slave提升为新的Master,然后将所有其他的Slave重新指向新的Master。整个故障转移过程对应用程序是完全透明的。

1.1 存在隐患

在MHA自动故障切换的过程中，MHA试图从宕掉的主服务器上保存二进制日志，最大程度保证数据的不丢失，但这并不总是可行的。

例如，如果主服务器硬件故障或无法通过SSH访问，MHA没有办法保存二进制日志，只能进行故障转移而丢失了最新数据。

拓：MySQL 服务挂了，但是可以从服务器拷贝二进制。但如果硬件宕机或者 SSH 不能连接，不能获取到最新的binlog日志，如果复制出现延迟，会丢失数据。

使用MySQL5.5的半同步复制，可以大大降低数据丢失的风险。MHA可以和半同步复制结合起来。如果只有一个Slave已经收到了最新的二进制日志，MHA可以将最新的二进制日志应用于其他所有Slave服务器上，保持数据一致性。

最新版0.56版本，增加了支持GTID的功能，建议在MySQL5.6及之后版本使用。MySQL5.5 建议使用管理节点版本0.55，数据节点0.54。

1.2 适用场景

目前MHA主要支持一主多从的架构，要搭建MHA，要求一个复制集群必须最少有3台数据库服务器，一主二从，即一台充当Master，一台充当备用Master，另一台充当从库。出于成本考虑，淘宝在此基础上进行了改造，目前淘宝开发的TMHA已经支持一主一从。

1.3 MHA工作原理

1. 从宕机崩溃的Master保存二进制日志事件（binlog event）；

2. 识别含有最新更新的Slave；

3. 应用差异的中继日志（relay log）到其他Slave；

4. 应用从Master保存的二进制日志事件；

5. 提升一个Slave为新的Master；

6. 使其他的Slave连接新的Master进行复制；

1.4 MHA的组成

MHA 软件由两部分组成，Manager工具包和Node工具包，具体如下。

1. Manager 工具包情况如下：

l masterha_check_ssh: 检查MHA的SSH配置情况。

l masterha_check_repl: 检查MySQL复制状况。

l masterha_manager: 启动MHA。

l masterha_check_status: 检测当前MHA运行状态。

l masterha_master_monitor: 检测Master是否宕机。

l masterha_master_switch: 控制故障转移（自动或手动）。

l masterha_conf_host: 添加或删除配置的server信息。

2. Node 工具包（通常由MHA Manager的脚本触发，无需人工操作）情况如下： l save_binary_logs: 保存和复制Master的binlog日志。

l apply_diff_relay_logs: 识别差异的中级日志时间并将其应用到其他Slave。

l filter_mysqlbinlog: 去除不必要的ROOLBACK事件（已经废弃）

l purge_relay_logs: 清除中继日志（不阻塞SQL线程）

重：为了尽可能的减少因为主库硬件损坏宕机造成的数据丢失，因此在配置MHA的同时建议必须配置MySQL5.5半同步复制。

拓展思想：为了保证数据一致性，MySQL复制中，常常会在Master上使用sync_binlog参数保证binlog持久化，保证数据一致性。但这种方式对磁盘I/O会造成10~20%的影响。但是还有另外一个思路，就是使用 MySQL 半同步复制来保证数据一致性，MySQL 半同步复制是在从服务器的内存中处理数据并进行发聩，虽然也会造成性能影响，但是相对于对Master造成的磁盘I/O的影响来说，反而是个更好的方法。据《高性能MySQL》第三版中10.9的测试，写入远程的内存（一台从库的反馈）比写入本地的磁盘（写入并刷新）要更快。使用半同步复制相比主在主库上进行强持久化的性能有两倍的改善。

第2章搭建环境情况

2.1 基础环境情况一

操作系统：Centos 6.5 64位； SSH：使用默认端口；

用户：root

写VIP：192.168.1.20 读VIP：192.168.1.21

2.2 基础环境情况二

角色	IP 地址	主机名	Server-ID	类型
Master	192.168.1.200	Server-A	168001200	写
Candicate Master	192.168.1.201	Server-B	168001201	读
Slave	192.168.1.202	Server-C	168001202	读
Monitor host	192.168.1.203	Server-D	MySQL 主库的故障转移
LVS+Keepalived-A	192.168.1.204	LVS-A	读操作的负载均衡
LVS+Keepalived-B	192.168.1.205	LVS-B	LVS 高可用备机

2.3 软件部署情况

角色	Manager	Node	Keepalived	LVS
Master	-	部署	-	-
Candicate Master	-	部署	-	-
Slave	-	部署	-	-
Monitor host	部署	部署	-	-
LVS+Keepalived-A	-	-	部署	部署
LVS+Keepalived-B			部署	部署

2.4 拓扑图及实现原理

2.4.1 拓扑图

LVS- 备

2.4.2 实现原理

1. 读操作

1） LVS 实现读操作的负载均衡；

2） Keepalived 在上层管理LVS，并对两台从库进行健康检测（通过定义Check脚本）；

3）一台从库出现故障后，Keepalived将其剔除出负载均衡集群；

2. 写操作

1）在Master上绑定写VIP（MHA启动后会通过脚本进行操作）；

2） MHA 监控Master状态，当Master出现故障后（宕机、复制暂停）时;

3）通过Failover脚本，卸载Master上的WVIP；

4）通过Failover在CMaster上绑定WVIP，提升其为主库；

5）同步并应用差异日志，并将从库指向新主库；

问题：当MHA把Master切换到了CMaster上后，LVS如何处理分发在CMaster上的读操作？解释：由于Keepalived会通过脚本定期监控CMaster的状态，包括同步、SQL线程、I/O线程，所以当CMaster升级为主库后，这些状态都将消失，Keepalived将自动将CMaster剔除出负载均衡集群。

第3章安装部署MySQL并配置复制

3.1 搭建主从复制环境

1. mysqldump 出主库数据，增加master-data=2；

2. 导入从库；

3. change master to master_host='192.168.1.200',master_user='repl',master_password='replpassword'

,master_port=3306,master_log_file='mysql-bin.00000x',master_log_pos=xxx;

4. start slave;

5. 配置从服务器只读；

第4章安装部署MHA

4.1 安装MHA Node

在所有的MySQL服务器上安装；在MHA的管理节点安装Node节点。

4.1.1 配置YUM源，依赖关系很难搞。

wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-6.repo

4.1.2 MySQL服务器上安装perl模块（DBD::mysql）

1. 方法一使用YUM 安装（推荐）

yum install -y perl-DBD-MySQL.x86_64 perl-DBI.x86_64

2. 方法二使用cpanm安装（Centos 5.5下不太好用）

编写脚本DBD_install.sh

#!/bin/bash

wget http://xrl.us/cpanm --no-check-certificate mv cpanm /usr/bin/ chmod 755 /usr/bin/cpanm cat >/root/list<

for package in `cat /root/list` do

cpanm $package done

二次安装时出现了Config::mysql和Log::Dispatch安装不上的情况，由于是恢复快照，怀疑可能是时间的问题，二次安装后，手动cpanm install Config::Tiny安装成功。

4.1.3 在所有节点上安装MHA Node：

tar xf mha4mysql-node-0.53.tar.gz cd mha4mysql-node perl Makefile.PL make && make install

4.2 安装MHA Manger

MHA Manager 中主要几个管理员的命令行工具，也是依赖一些Perl模块的。

1. 方法一使用YUM 安装（推荐）

yum install -y perl-Config-Tiny.noarch perl-Log-Dispatch.noarch perl-Parallel-ForkManager.noarch perl-DBD-MySQL perl-DBI

2. 方法二使用cpanm安装。注意：时间必须整正确

编写脚本DBD_install.sh

#!/bin/bash

wget http://xrl.us/cpanm --no-check-certificate mv cpanm /usr/bin/ chmod 755 /usr/bin/cpanm cat >/root/list<

for package in `cat /root/list` do

cpanm $package done

4.2.2 在所有节点上安装MHA Node：

tar xf mha4mysql-node-0.53.tar.gz cd mha4mysql-node perl Makefile.PL make && make install

4.2.3 安装MHA Manager

tar zxf mha4mysql-manager-0.53.tar.gz cd mha4mysql-manager-0.53 perl Makefile.PL make make install

安装过程中出现很多情况，比如在Centos5中，不适合cpanm的方法，使用yum是个好主意，而且安装MHA之前必须要MySQL

Cannot find the file 'mysql_config'! Your execution PATH doesn't seem not contain the path to mysql_config. Resorting to guessed values!

Can't exec "mysql_config": No such file or directory at Makefile.PL line 473.

Can't find mysql_config. Use --mysql_config option to specify where mysql_config is located Can't exec "mysql_config": No such file or directory at Makefile.PL line 473.

Can't find mysql_config. Use --mysql_config option to specify where mysql_config is located

Can't exec "mysql_config": No such file or directory at Makefile.PL line 473.

Can't find mysql_config. Use --mysql_config option to specify where mysql_config is located

4.3 配置SSH登录无密码验证

在Manager上配置到所有的Node节点的无密码验证

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.200" ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.201" ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.202"

在Master上配置到所有的Node节点的无密码验证

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.201" ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.202"

在Candicate Master上配置到所有的Node节点的无密码验证

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.200" ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.202"

在Slave上配置到所有的Node节点的无密码验证

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.200" ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.1.201"

配置主机名

cat >>/etc/hosts<< EOF 192.168.1.200 ip200

192.168.1.201 ip201

192.168.1.202 ip202 192.168.1.203 ip203 EOF

在Master上建立监控账户

grant all on *.* to 'root' @'192.168.1.%' identified by 'rootpassword';

否则，在MHA检查复制时，会报如下错误

[root@MHA mha]# masterha_check_repl --conf=/etc/mha/app1.cnf

Tue Dec 23 16:21:52 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Tue Dec 23 16:21:52 2014 - [info] Reading application default configurations from /etc/mha/app1.cnf..

Tue Dec 23 16:21:52 2014 - [info] Reading server configurations from /etc/mha/app1.cnf..

Tue Dec 23 16:21:52 2014 - [info] MHA::MasterMonitor version 0.55.

Tue Dec 23 16:21:52 2014 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln255] Got MySQL error when connecting 192.168.1.202(192.168.1.202:3306) :1045 :Access denied for user 'root'@'192.168.1.203' (using password: YES), but this is not mysql crash. Check MySQL server settings.

at /usr/local/share/perl5/MHA/ServerManager.pm line 251

Tue Dec 23 16:21:52 2014 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln255] Got MySQL error when connecting 192.168.1.200(192.168.1.200:3306) :1045:Access denied for user 'root'@'192.168.1.203' (using password: YES), but this is not mysql crash. Check MySQL server settings.

at /usr/local/share/perl5/MHA/ServerManager.pm line 251

Tue Dec 23 16:21:52 2014 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln255] Got MySQL error when connecting 192.168.1.201(192.168.1.201:3306) :1045:Access denied for user 'root'@'192.168.1.203' (using password: YES), but this is not mysql crash. Check MySQL server settings.

at /usr/local/share/perl5/MHA/ServerManager.pm line 251

Tue Dec 23 16:21:52 2014 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln263] Got fatal error, stopping operations

Tue Dec 23 16:21:52 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln386] Error happend on checking configurations. at /usr/local/share/perl5/MHA/MasterMonitor.pm line 300

Tue Dec 23 16:21:52 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln482] Error happened on monitoring servers.

Tue Dec 23 16:21:52 2014 - [info] Got exit code 1 (Not master dead).

4.4 配置MHA（由于前期端口问题，将端口修改为默认）

4.4.1 在Manager上创建配置文件

mkdir -p /etc/masterha vim /etc/masterha/app1.cnf

注意：由于脚本中并没有/opt/master_ip_failover脚本，启动时会报错，请到mha4mysqlmanager-0.5X/samples/scripts下拷贝对应脚本到指定位置。

[server default]

manager_log=/etc/mha/app1/app1.log manager_workdir=/etc/mha/app1/ master_binlog_dir=/var/lib/mysql

master_ip_failover_script=/etc/mha/master_ip_failover ping_interval=1 remote_workdir=/tmp

secondary_check_script=/usr/local/bin/masterha_secondary_check -s --master_host=ip200 --master_ip=192.168.1.200 --master_port=3306 ssh_user=root user=root

password=rootpassword repl_password=replpassword repl_user=repl

[server1]

ip201

-s ip200

--user=root

hostname=192.168.1.200 port=3306 [server2] candidate_master=1 check_repl_delay=0 hostname=192.168.1.201 port=3306 [server3] hostname=192.168.1.202 port=3306

4.4.2 故障转移脚本

[root@MHA mha]# cat master_ip_failover

#!/usr/bin/env perl

use strict;

use warnings FATAL => 'all';

use Getopt::Long;

my (

$command, $ssh_user, $orig_master_host, $orig_master_ip,

$orig_master_port, $new_master_host, $new_master_ip, $new_master_port,

);

my $vip = '192.168.1.20/24'; # Virtual IP my $key = "1";

my $ssh_start_vip = "/sbin/ifconfig eth1:$key $vip"; my $ssh_stop_vip = "/sbin/ifconfig eth1:$key down";

$ssh_user = "root";

GetOptions(

'command=s' => \$command,

'ssh_user=s' => \$ssh_user,

'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip,

'orig_master_port=i' => \$orig_master_port,

'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip,

'new_master_port=i' => \$new_master_port,

);

exit &main();

sub main {

print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

if ( $command eq "stop" || $command eq "stopssh" ) {

# $orig_master_host, $orig_master_ip, $orig_master_port are passed.

# If you manage master ip address at global catalog database, # invalidate orig_master_ip here. my $exit_code = 1;

#eval {

# print "Disabling the VIP on old master: $orig_master_host \n";

# &stop_vip();

# $exit_code = 0;

#};

eval {

print "Disabling the VIP on old master: $orig_master_host \n";

#my $ping=`ping -c 1 10.0.0.13 | grep "packet loss" | awk -F',' '{print $3}' | awk '{print $1}'`;

#if ( $ping le "90.0%" && $ping gt "0.0%" ){

#$exit_code = 0;

#}

#else {

&stop_vip();

# updating global catalog, etc

$exit_code = 0;

#}

};

if ($@) {

warn "Got Error: $@\n"; exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "start" ) {

14

# all arguments are passed.

# If you manage master ip address at global catalog database, # activate new_master_ip here.

# You can also grant write access (create user, set read_only=0, etc) here. my $exit_code = 10; eval {

print "Enabling the VIP - $vip on the new master - $new_master_host \n";

&start_vip();

$exit_code = 0;

}; if ($@) { warn $@; exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "status" ) {

print "Checking the Status of the script.. OK \n"; `ssh $ssh_user\@$orig_master_ip \" $ssh_start_vip \"`; exit 0;

} else { &usage(); exit 1;

}

# A simple system call that enable the VIP on the new master sub start_vip() {

`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;

}

# A simple system call that disable the VIP on the old_master sub stop_vip() {

`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;

} sub usage { print

"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip

--new_master_port=port\n";

}

15

# the end.

4.4.3 配置从库手动清除relay log清除方式

set global relay_log_purge = 0; 同时添加到配置文件中。

4.4.4 配置从库定时任务清理relay log

echo "0 4 * * * /usr/bin/purge_relay_logs --user=root --password=123.com -disable_relay_log_purge --port=3306

--workdir=/tmp >>/usr/local/masterha/log/purge_relay_logs.log 2>&1" > /etc/cron.d/purge_relay_logs

待验证是否起作用？

拓：/etc/cron.d/ 这个目录用来存放任何要执行的crontab文件或脚本。手动测试清理脚本

[root@Server-C mysql]# /usr/bin/purge_relay_logs --user=root --password=123.com -disable_relay_log_purge

--port=3306 --workdir=/tmp

2014-11-18 10:01:37: purge_relay_logs script started.

Found relay_log.info: /var/lib/mysql/relay-log.info Opening /var/lib/mysql/mysql-relay-bin.000001 .. Opening /var/lib/mysql/mysql-relay-bin.000002 ..

Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok.

2014-11-18 10:01:40: All relay log purging operations succeeded.

4.4.5 所有MySQL节点增加mysqlbinlog环境变量

echo "export PATH=$PATH:/usr/local/mysql/bin" >> /etc/bashrc ;source /etc/bashrc

如果不添加，在MHA在检查复制情况时会报如下错误

Tue Dec 23 16:22:18 2014 - [info] Master MHA Node version is 0.54.

Tue Dec 23 16:22:18 2014 - [info] Checking recovery script configurations on the current master..

Tue Dec 23 16:22:18 2014 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/lib/mysql --output_file=/tmp/save_binary_logs_test --manager_version=0.55 --start_file=mysql-bin.000003

Tue Dec 23 16:22:18 2014 - [info] Connecting to root@192.168.1.200(192.168.1.200)..

Creating /tmp if not exists.. ok.

Checking output directory is accessible or not.. ok.

Binlog found at /var/lib/mysql, up to mysql-bin.000003

Tue Dec 23 16:22:19 2014 - [info] Master setting check done.

Tue Dec 23 16:22:19 2014 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..

Tue Dec 23 16:22:19 2014 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root'

--slave_host=192.168.1.201 --slave_ip=192.168.1.201 --slave_port=3306 --workdir=/tmp

--target_version=5.5.32-log --manager_version=0.55 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx

16

Tue Dec 23 16:22:19 2014 - [info] Connecting to root@192.168.1.201(192.168.1.201:22)..

Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl5/MHA/BinlogManager.pm line 99. mysqlbinlog version not found!

at /usr/local/bin/apply_diff_relay_logs line 482

Tue Dec 23 16:22:20 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln195] Slaves settings check failed!

Tue Dec 23 16:22:20 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln375] Slave configuration failed.

Tue Dec 23 16:22:20 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln386] Error happend on checking configurations. at /usr/local/bin/masterha_check_repl line 48

Tue Dec 23 16:22:20 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln482] Error happened on monitoring servers.

Tue Dec 23 16:22:20 2014 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

缺报错

4.4.6 检查SSH/复制/MHA Manager的配置

检查SSH情况：masterha_check_ssh --conf=/etc/mha/app1.cnf 检查复制情况：masterha_check_repl --conf=/etc/mha/app1.cnf 检查MHA状态：masterha_check_status --conf=/etc/mha/app1.cnf

1. 检查SSH

[root@Server-D app1]# masterha_check_ssh --conf=/etc/masterha/app1.cnf

Thu Dec 11 21:33:06 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Thu Dec 11 21:33:06 2014 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Thu Dec 11 21:33:06 2014 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Thu Dec 11 21:33:06 2014 - [info] Starting SSH connection tests..

Thu Dec 11 21:33:06 2014 - [debug]

Thu Dec 11 21:33:06 2014 - [debug] Connecting via SSH from root@192.168.1.200(192.168.1.200:22) to root@192.168.1.201(192.168.1.201:22)..

Thu Dec 11 21:33:06 2014 - [debug] ok.

Thu Dec 11 21:33:06 2014 - [debug] Connecting via SSH from root@192.168.1.200(192.168.1.200:22) to root@192.168.1.202(192.168.1.202:22)..

Thu Dec 11 21:33:06 2014 - [debug] ok.

Thu Dec 11 21:33:07 2014 - [debug]

Thu Dec 11 21:33:06 2014 - [debug] Connecting via SSH from root@192.168.1.201(192.168.1.201:22) to root@192.168.1.200(192.168.1.200:22)..

Thu Dec 11 21:33:06 2014 - [debug] ok.

Thu Dec 11 21:33:06 2014 - [debug] Connecting via SSH from root@192.168.1.201(192.168.1.201:22) to root@192.168.1.202(192.168.1.202:22)..

Thu Dec 11 21:33:07 2014 - [debug] ok.

Thu Dec 11 21:33:07 2014 - [debug]

Thu Dec 11 21:33:07 2014 - [debug] Connecting via SSH from root@192.168.1.202(192.168.1.202:22) to root@192.168.1.200(192.168.1.200:22)..

Thu Dec 11 21:33:07 2014 - [debug] ok.

Thu Dec 11 21:33:07 2014 - [debug] Connecting via SSH from root@192.168.1.202(192.168.1.202:22) to root@192.168.1.201(192.168.1.201:22)..

Thu Dec 11 21:33:07 2014 - [debug] ok.

Thu Dec 11 21:33:07 2014 - [info] All SSH connection tests passed successfully.

拓：Warning: Permanently added '192.168.1.201' (RSA) to the list of known hosts.

永久添加主机RSA列表（不知道有什么作用，在此执行没问题了）

2. 检查复制

masterha_check_repl --conf=/etc/masterha/app1.cnf

[root@Server-D app1]# masterha_check_repl --conf=/etc/masterha/app1.cnf

Thu Dec 11 21:35:53 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping

没有默认脚本。 .

Thu Dec 11 21:35:53 2014 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Thu Dec 11 21:35:53 2014 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Thu Dec 11 21:35:53 2014 - [info] MHA::MasterMonitor version 0.56.

Thu Dec 11 21:35:53 2014 - [info] GTID failover mode = 0 Thu Dec 11 21:35:53 2014 - [info] Dead Servers:

Thu Dec 11 21:35:53 2014 - [info] Alive Servers:

Thu Dec 11 21:35:53 2014 - [info] 192.168.1.200(192.168.1.200:3306)

Thu Dec 11 21:35:53 2014 - [info] 192.168.1.201(192.168.1.201:3306) Thu Dec 11 21:35:53 2014 - [info] 192.168.1.202(192.168.1.202:3306) Thu Dec 11 21:35:53 2014 - [info] Alive Slaves:

Thu Dec 11 21:35:53 2014 - [info] 192.168.1.201(192.168.1.201:3306) Version=5.5.32-log (oldest major version between slaves) log-bin:enabled

Thu Dec 11 21:35:53 2014 - [info] Replicating from 192.168.1.200(192.168.1.200:3306)

Thu Dec 11 21:35:53 2014 - [info] Primary candidate for the new Master (candidate_master is set)

Thu Dec 11 21:35:53 2014 - [info] 192.168.1.202(192.168.1.202:3306) Version=5.5.32-log (oldest major version between slaves) log-bin:enabled

Thu Dec 11 21:35:53 2014 - [info] Replicating from 192.168.1.200(192.168.1.200:3306) Thu Dec 11 21:35:53 2014 - [info] Current Alive Master: 192.168.1.200(192.168.1.200:3306) Thu Dec 11 21:35:53 2014 - [info] Checking slave configurations..

Thu Dec 11 21:35:53 2014 - [info] Checking replication filtering settings..

Thu Dec 11 21:35:53 2014 - [info] binlog_do_db= , binlog_ignore_db= Thu Dec 11 21:35:53 2014 - [info] Replication filtering check ok. Thu Dec 11 21:35:53 2014 - [info] GTID (with auto-pos) is not supported

不支持GTID

此处省略N行。。。。。

Checking the Status of the script.. OK Thu Dec 11 21:35:56 2014 - [info] OK.

Thu Dec 11 21:35:56 2014 - [warning] shutdown_script is not defined. 没有这个脚本

Thu Dec 11 21:35:56 2014 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

错误：ln -s /usr/local/mysql/bin/* /usr/local/bin/

错误：解决：由于修改了IP地址，其他三台机器的hosts没有修改，修改后正常。

[root@Server-D opt]# masterha_check_repl --conf=/etc/masterha/app1.cnf

Fri Dec 19 11:09:50 2014 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Fri Dec 19 11:09:50 2014 - [info] Reading application default configurations from /etc/masterha/app1.cnf..

Fri Dec 19 11:09:50 2014 - [info] Reading server configurations from /etc/masterha/app1.cnf..

Fri Dec 19 11:09:50 2014 - [info] MHA::MasterMonitor version 0.53.

Fri Dec 19 11:09:58 2014 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln188] There is no alive server.

We can't do failover

Fri Dec 19 11:09:58 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln383] Error happend on checking configurations. at /usr/local/share/perl5/MHA/MasterMonitor.pm line 298

Fri Dec 19 11:09:58 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln478] Error happened on monitoring servers.

Fri Dec 19 11:09:58 2014 - [info] Got exit code 1 (Not master dead).

3. 检查MHA Manager

[root@Server-D app1]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 is stopped(2:NOT_RUNNING). 表示MHA Manager没有启动

4. 启动MHA Manager

nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover<

/dev/null >/masterha/app1/manager.log 2>&1 &

如果启动一段时间后退出nohup，请查看app1.log文件中的报错。

处省略N行

Thu Dec 11 21:44:48 2014 - [info] Checking master_ip_failover_script status:

Thu Dec 11 21:44:48 2014 - [info] /opt/master_ip_failover --command=status --ssh_user=root

--orig_master_host=192.168.1.200 --orig_master_ip=192.168.1.200 --orig_master_port=3306 重：Unknown option: orig_master_ssh_port 因为我的SSH是52020端口，所以这个的这个变量报错不知道哪出的，感觉要修改perl脚本，待研究。

IN SCRIPT TEST====sudo /sbin/ifconfig eth0:1 down==sudo /sbin/ifconfig eth0:1 192.168.1.10/24===

Checking the Status of the script.. OK Thu Dec 11 21:44:48 2014 - [info] OK.

Thu Dec 11 21:44:48 2014 - [warning] shutdown_script is not defined.

Thu Dec 11 21:44:48 2014 - [info] Set master ping interval 1 seconds.

Thu Dec 11 21:44:48 2014 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s ip201

-s ip200 --user=root --master_host=ip200 --master_ip=192.168.1.200 --master_port=3306

Thu Dec 11 21:44:48 2014 - [info] Starting ping health check on 192.168.1.200(192.168.1.200:3306)..

Thu Dec 11 21:44:48 2014 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..

拓：如果KILL掉进程，会出现如下报错

Thu Dec 11 21:47:45 2014 - [info] MHA::MasterMonitor version 0.56.

Thu Dec 11 21:47:45 2014 - [warning] /masterha/app1/app1.master_status.health already exists. You might have kille d manager with SIGKILL(-9), may run two or more monitoring process for the same application, or use the same working d irectory. Check for details, and consider setting --workdir separately.

原因是KILL掉后，/masterha/app1/下产生的app1.master_status.health没有删除，如果

KILL 还要删掉这个文件

5. 在此检查MHA Manager状态

[root@Server-D app1]# masterha_check_status --conf=/etc/masterha/app1.cnf

app1 monitoring program is now on initialization phase(10:INITIALIZING_MONITOR). Wait for a while and try checking again.

启动几秒后

[root@Server-D app1]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 (pid:2624) is running(0:PING_OK), master:192.168.1.200

重：截止到现在为止，MHA部署完成，暂时停止MHA服务，然后部署LVS+Keepalived。

拓展：Linux下快速删除大文件（硬连接方式）

原理：硬链接基础

当多个文件共同指向同一inode、inode链接数N>1、删除任何一个文件都是巨快因为、此时删除的仅仅是指向inode的指针

而当N=1时、则不一样了、此时删除的文件相关的所有数据块、所以慢

测试：

root@ # ln stock.ibd stock.id.hdlk root@ # ls stock.* -l

-rw-rw—- 1 mysql mysql 9196 Apr 14 23:03 stock.frm

-rw-r–r– 2 mysql mysql 19096666112 Apr 15 09:55 stock.ibd -rw-r–r– 2 mysql mysql 19096666112 Apr 15 09:55 stock.id.hdlk

你会发现stock.ibd的INODES属性变成了2；下面我们继续来删表。

root@127.0.0.1 : test 21:44:37> show table status like ‘stock’ \G

*************************** 1. row ***************************

Name: stock

Engine: InnoDB

Version: 10

Row_format: Compact Rows: 49916863

Avg_row_length: 356

Data_length: 17799577600 Max_data_length: 0

Index_length: 1025507328

Data_free: 4194304

Auto_increment: NULL

Create_time: 2011-05-18 14:55:08

Update_time: NULL

Check_time: NULL

Collation: utf8_general_ci

Checksum: NULL Create_options:

Comment:

1 row in set (0.23 sec)

root@127.0.0.1 : test 21:39:34> drop table stock ;

Query OK, 0 rows affected (0.99 sec)

1 秒不到就删除完成；也就是DROP TABLE不用再HANG这么久了。但table是删除了，数据文件还在，所以你还需要最后数据文件给删除。 root # ll total 19096666112

-rw-r–r– 2 mysql mysql 19096666112 Apr 15 09:55 stock.id.hdlk root # rm stock.id.hdlk

最后一步删除原始大文件也可以选择rsync来删除、比rm快多了

报错1

sh: /opt/master_ip_failover: Permission denied

Tue Dec 9 23:14:08 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln226] Failed to get master_ip_failover_script status with return code 126:0.

Tue Dec 9 23:14:08 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations. at /usr/local/bin/masterha_manager line 50

Tue Dec 9 23:14:08 2014 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers. 缺少执行权限。

报错2

un Mar 2 13:13:46 2014 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln178] Got ERROR: Use of uninitialized value $msg in scalar chomp at /usr/share/perl5/vendor_perl/MHA/ManagerConst.pm line 90.

-----------------------------------------------------------

解决这个错误是在文件/usr/share/perl5/vendor_perl/MHA/ManagerConst.pm 第90行(chomp $msg)前加入一行： 1

$msg = "" unless($msg);

第5章 LVS+Keepalived安装部署

5.1 LVS 部署

Centos 5 使用LVS 1.24

Centos 6 使用LVS 1.26

ln -s /usr/src/kernels/2.6.18-371.el5-x86_64/ /usr/src/linux yum install libnl* popt* -y

wget http://www.linuxvirtualserver.org/software/kernel-2.6/ipvsadm-1.26.tar.gz tar xzf ipvsadm-1.26.tar.gz cd ipvsadm-1.26 make echo $?

make install /sbin/ipvsadm lsmod | grep ip_vs

5.2 Keepalived部署

Keepalived 运行在LVS之上，所以必须和LVS部署在一台服务器上。Keepalived在目前架构

中起到了健康检查的功能和管理负载均衡节点的功能。

yum -y install openssl-devel

wget http://www.keepalived.org/software/keepalived-1.1.19.tar.gz tar xzf keepalived-1.1.19.tar.gz cd keepalived-1.1.19

./configure make make install

复制相关文件到目录

/bin/cp /usr/local/etc/rc.d/init.d/keepalived /etc/init.d/ /bin/cp /usr/local/etc/sysconfig/keepalived /etc/sysconfig/ mkdir /etc/keepalived -p

/bin/cp /usr/local/etc/keepalived/keepalived.conf /etc/keepalived/ /bin/cp /usr/local/sbin/keepalived /usr/sbin/

5.3 配置Keepalived配置文件（读负载均衡部分）

5.3.1 配置CMaster和Slave

1. Cmaster 运行Rvip脚本

2. Slave 运行Rvip脚本

Rvip 脚本

[root@Server-B opt]# cat lvs-start-client-read.sh

#!/bin/bash #real_server.sh vip=192.168.1.21 open() {

ifconfig lo:Rvip ${vip}/32 up

sysctl -w net.ipv4.conf.lo.arp_announce=2 sysctl -w net.ipv4.conf.lo.arp_ignore=1 sysctl -w net.ipv4.conf.all.arp_announce=2 sysctl -w net.ipv4.conf.all.arp_ignore=1

} close() {

ifconfig lo:Rvip down

sysctl -w net.ipv4.conf.lo.arp_announce=0 sysctl -w net.ipv4.conf.lo.arp_ignore=0 sysctl -w net.ipv4.conf.all.arp_announce=0 sysctl -w net.ipv4.conf.all.arp_ignore=0

} case $1 in start) open

;; stop) close

;; *)

echo "Usage: $0 need argument [start|stop]"

;; esac esac

3. 编辑Keepalived配置文件

[root@LVS-A opt]# cat /etc/keepalived/keepalived.conf

! Configuration File for keepalived

global_defs {

notification_email { jackbillow@gmail.com

}

notification_email_from alert-noreply@test.com.cn smtp_server 127.0.0.1 smtp_connect_timeout 30 router_id MHA-A

}

# db Read

vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 150 advert_int 6 authentication { auth_type PASS auth_pass 123qwe

}

virtual_ipaddress { 192.168.1.21/24

}

# VIP 192.168.1.21

virtual_server 192.168.1.21 3306 { delay_loop 10 lb_algo rr lb_kind DR

nat_mask 255.255.255.0 protocol TCP

#sorry_server 192.168.1.201 3306

real_server 192.168.1.201 3306 { weight 1 TCP_CHECK { connect_port 3306 connect_timeout 10 nb_get_retry 3 delay_before_retry 5

}

MISC_CHECK {

24

misc_path "/etc/keepalived/check_slave.pl 192.168.1.201 3306" misc_dynamic

}

real_server 192.168.1.202 3306 { weight 1 TCP_CHECK { connect_port 3306 connect_timeout 10 nb_get_retry 3 delay_before_retry 5

}

MISC_CHECK {

misc_path "/etc/keepalived/check_slave.pl 192.168.1.202 3306" misc_dynamic

}

4. 配置检测脚本

上传到LVS指定目录/etc/keepalived下

cat /etc/keepalived/check_slave.pl

#!/usr/bin/perl -w

use DBI; use DBD::mysql;

# CONFIG VARIABLES

$SBM = 200;

$db = "information_schema";

$host = $ARGV[0];

$port = $ARGV[1];

$user = "repl_user";

$pw = "testpasswd";

# SQL query

$query = "show slave status";

$dbh = DBI->connect("DBI:mysql:$db:$host:$port", $user, $pw, { RaiseError => 0,PrintError => 0 });

#print "$db, $host , $port , $user, $pw";

if (!defined($dbh)) { #print "connect fail."; exit 1;

}

$sqlQuery = $dbh->prepare($query);

$sqlQuery->execute;

$Slave_IO_Running = "";

$Slave_SQL_Running = "";

$Seconds_Behind_Master = "";

while (my $ref = $sqlQuery->fetchrow_hashref()) {

$Slave_IO_Running = $ref->{'Slave_IO_Running'};

$Slave_SQL_Running = $ref->{'Slave_SQL_Running'};

$Seconds_Behind_Master = $ref->{'Seconds_Behind_Master'};

}

#print "Slave_IO_Running = $Slave_IO_Running\n";

#print "Slave_SQL_Running = $Slave_SQL_Running\n";

#print "Seconds_Behind_Master = $Seconds_Behind_Master\n";

$sqlQuery->finish;

$dbh->disconnect();

if ( $Slave_IO_Running eq "No" || $Slave_SQL_Running eq "No" ) { #print "Slave_IO_Running or Slave_SQL_Running is No"; exit 1;

26

} else {

if ( $Seconds_Behind_Master > $SBM ) { #print "Seconds_Behind_Master > SBM"; exit 1;

} else {

#print "Seconds_Behind_Master < SBM"; exit 0;

}

5. 查看LVS状态

watch -n1 ipvsadmin -Ln

6. 启动MHA Manager

启动MHA，启动过程中会在Master上配置WVIP

nohup masterha_manager --conf=/etc/mha/app1.cnf

/dev/null >/masterha/app1/manager.log 2>&1 &

--remove_dead_master_conf

--ignore_last_failover<

检查MHA状态

masterha_check_status --conf=/etc/mha/app1.cnf

7. 查看Master状态

8. 测试Master

客户端连接。

9. 模拟Master故障中的问题

情景描述：

1. 模拟Master故障后，MHA完成了Master想CMaster的故障转移；

2.Keepalived 通过check_mysql.pl脚本检测MySQL从库状态，主要检测了SQL线程、I/O线

程和延迟时间。

问题现象：由于MHA进行切换后，CMaster之前从库的配置完全消失（show slave status\G)，但是Keepalived的check_mysql.pl脚本，并不能检测到MySQL状态改变，出现了不能从读

负载均衡中剔除CMaster的情况。

第6章部署LVS高可用

6.1 LVS 部署

Centos 5 使用LVS 1.24

Centos 6 使用LVS 1.26

ln -s /usr/src/kernels/2.6.32-358.el6.x86_64/ /usr/src/linux yum install libnl* popt* -y

wget http://www.linuxvirtualserver.org/software/kernel-2.6/ipvsadm-1.26.tar.gz tar xzf ipvsadm-1.26.tar.gz cd ipvsadm-1.26 make echo $?

make install /sbin/ipvsadm lsmod | grep ip_vs

6.2 Keepalived部署

Keepalived 运行在LVS之上，所以必须和LVS部署在一台服务器上。Keepalived在目前架构

中起到了健康检查的功能和管理负载均衡节点的功能。

yum -y install openssl-devel

wget http://www.keepalived.org/software/keepalived-1.1.19.tar.gz tar xzf keepalived-1.1.19.tar.gz cd keepalived-1.1.19

./configure make make install

复制相关文件到目录

/bin/cp /usr/local/etc/rc.d/init.d/keepalived /etc/init.d/ /bin/cp /usr/local/etc/sysconfig/keepalived /etc/sysconfig/ mkdir /etc/keepalived -p

/bin/cp /usr/local/etc/keepalived/keepalived.conf /etc/keepalived/

/bin/cp /usr/local/sbin/keepalived /usr/sbin/

6.3 配置备份Keepalived配置文件

[root@LVS-B keepalived]# cat keepalived.conf

! Configuration File for keepalived

global_defs { notification_email { jackbillow@gmail.com

}

notification_email_from alert-noreply@test.com.cn smtp_server 127.0.0.1 smtp_connect_timeout 30 router_id MHA-B

}

# db Read

vrrp_instance VI_1 { state BACKUP interface eth2 virtual_router_id 51 priority 100 advert_int 1 authentication { auth_type PASS auth_pass 123qwe

}

virtual_ipaddress { 192.168.1.21/24

}

# VIP 192.168.1.21

virtual_server 192.168.1.21 3306 { delay_loop 10 lb_algo rr lb_kind DR

nat_mask 255.255.255.0 protocol TCP

#sorry_server 192.168.1.201 3306

real_server 192.168.1.201 3306 { weight 1 TCP_CHECK { connect_port 3306 connect_timeout 10 nb_get_retry 3 delay_before_retry 5

30

}

MISC_CHECK {

misc_path "/etc/keepalived/check_slave.pl 192.168.1.201 3306" misc_dynamic

}

real_server 192.168.1.202 3306 { weight 1 TCP_CHECK { connect_port 3306 connect_timeout 10 nb_get_retry 3 delay_before_retry 5

}

MISC_CHECK {

misc_path "/etc/keepalived/check_slave.pl 192.168.1.202 3306" misc_dynamic

}

6.4 配置检测脚本

[root@LVS-B keepalived]# cat check_slave.pl

#!/usr/bin/perl -w

use DBI; use DBD::mysql;

# CONFIG VARIABLES

$SBM = 200;

$db = "information_schema";

$host = $ARGV[0];

$port = $ARGV[1];

$user = "repl";

$pw = "replpassword";

# SQL query

$query = "show slave status";

$dbh = DBI->connect("DBI:mysql:$db:$host:$port", $user, $pw, { RaiseError => 0,PrintError => 0 });

#print "$db, $host , $port , $user, $pw";

if (!defined($dbh)) { #print "connect fail."; exit 1;

}

$sqlQuery = $dbh->prepare($query);

$sqlQuery->execute;

$Slave_IO_Running = "";

$Slave_SQL_Running = "";

$Seconds_Behind_Master = "";

while (my $ref = $sqlQuery->fetchrow_hashref()) {

$Slave_IO_Running = $ref->{'Slave_IO_Running'};

$Slave_SQL_Running = $ref->{'Slave_SQL_Running'};

$Seconds_Behind_Master = $ref->{'Seconds_Behind_Master'};

}

#print "Slave_IO_Running = $Slave_IO_Running\n";

#print "Slave_SQL_Running = $Slave_SQL_Running\n";

#print "Seconds_Behind_Master = $Seconds_Behind_Master\n";

$sqlQuery->finish;

$dbh->disconnect();

if ( $Slave_IO_Running eq "No" || $Slave_SQL_Running eq "No" ) { #print "Slave_IO_Running or Slave_SQL_Running is No"; exit 1;

} else {

if ( $Seconds_Behind_Master > $SBM ) { #print "Seconds_Behind_Master > SBM"; exit 1;

} else {

#print "Seconds_Behind_Master < SBM"; exit 0;

}

32

}

第7章模拟故障测试

7.1 Master故障

由于Keepalived中的check_slave.pl脚本不完善，只识别SQL线程、I/O线程和延迟过高的情况，当由于Master故障后，MHA自动将主库迁移到CMaster上后，CMaster的“show slave status\G”为空，Keepalived则不能检测到。即出现CMaster即承担读也承担写的情况。由于shell脚本效率问题，不太建议使用。可以考虑python编写或修改perl脚本。

7.1.1 停止Master，MHA切换VIP并应用差异数据

1. Master 状态

2. CMaster 状态

3. Master 停止数据库，后CMaster状态

4. MHA 日志

5. LVS 上的状态（CMaster201依然存在）

7.2 Slave故障

7.2.1 停止Slave的SQL线程或者I/O线程后，被剔除出集群。

1. 初始状态

2. 停止SQL线程

stop slave io_thread stop slave sql_thread

3. 查看状态

4. 启动SQL线程

start slave io_thread start slave sql_thread

5. 查看状态

7.3 CMaster故障

CMaster 如果出现故障，应当理解配置Slave为候选主库，避免由Master宕机后，不进行切换。停止MHA，将Slave配置中增加候选配置

7.4 LVS-A故障

7.4.1 当LVS-A停止Keepalived后，LVS-B接管LVS-A的所有工作

1. LVS-A 状态（通过管理工具模拟连接）

2. LVS-B 状态

3. 停止LVS-A

/etc/init.d/keepalived stop

4. 查看LVS-B状态

5. 刷新管理工具连接（连接到了Slave）

6. 查看LVS-B状态

参考： https://blog.csdn.net/wzy0623/article/details/81304654

https://www.cnblogs.com/gomysql/p/3675429.html

使用MHA实现MySQL主从复制高可用

一、MHA简介

二、实验架构设计

1. 基本环境

2. 架构设计

三、MHA安装配置

1. 配置主从复制

2. 安装Perl等依赖模块

3. 配置SSH登录无密码验证

4. 安装MHA Node

5. 安装MHA Manager

6. 配置MHA

7. 创建相关脚本

四、检查MHA配置

1. 检查SSH配置

2. 检查整个复制环境状况

3. 检查MHA Manager的状态

4. 查看启动日志

五、功能测试

1. 初始绑定VIP

2. 测试自动切换

3. 测试手工切换

4. 测试在线切换

5. 修复宕机的Master

参考：

一、MHA简介

MHA（Master High Availability）目前在MySQL高可用方面是一个相对成熟的解决方案，它由日本DeNA公司的youshimaton（现就职于Facebook公司）开发，是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中，MHA能做到在0~30秒之内自动完成数据库的故障切换操作，并且在进行故障切换的过程中，MHA能在最大程度上保证数据的一致性，以达到真正意义上的高可用。

该软件由两部分组成：MHA Manager（管理节点）和MHA Node（数据节点）。MHA Manager可以单独部署在一台独立的机器上管理多个master-slave集群，也可以部署在一台slave节点上。MHA Node运行在每台MySQL服务器上，MHA Manager会定时探测集群中的master节点，当master出现故障时，它可以自动将最新数据的slave提升为新的master，然后将所有其他的slave重新指向新的master。整个故障转移过程对应用程序完全透明。

在MHA自动故障切换过程中，MHA试图从宕机的主服务器上保存二进制日志，最大程度的保证数据的不丢失，但这并不总是可行的。例如，如果主服务器硬件故障或无法通过ssh访问，MHA没法保存二进制日志，只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制，可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志，MHA可以将最新的二进制日志应用于其他所有的slave服务器上，因此可以保证所有节点的数据一致性。

目前MHA主要支持一主多从的架构。要搭建MHA，要求一个复制集群中必须最少有三台数据库服务器，一主二从，即一台充当master，一台充当备用master，另外一台充当从库。因为至少需要三台服务器，出于机器成本的考虑，淘宝也在该基础上进行了改造，目前淘宝TMHA已经支持一主一从。（出自：《深入浅出MySQL(第二版)》）从代码层面看，MHA就是一套Perl脚本，那么相信以阿里系的技术实力，将MHA改成支持一主一从也并非难事。

图1所示为MHA架构：

图1

MHA工作原理总结为以下几条：

从宕机崩溃的master保存二进制日志事件（binlog events）；

识别含有最新更新的slave；

应用差异的中继日志（relay log）到其他slave；

应用从master保存的二进制日志事件（binlog events）；

提升一个slave为新master；

使用其他的slave连接新的master进行复制。

官方介绍：https://code.google.com/archive/p/mysql-master-ha/

二、实验架构设计

1. 基本环境

操作系统版本：CentOS Linux release 7.2.1511 (Core)

MySQL版本：5.6.14

VIP（虚IP）：172.16.1.100

2. 架构设计

实验架构如图2所示。

图2

hdp1作为MHA Manager，其它三台主机构成MySQL一主二从复制集群，作为MHA Node。

三、MHA安装配置

1. 配置主从复制

MySQL主从复制的配置较为简单，具体过程可参考MySQL官方文档，这里从略。如果是全新搭建的复制，只要打开Master的binlog，然后将Slave change master到指定的file和pos，再start slave即可。如果是为已经存在且正在使用的数据库搭建从库，有两种方式，一是用mysqldump master-data参数记录master的file和pos，但可能卡库；比较好的方法是用innobackupex联机搭建从库，过程如下：

（1）前置条件

主从都安装好依赖包：

yum install perl perl-DBI perl-DBD-MySQL perl-IO-Socket-SSL perl-Time-HiRes

主从都安装percona-xtrabackup

设置PATH环境变量，如：

.:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/home/mysql/mysql-5.6.14/bin:/home/mysql/percona-xtrabackup-2.2.4-Linux-x86_64/bin:/home/mysql/bin

（2）配置主到从的SSH免密码连接

在主上用mysql用户执行：

ssh-keygen

... 一路回车 ...

ssh-copy-id slave的IP或主机名

（3）备份并传输

例如，在主上用mysql用户执行：

innobackupex --user root --password 123456 --defaults-file=/home/mysql/mysql-5.6.14/my.cnf --no-lock --socket=/home/mysql/mysql-5.6.14/mysql.sock --port 3306 --stream=tar ./ | ssh mysql@172.16.1.126 \ "cat - > /home/mysql/backup.tar"

（4）恢复备份

在从上用mysql用户执行：

# 解压缩

tar -ixvf backup.tar -C /home/mysql/mysql-5.6.14/data

# 应用日志

innobackupex --apply-log /home/mysql/mysql-5.6.14/data/

# 查看binlog日志文件的位置值

cat /home/mysql/mysql-5.6.14/data/xtrabackup_binlog_info

# 编辑my.cnf

vi /etc/my.cnf

# 启动MySQL，目录要和主保持一致

service mysql start

mysql -uroot -p123456 -P3306 -h127.0.0.1

# 配置复制

reset master;

reset slave all;

change master to

master_host='172.16.1.127',

master_port=3306,

master_user='repl',

master_password='123456',

master_log_file='mysql-bin.000001',

master_log_pos=120;

# 其中master_log_file和master_log_pos赋予/home/mysql/mysql5.6.14/data/xtrabackup_binlog_info中的值。

# 启动slave

start slave;

# 查看slave状态

show slave status\G

（5）后续工作

备份my.cnf、bat文件和crontab等。

2. 安装Perl等依赖模块

用root用户在所有四个节点执行下面的操作。

# 安装一个epel源

wget -O /etc/yum.repos.d/epel-7.repo http://mirrors.aliyun.com/repo/epel-7.repo

# 用yum安装依赖包

yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y

3. 配置SSH登录无密码验证

在hdp1 172.16.1.124（Monitor）上用root用户执行：

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.125

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.126

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.127

在hdp4 172.16.1.127（Master）上用root用户执行：

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.125

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.126

在hdp3 172.16.1.126（slave1）上用root用户执行：

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.125

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.127

在hdp2 172.16.1.125（slave2）上用root用户执行：

ssh-keygen -t rsa

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.126

ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.127

4. 安装MHA Node

下载地址：https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads

在hdp2、hdp3、hdp4上用root用户执行下面的操作。

rpm -ivh mha4mysql-node-0.56-0.el6.noarch.rpm

安装完成后，在/usr/bin/目录下有如下MHA相关文件：

apply_diff_relay_logs

filter_mysqlbinlog

purge_relay_logs

save_binary_logs

这些脚本工具通常由MHA Manager的脚本触发，无需人为操作。脚本说明：

apply_diff_relay_logs：识别差异的中继日志事件并将其差异的事件应用于其它slave。

filter_mysqlbinlog：去除不必要的ROLLBACK事件（MHA已不再使用这个工具）。

purge_relay_logs：清除中继日志（不会阻塞SQL线程）。

save_binary_logs：保存和复制master的二进制日志。

5. 安装MHA Manager

下载地址：https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads

在hdp1上用root用户执行下面的操作。

rpm -ivh mha4mysql-manager-0.56-0.el6.noarch.rpm

安装完成后，在/usr/bin/目录下有如下MHA相关文件：

masterha_check_repl

masterha_check_ssh

masterha_check_status

masterha_conf_host

masterha_manager

masterha_master_monitor

masterha_master_switch

masterha_secondary_check

masterha_stop

apply_diff_relay_logs

filter_mysqlbinlog

purge_relay_logs

save_binary_logs

6. 配置MHA

在hdp1上用root用户执行下面（1）、（2）、（3）的操作。

（1）建立配置文件目录

mkdir -p /etc/masterha

（2）创建配置文件/etc/masterha/app1.cnf，内容如下：

[server default]

manager_log=/var/log/masterha/app1/manager.log

manager_workdir=/var/log/masterha/app1.log

master_binlog_dir=/data

master_ip_failover_script=/usr/bin/master_ip_failover

master_ip_online_change_script=/usr/bin/master_ip_online_change

password=123456

ping_interval=1

remote_workdir=/tmp

repl_password=123456

repl_user=repl

secondary_check_script=/usr/bin/masterha_secondary_check -s hdp4 -s hdp3 --user=root --master_host=hdp4 --master_ip=172.16.1.127 --master_port=3306

shutdown_script=""

ssh_user=root

user=root

[server1]

hostname=172.16.1.127

port=3306

[server2]

candidate_master=1

check_repl_delay=0

hostname=172.16.1.126

port=3306

[server3]

hostname=172.16.1.125

port=3306

server default段是manager的一些基本配置参数，server1、server2、server3分别对应复制中的master、第一个slave、第二个slave。该文件的语法要求严格，变量值后不要有多余的空格。主要配置项说明如下。

manager_log：设置manager的日志文件。

manager_workdir：设置manager的工作目录。

master_binlog_dir：设置master保存binlog的位置，以便MHA可以找到master的日志，这里的也就是mysql的数据目录。

master_ip_failover_script：设置自动failover时候的切换脚本。

master_ip_online_change_script：设置手动切换时候的切换脚本。

password：设置mysql中root用户的密码。

ping_interval：设置监控主库，发送ping包的时间间隔，默认是3秒，尝试三次没有回应的时候自动进行railover。

remote_workdir：设置远端mysql在发生切换时binlog的保存位置。

repl_password：设置复制用户的密码。

repl_user：设置复制环境中的复制用户名

secondary_check_script：一旦MHA到hdp4的监控之间出现问题，MHA Manager将会尝试从hdp3登录到hdp4。

shutdown_script：设置故障发生后关闭故障主机脚本。该脚本的主要作用是关闭主机放在发生脑裂，这里没有使用。

ssh_user：设置ssh的登录用户名。

user：设置监控用户为root。

candidate_master：设置为候选master。设置该参数以后，发生主从切换以后将会将此从库提升为主库，即使这个主库不是集群中事件最新的slave。

check_repl_delay：默认情况下如果一个slave落后master 100M的relay logs的话，MHA将不会选择该slave作为一个新的master，因为对于这个slave的恢复需要花费很长时间，通过设置check_repl_delay=0，MHA触发切换在选择一个新的master的时候将会忽略复制延时，这个参数对于设置了candidate_master=1的主机非常有用，因为这个候选主在切换的过程中一定是新的master。

（3）建立软连接

ln -s /home/mysql/mysql-5.6.14/bin/mysqlbinlog /usr/bin/mysqlbinlog

ln -s /home/mysql/mysql-5.6.14/bin/mysql /usr/bin/mysql

（4）设置复制中Slave的relay_log_purge参数

在hdp3和hdp2上用mysql用户执行：

mysql -uroot -p123456 -e "set global relay_log_purge=0"

注意，MHA在发生切换的过程中，从库的恢复过程中依赖于relay log的相关信息，所以这里要将relay log的自动清除设置为OFF，采用手动清除relay log的方式。默认情况下，从服务器上的中继日志会在SQL线程执行完毕后被自动删除。但是在MHA环境中，这些中继日志在恢复其他从服务器时可能会被用到，因此需要禁用中继日志的自动删除功能。定期清除中继日志需要考虑到复制延时的问题。在ext3的文件系统下，删除大的文件需要一定的时间，会导致严重的复制延时。为了避免复制延时，需要暂时为中继日志创建硬链接，因为在linux系统中通过硬链接删除大文件速度会很快。（在mysql数据库中，删除大表时，通常也采用建立硬链接的方式）

7. 创建相关脚本

（1）创建定期清理relay脚本

在hdp3、hdp2两台slave上建立/root/purge_relay_log.sh 文件，内容如下：

#!/bin/bash

. /home/mysql/.bashrc

user=root

passwd=123456

port=3306

log_dir='/data'

work_dir='/data'

purge='/usr/bin/purge_relay_logs'

if [ ! -d $log_dir ]

then

mkdir $log_dir -p

fi

$purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1

purge_relay_logs的参数说明：

user mysql：MySQL用户名。

password mysql：MySQL用户密码。

port：MySQL端口号。

workdir：指定创建relay log的硬链接的位置，默认是/var/tmp。由于系统不同分区创建硬链接文件会失败，故需要执行硬链接具体位置，成功执行脚本后，硬链接的中继日志文件被删除。

disable_relay_log_purge：默认情况下，如果relay_log_purge=1，脚本会什么都不清理，自动退出。通过设定这个参数，当relay_log_purge=1的情况下会将relay_log_purge设置为0。清理relay log之后，最后将参数设置为OFF。

改模式为可执行：

chmod 755 purge_relay_log.sh

手工执行/root/purge_relay_log.sh，在控制台输出：

2018-07-31 12:45:20: purge_relay_logs script started.

Found relay_log.info: /data/relay-log.info

Opening /data/hdp2-relay-bin.000001 ..

Opening /data/hdp2-relay-bin.000002 ..

Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log

files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok.

2018-07-31 12:45:23: All relay log purging operations succeeded.

添加到crontab中：

0 4 * * * /bin/bash /root/purge_relay_log.sh

（2）创建自动failover脚本

在hdp1上创建/usr/bin/master_ip_failover文件，内容如下：

#!/usr/bin/env perl

use strict;

use warnings FATAL => 'all';

use Getopt::Long;

my (

$command, $ssh_user, $orig_master_host, $orig_master_ip,

$orig_master_port, $new_master_host, $new_master_ip, $new_master_port

);

my $vip = '172.16.1.100'; # Virtual IP

my $key = "1";

my $ssh_start_vip = "/sbin/ifconfig ens32:$key $vip";

my $ssh_stop_vip = "/sbin/ifconfig ens160:$key down";

GetOptions(

'command=s' => \$command,

'ssh_user=s' => \$ssh_user,

'orig_master_host=s' => \$orig_master_host,

'orig_master_ip=s' => \$orig_master_ip,

'orig_master_port=i' => \$orig_master_port,

'new_master_host=s' => \$new_master_host,

'new_master_ip=s' => \$new_master_ip,

'new_master_port=i' => \$new_master_port,

);

exit &main();

sub main {

print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

if ( $command eq "stop" || $command eq "stopssh" ) {

# $orig_master_host, $orig_master_ip, $orig_master_port are passed.

# If you manage master ip address at global catalog database,

# invalidate orig_master_ip here.

my $exit_code = 1;

eval {

print "Disabling the VIP on old master: $orig_master_host \n";

&stop_vip();

$exit_code = 0;

};

if ($@) {

warn "Got Error: $@\n";

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "start" ) {

# all arguments are passed.

# If you manage master ip address at global catalog database,

# activate new_master_ip here.

# You can also grant write access (create user, set read_only=0, etc) here.

my $exit_code = 10;

eval {

print "Enabling the VIP - $vip on the new master - $new_master_host \n";

&start_vip();

$exit_code = 0;

};

if ($@) {

warn $@;

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "status" ) {

print "Checking the Status of the script.. OK \n";

`ssh $ssh_user\@$orig_master_host \" $ssh_start_vip \"`;

exit 0;

}

else {

&usage();

exit 1;

}

# A simple system call that enable the VIP on the new master

sub start_vip() {

`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;

}

# A simple system call that disable the VIP on the old_master

sub stop_vip() {

`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;

}

sub usage {

print

"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";

}

注意脚本中VIP漂移的部分。

（3）创建手动failover脚本

在hdp1上创建/usr/bin/master_ip_online_change文件，内容如下：

#!/usr/bin/env perl

## Note: This is a sample script and is notcomplete. Modify the script based on your environment.

use strict;

use warnings FATAL => 'all';

use Getopt::Long;

use MHA::DBHelper;

use MHA::NodeUtil;

# use Time::HiRes qw( sleep gettimeofdaytv_interval );

use Time::HiRes qw(sleep gettimeofday tv_interval);

use Data::Dumper;

my $_tstart;

my $_running_interval = 0.1;

my (

$command, $orig_master_host, $orig_master_ip,

$orig_master_port, $orig_master_user,

$new_master_host, $new_master_ip, $new_master_port,

$new_master_user,

);

my $vip = '172.16.1.100'; # Virtual IP

my $key = "1";

my $ssh_start_vip = "/sbin/ifconfig ens32:$key $vip";

my $ssh_stop_vip = "/sbin/ifconfig ens160:$key down";

my $ssh_user = "root";

my $new_master_password = "123456";

my $orig_master_password = "123456";

GetOptions(

'command=s' =>\$command,

#'ssh_user=s' => \$ssh_user,

'orig_master_host=s' =>\$orig_master_host,

'orig_master_ip=s' =>\$orig_master_ip,

'orig_master_port=i' =>\$orig_master_port,

'orig_master_user=s' =>\$orig_master_user,

#'orig_master_password=s' => \$orig_master_password,

'new_master_host=s' =>\$new_master_host,

'new_master_ip=s' =>\$new_master_ip,

'new_master_port=i' =>\$new_master_port,

'new_master_user=s' =>\$new_master_user,

#'new_master_password=s' =>\$new_master_password,

);

exit &main();

sub current_time_us {

my ($sec, $microsec ) = gettimeofday();

my$curdate = localtime($sec);

return $curdate . " " . sprintf( "%06d", $microsec);

}

sub sleep_until {

my$elapsed = tv_interval($_tstart);

if ($_running_interval > $elapsed ) {

sleep( $_running_interval - $elapsed );

}

sub get_threads_util {

my$dbh = shift;

my$my_connection_id = shift;

my$running_time_threshold = shift;

my$type = shift;

$running_time_threshold = 0 unless ($running_time_threshold);

$type = 0 unless($type);

my@threads;

my$sth = $dbh->prepare("SHOW PROCESSLIST");

$sth->execute();

while ( my $ref = $sth->fetchrow_hashref() ) {

my$id = $ref->{Id};

my$user = $ref->{User};

my$host = $ref->{Host};

my$command = $ref->{Command};

my$state = $ref->{State};

my$query_time = $ref->{Time};

my$info = $ref->{Info};

$info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);

next if ( $my_connection_id == $id );

next if ( defined($query_time) && $query_time <$running_time_threshold );

next if ( defined($command) && $command eq "Binlog Dump" );

next if ( defined($user) && $user eq "system user" );

&& $command eq "Sleep"

&& defined($query_time)

&& $query_time >= 1 );

if( $type >= 1 ) {

next if ( defined($command) && $command eq "Sleep" );

nextif ( defined($command) && $command eq "Connect" );

}

if( $type >= 2 ) {

next if ( defined($info) && $info =~ m/^select/i );

next if ( defined($info) && $info =~ m/^show/i );

}

push @threads, $ref;

}

return @threads;

}

sub main {

if ($command eq "stop" ) {

##Gracefully killing connections on the current master

#1. Set read_only= 1 on the new master

#2. DROP USER so that no app user can establish new connections

#3. Set read_only= 1 on the current master

#4. Kill current queries

#* Any database access failure will result in script die.

my$exit_code = 1;

eval {

## Setting read_only=1 on the new master (to avoid accident)

my $new_master_handler = new MHA::DBHelper();

# args: hostname, port, user, password, raise_error(die_on_error)_or_not

$new_master_handler->connect( $new_master_ip, $new_master_port,

$new_master_user, $new_master_password, 1 );

print current_time_us() . " Set read_only on the new master..";

$new_master_handler->enable_read_only();

if ( $new_master_handler->is_read_only() ) {

print "ok.\n";

}

else {

die "Failed!\n";

}

$new_master_handler->disconnect();

# Connecting to the orig master, die if any database error happens

my $orig_master_handler = new MHA::DBHelper();

$orig_master_handler->connect( $orig_master_ip, $orig_master_port,

$orig_master_user, $orig_master_password, 1 );

## Drop application user so that nobodycan connect. Disabling per-session binlog beforehand

#$orig_master_handler->disable_log_bin_local();

#print current_time_us() . " Drpping app user on the origmaster..\n";

#FIXME_xxx_drop_app_user($orig_master_handler);

## Waiting for N * 100 milliseconds so that current connections can exit

my $time_until_read_only = 15;

$_tstart = [gettimeofday];

my @threads = get_threads_util( $orig_master_handler->{dbh},

$orig_master_handler->{connection_id} );

while ( $time_until_read_only > 0 && $#threads >= 0 ) {

if ( $time_until_read_only % 5 == 0 ) {

printf "%s Waiting all running %d threads aredisconnected.. (max %d milliseconds)\n",

current_time_us(), $#threads + 1, $time_until_read_only * 100;

if ( $#threads < 5 ) {

print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump ."\n"

foreach (@threads);

}

sleep_until();

$_tstart = [gettimeofday];

$time_until_read_only--;

@threads = get_threads_util( $orig_master_handler->{dbh},

$orig_master_handler->{connection_id} );

}

## Setting read_only=1 on the current master so that nobody(exceptSUPER) can write

print current_time_us() . " Set read_only=1 on the orig master..";

$orig_master_handler->enable_read_only();

if ( $orig_master_handler->is_read_only() ) {

print "ok.\n";

}

else {

die "Failed!\n";

}

## Waiting for M * 100 milliseconds so that current update queries cancomplete

my $time_until_kill_threads = 5;

@threads = get_threads_util( $orig_master_handler->{dbh},

$orig_master_handler->{connection_id} );

while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {

if ( $time_until_kill_threads % 5 == 0 ) {

printf "%s Waiting all running %d queries aredisconnected.. (max %d milliseconds)\n",

current_time_us(), $#threads + 1, $time_until_kill_threads * 100;

if ( $#threads < 5 ) {

print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump ."\n"

foreach (@threads);

}

sleep_until();

$_tstart = [gettimeofday];

$time_until_kill_threads--;

@threads = get_threads_util( $orig_master_handler->{dbh},

$orig_master_handler->{connection_id} );

}

print "Disabling the VIPon old master: $orig_master_host \n";

&stop_vip();

## Terminating all threads

print current_time_us() . " Killing all applicationthreads..\n";

$orig_master_handler->kill_threads(@threads) if ( $#threads >= 0);

print current_time_us() . " done.\n";

#$orig_master_handler->enable_log_bin_local();

$orig_master_handler->disconnect();

## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK

$exit_code = 0;

};

if($@) {

warn "Got Error: $@\n";

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "start" ) {

##Activating master ip on the new master

#1. Create app user with write privileges

#2. Moving backup script if needed

#3. Register new master's ip to the catalog database

# We don't return error even thoughactivating updatable accounts/ip failed so that we don't interrupt slaves'recovery.

# If exit code is 0 or 10, MHA does notabort

my$exit_code = 10;

eval{

my $new_master_handler = new MHA::DBHelper();

# args: hostname, port, user, password, raise_error_or_not

$new_master_handler->connect( $new_master_ip, $new_master_port,

$new_master_user, $new_master_password, 1 );

## Set read_only=0 on the new master

#$new_master_handler->disable_log_bin_local();

print current_time_us() . " Set read_only=0 on the newmaster.\n";

$new_master_handler->disable_read_only();

## Creating an app user on the new master

#print current_time_us() . " Creating app user on the newmaster..\n";

#FIXME_xxx_create_app_user($new_master_handler);

#$new_master_handler->enable_log_bin_local();

$new_master_handler->disconnect();

## Update master ip on the catalog database, etc

print "Enabling the VIP -$vip on the new master - $new_master_host \n";

&start_vip();

$exit_code = 0;

};

if($@) {

warn "Got Error: $@\n";

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "status" ) {

#do nothing

exit 0;

}

else{

&usage();

exit 1;

}

# A simple system call that enable the VIPon the new master

sub start_vip() {

`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;

}

# A simple system call that disable the VIPon the old_master

sub stop_vip() {

`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;

}

sub usage {

print

"Usage: master_ip_online_change --command=start|stop|status--orig_master_host=host --orig_master_ip=ip --orig_master_port=port--new_master_host=host --new_master_ip=ip --new_master_port=port\n";

die;

}

注意脚本中VIP漂移的部分。

四、检查MHA配置

1. 检查SSH配置

在hdp1上用root用户操作。

[root@hdp1~]#masterha_check_ssh --conf=/etc/masterha/app1.cnf

Tue Jul 31 12:50:22 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Tue Jul 31 12:50:22 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Tue Jul 31 12:50:22 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Tue Jul 31 12:50:22 2018 - [info] Starting SSH connection tests..

Tue Jul 31 12:50:23 2018 - [debug]

Tue Jul 31 12:50:22 2018 - [debug] Connecting via SSH from root@172.16.1.127(172.16.1.127:22) to root@172.16.1.126(172.16.1.126:22)..

Tue Jul 31 12:50:22 2018 - [debug] ok.

Tue Jul 31 12:50:22 2018 - [debug] Connecting via SSH from root@172.16.1.127(172.16.1.127:22) to root@172.16.1.125(172.16.1.125:22)..

Tue Jul 31 12:50:23 2018 - [debug] ok.

Tue Jul 31 12:50:24 2018 - [debug]

Tue Jul 31 12:50:23 2018 - [debug] Connecting via SSH from root@172.16.1.126(172.16.1.126:22) to root@172.16.1.127(172.16.1.127:22)..

Tue Jul 31 12:50:23 2018 - [debug] ok.

Tue Jul 31 12:50:23 2018 - [debug] Connecting via SSH from root@172.16.1.126(172.16.1.126:22) to root@172.16.1.125(172.16.1.125:22)..

Tue Jul 31 12:50:23 2018 - [debug] ok.

Tue Jul 31 12:50:25 2018 - [debug]

Tue Jul 31 12:50:23 2018 - [debug] Connecting via SSH from root@172.16.1.125(172.16.1.125:22) to root@172.16.1.127(172.16.1.127:22)..

Tue Jul 31 12:50:23 2018 - [debug] ok.

Tue Jul 31 12:50:23 2018 - [debug] Connecting via SSH from root@172.16.1.125(172.16.1.125:22) to root@172.16.1.126(172.16.1.126:22)..

Tue Jul 31 12:50:24 2018 - [debug] ok.

Tue Jul 31 12:50:25 2018 - [info] All SSH connection tests passed successfully.

[root@hdp1~]#

可以看到各个节点ssh验证都是ok的。

2. 检查整个复制环境状况

在hdp1上用root用户操作。

[root@hdp1~]#masterha_check_repl --conf=/etc/masterha/app1.cnf

Tue Jul 31 12:52:19 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Tue Jul 31 12:52:19 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Tue Jul 31 12:52:19 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Tue Jul 31 12:52:19 2018 - [info] MHA::MasterMonitor version 0.56.

Tue Jul 31 12:52:21 2018 - [info] GTID failover mode = 0

Tue Jul 31 12:52:21 2018 - [info] Dead Servers:

Tue Jul 31 12:52:21 2018 - [info] Alive Servers:

Tue Jul 31 12:52:21 2018 - [info] 172.16.1.127(172.16.1.127:3306)

Tue Jul 31 12:52:21 2018 - [info] 172.16.1.126(172.16.1.126:3306)

Tue Jul 31 12:52:21 2018 - [info] 172.16.1.125(172.16.1.125:3306)

Tue Jul 31 12:52:21 2018 - [info] Alive Slaves:

Tue Jul 31 12:52:21 2018 - [info] 172.16.1.126(172.16.1.126:3306) Version=5.6.14-log (oldest major version between slaves) log-bin:enabled

Tue Jul 31 12:52:21 2018 - [info] Replicating from 172.16.1.127(172.16.1.127:3306)

Tue Jul 31 12:52:21 2018 - [info] Primary candidate for the new Master (candidate_master is set)

Tue Jul 31 12:52:21 2018 - [info] 172.16.1.125(172.16.1.125:3306) Version=5.6.14-log (oldest major version between slaves) log-bin:enabled

Tue Jul 31 12:52:21 2018 - [info] Replicating from 172.16.1.127(172.16.1.127:3306)

Tue Jul 31 12:52:21 2018 - [info] Current Alive Master: 172.16.1.127(172.16.1.127:3306)

Tue Jul 31 12:52:21 2018 - [info] Checking slave configurations..

Tue Jul 31 12:52:21 2018 - [info] read_only=1 is not set on slave 172.16.1.126(172.16.1.126:3306).

Tue Jul 31 12:52:21 2018 - [info] Checking replication filtering settings..

Tue Jul 31 12:52:21 2018 - [info] binlog_do_db= , binlog_ignore_db=

Tue Jul 31 12:52:21 2018 - [info] Replication filtering check ok.

Tue Jul 31 12:52:21 2018 - [info] GTID (with auto-pos) is not supported

Tue Jul 31 12:52:21 2018 - [info] Starting SSH connection tests..

Tue Jul 31 12:52:23 2018 - [info] All SSH connection tests passed successfully.

Tue Jul 31 12:52:23 2018 - [info] Checking MHA Node version..

Tue Jul 31 12:52:24 2018 - [info] Version check ok.

Tue Jul 31 12:52:24 2018 - [info] Checking SSH publickey authentication settings on the current master..

Tue Jul 31 12:52:24 2018 - [info] HealthCheck: SSH to 172.16.1.127 is reachable.

Tue Jul 31 12:52:24 2018 - [info] Master MHA Node version is 0.56.

Tue Jul 31 12:52:24 2018 - [info] Checking recovery script configurations on 172.16.1.127(172.16.1.127:3306)..

Tue Jul 31 12:52:24 2018 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000001

Tue Jul 31 12:52:24 2018 - [info] Connecting to root@172.16.1.127(172.16.1.127:22)..

Creating /tmp if not exists.. ok.

Checking output directory is accessible or not..

ok.

Binlog found at /data, up to mysql-bin.000001

Tue Jul 31 12:52:25 2018 - [info] Binlog setting check done.

Tue Jul 31 12:52:25 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..

Tue Jul 31 12:52:25 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=172.16.1.126 --slave_ip=172.16.1.126 --slave_port=3306 --workdir=/tmp --target_version=5.6.14-log --manager_version=0.56 --relay_log_info=/data/relay-log.info --relay_dir=/data/ --slave_pass=xxx

Tue Jul 31 12:52:25 2018 - [info] Connecting to root@172.16.1.126(172.16.1.126:22)..

Checking slave recovery environment settings..

Opening /data/relay-log.info ... ok.

Relay log found at /data, up to hdp3-relay-bin.000003

Temporary relay log file is /data/hdp3-relay-bin.000003

Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.

done.

Testing mysqlbinlog output.. done.

Cleaning up test file(s).. done.

Tue Jul 31 12:52:25 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=172.16.1.125 --slave_ip=172.16.1.125 --slave_port=3306 --workdir=/tmp --target_version=5.6.14-log --manager_version=0.56 --relay_log_info=/data/relay-log.info --relay_dir=/data/ --slave_pass=xxx

Tue Jul 31 12:52:25 2018 - [info] Connecting to root@172.16.1.125(172.16.1.125:22)..

Checking slave recovery environment settings..

Opening /data/relay-log.info ... ok.

Relay log found at /data, up to hdp2-relay-bin.000003

Temporary relay log file is /data/hdp2-relay-bin.000003

Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.

done.

Testing mysqlbinlog output.. done.

Cleaning up test file(s).. done.

Tue Jul 31 12:52:25 2018 - [info] Slaves settings check done.

Tue Jul 31 12:52:25 2018 - [info]

172.16.1.127(172.16.1.127:3306) (current master)

+--172.16.1.126(172.16.1.126:3306)

+--172.16.1.125(172.16.1.125:3306)

Tue Jul 31 12:52:25 2018 - [info] Checking replication health on 172.16.1.126..

Tue Jul 31 12:52:25 2018 - [info] ok.

Tue Jul 31 12:52:25 2018 - [info] Checking replication health on 172.16.1.125..

Tue Jul 31 12:52:25 2018 - [info] ok.

Tue Jul 31 12:52:25 2018 - [info] Checking master_ip_failover_script status:

Tue Jul 31 12:52:25 2018 - [info] /usr/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=172.16.1.127 --orig_master_ip=172.16.1.127 --orig_master_port=3306

IN SCRIPT TEST====/sbin/ifconfig ens160:1 down==/sbin/ifconfig ens32:1 172.16.1.100===

Checking the Status of the script.. OK

SIOCSIFADDR: No such device

ens32:1: ERROR while getting interface flags: No such device

Tue Jul 31 12:52:25 2018 - [info] OK.

Tue Jul 31 12:52:25 2018 - [warning] shutdown_script is not defined.

Tue Jul 31 12:52:25 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

没有明显报错，只有几个警告而已，复制显示正常。

3. 检查MHA Manager的状态

在hdp1上用root用户操作。

[root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf

app1 is stopped(2:NOT_RUNNING).

[root@hdp1~]#

显示"NOT_RUNNING"，这代表MHA监控没有开启。执行下面的命令后台启动MHA。

mkdir -p /var/log/masterha/app1/

nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &

启动参数说明：

remove_dead_master_conf：该参数代表当发生主从切换后，老的主库的ip将会从配置文件中移除。

manger_log：日志存放位置。

ignore_last_failover：在缺省情况下，如果MHA检测到连续发生宕机，且两次宕机间隔不足8小时的话，则不会进行Failover，之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件，默认情况下，MHA发生切换后会在日志目录，也就是上面设置的/data产生app1.failover.complete文件，下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换，除非在第一次切换后收到删除该文件。为了方便，这里设置为--ignore_last_failover。

再次检查MHA Manager的状态：

[root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf

app1 (pid:298237) is running(0:PING_OK), master:172.16.1.127

[root@hdp1~]#

可以看见已经在监控了，而且master的主机为172.16.1.127。

4. 查看启动日志

在hdp1上用root用户操作。

[root@hdp1~]#tail -n20 /var/log/masterha/app1/manager.log

Tue Jul 31 12:57:06 2018 - [info]

172.16.1.127(172.16.1.127:3306) (current master)

+--172.16.1.126(172.16.1.126:3306)

+--172.16.1.125(172.16.1.125:3306)

Tue Jul 31 12:57:06 2018 - [info] Checking master_ip_failover_script status:

Tue Jul 31 12:57:06 2018 - [info] /usr/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=172.16.1.127 --orig_master_ip=172.16.1.127 --orig_master_port=3306

IN SCRIPT TEST====/sbin/ifconfig ens160:1 down==/sbin/ifconfig ens32:1 172.16.1.100===

Checking the Status of the script.. OK

SIOCSIFADDR: No such device

ens32:1: ERROR while getting interface flags: No such device

Tue Jul 31 12:57:06 2018 - [info] OK.

Tue Jul 31 12:57:06 2018 - [warning] shutdown_script is not defined.

Tue Jul 31 12:57:06 2018 - [info] Set master ping interval 1 seconds.

Tue Jul 31 12:57:06 2018 - [info] Set secondary check script: /usr/bin/masterha_secondary_check -s hdp4 -s hdp3 --user=root --master_host=hdp4 --master_ip=172.16.1.127 --master_port=3306

Tue Jul 31 12:57:06 2018 - [info] Starting ping health check on 172.16.1.127(172.16.1.127:3306)..

Tue Jul 31 12:57:06 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..

[root@hdp1~]#

五、功能测试

1. 初始绑定VIP

在hdp4 172.16.1.127（master）上用root用户执行：

/sbin/ifconfig ens160:1 172.16.1.100/24

查看VIP：

[root@hdp4~]#ip a

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo

valid_lft forever preferred_lft forever

inet6 ::1/128 scope host

valid_lft forever preferred_lft forever

2: ens160: mtu 1500 qdisc mq state UP qlen 1000

link/ether 00:50:56:a5:49:7f brd ff:ff:ff:ff:ff:ff

inet 172.16.1.127/24 brd 172.16.1.255 scope global ens160

valid_lft forever preferred_lft forever

inet 172.16.1.100/16 brd 172.16.255.255 scope global ens160:1

valid_lft forever preferred_lft forever

inet6 fe80::250:56ff:fea5:497f/64 scope link

valid_lft forever preferred_lft forever

[root@hdp4~]#

2. 测试自动切换

（1）在slave1库（172.16.1.126）上停掉slave IO线程，模拟主从延时：

mysql -uroot -p123456 -e "stop slave io_thread;"

（2）在master库（172.16.1.127）安装sysbench，进行sysbench数据生成，在sbtest库下生成sbtest表，共10W记录。

# 用root用户安装sysbench

yum install sysbench -y

# 用mysql用户建立sbtest 数据库

mysql -uroot -p123456 -e "create database sbtest;"

# 用mysql用户执行sysbench生成数据

sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-user=root --mysql-password=123456 --oltp-test-mode=complex --oltp-tables-count=10 --oltp-table-size=10000 --threads=10 --time=120 --report-interval=10 --db-driver=mysql prepare

（3）用root用户停止master的mysql服务。

service mysql stop

（4）验证VIP漂移。

在hdp3上用root用户操作。

[root@hdp3~]#ip a

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo

valid_lft forever preferred_lft forever

inet6 ::1/128 scope host

valid_lft forever preferred_lft forever

2: ens32: mtu 1500 qdisc pfifo_fast state UP qlen 1000

link/ether 00:50:56:a5:0f:77 brd ff:ff:ff:ff:ff:ff

inet 172.16.1.126/24 brd 172.16.1.255 scope global ens32

valid_lft forever preferred_lft forever

inet 172.16.1.100/16 brd 172.16.255.255 scope global ens32:1

valid_lft forever preferred_lft forever

inet6 fe80::250:56ff:fea5:f77/64 scope link

valid_lft forever preferred_lft forever

[root@hdp3~]#

在hdp4上用root用户操作。

[root@hdp4~]#ip a

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo

valid_lft forever preferred_lft forever

inet6 ::1/128 scope host

valid_lft forever preferred_lft forever

2: ens160: mtu 1500 qdisc mq state UP qlen 1000

link/ether 00:50:56:a5:49:7f brd ff:ff:ff:ff:ff:ff

inet 172.16.1.127/24 brd 172.16.1.255 scope global ens160

valid_lft forever preferred_lft forever

inet6 fe80::250:56ff:fea5:497f/64 scope link

valid_lft forever preferred_lft forever

[root@hdp4~]#

可以看到VIP已经从hdp4 172.16.1.127（master）漂移到了hdp3 172.16.1.126（slave1）。

（5）客户端用VIP访问数据库

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.100 -e "show databases; use sbtest; show tables; select count(*) from sbtest1; select count(*) from sbtest10;"

mysql: [Warning] Using a password on the command line interface can be insecure.

+--------------------+

| Database |

+--------------------+

| information_schema |

| mysql |

| performance_schema |

| sbtest |

| source |

| test |

+--------------------+

+------------------+

| Tables_in_sbtest |

+------------------+

| sbtest1 |

| sbtest10 |

| sbtest2 |

| sbtest3 |

| sbtest4 |

| sbtest5 |

| sbtest6 |

| sbtest7 |

| sbtest8 |

| sbtest9 |

+------------------+

+----------+

| count(*) |

+----------+

| 10000 |

+----------+

| count(*) |

+----------+

| 10000 |

+----------+

C:\WINDOWS\system32>

在还没创建sbtest库的时候，172.16.1.126就停了slave sql线程。在新的Master 172.16.1.126上查看数据，可以看到落后的数据也同步过来了，数据没有丢失。

（6）查看复制的主从切换

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"

mysql: [Warning] Using a password on the command line interface can be insecure.

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

Master_Host: 172.16.1.126

Master_User: repl

Master_Port: 3306

Connect_Retry: 60

Master_Log_File: mysql-bin.000001

Read_Master_Log_Pos: 19093607

Relay_Log_File: hdp2-relay-bin.000002

Relay_Log_Pos: 283

Relay_Master_Log_File: mysql-bin.000001

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Replicate_Do_DB:

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 19093607

Relay_Log_Space: 455

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 126

Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56

Master_Info_File: /data/master.info

SQL_Delay: 0

SQL_Remaining_Delay: NULL

Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

Master_Retry_Count: 86400

Master_Bind:

Last_IO_Error_Timestamp:

Last_SQL_Error_Timestamp:

Master_SSL_Crl:

Master_SSL_Crlpath:

Retrieved_Gtid_Set:

Executed_Gtid_Set:

Auto_Position: 0

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"

mysql: [Warning] Using a password on the command line interface can be insecure.

C:\WINDOWS\system32>

可以看到，172.16.1.126称为新的master，而172.16.1.125也指向了这个新的master。

（7）检查MHA Manager的状态

在hdp1上用root用户执行下面的操作。

[root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf

app1 is stopped(2:NOT_RUNNING).

[1]+ Done nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1

[root@hdp1~]#

发现在执行了一次自动failover后，MHA Manager进程停止了。官网上对这种情况的解释如下：

意思是安装一个进程工具，通过该工具结合脚本来管理进程。

3. 测试手工切换

首先要还原环境。

还原数据库复制：

-- 在hdp4、hdp3、hdp2上重置master、slave

stop slave;

drop database sbtest;

reset master;

reset slave all;

-- 在hdp3、hdp2上重新指向hdp4为master

change master to

master_host='172.16.1.127',

master_port=3306,

master_user='repl',

master_password='123456',

master_log_file='mysql-bin.000001',

master_log_pos=120;

start slave;

show slave status\G

还原VIP绑定：

# 在hdp3上用root用户执行

/sbin/ifconfig ens32:1 down

# 在hdp4上用root用户执行

/sbin/ifconfig ens160:1 172.16.1.100

还原配置文件：

编辑在hdp1上/etc/masterha/app1.cnf，将[server1]段添加回去。

启动MHA Manage：

# 在hdp1上用root用户执行

nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &

至此环境还原完毕，可以开始测试手工切换。当主服务器故障时，人工手动调用MHA来进行故障切换操作，步骤如下。

（1）停止MHA Manage

在hdp1上用root用户操作。

masterha_stop --conf=/etc/masterha/app1.cnf

（2）关闭master

在hdp4上用root用户操作。

service mysql stop

（3）执行手工切换

在hdp1上用root用户操作。

masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=172.16.1.127 --dead_master_port=3306 --new_master_host=172.16.1.126 --new_master_port=3306 --ignore_last_failover

（4）验证VIP漂移到172.16.1.126

[root@hdp3~]#ip a

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo

valid_lft forever preferred_lft forever

inet6 ::1/128 scope host

valid_lft forever preferred_lft forever

2: ens32: mtu 1500 qdisc pfifo_fast state UP qlen 1000

link/ether 00:50:56:a5:0f:77 brd ff:ff:ff:ff:ff:ff

inet 172.16.1.126/24 brd 172.16.1.255 scope global ens32

valid_lft forever preferred_lft forever

inet 172.16.1.100/16 brd 172.16.255.255 scope global ens32:1

valid_lft forever preferred_lft forever

inet6 fe80::250:56ff:fea5:f77/64 scope link

valid_lft forever preferred_lft forever

[root@hdp3~]#

（5）验证复制关系

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"

mysql: [Warning] Using a password on the command line interface can be insecure.

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

Master_Host: 172.16.1.126

Master_User: repl

Master_Port: 3306

Connect_Retry: 60

Master_Log_File: mysql-bin.000001

Read_Master_Log_Pos: 120

Relay_Log_File: hdp2-relay-bin.000002

Relay_Log_Pos: 283

Relay_Master_Log_File: mysql-bin.000001

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Replicate_Do_DB:

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 120

Relay_Log_Space: 455

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 126

Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56

Master_Info_File: /data/master.info

SQL_Delay: 0

SQL_Remaining_Delay: NULL

Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

Master_Retry_Count: 86400

Master_Bind:

Last_IO_Error_Timestamp:

Last_SQL_Error_Timestamp:

Master_SSL_Crl:

Master_SSL_Crlpath:

Retrieved_Gtid_Set:

Executed_Gtid_Set:

Auto_Position: 0

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"

mysql: [Warning] Using a password on the command line interface can be insecure.

C:\WINDOWS\system32>

（6）验证客户端VIP访问

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.100 -e "show variables like 'server_id'; show databases;"

mysql: [Warning] Using a password on the command line interface can be insecure.

+---------------+-------+

| Variable_name | Value |

+---------------+-------+

| server_id | 126 |

+---------------+-------+

+--------------------+

| Database |

+--------------------+

| information_schema |

| mysql |

| performance_schema |

| source |

| test |

+--------------------+

C:\WINDOWS\system32>

4. 测试在线切换

在许多情况下，需要将现有的主服务器迁移到另外一台服务器上。比如主服务器硬件故障，RAID控制卡需要重建，将主服务器移到性能更好的服务器上等等。维护主服务器引起性能下降，导致停机时间至少无法写入数据。另外，阻塞或杀掉当前运行的会话会导致主主之间数据不一致的问题发生。MHA 提供快速切换和优雅的阻塞写入，这个切换过程只需要 0.5-2s 的时间，这段时间内数据是无法写入的。在很多情况下，0.5-2s 的阻塞写入是可以接受的。因此切换主服务器不需要计划分配维护时间窗口。

MHA在线切换的大概过程：

检测复制设置和确定当前主服务器

确定新的主服务器

阻塞写入到当前主服务器

等待所有从服务器赶上复制

授予写入到新的主服务器

重新设置从服务器

注意，在线切换的时候应用架构需要考虑以下两个问题：

自动识别master和slave的问题（master的机器可能会切换），如果采用了vip的方式，基本可以解决这个问题。

负载均衡的问题（可以定义大概的读写比例，每台机器可承担的负载比例，当有机器离开集群时，需要考虑这个问题）

为了保证数据完全一致性，在最快的时间内完成切换，MHA的在线切换必须满足以下条件才会切换成功，否则会切换失败。

所有slave的IO线程都在运行

所有slave的SQL线程都在运行

所有的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒。如果在切换过程中不指定running_updates_limit，那么默认情况下running_updates_limit为1秒。

在master端，通过show processlist输出，没有一个更新花费的时间大于running_updates_limit秒。

在测试前，先按照上面“测试手工切换”测试前的步骤执行还原环境（手工切换不用修改/etc/masterha/app1.cnf配置文件），然后按以下步骤测试线切换：

（1）停止MHA Manage

在hdp1上用root用户操作。

masterha_stop --conf=/etc/masterha/app1.cnf

（2）执行在线切换命令

在hdp1上用root用户操作。

masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.16.1.126 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000

（3）验证复制关系

在hdp2、hdp3、hdp4查看slave status：

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"

mysql: [Warning] Using a password on the command line interface can be insecure.

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

Master_Host: 172.16.1.126

Master_User: repl

Master_Port: 3306

Connect_Retry: 60

Master_Log_File: mysql-bin.000001

Read_Master_Log_Pos: 120

Relay_Log_File: hdp2-relay-bin.000002

Relay_Log_Pos: 283

Relay_Master_Log_File: mysql-bin.000001

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Replicate_Do_DB:

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 120

Relay_Log_Space: 455

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 126

Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56

Master_Info_File: /data/master.info

SQL_Delay: 0

SQL_Remaining_Delay: NULL

Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

Master_Retry_Count: 86400

Master_Bind:

Last_IO_Error_Timestamp:

Last_SQL_Error_Timestamp:

Master_SSL_Crl:

Master_SSL_Crlpath:

Retrieved_Gtid_Set:

Executed_Gtid_Set:

Auto_Position: 0

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"

mysql: [Warning] Using a password on the command line interface can be insecure.

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.127 -e "show slave status\G"

mysql: [Warning] Using a password on the command line interface can be insecure.

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

Master_Host: 172.16.1.126

Master_User: repl

Master_Port: 3306

Connect_Retry: 60

Master_Log_File: mysql-bin.000001

Read_Master_Log_Pos: 120

Relay_Log_File: hdp4-relay-bin.000002

Relay_Log_Pos: 283

Relay_Master_Log_File: mysql-bin.000001

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Replicate_Do_DB:

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 120

Relay_Log_Space: 455

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 126

Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56

Master_Info_File: /data/master.info

SQL_Delay: 0

SQL_Remaining_Delay: NULL

Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

Master_Retry_Count: 86400

Master_Bind:

Last_IO_Error_Timestamp:

Last_SQL_Error_Timestamp:

Master_SSL_Crl:

Master_SSL_Crlpath:

Retrieved_Gtid_Set:

Executed_Gtid_Set:

Auto_Position: 0

C:\WINDOWS\system32>

可以看到hdp3 172.16.1.126成为新的master，而hdp2 172.16.1.125和hdp4 172.16.1.127 成为指向新master的slave。

（4）验证VIP自动漂移

[root@hdp3~]#ip a

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo

valid_lft forever preferred_lft forever

inet6 ::1/128 scope host

valid_lft forever preferred_lft forever

2: ens32: mtu 1500 qdisc pfifo_fast state UP qlen 1000

link/ether 00:50:56:a5:0f:77 brd ff:ff:ff:ff:ff:ff

inet 172.16.1.126/24 brd 172.16.1.255 scope global ens32

valid_lft forever preferred_lft forever

inet 172.16.1.100/16 brd 172.16.255.255 scope global ens32:1

valid_lft forever preferred_lft forever

inet6 fe80::250:56ff:fea5:f77/64 scope link

valid_lft forever preferred_lft forever

[root@hdp3~]#

（5）验证客户端通过VIP访问数据库

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.100 -e "show variables like 'server_id'"

mysql: [Warning] Using a password on the command line interface can be insecure.

+---------------+-------+

| Variable_name | Value |

+---------------+-------+

| server_id | 126 |

+---------------+-------+

C:\WINDOWS\system32>

5. 修复宕机的Master

通常情况下自动切换以后，原master可能已经废弃掉，待原master主机修复后，如果数据完整的情况下，可能想把原来master重新作为新主库的slave。这时我们可以借助当时自动切换时刻的MHA日志来完成对原master的修复。下面是提取相关日志的命令：

grep -i "All other slaves should start" /var/log/masterha/app1/manager.log

可以看到类似下面的信息：

All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.1.126', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000005', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='123456';

意思是说，如果Master主机修复好了，可以在修复好后的Master执行CHANGE MASTER操作，作为新的slave库。

About Me

........................................................................................................................

● 本文作者：小麦苗，部分内容整理自网络，若有侵权请联系小麦苗删除

● 本文在itpub、博客园、CSDN和个人微 信公众号（
xiaomaimiaolhr）上有同步更新

● 本文itpub地址：
http://blog.itpub.net/26736162 

● 本文博客园地址：
http://www.cnblogs.com/lhrbest 

● 本文CSDN地址：
https://blog.csdn.net/lihuarongaini 

● 本文pdf版、个人简介及小麦苗云盘地址：
http://blog.itpub.net/26736162/viewspace-1624453/ 

● 数据库笔试面试题库及解答：
http://blog.itpub.net/26736162/viewspace-2134706/ 

● DBA宝典今日头条号地址：
http://www.toutiao.com/c/user/6401772890/#mid=1564638659405826 

........................................................................................................................

● QQ群号：
230161599
、618766405

● 微 信群：可加我微 信，我拉大家进群，非诚勿扰

● 联系我请加QQ好友
（
646634621
），注明添加缘由

● 于 2020-02-01 06:00 ~ 2020-02-31 24:00 在西安完成

● 最新修改时间：2020-02-01 06:00 ~ 2020-02-31 24:00

● 文章内容来源于小麦苗的学习笔记，部分整理自网络，若有侵权或不当之处还请谅解

● 版权所有，欢迎分享本文，转载请保留出处

........................................................................................................................

● 
小麦苗的微店：
https://weidian.com/s/793741433?wfr=c&ifr=shopdetail

● 
小麦苗出版的数据库类丛书：
http://blog.itpub.net/26736162/viewspace-2142121/

● 
小麦苗OCP、OCM、高可用网络班：
http://blog.itpub.net/26736162/viewspace-2148098/

● 
小麦苗腾讯课堂主页：
https://lhr.ke.qq.com/

........................................................................................................................
使用

微 信客户端扫描下面的二维码来关注小麦苗的微 信公众号（
xiaomaimiaolhr）及QQ群（DBA宝典）、添加小麦苗微 信，
学习最实用的数据库技术。 


........................................................................................................................

MySQL 高可用MHA安装部署以及故障转移

1，简介

1.1mha简介

1.2，mha特点

2，工作原理

3，mha工具包

4，主机部署

5，生成ssh无密钥证书

5.1 先在192.168.52.129上面生成密钥

5.2 在192.168.52.130上面生产密钥

5.3 在192.168.52.131上面生产密钥

5.4 构造3个通用的authorized_keys

5.5 开始check验证如下：

5.6 实现主机名hostname登录

6，准备好mysql主从环境

7，开始安装mha

7.1，在数据节点上安装mha

7.2 在管理节点上安装mha

7.3 编辑管理节点配置

7.4 利用mha工具检测ssh

7.5 使用mha工具check检查repl环境

8，管理mha操作

8.1 启动manager

8.2 使用masterha_check_status检测下

8.3 停止manager

8.4 master死机自动切换测试

8.5 master手动切换

9报错记录总结

报错记录1：

报错记录2：

错误记录3：

报错记录4：

报错记录5：

报错记录6：

10，参考资料

MYSQL 5.7 MHA(GTID+ROW)部署及failover，online_change实战演练

1、MHA简介

2、MHA原理

3、MHA优缺点

3.1. 优点

3.2. 缺点

4、MHA工具包的功能

5、MYSQL MHA安装以及演练

5.1. 配置互信

5.2. 搭建一主两从

5.3. 安装MHA节点

a) 安装MHA-NODE节点

b) 安装配置MHA-manager管理节点

5.4. 添加VIP

5.5. 启动MHA服务

5.6. 故障转移演练

5.7. 恢复原master

第1章 MHA架构介绍

1.1 存在隐患

1.2 适用场景

1.3 MHA工作原理

1.4 MHA的组成

第2章 搭建环境情况

2.1 基础环境情况一

2.2 基础环境情况二

2.3 软件部署情况

2.4 拓扑图及实现原理

2.4.1 拓扑图

2.4.2 实现原理

第3章 安装部署MySQL并配置复制

3.1 搭建主从复制环境

第4章 安装部署MHA

4.1 安装MHA Node

4.1.2 MySQL服务器上安装perl模块（DBD::mysql）

4.1.3 在所有节点上安装MHA Node：

4.2 安装MHA Manger

4.2.2 在所有节点上安装MHA Node：

4.2.3 安装MHA Manager

4.3 配置SSH登录无密码验证

4.4 配置MHA（由于前期端口问题，将端口修改为默认）

4.4.1 在Manager上创建配置文件

4.4.2 故障转移脚本

4.4.3 配置从库手动清除relay log清除方式

4.4.4 配置从库定时任务清理relay log

4.4.5 所有MySQL节点增加mysqlbinlog环境变量

第2章搭建环境情况

第3章安装部署MySQL并配置复制

第4章安装部署MHA

第6章部署LVS高可用

第7章模拟故障测试