相较于rhel6/centos6 下的HA配置,rhel7 下使用pcs 取代了原先的ccs和luci ,结合pcs+pacemaker+corosyn+oracle 可以搭建 Oracle 的HA配置,并迁移数据共享平台数据库。使用corosync作为集群消息事务层(Massage Layer),pacemaker作为集群资源管理器(Cluster Resource Management),pcs作为CRM的管理接口工具。
编辑host文件:
pcs01 : 10.228.110.126
pcs02 :10.228.110.128
VIP : 192.168.110.133
配置节点间互信,可使用如下脚本及命令配置
./sshUserSetup.sh -user root -hosts “hacluster01 hacluster02” -advanced –noPromptPassphrase
./sshUserSetup.sh -user root -hosts “hacluster01 hacluster02” -advanced –noPromptPassphrase
关闭防火墙
systemctl start firewalld systemctl stop firewalld systemctl disable firewalld 查看防火墙状态 systemctl status firewalld.service 关闭防火墙 systemctl stop firewalld.service 查看开机是否启动防火墙 systemctl is-enabled firewalld.service 开机不启动 systemctl disable firewalld.service firewall-cmd --permanent --add-service=high-availability # firewall-cmd --permanent --add-service=high-availability # firewall-cmd --add-service=high-availability #firewall-cmd --reload
复制
关闭selunux
vi /etc/selinux/config selinux=disabled setenforce 0
复制
关闭NetworkManager
sysemctl stop NetworkManager sysemctl disable NetworkManager sysemctl status NetworkManager
复制
挂载ISO
mount /home/patrol/rhel-server-7.4-x86_64-dvd.iso /mnt
复制
或者 直接在IBMC管理口上加载ISO镜像
然后 mount /dev/sr0 /mnt
配置yum
[root@pcs02 yum.repos.d]# cat ha.repo [rhel-yum] name=ha2 ---这个名字要一样, baseurl=file:///mnt/addons/HighAvailabilit-----ISO里双机软件包的位置 enabled=1 gpgcheck=0 [ha] name=ha2 baseurl=file:///mnt enabled=1 gpgcheck=0
复制
软件安装
1 使用yum源可以直接安装corosync pacemaker以及pcs软件:
yum install pcs pacemaker fence-agents-all --yum install /iso/addons/HighAvailability/* -y yum install lvm2-cluster gfs2-utils ---我是直接把gfs2-utils这个rpm拷贝到系统里 rpm -ivh gfs2-utils.rpm这样安装的这个包, lvm2-cluster这个包没有找到。。。没安装 。
复制
2 在安装linux系统时,选定其实在安装操作系统的时候在软件源那儿,选中“HIGH AVAILABILITY ADD-ON”这个选项,安装系统的时候就安装上双机软件了,这些就不用操作了。Hacluster用户也自动有了。
yum install pcs pacemaker fence-agents-all
复制
更改hacluster用户密码
两台机器分别更改hacluster用户密码
passwd hacluster
复制
两台机器密码改成一样的
开启服务
开启pcsd服务,两台都要开启
systemctl start pcsd systemctl enable pcsd systemctl start pcsd.service systemctl enable pcsd.service
复制
验证hacluster用户
pcs cluster auth node1 node2 ... nodeN --语法 [root@z1 ~]# pcs host auth z1.example.com z2.example.com [root@pcs01 patrol]# pcs cluster auth pcs01 pcs02 Username: hacluster Password: pcs01: Authorized pcs02: Authorized [root@pcs01 patrol]# [root@pcs02 Packages]# pcs cluster auth pcs01 pcs02 pcs01: Already authorized pcs02: Already authorized [root@pcs02 Packages]# 创建cluster Create a cluster named mycluster with both nodes as cluster members. This command creates and starts the cluster in one step. You only need to run this from one node in the cluster because pcs configuration commands take affect for the entire cluster. On one node in cluster, run the following command. [root@z1 ~]# pcs cluster setup my_cluster --start z1.example.com z2.example.com # pcs cluster setup mycluster --start pcs01 pcs02
复制
软件配WEB登录
https://nodename:2224,或者 nodename换成ip
复制
添加节点、cluster
添加两个节点,然后把两个节点做成一个 cluster
或者pcs cluster setup -name mycluster pcs01 pcs02
添加vip
pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.0.120 cidr_netmask=24 op monitor interval=30s
复制
Resource id 资源名称 随意
IP 浮动IP地址
Nic 浮动ip 绑定到哪块网卡上,默认,或者指定
添加LVM资源配置
Volgrpname VG名字 添加Filesystem资源配置 Device 文件系统对应的设备 如:/dev/ftpvg/lv_backup Directory mount点 如:/reportdata.backup Fstype 文件系统类型 如: ext4
复制
也可以同时配置资源之间的启停依赖顺序关系。
添加oracle资源配置
其中oracle资源主要包括tns和rdbms instance两种。
添加监听资源
按照如下步骤进行监听资源添加以及测试,图形添加:
type 部分选择oralsnr resource group 选择分配的组。 resource_id 为资源的名称 sid为tnsnames.ora里面的别名 Home为数据库ORACLE_HOME User为数据库软件安装用户 tns_admin为$ORACLE_HOME/network/admin
复制
配置完成操作的案例:
数据库资源
按照如下步骤进行数据库资源添加以及测试,图形添加:
Resource_id为资源名称. SID为数据库实例名称. Home值为数据库ORACLE_HOME值 User为数据库软件安装用户
复制
以下为实际操作部署的资源信息:
添加 fence设备
节点的IMM管理口 IP 、用户名、密码、
A Red Hat High Availability cluster requires that you configure fencing for the cluster. The reasons for this requirement are described in Fencing in a Red Hat High Availability Cluster. For this introduction, however, which is intended to show only how to use the basic Pacemaker commands, disable fencing by setting the stonith-enabled cluster option to false. WARNING The use of stonith-enabled=false is completely inappropriate for a production cluster. It tells the cluster to simply pretend that failed nodes are safely powered off. # pcs property set stonith-enabled=false
复制
Port 节点名
软件管理
资源检查
pcs status pcs status –full pcs status resources –full pcs status resources arch pcs status resources dsdb pcs resource show pcs resource show vip
复制
配置资源优先级
pcs constraint location ora prefers pcs01=150 pcs constraint location ora prefers pcs02=50 pcs constraint location arch prefers pcs01=150 pcs constraint location arch prefers pcs02=50 pcs constraint location data prefers pcs01=150 pcs constraint location data prefers pcs02=50 pcs constraint location VIP prefers pcs01=150 pcs constraint location VIP prefers pcs02=50 pcs constraint location LVM prefers pcs01=150 pcs constraint location LVM prefers pcs02=50 pcs constraint location lsnrctl prefers pcs01=150 pcs constraint location lsnrctl prefers pcs02=50 pcs constraint location database prefers pcs01=150 pcs constraint location database prefers pcs02=50 禁用与启用资源 pcs resource enable resource_id pcs resource disable resource_id --关闭资源,并不再启动资源 资源组切换 pcs resource clear dsdb pcs resource move dsdb pcs resource move dsdb pcs01 pcs resource move dsdb pcs02
复制
其他命令
1、pcs resource move 资源名/资源组名 --把资源/资源组 移动到另一个节点 2、执行 pcs resource clear pcs resource cleanup WebSite资源名/资源组名 --清楚此资源在此节点上的 location限制,清楚限制之后,此资源/组,才可以在一次切换到这个节点,否则这个节点不能接收资源/组 了。 pcs status 查看cluster 整体状态 pcs status nodes ftp2 显示ftp2节点状态 pcs status groups ftpgroup 显示资源组状态 pcs config 显示所有配置信息 pcs resource show --full 显示所有资源信息 pcs resource show VIP 查看资源信息 fence_ipmilan -a 10.228.230.98 -l USERID -p PASSW0RD -o status 测试 fence 设备 pcs cluster stop [--all] [node] [...] 停止cluster pcs cluster start 启动cluster 使 z1.example.com 進入 standby 模式中 # pcs cluster standby z1.example.com z1 從 standby 模式中退出 # pcs cluster unstandby z1.example.com 集群模块管理 模块信息如下: pcs cluster status 如何保障pacemaker corssync以及pcsd均属于enable状态? systemctl enable corosync pacemaker pcsd
复制
模拟故障
监听故障
模拟监听故障,手工停止监听程序,则cluster软件(资源参数intervals)过一段时间,会自动启动监听程序。
[oracle@pcs01 trace]$ lsnrctl stop listener_dsdb <--模拟监听故障 LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 09-NOV-2018 17:53:44 Copyright (c) 1991, 2013, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.228.110.133)(PORT=1521))) The command completed successfully [oracle@pcs01 trace]$ ps -ef|grep tns root 183 2 0 11月06 ? 00:00:00 [netns] oracle 4393 46314 0 17:53 pts/0 00:00:00 grep --color=auto tns [oracle@pcs01 trace]$ ps -ef|grep tns root 183 2 0 11月06 ? 00:00:00 [netns] oracle 4468 46314 0 17:53 pts/0 00:00:00 grep --color=auto tns [oracle@pcs01 trace]$ ps -ef|grep tns root 183 2 0 11月06 ? 00:00:00 [netns] oracle 4513 46314 0 17:53 pts/0 00:00:00 grep --color=auto tns [oracle@pcs01 trace]$ ps -ef|grep tns root 183 2 0 11月06 ? 00:00:00 [netns] oracle 4516 46314 0 17:53 pts/0 00:00:00 grep --color=auto tns [oracle@pcs01 trace]$ ps -ef|grep tns root 183 2 0 11月06 ? 00:00:00 [netns] oracle 4777 46314 0 17:53 pts/0 00:00:00 grep --color=auto tns [oracle@pcs01 trace]$ ps -ef|grep tns ß 监听程序已经启动 root 183 2 0 11月06 ? 00:00:00 [netns] oracle 5110 1 0 17:53 ? 00:00:00 /sjgxpt/oracle/product/11.2/db_1/bin/tnslsnr listener_dsdb -inherit oracle 5233 46314 0 17:53 pts/0 00:00:00 grep --color=auto tns [oracle@pcs01 trace]$ ps -ef|grep tns root 183 2 0 11月06 ? 00:00:00 [netns] oracle 5110 1 1 17:53 ? 00:00:00 /sjgxpt/oracle/product/11.2/db_1/bin/tnslsnr listener_dsdb -inherit oracle 5376 46314 0 17:53 pts/0 00:00:00 grep --color=auto tns [oracle@pcs01 trace]$
复制
数据库故障
ABORT方式模拟数据库宕库故障
[oracle@pcs01 ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Mon Nov 12 11:43:05 2018 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> shutdown abort ORACLE instance shut down. SQL> exit 数据库alert日志如下: Mon Nov 12 11:44:05 2018 Shutting down instance (abort) License high water mark = 33 USER (ospid: 8659): terminating the instance Instance terminated by USER, pid = 8659 Mon Nov 12 11:44:07 2018 Instance shutdown complete Mon Nov 12 11:46:03 2018 Adjusting the default value of parameter parallel_max_servers from 1280 to 970 due to the value of parameter processes (1000) Starting ORACLE instance (normal) ************************ Large Pages Information ******************* Per process system memlock (soft) limit = 13 GB Total Shared Global Region in Large Pages = 0 KB (0%) Large Pages used by this instance: 0 (0 KB) Large Pages unused system wide = 0 (0 KB) Large Pages configured system wide = 0 (0 KB) Large Page size = 2048 KB RECOMMENDATION: Total System Global Area size is 20 GB. For optimal performance, prior to the next instance restart: 1. Increase the number of unused large pages by at least 10241 (page size 2048 KB, total size 20 GB) system wide to get 100% of the System Global Area allocated with large pages 2. Large pages are automatically locked into physical memory. Increase the per process memlock (soft) limit to at least 20 GB to lock 100% System Global Area's large pages into physical memory ******************************************************************** LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 32 Number of processor cores in the system is 16 Number of processor sockets in the system is 4 Picked latch-free SCN scheme 3 Autotune of undo retention is turned on. IMODE=BR ILAT =168 LICENSE_MAX_USERS = 0 SYS auditing is disabled NUMA system with 8 nodes detected Oracle NUMA support not enabled The parameter _enable_NUMA_support should be set to TRUE to enable Oracle NUMA support Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options. ORACLE_HOME = /sjgxpt/oracle/product/11.2/db_1 System name: Linux Node name: pcs01 Release: 3.10.0-693.el7.x86_64 Version: #1 SMP Thu Jul 6 19:56:57 EDT 2017 Machine: x86_64 Using parameter settings in server-side spfile /sjgxpt/oracle/product/11.2/db_1/dbs/spfiledsdb.ora System parameters with non-default values: processes = 1000 sessions = 1536 sga_target = 20G control_files = "/sjgxpt/oracle/data/dsdb/control01.ctl" control_files = "/sjgxpt/oracle/data/dsdb/control02.ctl" db_block_size = 8192 compatible = "11.2.0.4.0" log_archive_dest_1 = "LOCATION=/sjgxpt/oracle/arch" log_archive_format = "%t_%s_%r.dbf" undo_tablespace = "UNDOTBS1" remote_login_passwordfile= "EXCLUSIVE" db_domain = "" dispatchers = "(PROTOCOL=TCP) (SERVICE=dsdbXDB)" local_listener = "(DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST =10.228.110.133)(PORT = 1521)) (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1521)) ) )" java_jit_enabled = TRUE audit_file_dest = "/sjgxpt/oracle/product/admin/dsdb/adump" audit_trail = "DB" db_name = "dsdb" open_cursors = 300 pga_aggregate_target = 12875M diagnostic_dest = "/sjgxpt/oracle/product" Mon Nov 12 11:46:03 2018 PMON started with pid=2, OS id=16298 Mon Nov 12 11:46:04 2018 PSP0 started with pid=3, OS id=16300 Mon Nov 12 11:46:05 2018 VKTM started with pid=4, OS id=16309 at elevated priority VKTM running at (1)millisec precision with DBRM quantum (100)ms Mon Nov 12 11:46:05 2018 GEN0 started with pid=5, OS id=16313 Mon Nov 12 11:46:05 2018 DIAG started with pid=6, OS id=16315 Mon Nov 12 11:46:05 2018 DBRM started with pid=7, OS id=16317 Mon Nov 12 11:46:05 2018 DIA0 started with pid=8, OS id=16319 Mon Nov 12 11:46:05 2018 MMAN started with pid=9, OS id=16321 Mon Nov 12 11:46:05 2018 DBW0 started with pid=10, OS id=16323 Mon Nov 12 11:46:05 2018 DBW1 started with pid=11, OS id=16325 Mon Nov 12 11:46:05 2018 DBW2 started with pid=12, OS id=16327 Mon Nov 12 11:46:05 2018 DBW3 started with pid=13, OS id=16329 Mon Nov 12 11:46:05 2018 LGWR started with pid=14, OS id=16331 Mon Nov 12 11:46:05 2018 CKPT started with pid=15, OS id=16333 Mon Nov 12 11:46:05 2018 SMON started with pid=16, OS id=16335 Mon Nov 12 11:46:05 2018 RECO started with pid=17, OS id=16337 Mon Nov 12 11:46:05 2018 MMON started with pid=18, OS id=16339 Mon Nov 12 11:46:05 2018 MMNL started with pid=19, OS id=16341 starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... starting up 1 shared server(s) ... ORACLE_BASE not set in environment. It is recommended that ORACLE_BASE be set in the environment Reusing ORACLE_BASE from an earlier startup = /sjgxpt/oracle/product Mon Nov 12 11:46:05 2018 ALTER DATABASE MOUNT Successful mount of redo thread 1, with mount id 605261437 Database mounted in Exclusive Mode Lost write protection disabled Completed: ALTER DATABASE MOUNT Mon Nov 12 11:46:09 2018 alter database open Beginning crash recovery of 1 threads parallel recovery started with 31 processes Started redo scan Completed redo scan read 47 KB redo, 30 data blocks need recovery Started redo application at Thread 1: logseq 250, block 12222 Recovery of Online Redo Log: Thread 1 Group 1 Seq 250 Reading mem 0 Mem# 0: /sjgxpt/oracle/data/dsdb/redo01.log Completed redo application of 0.02MB Completed crash recovery at Thread 1: logseq 250, block 12317, scn 4277695 30 data blocks read, 30 data blocks written, 47 redo k-bytes read LGWR: STARTING ARCH PROCESSES Mon Nov 12 11:46:10 2018 ARC0 started with pid=54, OS id=16641 ARC0: Archival started LGWR: STARTING ARCH PROCESSES COMPLETE ARC0: STARTING ARCH PROCESSES Thread 1 advanced to log sequence 251 (thread open) Mon Nov 12 11:46:10 2018 ARC1 started with pid=55, OS id=16667 Mon Nov 12 11:46:10 2018 ARC2 started with pid=56, OS id=16683 Mon Nov 12 11:46:10 2018 ARC3 started with pid=57, OS id=16697 ARC1: Archival started ARC2: Archival started ARC1: Becoming the 'no FAL' ARCH ARC1: Becoming the 'no SRL' ARCH ARC2: Becoming the heartbeat ARCH Thread 1 opened at log sequence 251 Current log# 2 seq# 251 mem# 0: /sjgxpt/oracle/data/dsdb/redo02.log Successful open of redo thread 1 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set SMON: enabling cache recovery ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE Archived Log entry 244 added for thread 1 sequence 250 ID 0x23e8dee0 dest 1: [16498] Successfully onlined Undo Tablespace 2. Undo initialization finished serial:0 start:501291664 end:501291704 diff:40 (0 seconds) Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed SMON: enabling tx recovery Database Characterset is AL32UTF8 No Resource Manager plan active replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Mon Nov 12 11:46:10 2018 QMNC started with pid=58, OS id=16794 Completed: alter database open Starting background process CJQ0 Mon Nov 12 11:46:11 2018 CJQ0 started with pid=59, OS id=16892
复制
主机故障
主机重启方式进行模拟:
[root@pcs01 ~]# reboot Connection to 10.228.110.126 closed by remote host. Connection to 10.228.110.126 closed.
复制
软件资源维护
更新psu以及jvm补丁
如果需要对oracle数据库进行维护,包括停止数据库或者监听等需要采用维护模式,避免资源重启以及切换:
进入维护模式:pcs resource unmanage resource1
解除维护模式 pcs resource manage resource1
资源状态察看pcs状态
[root@pcs01 ~]# pcs status Cluster name: cluster1 Stack: corosync Current DC: pcs02 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Thu May 30 10:07:00 2019 Last change: Wed Apr 3 15:40:08 2019 by hacluster via crmd on pcs01 2 nodes configured 9 resources configured Online: [ pcs01 pcs02 ] Full list of resources: ibmca (stonith:fence_ipmilan): Stopped ibmcb (stonith:fence_ipmilan): Started pcs01 Resource Group: dsdb VIP (ocf::heartbeat:IPaddr): Started pcs01 LVM (ocf::heartbeat:LVM): Started pcs01 orahome (ocf::heartbeat:Filesystem): Started pcs01 arch (ocf::heartbeat:Filesystem): Started pcs01 data (ocf::heartbeat:Filesystem): Started pcs01 database (ocf::heartbeat:oracle): Started pcs01 lsnrctl (ocf::heartbeat:oralsnr): Started pcs01 Failed Actions: * ibmca_start_0 on pcs02 'unknown error' (1): call=239, status=Timed Out, exitreason='none', last-rc-change='Wed Apr 3 14:42:27 2019', queued=0ms, exec=20177ms * ibmcb_monitor_60000 on pcs01 'unknown error' (1): call=182, status=Timed Out, exitreason='none', last-rc-change='Wed May 29 19:30:59 2019', queued=0ms, exec=20136ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@pcs01 ~]#
复制
将节点置为维护模式:
[oracle@pcs01 20190415psu_11.2.0.4]$ lsnrctl stop listener_dsdb LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 30-MAY-2019 10:10:11 Copyright (c) 1991, 2013, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.228.110.133)(PORT=1521))) The command completed successfully [oracle@pcs01 20190415psu_11.2.0.4]$ [oracle@pcs01 29141056]$ opatch apply Jvm Opatch apply Startup upgrade
复制
取消维护模式.
pcs resource clear dsdb pcs resource move dsdb pcs01 pcs resource move dsdb pcs02
复制