The database instance Crash because the CPU High ?

本站文章除注明转载外，均为本站原创：转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: The database instance Crash because the CPU High ?

某系统的其中一个RAC节点的db实例被干掉并自动重启了，如下是该实例的alert log信息：

Wed Oct 08 12:59:51 CST 2014Thread 1 advanced to log sequence 543 (LGWR switch)Current log# 1 seq# 543 mem# 0: +DGSYS/hisdb/onlinelog/group_1.261.856010381Wed Oct 08 13:13:45 CST 2014WARNING: inbound connection timed out (ORA-3136)Wed Oct 08 13:16:56 CST 2014LMON (ospid: 17329) is not heartbeating for 204 seconds.LMON is not healthy and has no heartbeat.Please check LMD0/LMS0 and DIAG trace files for detail.Wed Oct 08 13:17:02 CST 2014LMS0 (ospid: 17333) is terminating the instance.LMS0: terminating instance due to error 484Wed Oct 08 13:17:05 CST 2014System state dump is made for local instanceSystem State dumped to trace file /opt/oracle/admin/hisdb/bdump/hisdb1_diag_17325.trcWed Oct 08 13:17:06 CST 2014Shutting down instance (abort)License high water mark = 1068Wed Oct 08 13:17:07 CST 2014Instance terminated by LMS0, pid = 17333Wed Oct 08 13:17:11 CST 2014Instance terminated by USER, pid = 15788Wed Oct 08 13:17:17 CST 2014Starting ORACLE instance (normal)LICENSE_MAX_SESSION = 0LICENSE_SESSIONS_WARNING = 0Interface type 1 eth1 10.0.0.0 configured from OCR for use as a cluster interconnectInterface type 1 eth0 10.3.36.192 configured from OCR for use as  a public interfacePicked latch-free SCN scheme 3Wed Oct 08 13:17:27 CST 2014Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/oracle/product/10.2.0/db/dbs/archLICENSE_MAX_USERS = 0SYS auditing is disabledksdpec: called for event 13740 prior to event group initializationStarting up ORACLE RDBMS Version: 10.2.0.5.0.System parameters with non-default values:。。。。。。。Wed Oct 08 13:17:39 CST 2014ALTER DATABASE OPENPicked broadcast on commit scheme to generate SCNsWed Oct 08 13:17:39 CST 2014SUCCESS: diskgroup MSDATA was mountedSUCCESS: diskgroup DGDATA was mountedWed Oct 08 13:17:39 CST 2014Thread 1 opened at log sequence 544Current log# 2 seq# 544 mem# 0: +DGSYS/hisdb/onlinelog/group_2.262.856010381Successful open of redo thread 1Wed Oct 08 13:17:39 CST 2014MTTR advisory is disabled because FAST_START_MTTR_TARGET is not setWed Oct 08 13:17:39 CST 2014SMON: enabling cache recoveryWed Oct 08 13:17:40 CST 2014Successfully onlined Undo Tablespace 1.Wed Oct 08 13:17:40 CST 2014SMON: enabling tx recoveryWed Oct 08 13:17:40 CST 2014Database Characterset is ZHS16GBKOpening with internal Resource Manager planreplication_dependency_tracking turned off (no async multimaster replication found)Starting background process QMNCQMNC started with pid=28, OS id=16432Wed Oct 08 13:17:42 CST 2014Completed: ALTER DATABASE OPEN

我们可以看到，该实例在13:17:07 秒被LMS0进程强行终止掉了，接着该实例在13:17:17 被正常启动。在该实例被强行终止之前，有一点关键信息是值得我们注意的，如下：

Wed Oct 08 13:13:45 CST 2014WARNING: inbound connection timed out (ORA-3136)Wed Oct 08 13:16:56 CST 2014LMON (ospid: 17329) is not heartbeating for 204 seconds.LMON is not healthy and has no heartbeat.Please check LMD0/LMS0 and DIAG trace files for detail.Wed Oct 08 13:17:02 CST 2014LMS0 (ospid: 17333) is terminating the instance.LMS0: terminating instance due to error 484

从这部分信息，我们可以大致判断，在13:16:56时，Oracle已经发现LMON进程长时间没有检测到心跳了，这个时间长达204秒。如果根据时间向前推进，在13:13:32时间点，实际上Lmon进程就开始出现异常了。我们也可以看到在13:13:45时间点，出现了一个ora-3136错误。一般来讲，这个waring跟系统的负载可能有极大的关系，例如资源使用极高，可能出现超时的情况。

从alert log信息来看，Oracle 让我们去查看LMD0/LMS0 以及diag的信息来进行进一步的分析。那么我们首先就来看一下LMD0进程的信息：

*** 2014-10-08 12:47:22.077Setting 3-way CR grants to 1 global-lru off? 0*** 2014-10-08 13:16:58.621KJM_HISTORY: RCVR STALL OP(12) context 5 elapsed 202974048 usKJM HIST LMD0:12:5:202974048 7:6 6:3 10:4 2:6 1:2 11:2 15:2 12:80846 7:46:4 10:9 11:2 15:4 12:78700 7:8 6:1 10:8 11:2 15:212:79509 7:3 6:1 10:3 11:2 15:2 12:78710 7:2 6:2 10:411:3 15:1 12:79073 7:3 6:2 10:5 11:2 15:2 12:79846 7:66:1 10:14 11:2 15:5 12:78560 7:6 6:3 10:9 11:2 15:512:79708 7:5 6:1 10:6 11:0 15:1 12:79261 7:3 6:1 10:1011:1 15:4 12:79432 7:4----------------------------------------SO: 0x478e7ff10, type: 4, owner: 0x4798e0088, flag: INIT/-/-/0x00(session) sid: 3301 trans: (nil), creator: 0x4798e0088, flag: (51) USR/- BSY/-/-/-/-/-DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000txn branch: (nil)oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYSlast wait for 'ges remote message' wait_time=3 min 22 sec, seconds since wait started=258waittime=40, loop=0, p3=0blocking sess=0x(nil) seq=8230Dumping Session Wait Historyfor 'ges remote message' count=1 wait_time=3 min 22 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.080830 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.078682 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.079500 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.078702 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.079063 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.079834 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.078517 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.079689 secwaittime=40, loop=0, p3=0for 'ges remote message' count=1 wait_time=0.079250 secwaittime=40, loop=0, p3=0

从LMD0进程的信息来看，可以发现该进程（lmd0）已经等待了258秒，等待事件为ges remote message，除此之外，没有其他的信息了。

既然这样，那我们继续来看下LMS0进程的信息：

*** 2014-09-30 10:01:46.825DRM(3) win(1) lms 0 finished replaying gcs resourceslms 0 finished fixing gcs write protocolDRM(3) win(2) lms 0 finished replaying gcs resourceslms 0 finished fixing gcs write protocolDRM(3) win(3) lms 0 finished replaying gcs resourceslms 0 finished fixing gcs write protocolDRM(3) win(4) lms 0 finished replaying gcs resourceslms 0 finished fixing gcs write protocolDRM(3) win(5) lms 0 finished replaying gcs resourceslms 0 finished fixing gcs write protocolDRM(3) win(6) lms 0 finished replaying gcs resourceslms 0 finished fixing gcs write protocolDRM(3) win(7) lms 0 finished replaying gcs resourceslms 0 finished fixing gcs write protocolDRM(3) win(8) lms 0 finished replaying gcs resourceslms 0 finished fixing gcs write protocol*** 2014-10-08 13:16:56.712kjfdchklmon: LMON is not heartbeating for 204 sec (1412745209.1412745413.0).kjfdchklmon: Dumping callstack of lmonSubmitting asynchronized dump request [20]kjfdchklmon: LMON is not healthy. kill instance.ksuitm: waiting up to [5] seconds before killing DIAG(17325)

LMS0进程的信息似乎也没有太多的价值。最后我们只能看看diag的信息了，Oracle在将所有进程kill之前进行了一次dump。

*** 2014-10-08 13:16:56.894Process diagnostic dump for oracle@hisdbs1 (LMON), OS id=17329,pid: 5, proc_ser: 1, sid: 3302, sess_ser: 1-------------------------------------------------------------------------------loadavg : 62.17 35.29 18.12Memory (Avail / Total) = 5976.90M / 64455.59MSwap (Avail / Total) = 21215.88M /  24003.36MF S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMDShort stack dump: ORA-00072: ?? "Unix process pid: 17329, image: oracle@hisdbs1 (LMON)" ???-------------------------------------------------------------------------------Process diagnostic dump actual duration=8.010000 sec(max dump time=30.000000 sec)*** 2014-10-08 13:17:04.908----------------------------------------SO: 0x47e8f2a38, type: 2, owner: (nil), flag: INIT/-/-/0x00(process) Oracle pid=5, calls cur/top: 0x4762511b8/0x4762511b8, flag: (6) SYSTEMint error: 0, call error: 0, sess error: 0, txn error 0(post info) last post received: 0 0 91last post received-location: kjfzack: post pmon for dead process cleanuplast process to post me: 4798e0088 1 6last post sent: 0 0 108last post sent-location: kjmpost: post lmdlast process posted by me: 4798e0088 1 6(latch info) wait_event=0 bits=0Location from where call was made: ksqgel: create enqueue: parent objContext saved from call: 19162010040waiting for 600098c8 enqueues level=5Location from where latch is held: ksqgel: create enqueue: parent objContext saved from call: 16790760304state=busy, wlstate=freewaiters [orapid (seconds since: put on list, posted, alive check)]:5 (216, 1412745425, 6)           ---LMON774 (216, 1412745425, 216)15 (216, 1412745425, 153)        ---ckpt813 (216, 1412745425, 216)18 (216, 1412745425, 141)        ---CJQ025 (216, 1412745425, 216)        ---ASMB886 (27, 1412745425, 27)280 (15, 1412745425, 15)783 (9, 1412745425, 9)800 (9, 1412745425, 9)waiter count=10gotten 100670517 times wait, failed first 9774 sleeps 746gotten 0 times nowait, failed: 0possible holder pid = 792 ospid=15095on wait list for 600098c8Process Group: DEFAULT, pseudo proc: 0x478a24a58O/S info: user: oracle, term: UNKNOWN, ospid: 17329 (DEAD)OSD pid info: Unix process pid: 17329, image: oracle@hisdbs1 (LMON)PSO child state object changes :Dump of memory from 0x000000047E8930D8 to 0x000000047E8932E047E8930D0                   00000008 00000000          [........]。。。。。。Repeat 23 times----------------------------------------SO: 0x477e7c9c0, type: 4, owner: 0x47e8f2a38, flag: INIT/-/-/0x00(session) sid: 3302 trans: (nil), creator: 0x47e8f2a38, flag: (51) USR/- BSY/-/-/-/-/-DID: 0001-0005-00000006, short-term DID: 0000-0000-00000000txn branch: (nil)oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYSwaiting for 'latch free' wait_time=0, seconds since wait started=210address=600098c8, number=13, tries=2bblocking sess=0x(nil) seq=28098Dumping Session Wait Historyfor 'latch free' count=1 wait_time=4.885477 secaddress=600098c8, number=13, tries=2afor 'latch free' count=1 wait_time=4.902795 secaddress=600098c8, number=13, tries=29for 'latch free' count=1 wait_time=4.884726 secaddress=600098c8, number=13, tries=28for 'latch free' count=1 wait_time=4.883317 secaddress=600098c8, number=13, tries=27for 'latch free' count=1 wait_time=4.887379 secaddress=600098c8, number=13, tries=26for 'latch free' count=1 wait_time=4.885649 secaddress=600098c8, number=13, tries=25for 'latch free' count=1 wait_time=4.886556 secaddress=600098c8, number=13, tries=24for 'latch free' count=1 wait_time=4.888166 secaddress=600098c8, number=13, tries=23for 'latch free' count=1 wait_time=4.885367 secaddress=600098c8, number=13, tries=22for 'latch free' count=1 wait_time=4.885849 secaddress=600098c8, number=13, tries=21。。。。。。------------process 0x0x474316de0--------------------proc version      : 0Local node        : 0pid               : 17329lkp_node          : 0svr_mode          : 0proc state        : KJP_NORMALLast drm hb acked : 0Total accesses    : 2Imm.  accesses    : 1Locks on ASTQ     : 0Locks Pending AST : 0Granted locks     : 0AST_Q:PENDING_Q:GRANTED_Q:Performing diagnostic data dump for this instance

从前面的load来看，load信息如下：loadavg : 62.17 35.29 18.12

这里简单解释一下，这3个loadavg值表示该主机分别在1分钟，5分钟，15分钟内的平均load值。显然，这里我们不看后面2个指标了。根据时间点，我们这里看第一个loadavg值是62.17. 这个值已经超出CPU的数倍了，由此我们不难判断在13:16:56向前推进1分钟的时间内系统的load是极高的。

对于LMON进程，主要是监控RAC的GES信息，当然其作用不仅仅局限于此，其还负责检查集群中每个节点的健康情况，当有节点出现故障是，负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道AC的脑裂机制，如果IO fencing是Oracle本身来完成，也就是说由CLusterware来完成。那么Lmon进程检查到实例级别出现脑裂时，会通知clusterware来进行脑裂操作，然而其并不会等待clusterware的处理结果。当等待超过一定时间，那么LMON进程会自动触发IMR(instance membership recovery)，这实际上也就是我们所说的Instance membership reconfig。

对于报错中提到的LMON进程无法获得心跳了，那么LMON进程的心跳是什么呢？ lmon进程主要通过2种心跳机制来判断集群节点的状态：

1) 网络心跳（主要是通过ping进行检测）2) 控制文件磁盘心跳，类似ckpt进程每3s更新一次controlfile的机制。

那么什么情况下，会导致LMON无法获得心跳呢？

一般来讲有可能是该进程无法获得latch或者系统的资源不足，导致LMON进程无法正常工作。而这里我们也确实发现系统的load极高。从上面的进程详细信息来看，我们也确实发现该进程一直在等待一个latch，同时有其他的10个进程正在等待LMON进程，这其中的等待就包括LMON,ASMB，ckpt进程等等。从diag的dump来看，提示lmon进程已经等待了216秒。

从process dump我们还可以看到该进程一直无法获得的latch是可能是被如下进程所持有：

possible holder pid = 792 ospid=15095

不过可惜的是，我搜索diag日志，居然找不到pid 792这个会话。我们知道diag的trace中的process state dump信息是根据PID排序的。我发现dump内容到process 69就结束了。从PID来看，该实例在出问题的时候，起码都800个process以上，dump的内容不可能只有这么一点。

从2节点的核心进程的日志也看不到有价值的信息。后面监控了一段时间发现该节点cpu几乎都是90%以上，很多时候甚至100%。

检查发现是CBC latch，而且都是一些报表的SQL。

对于这个case，我仍然感觉可能是资源异常导致LMON进程异常，最后Oracle 检测到该进程长时间没有心跳，最终终止了该实例。

基于这个假设，我打算在自己的VM Rac环境(Linux +10.2.0.5)测试一下，如果某个节点CPU使用率极高的情况下，是否会导致LMON进程异常。

不过遗憾的是，我并没有模拟出这种情况，如下是我的模拟过程：

利用shell 来模拟cpu的消耗：

[root@rac1 ~]# cat kill_cpu.sh#! /bin/sh# filename killcpu.shfor i in `seq $1`doecho -ne "i=0;while truedoi=i+1;done" | /bin/sh &pid_array[$i]=$! ;donefor i in "${pid_array[@]}"; doecho 'kill ' $i ';';done

调用该脚本即可：[root@rac1 ~]# ./kill_cpu.sh 1kill? 17449 ;

这样测试了10分钟，发现Rac毫无反应，于是我又模拟了latch: cache buffer chains. 以此来尽可能的将该节点的CPU消耗光。

SQL> create table killdb(no int,object_name varchar2(50));Table created.SQL> declare i int;2  begin3  for i in 1..5 loop4  insert into killdb  select  rownum as no,object_name from dba_objects;5  end loop;6  end;7  /PL/SQL procedure successfully completed.SQL> select count(1) from killdb;COUNT(1)----------250645SQL> commit;Commit complete.SQL> create or replace procedure full_scan is2   i int;3   icount int;4   begin5   for i in 1..1000 loop6   select count(*) into icount from killdb;7   end loop;8   end;9   /Procedure created.SQL> var job_no number;SQL> begin2  for idx in 1..30 loop3  dbms_job.submit(:job_no,'full_scan;');4  commit;5  end loop ;6  end;7  /PL/SQL procedure successfully completed.SQL> select event,count(1) from v$session where wait_class#<>6 group by event;EVENT                                                              COUNT(1)---------------------------------------------------------------- ----------os thread startup                                                         1null event                                                                1latch: cache buffers chains                                              26gc current request                                                        1SQL*Net message to client                                                 1

如下是该节点的CPU使用情况：

08:42:47 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s08:42:48 PM  all   99.01    0.00    0.99    0.00    0.00    0.00    0.00    0.00   1005.9408:42:49 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   1044.44......省略部分内容08:43:08 PM  all   99.01    0.00    0.99    0.00    0.00    0.00    0.00    0.00   1003.9608:43:09 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   1038.3808:43:10 PM  all   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   1019.0008:43:11 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   1032.0008:43:12 PM  all   99.01    0.00    0.99    0.00    0.00    0.00    0.00    0.00   1000.9908:43:13 PM  all   95.96    0.00    4.04    0.00    0.00    0.00    0.00    0.00   1202.0208:43:14 PM  all   93.07    0.00    5.94    0.00    0.00    0.99    0.00    0.00   1057.4308:43:15 PM  all   86.32    0.00   12.63    0.00    0.00    1.05    0.00    0.00   1080.0008:43:16 PM  all   96.97    0.00    3.03    0.00    0.00    0.00    0.00    0.00   1058.5908:44:48 PM  all   99.25    0.00    0.61    0.00    0.08    0.07    0.00    0.00   1026.5008:46:16 PM  all   99.24    0.00    0.61    0.00    0.05    0.10    0.00    0.00   1024.4908:46:44 PM  all   99.06    0.00    0.73    0.00    0.07    0.15    0.00    0.00   1042.5708:46:45 PM  all   91.43    0.00    7.62    0.00    0.00    0.95    0.00    0.00   1140.9508:46:46 PM  all   82.61    0.00   17.39    0.00    0.00    0.00    0.00    0.00   1042.3908:46:47 PM  all   57.14    0.00   42.86    0.00    0.00    0.00    0.00    0.00   1068.2508:46:48 PM  all   52.00    0.00   48.00    0.00    0.00    0.00    0.00    0.00   1092.0008:46:49 PM  all   84.34    0.00   14.46    0.00    0.00    1.20    0.00    0.00   1090.3608:46:50 PM  all   90.82    0.00    7.14    0.00    1.02    1.02    0.00    0.00   1137.7608:46:51 PM  all   89.69    0.00   10.31    0.00    0.00    0.00    0.00    0.00   1040.2108:46:52 PM  all   95.65    0.00    4.35    0.00    0.00    0.00    0.00    0.00   1052.1708:46:53 PM  all   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   1012.0008:46:54 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   1028.0008:46:55 PM  all   99.01    0.00    0.99    0.00    0.00    0.00    0.00    0.00   1007.9208:46:56 PM  all   98.92    0.00    1.08    0.00    0.00    0.00    0.00    0.00   1052.69

我们可以看到在08:43:16 到 08:46:44 之间，rac 节点1其实hang了很长时间，可惜这个数据库节点都没有挂。

我只能说Oracle 太强悍了，我檫。

BUG 10008092 caused instance crashdatabase crash with ora-0049410gR2 rac(asm) crash with ora-15064single instance to 3 node racabout Undo Tablespace used High ?

本站文章除注明转载外，均为本站原创：转载自love wife & love life —Roger 的Oracle技术博客本文链接地址: The database instance Crash because the CPU High ? 某系统的其中一个RAC节点的db实例被干掉并自动重启了，如下是该实例的alert log信息： Wed Oct 08 12:59:51 CST 2014 Thread 1 advanced to log sequence 543 (LGWR switch) Current log# 1 seq# 543 mem# 0: +DGSYS/hisdb/onlinelog/group_1.261.856010381 Wed Oct 08 13:13:45 CST 2014 WARNING: inbound connection timed out (ORA-3136) […] 却还是，会愚蠢的选择相互敌视的方式。即使背脊相抵，

相关文章：

你感兴趣的文章：

标签云：