I/O slave wait 导致CPU 100%

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: I/O slave wait 导致CPU 100%

某客户的一个rac 节点CPU使用率为100%,导致整体系统运行缓慢。??? 通过mpstat命令检查发现系统CPU使用率为100%,系统的CPU idle几乎为0. 导致系统整体负载极高,如下:

 [oracle@ecdbs1 ~]$ mpstat 1 10Linux 2.6.9-89.5AXS2largesmp (ecdbs1)   2014年10月04日09时52分39秒  CPU   %user   %nice %system %iowait    %irq   %soft   %idle    intr/s09时52分40秒  all   82.42    0.00   17.52    0.00    0.00    0.06    0.00   5321.0009时52分41秒  all   82.63    0.00   17.30    0.00    0.00    0.06    0.00   3562.0009时52分42秒  all   81.52    0.00   18.36    0.00    0.00    0.13    0.00   6107.0709时52分43秒  all   82.36    0.00   17.45    0.00    0.06    0.13    0.00   4194.0609时52分44秒  all   82.47    0.00   17.53    0.00    0.00    0.00    0.00   3551.5209时52分45秒  all   82.63    0.00   17.24    0.00    0.00    0.13    0.00   3553.5409时52分46秒  all   83.29    0.00   16.71    0.00    0.00    0.00    0.00   3526.7309时52分47秒  all   82.48    0.00   17.40    0.00    0.00    0.13    0.00   3553.5409时52分48秒  all   82.47    0.00   17.35    0.00    0.06    0.13    0.00   6888.0009时52分49秒  all   82.15    0.00   17.78    0.00    0.00    0.06    0.00   5739.60Average:        all   82.44    0.00   17.46    0.00    0.01    0.08    0.00   4601.00
由于这个主机节点上跑了2个实例,其中一个是rac的实例,另外一个是单实例。通过top 检查发现cpu基本上都是被该节点上的
单实例的进程所消耗,如下:
top - 16:34:59 up 747 days, 20:49,  9 users,  load average: 139.68, 140.21, 140.37Tasks: 861 total, 141 running, 720 sleeping,   0 stopped,   0 zombieCpu(s): 82.4% us, 17.5% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.1% siMem:  66002524k total, 63895780k used,  2106744k free,   165428k buffersSwap: 12289716k total,   503928k used, 11785788k free,  4302948k cached  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND12066 oracle    25   0 10.1g 4.1g 4.1g R 13.5  6.5   1573:36 oracle12232 oracle    25   0 10.1g 4.2g 4.2g R 13.5  6.7   1030:02 oracle........省略部分内容 9543 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.6 379:27.64 oracle10963 oracle    25   0 10.1g 4.3g 4.3g R 13.2  6.9 544:54.84 oracle11483 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1425:30 oracle13387 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1509:03 oracle13737 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1051:15 oracle14163 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.6 227:22.63 oracle14842 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1051:56 oracle15413 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1043:22 oracle16816 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.6 382:09.71 oracle17000 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1033:54 oracle17524 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.6 389:02.27 oracle18087 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5 814:07.48 oracle18613 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5 811:07.11 oracle19750 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5 585:21.47 oracle19886 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1434:08 oracle23159 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1556:51 oracle23242 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1492:47 oracle23361 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5 608:33.05 oracle24589 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5   1037:20 oracle26459 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5 602:28.33 oracle27194 oracle    25   0 10.2g 4.4g 4.4g R 13.2  7.0 785:01.96 oracle27907 oracle    25   0 10.1g 4.1g 4.1g R 13.2  6.5 787:36.62 oracle

根据进程号,定位到相关的user process:

SQL> oradebug setospid 20377Oracle pid: 115, Unix process pid: 20377, image: oracle@ecdbs1SQL> select s.sid,s.serial#,p.spid from v$process p,v$session s where p.addr=s.paddr and p.spid=17000;       SID    SERIAL# SPID---------- ---------- ------------------------------------      2586        959 17000SQL> !ps -ef|grep 17000oracle    5438  3305  0 17:11 pts/7    00:00:00 /bin/bash -c ps -ef|grep 17000oracle    5440  5438  0 17:11 pts/7    00:00:00 grep 17000oracle   17000     1 14 Sep27 ?        17:18:09 oracleewms (LOCAL=NO)SQL> oradebug setospid 17000Oracle pid: 40, Unix process pid: 17000, image: oracle@ecdbs1SQL> show parameter instance_nameNAME                                 TYPE                              VALUE------------------------------------ --------------------------------- ------------------------------instance_name                        string                            ewmsSQL> oradebug dump processstate 10Statement processed.SQL> oradebug close_traceoradStatement processed.SQL> ebug tracefile_name/opt/oracle/admin/ewms/udump/ewms_ora_17000.trcSQL> !

通过检查发现,存在大量的I/O salve wait 等待事件,如下:

SQL> select distinct sql_id,event,count(1) from v$session where event like 'i/o%' group by sql_id,event order by 3;SQL_ID                                  EVENT                     COUNT(1)--------------------------------------- ----------------------- ----------bh329g5m6646j                           i/o slave wait                   17ac3yq9zzw272                           i/o slave wait                   1du6nsuvjrpgpr                           i/o slave wait                   13msays95wyank                           i/o slave wait                   17q2at1qmgzpwm                           i/o slave wait                   111gwubk12a25h                           i/o slave wait                   1206xt31taaha2                           i/o slave wait                   1f9kc9t3gghckm                           i/o slave wait                   1dpzmy9wbndrst                           i/o slave wait                   16nyq5ss342du4                           i/o slave wait                   4c5c115p47jyh1                           i/o slave wait                  37bh1uwqum800ab                           i/o slave wait                  85

更要命的是,我从top 命令的结果中挑选了几个cpu消耗高的进程,检查发现其对应的等待事件都是这个。

从上面的查询来看,这部分会话执行的SQL都类似,开始我怀疑难道是会话执行的SQL有问题 ? 通过dump定位到SQL语句和执行计划,发现执行计划是ok,效率还挺高。

============Plan Table============-------------------------------------------------------------------------+-----------------------------------+| Id  | Operation                             | Name                     | Rows  | Bytes | Cost  | Time      |-------------------------------------------------------------------------+-----------------------------------+| 0   | SELECT STATEMENT                      |                          |       |       |   708 |           || 1   |  TEMP TABLE TRANSFORMATION            |                          |       |       |       |           || 2   |   LOAD AS SELECT                      |                          |       |       |       |           || 3   |    VIEW                               |                          |     1 |   154 |   681 |  00:00:09 || 4   |     NESTED LOOPS                      |                          |     1 |   137 |   681 |  00:00:09 || 5   |      VIEW                             |                          |  1468 |  126K |   681 |  00:00:09 || 6   |       MERGE JOIN                      |                          |  1468 |  100K |   681 |  00:00:09 || 7   |        TABLE ACCESS BY INDEX ROWID    | MLOS_SM_INSTIINFO        |     1 |    27 |    10 |  00:00:01 || 8   |         INDEX FULL SCAN               | PK_INSTIINFO             |    46 |       |     2 |  00:00:01 || 9   |        SORT JOIN                      |                          |   16K |  697K |   671 |  00:00:09 || 10  |         TABLE ACCESS FULL             | EWMS_INV_TRANSACTION     |   16K |  697K |   669 |  00:00:09 || 11  |      TABLE ACCESS BY INDEX ROWID      | EWMS_BASE_CARGO          |     1 |    49 |     0 |           || 12  |       INDEX UNIQUE SCAN               | PK_EWMS_BASE_CARGO       |     1 |       |     0 |           || 13  |   COUNT                               |                          |       |       |       |           || 14  |    VIEW                               |                          |     1 |   398 |    27 |  00:00:01 || 15  |     SORT ORDER BY                     |                          |     1 |   383 |    27 |  00:00:01 || 16  |      HASH JOIN OUTER                  |                          |     1 |   383 |    26 |  00:00:01 || 17  |       HASH JOIN OUTER                 |                          |     1 |   317 |    17 |  00:00:01 || 18  |        HASH JOIN OUTER                |                          |     1 |   251 |    14 |  00:00:01 || 19  |         HASH JOIN OUTER               |                          |     1 |   185 |    10 |  00:00:01 || 20  |          HASH JOIN OUTER              |                          |     1 |   119 |     7 |  00:00:01 || 21  |           VIEW                        |                          |     1 |    53 |     3 |  00:00:01 || 22  |            HASH GROUP BY              |                          |     1 |    62 |     3 |  00:00:01 || 23  |             VIEW                      |                          |     1 |    62 |     2 |  00:00:01 || 24  |              TABLE ACCESS FULL        | SYS_TEMP_0FD9D6835_CC9CBC|     1 |    90 |     2 |  00:00:01 || 25  |           VIEW                        |                          |     1 |    66 |     3 |  00:00:01 || 26  |            HASH GROUP BY              |                          |     1 |    78 |     3 |  00:00:01 || 27  |             VIEW                      |                          |     1 |    78 |     2 |  00:00:01 || 28  |              TABLE ACCESS FULL        | SYS_TEMP_0FD9D6835_CC9CBC|     1 |    90 |     2 |  00:00:01 || 29  |          VIEW                         |                          |     1 |    66 |     3 |  00:00:01 || 30  |           HASH GROUP BY               |                          |     1 |    92 |     3 |  00:00:01 || 31  |            FILTER                     |                          |       |       |       |           || 32  |             VIEW                      |                          |     1 |    92 |     2 |  00:00:01 || 33  |              TABLE ACCESS FULL        | SYS_TEMP_0FD9D6835_CC9CBC|     1 |    90 |     2 |  00:00:01 || 34  |         VIEW                          |                          |     1 |    66 |     3 |  00:00:01 || 35  |          HASH GROUP BY                |                          |     1 |    92 |     3 |  00:00:01 || 36  |           FILTER                      |                          |       |       |       |           || 37  |            VIEW                       |                          |     1 |    92 |     2 |  00:00:01 || 38  |             TABLE ACCESS FULL         | SYS_TEMP_0FD9D6835_CC9CBC|     1 |    90 |     2 |  00:00:01 || 39  |        VIEW                           |                          |     1 |    66 |     3 |  00:00:01 || 40  |         HASH GROUP BY                 |                          |     1 |    78 |     3 |  00:00:01 || 41  |          VIEW                         |                          |     1 |    78 |     2 |  00:00:01 || 42  |           TABLE ACCESS FULL           | SYS_TEMP_0FD9D6835_CC9CBC|     1 |    90 |     2 |  00:00:01 || 43  |       VIEW                            |                          |     1 |    66 |     8 |  00:00:01 || 44  |        HASH GROUP BY                  |                          |     1 |   109 |     8 |  00:00:01 || 45  |         NESTED LOOPS                  |                          |     1 |   109 |     7 |  00:00:01 || 46  |          NESTED LOOPS                 |                          |     1 |    82 |     6 |  00:00:01 || 47  |           TABLE ACCESS FULL           | EWMS_INV_LOGINVENTORY    |   640 |   21K |     6 |  00:00:01 || 48  |           TABLE ACCESS BY INDEX ROWID | EWMS_BASE_CARGO          |     1 |    49 |     0 |           || 49  |            INDEX UNIQUE SCAN          | PK_EWMS_BASE_CARGO       |     1 |       |     0 |           || 50  |          TABLE ACCESS BY INDEX ROWID  | MLOS_SM_INSTIINFO        |     1 |    27 |     1 |  00:00:01 || 51  |           INDEX UNIQUE SCAN           | PK_INSTIINFO             |     1 |       |     0 |           |-------------------------------------------------------------------------+-----------------------------------+

由此可以判断,应该跟SQL本身的效率是没有关系的。从dump的processstate strace还发现了一点有价值的信息:

WARNING:io_submit failed due to kernel limitations MAXAIO for process=0 pending aio=0WARNING:asynch I/O kernel limits is set at AIO-MAX-NR=65536 AIO-NR=65529WARNING:1 Oracle process running out of OS kernelI/O resources aiolimit=0ksfdgo()+1488<-ksfdaio1()+9848<-kfkUfsIO()+594<-kfkDoIO()+631<-kfkIOPriv()+616<-............

根据这个AIO的告警,mos上一搜,果然是跟AIO有关系。例如存在类似这样的bug:

Bug 9949948 Linux: Process spin under ksfdrwat0 if OS Async IO not configured high enough

最后建议调整Linux AIO参数之后目前一切正常。

简单记录一下,供大家参考!

No related posts.

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客 本文链接地址: I/O slave wait 导致CPU 100% 某客户的一个rac 节点CPU使用率为100%,导致整体系统运行缓慢。??? 通过mpstat命令检查发现系统CPU使用率为100%,系统的CPU idle几乎为0. 导致系统整体负载极高,如下: [oracle@ecdbs1 ~]$ mpstat 1 10 Linux 2.6.9-89.5AXS2largesmp (ecdbs1) 2014年10月04日 09时52分39秒 CPU %user %nice %system %iowait %irq %soft %idle intr/s 09时52分40秒 all 82.42 0.00 17.52 0.00 0.00 0.06 0.00 5321.00 09时52分41秒 all 82.63 0.00 17.30 0.00 0.00 […] 既有美妙的风景,也会有称不上景、只有风的地方。

I/O slave wait 导致CPU 100%

相关文章:

你感兴趣的文章:

标签云: