oracle rac instance crash case

这是昨天帮北区同事分析的一个数据库故障,这里跟大家分享。故障发生在昨天下午,数据库实例的一个节点自动crash,并且自动启动时无法启动,最后通过手工启动成功。首先我们来分析一下为什么crash之后,crs无能自动将数据库实例open,如下是alert log的信息:

Sun May 29 18:02:16 2016
Errors in file /u01/app/oracle/admin/sundb/bdump/sundb1_lmsa_38077024.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [0], [0], [], [], [], []
.....
Errors in file /u01/app/oracle/admin/sundb/bdump/sundb1_lms2_38536062.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [0], [0], [], [], [], []
Sun May 29 18:02:16 2016
ALTER DATABASE   MOUNT
Sun May 29 18:02:16 2016
Errors in file /u01/app/oracle/admin/sundb/bdump/sundb1_lmse_37159724.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [0], [0], [], [], [], []
Sun May 29 18:02:16 2016
Errors in file /u01/app/oracle/admin/sundb/bdump/sundb1_lmsa_38077024.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [0], [0], [], [], [], []
Sun May 29 18:02:16 2016
Starting background process ASMB
......
Errors in file /u01/app/oracle/admin/sundb/bdump/sundb1_lmsb_38142882.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [0], [0], [], [], [], []
Sun May 29 18:02:21 2016
Errors in file /u01/app/oracle/admin/sundb/bdump/sundb1_lms0_39257032.trc:
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [0], [0], [], [], [], []
Sun May 29 18:02:21 2016
LMS0: terminating instance due to error 484
Sun May 29 18:02:21 2016
Dump system state for local instance only
System State dumped to trace file /u01/app/oracle/admin/sundb/bdump/sundb1_diag_39519196.trc
Sun May 29 18:02:21 2016
Trace dumping is performing id=[cdmp_20160529180221]
Sun May 29 18:02:22 2016
Shutting down instance (abort)
License high water mark = 1

 

我们可以发现,数据库实例在mount的过程中其实就没有成功。从日志上来看,几乎均为lms进程的错误,这里我们暂且不论这个ora-00600 错误是什么含义。首先先分析日志文件,看看其中的内容是什么?

Unix process pid: 38077024, image: oracle@cxhxrac1 (LMSa)

*** SERVICE NAME:() 2016-05-29 18:02:13.888
*** SESSION ID:(145.1) 2016-05-29 18:02:13.888
 0 GCS shadows cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 LMS 10: 149696 GCS resources on freelist, 150016 on array, 150021 allocated
 0 GCS shadows traversed, 0 replayed, 0 duplicates
 lms 10 finished replaying gcs resources
 0 write requests issued in 14291 GCS resources
 1 PIs marked suspect, 0 flush PI msgs
 0 write requests issued in 0 GCS resources
 0 PIs marked suspect, 0 flush PI msgs
 lms 10 finished fixing gcs write protocol
 FUSION MSG 11076caf0,34 from[1,30146664] ver[10,-1] ln 128 sq[1,8]
      CLOSE [0x178155.12a0000, 298] shadow [0,0] seq 0x2 act 1
        client [7000002bbffca60,155] reqid 163 ordered 0
        grant 1 convert 0 role 0
        pi [0x0.0x0] flags 0x0 state 0x20
        disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
        msgRM# -1 bkt# 298 drmbkt# 298
   pkey 298
   hv 2 [stat 0x0, 0->0, wm 32767, RMno 0, reminc 10, dom 0]
   kjga st 0x4, step 0.0.0, cinc 10, rmno 0, flags 0x0
   lb 0, hb 0, myb 298, drmb 298, apifrz 0
*** 2016-05-29 18:02:16.034
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kjdrisRMnovalid:msgRMno], [4294967295], [0], [0], [], [], [], []

 

上述信息是lmsa 进程的trace文件内容,我们不难看出,这部分信息跟Oracle DRM操作有关。实际上,我们也知道Oracle lms进程会参与DRM的操作,关于这一点,要从lms进程的原理说起。网上有很多文章描述,这里我们不多说。

首先从上述trace 内容关键字pkey 298我们可以看出,这是10g的rac,这是drm信息的标示之一。另外其实从kjdrisRMnovalid:msgRMno也可以看出,这是DRM的操作。既然如此,那么是不是drm的参数配置有问题呢?因为这是一个rac集群,所以我们继续来看下alert log内容即可。通过分析发现在上次数据库实例启动时为4月28号,启动的参数如下:

Thu Apr 28 22:36:58 2016
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
.....
  standby_file_management  = AUTO
  _gc_undo_affinity        = FALSE
  _gc_affinity_time        = 0

 

可以看出,配置了drm的参数,也就是屏蔽了drm功能。而这次数据库实例crash后,crs在自动拉数据库实例时并没有启用这2个参数。毫无疑问,这说明客户之前用的pfile,而crs启动数据库实例时使用的是spfile。由于spfile和pfile的内容不一致,也就出现来这个问题。

对于实例宕机后无法自动启动的疑问,我们已经能够解释清楚了。那么最开始的问题呢?数据库实例为什么会宕机呢?这才是关键关键。下面我们继续来看alert log:

Sun May 29 17:24:03 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_diag_35455882.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Sun May 29 17:24:41 2016
Restarting dead background process DIAG
DIAG started with pid=3, OS id=35848586
Sun May 29 17:25:00 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_diag_35848586.trc:
ORA-00600: internal error code, arguments: [17147], [0x70000090C5F9378], [], [], [], [], [], []
Sun May 29 17:29:56 2016
Restarting dead background process DIAG
DIAG started with pid=262, OS id=34472420
Sun May 29 17:30:15 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_diag_34472420.trc:
ORA-00600: internal error code, arguments: [17147], [0x70000090C5F9378], [], [], [], [], [], []
Sun May 29 17:32:36 2016
Thread 1 advanced to log sequence 22645
  Current log# 10 seq# 22645 mem# 0: +NEWDATA/xxxx/onlinelog/group_10.1016.886632517
Sun May 29 17:35:14 2016
Restarting dead background process DIAG
DIAG started with pid=573, OS id=1835604
Sun May 29 17:35:33 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_diag_1835604.trc:
ORA-00600: internal error code, arguments: [17147], [0x70000090C5F9378], [], [], [], [], [], []
Sun May 29 17:38:32 2016
Errors in file /u01/app/oracle/admin/xxxx/udump/xxxx1_ora_64421944.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Sun May 29 17:38:34 2016
......
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_diag_12452748.trc:
ORA-00600: internal error code, arguments: [17147], [0x70000090C5F9378], [], [], [], [], [], []
Sun May 29 17:45:59 2016
Restarting dead background process DIAG
DIAG started with pid=153, OS id=17367786
Sun May 29 17:46:17 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_diag_17367786.trc:
ORA-00600: internal error code, arguments: [17147], [0x70000090C5F9378], [], [], [], [], [], []
Sun May 29 17:51:23 2016
Restarting dead background process DIAG
DIAG started with pid=859, OS id=13369506
Sun May 29 17:51:41 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_diag_13369506.trc:
ORA-00600: internal error code, arguments: [17147], [0x70000090C5F9378], [], [], [], [], [], []
Sun May 29 17:56:32 2016
Restarting dead background process DIAG
DIAG started with pid=665, OS id=11535338
Sun May 29 17:56:50 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_diag_11535338.trc:
ORA-00600: internal error code, arguments: [17147], [0x70000090C5F9378], [], [], [], [], [], []
Sun May 29 18:01:39 2016
Errors in file /u01/app/oracle/admin/xxxx/udump/xxxx1_ora_54133096.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Sun May 29 18:01:41 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_pmon_49414536.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Sun May 29 18:01:45 2016
DBW4: terminating instance due to error 472
Sun May 29 18:01:45 2016
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_lmd0_6947108.trc:
ORA-00472: PMON  process terminated with error
......
Errors in file /u01/app/oracle/admin/xxxx/bdump/xxxx1_psp0_33620412.trc:
ORA-00472: PMON  process terminated with error
Sun May 29 18:01:45 2016
Shutting down instance (abort)

 

我们可以看出,数据库实例之所以会crash掉,是因为Oracle PMON进程出现了异常,最后dbw4进程强行将数据库实例终止。很明显,我们需要继续分析pmon进程为什么会出现异常,如下是pmon 进程的trace 内容:

*** 2016-05-29 18:01:41.700
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [] [] [] [] [] []
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              FFFFFFFFFFFC820 ?
                                                   5F707472676C0001 ?
ksedmp+0290          bl       ksedst               1048E0580 ?
ssexhd+030c          bl       ksedmp               300001D15 ?
000048BC             ?        00000000             
kglobfr+01f0         bl       kglobcl              000000022 ? 000000009 ?
                                                   7000009E4E3D240 ? 1054EC358 ?
                                                   FFFFFFFFFFFD1B0 ?
kgllccl+1414         bl       kglobfr              FFFFFFFFFFFD290 ? 000000000 ?
                                                   1003C5ACC ? 110224BC8 ?
kgllcu+01b4          bl       kgllccl              10011C920 ? 000000000 ?
                                                   FFFFFFFFFFFD350 ?
kslilcr+0240         bl       _ptrgl               
ksl_cleanup+063c     bl       kslilcr              7000009E4E3D240 ? 3000000B8 ?
                                                   FFFFFFFFFFFD7E0 ?
ksuxfl+0678          bl       ksl_cleanup          000000000 ? 700000A1F5A3A38 ?
                                                   FFFFFFFFFFFDA30 ?
ksuxda+0060          bl       ksuxfl               A00000000000A ?
ksucln+0814          bl       ksuxda               
ksbrdp+0408          bl       _ptrgl               
opirip+03fc          bl       03F5A0B4             
opidrv+0448          bl       opirip               11028BD90 ? 41028D6D0 ?
                                                   FFFFFFFFFFFF950 ?
sou2o+0090           bl       opidrv               3202AB3E7C ? 4A006F198 ?
                                                   FFFFFFFFFFFF950 ?
opimai_real+0150     bl       01FC5034             
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0070         bl       main                 000000000 ? 000000000 ?

 

从上面这部分call stack内容可以看出,pmon的异常出现在函数kglobfr调用上,该函数表示

KGL OBject Free all heaps。通过分析发现Oracle Bug 8420775 : PMON CRASHED WITH ORA-07445的描述与pmon 的call stack堆栈基本一致;该文档中oracle sr认为该现象与oracle bug 5721821 一致

同时我又分析了故障期间的server process的ora-07445错误,发现基本上都在kglobcl操作上出现异常,基于这点分析发现;Oracle Bug 5754574 : ORA-7445 [KGLOBCL+178] AND PMON CRASH 的描述与pmon trace描述也很接近,同时该bug的描述中 Oracle SR也提到可能为Oracle bug 5721821

基于上述分析,我认为第一次数据库crash 为Oracle bug 5721821导致,而且该Bug也有针对10.2.0.3 for aix的Patch。因此我们建议客户安装该Patch,然后进行观察。

发表评论

电子邮件地址不会被公开。 必填项已用*标注