OGG-01705错误处理

陈海   2016年12月1日   OGG-01705错误处理无评论

–参考文档:
OGG Extract / Replicat Checkpoint RBA Is Larger than Local Trail Size (Doc ID 1138409.1)
–TAG : ogg replicat OGG-01705 logdump
今天早晨客户某主机异常down机,该节点部署了数据库+ogg。待数据库正常启动后,启动ogg复制进程,某个进程异常abended,无法启动。

GGSCI (crmcxdb1) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
REPLICAT ABENDED RPK_01G 00:06:53 04:39:02
REPLICAT RUNNING RPK_10G 03:48:11 00:00:03
REPLICAT RUNNING RPK_20G 00:57:59 00:00:01
REPLICAT RUNNING RPK_30G 04:34:23 00:00:09
REPLICAT RUNNING RPK_MAX 04:38:01 00:00:04

 

 

查看该进程详细报错信息。

GGSCI (crmcxdb1) 7> view report rpk_01G

***********************************************************************
Oracle GoldenGate Delivery for Oracle
Version 11.2.1.0.27 19591627 OGGCORE_11.2.1.0.0OGGBP_PLATFORMS_141006.1156_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Oct 6 2014 17:02:45

Copyright (C) 1995, 2014, Oracle and/or its affiliates. All rights reserved.

Starting at 2016-12-01 09:35:09
***********************************************************************
……省略……
2016-12-01 09:35:12 ERROR OGG-01705 Input checkpoint position 28224338 for input trail file '/ogg1/dir
dat/a1/a1046974' is greater than the size of the file (22402029). Please consult Oracle Knowledge Managem
ent Doc ID 1138409.1. for instructions.
……省略……

 

 

–错误指出checkpoint的位置大于trail文件的大小,并且给出了参考文档。
–该文章给出了3种解决方案,首先尝试第一种。
==option 1:
用以下公式找到问题进程的RBA(文章给出参考示例,下面为笔者自己的操作)
New datapump / Replicat RBA = Reader’s too-big checkpoint RBA(A) + First record RBA in the new trail file (after the restart abend)(B) – RBA of the matching record in the trail file referred in checkpoint file(C)
Reader’s too-big checkpoint RBA — A
First record in the new trail file (after the restart abend) — B
RBA of the matching record in the trail file referred in checkpoint file — C

我们通过文章的解释,分别找到这三个值,并一一记录。
1.查看问题进程信息,我们把这里看到的RBA记做A值。A=28224338

GGSCI (crmcxdb1) 9> info rpk_01g

REPLICAT RPK_01G Last Started 2016-12-01 09:59 Status ABENDED
Checkpoint Lag 00:06:53 (updated 05:27:13 ago)
Log Read Checkpoint File /ogg1/dirdat/a1/a1046974
2016-12-01 04:48:11.004299 RBA 28224338 <---RBA记做A值

 

2.通过checkpoint的信息,找到对应的文件

-rw-r-----. 1 ogg oinstall 99998668 Dec  1 04:41 a1046972
-rw-r-----. 1 ogg oinstall 99999198 Dec  1 04:50 a1046973
-rw-r-----. 1 ogg oinstall 22402029 Dec  1 09:28 a1046974         <---这个文件就是报错提到的,文件大小与报错信息相符。
/* ERROR   OGG-01705  Input checkpoint position 8224338 for input trail file '/ogg1/dirdat/a1/a1046974' is greater than the size of the file (22402029)     */
-rw-r-----. 1 ogg oinstall 99999892 Dec  1 09:28 a1046975
-rw-r-----. 1 ogg oinstall 99999135 Dec  1 09:29 a1046976
-rw-r-----. 1 ogg oinstall 99999561 Dec  1 09:29 a1046977

 

-bash-4.1$ ./logdump Oracle GoldenGate Log File Dump Utility for Oracle Version 11.2.1.0.27 19591627 OGGCORE_11.2.1.0.0OGGBP_PLATFORMS_141006.1156 Copyright (C) 1995, 2014, Oracle and/or its affiliates. All rights reserved. Logdump 1 >open /ogg1/dirdat/a1/a1046975 --打开46975文件
Current LogTrail is /ogg1/dirdat/a1/a1046975
Logdump 2 >ghdr on --估计是跟踪头文件的选项,为了显示更全的信息吧?
Logdump 3 >n --n表示next

2016/12/01 09:28:37.223.795 FileHeader Len 1176 RBA 0
Name: *FileHeader*
3000 01bf 3000 0008 4747 0d0a 544c 0a0d 3100 0002 | 0...0...GG..TL..1...
0003 3200 0004 2000 0000 3300 0008 02f2 68c2 3255 | ..2... ...3.....h.2U
3573 3400 001a 0018 7572 693a 6372 6d32 6462 313a | 5s4.....uri:crm2db1:
3a6f 6767 3a44 504b 5f30 3147 3500 001e 3500 001a | :ogg:DPK_01G5...5...
0018 7572 693a 6372 6d32 6462 313a 3a6f 6767 3a45 | ..uri:crm2db1::ogg:E
504b 5f30 3147 3600 001a 0018 2f6f 6767 312f 6469 | PK_01G6...../ogg1/di
7264 6174 2f61 312f 6131 3034 3639 3735 3700 0001 | rdat/a1/a10469757...

Logdump 4 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x00)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 0 (x0000) IO Time : 2016/12/01 09:28:37.102.738
IOType : 150 (x96) OrigNode : 0 (x00)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 0 AuditPos : 0
Continued : N (x00) RecCount : 0 (x00)

2016/12/01 09:28:37.102.738 RestartAbend Len 0 RBA 1184
Name:
After Image: Partition 0 G s

Logdump 5 >n --文章指出要记录“第一条record的RBA”,前两次"n"看到的是一些头部文件,此次看到了对象。故记录下RBA值,1246 记做B值。
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: B (x42)
RecLength : 14 (x000e) IO Time : 2016/12/01 04:47:41.886.892
IOType : 3 (x03) OrigNode : 255 (xff)
TransInd : . (x00) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 158419 AuditPos : 730417760
Continued : N (x00) RecCount : 1 (x01)

2016/12/01 04:47:41.886.892 Delete Len 14 RBA 1246 open /ogg1/dirdat/a1/a1046974
Current LogTrail is /ogg1/dirdat/a1/a1046974
Logdump 2 >ghdr on
Logdump 3 >filter include AuditRBA 158419 --这里的值是上一步取到的Audit RBA
Logdump 4 >filter include filename OM.WQ_FINISH_ERROR --这里需要给出对象名
Logdump 5 >filter match all
Logdump 6 >n
Scanned 10000 records, RBA 11976922, 2016/12/01 04:46:56.004.130 11976922 <--记做c值(我的输出与mos示例的输出不同,后面会继续验证) Filtering suppressed 18797 records

Logdump 7 >

 

此时可以文档中提供的公式去得到一个值 A+B-C=28224338+1246-11976922=16248662
New datapump / Replicat RBA = Reader’s too-big checkpoint RBA(A) + First record RBA in the new trail file (after the restart abend)(B) – RBA of the matching record in the trail file referred in checkpoint file(C)

–同时我们看到一条note:确保该值的TransInd 为x00或x03
Ensure that you have a good record at RBA 16248662 with TransInd x00 or x03
我们可以使用logdump跳到RBA为16248662的TransInd 是否符合条件。

Logdump 10 >open /ogg1/dirdat/a1/a1046975
Current LogTrail is /ogg1/dirdat/a1/a1046975
Logdump 11 >ghdr on
Logdump 12 >pos 16248662
Reading forward from RBA 16248662
Logdump 13 >n
Bad record found at RBA 16248662, format 5.50 Header token)
7368 4f72 | shOr
Logdump 14 >n
Bad record found at RBA 16248662, format 5.50 Header token)
7368 4f72 | shOr

–可以看到此处的数据并不符合要求(The result shows it’s not a good record)。并且笔者也尝试使用此处的RBA号,修改了问题进程,但是进程仍旧无法启动。

GGSCI (crmcxdb1) 11> alter rep RPK_01G, extseqno 46975, extrba 16248662
REPLICAT altered.

GGSCI (crmcxdb1) 12> start rpk_01g
Sending START request to MANAGER ...
REPLICAT RPK_01G starting

----报错,说明有问题,恢复至之前的位置
alter rep RPK_01G, extseqno 46974, extrba 28224338

–根据文章提示,我们进行第二种方法:If the RBA which we get does not point you to a good record, please proceed to Option 2.

==Option 2:
第二种方法还是一个公式,下面分别找到这几个值
New datapump / Replicat RBA = (Reader’s too-big checkpoint RBA) – (Actual size of datapump / replicat trail file (seqno X)) + First record in the new trail file (after the restart abend)

 

GGSCI (crmcxdb1) 9> info rpk_01g

REPLICAT RPK_01G Last Started 2016-12-01 09:59 Status ABENDED
Checkpoint Lag 00:06:53 (updated 05:27:13 ago)
Log Read Checkpoint File /ogg1/dirdat/a1/a1046974
2016-12-01 04:48:11.004299 RBA 28224338 <---A值 :28224338
-rw-r-----. 1 ogg oinstall 99998668 Dec  1 04:41 a1046972
-rw-r-----. 1 ogg oinstall 99999198 Dec  1 04:50 a1046973
-rw-r-----. 1 ogg oinstall 22402029 Dec  1 09:28 a1046974         << where the checkpoint is pointing     <--B值:22402029
-rw-r-----. 1 ogg oinstall 99999892 Dec  1 09:28 a1046975         << the next available trail file
-rw-r-----. 1 ogg oinstall 99999135 Dec  1 09:29 a1046976
-rw-r-----. 1 ogg oinstall 99999561 Dec  1 09:29 a1046977
Logdump 1 >open /ogg1/dirdat/a1/a1046975              --打开46975文件
Current LogTrail is /ogg1/dirdat/a1/a1046975 
Logdump 2 >ghdr on                                                   --估计是跟踪头文件的选项,为了显示更全的信息吧?
Logdump 3 >n                                                             --n表示next

2016/12/01 09:28:37.223.795 FileHeader           Len  1176 RBA 0 
Name: *FileHeader* 
 3000 01bf 3000 0008 4747 0d0a 544c 0a0d 3100 0002 | 0...0...GG..TL..1...  
 0003 3200 0004 2000 0000 3300 0008 02f2 68c2 3255 | ..2... ...3.....h.2U  
 3573 3400 001a 0018 7572 693a 6372 6d32 6462 313a | 5s4.....uri:crm2db1:  
 3a6f 6767 3a44 504b 5f30 3147 3500 001e 3500 001a | :ogg:DPK_01G5...5...  
 0018 7572 693a 6372 6d32 6462 313a 3a6f 6767 3a45 | ..uri:crm2db1::ogg:E  
 504b 5f30 3147 3600 001a 0018 2f6f 6767 312f 6469 | PK_01G6...../ogg1/di  
 7264 6174 2f61 312f 6131 3034 3639 3735 3700 0001 | rdat/a1/a10469757...  
 
Logdump 4 >n
___________________________________________________________________ 
Hdr-Ind    :     E  (x45)     Partition  :     .  (x00)  
UndoFlag   :     .  (x00)     BeforeAfter:     A  (x41)  
RecLength  :     0  (x0000)   IO Time    : 2016/12/01 09:28:37.102.738   
IOType     :   150  (x96)     OrigNode   :     0  (x00) 
TransInd   :     .  (x03)     FormatType :     R  (x52) 
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00) 
AuditRBA   :          0       AuditPos   : 0 
Continued  :     N  (x00)     RecCount   :     0  (x00) 

2016/12/01 09:28:37.102.738 RestartAbend         Len     0 RBA 1184 
Name:  
After  Image:                                             Partition 0   G  s   
   
Logdump 5 >n                                                              
___________________________________________________________________ 
Hdr-Ind    :     E  (x45)     Partition  :     .  (x04)  
UndoFlag   :     .  (x00)     BeforeAfter:     B  (x42)  
RecLength  :    14  (x000e)   IO Time    : 2016/12/01 04:47:41.886.892   
IOType     :     3  (x03)     OrigNode   :   255  (xff) 
TransInd   :     .  (x00)     FormatType :     R  (x52) 
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00) 
AuditRBA   :     158419       AuditPos   : 730417760 
Continued  :     N  (x00)     RecCount   :     1  (x01) 

2016/12/01 04:47:41.886.892 Delete               Len    14 RBA 1246                    <---C值=1246
Name: OM.WQ_FINISH_ERROR 
Before Image:                                             Partition 4   G  b   
 0000 000a 0000 0000 0000 3f2c 1036                | ..........?,.6

根据公式我们得出A-B+C=28224338-22402029+1246=5823555

同样的,需要使用logdump验证此数据是否为good record。验证发现满足条件。之后就可以使用命令修改RBA了。 Ensure that you have a good record at sequence number 46975 and RBA 5823555 with TransInd x00 or x03

Logdump 1 >open /ogg1/dirdat/a1/a1046975
Current LogTrail is /ogg1/dirdat/a1/a1046975
Logdump 2 >detail data
Logdump 3 >fileheader detail
Logdump 4 >ghdr on
Logdump 5 >pos 5823555
Reading forward from RBA 5823555
Logdump 8 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: B (x42)
RecLength : 14 (x000e) IO Time : 2016/12/01 04:48:10.887.776
IOType : 3 (x03) OrigNode : 255 (xff)
TransInd : . (x00) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 158419 AuditPos : 771814976
Continued : N (x00) RecCount : 1 (x01)

2016/12/01 04:48:10.887.776 Delete Len 14 RBA 5823555
Name: OM.WQ_FINISH_ERROR
Before Image: Partition 4 G b
0000 000a 0000 0000 0000 3f34 07fb | ..........?4..
Column 0 (x0000), Len 10 (x000a)
0000 0000 0000 3f34 07fb | ......?4..

 

–修改RBA,启动进程。

GGSCI (crmcxdb1) 40> alter rep RPK_01G, extseqno 46975, extrba 5823555
REPLICAT altered.
GGSCI (crmcxdb1) 42> start rpk_01G

Sending START request to MANAGER ...
REPLICAT RPK_01G starting

GGSCI (crmcxdb1) 45> !
info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
REPLICAT RUNNING RPK_01G 00:00:00 00:16:42
REPLICAT RUNNING RPK_10G 01:58:37 00:00:02
REPLICAT RUNNING RPK_20G 00:04:04 00:00:01
REPLICAT RUNNING RPK_30G 01:53:27 00:00:08
REPLICAT RUNNING RPK_MAX 02:25:09 00:00:09

GGSCI (crmcxdb1) 2> !
info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
REPLICAT RUNNING RPK_01G 04:35:38 00:00:02
REPLICAT RUNNING RPK_10G 02:19:15 00:00:02
REPLICAT RUNNING RPK_20G 00:00:00 00:00:06
REPLICAT RUNNING RPK_30G 01:51:55 00:00:04
REPLICAT RUNNING RPK_MAX 02:28:20 00:00:01

 

–EOF

发表评论

电子邮件地址不会被公开。 必填项已用*标注