HelloDBA [English]
搜索Internet 搜索 HelloDBABA
  Oracle技术站。email: fuyuncat@gmail.com  MSN: fuyuncat@hotmail.com   acoug  acoug 

高级复制中mlog$表的HWM过高导致性能问题

[English]

作者: fuyuncat

来源: www.HelloDBA.com

日期: 2008-06-07 14:50:13

分享到  新浪微博 腾讯微博 人人网 i贴吧 开心网 豆瓣 淘宝 推特 Facebook GMail Blogger Orkut Google Bookmarks

某系统升级后,报告事务端的数据库反应非常慢,用户大量投诉。

让现场查了一下系统,CPU占用到了80%(平时该系统正常时只有10%~20%)。wio也非常高(说明磁盘读写严重)。一段时间后,现场将statspack报告发回来了:

STATSPACK report for
 
DB Name         DB Id    Instance     Inst Num Release     Cluster Host
------------ ----------- ------------ -------- ----------- ------- ------------
OLTPDB        3781951398 OLTPDB              1 9.2.0.5.0   NO      jnaip21
 
            Snap Id     Snap Time      Sessions Curs/Sess Comment
            ------- ------------------ -------- --------- -------------------
Begin Snap:       3 27-Sep-04 13:15:58       32      83.4
  End Snap:       4 27-Sep-04 14:10:36       37      83.4
   Elapsed:               54.63 (mins)
 
Cache Sizes (end)
~~~~~~~~~~~~~~~~~
               Buffer Cache:       720M      Std Block Size:         8K
           Shared Pool Size:       112M          Log Buffer:       512K
 
Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:             11,763.97              7,400.17
              Logical reads:              8,251.68              5,190.75
              Block changes:                 83.17                 52.32
             Physical reads:              2,488.23              1,565.23
            Physical writes:                  6.15                  3.87
                 User calls:                411.88                259.09
                     Parses:                 56.57                 35.59
                Hard parses:                  0.29                  0.18
                      Sorts:                 14.59                  9.18
                     Logons:                  0.12                  0.08
                   Executes:                 80.04                 50.35
               Transactions:                  1.59
 
  % Blocks changed per Read:    1.01    Recursive Call %:    28.14
 Rollback per transaction %:    0.17       Rows per Sort:  1818.01
 
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:   99.99       Redo NoWait %:  100.00
            Buffer  Hit   %:   69.88    In-memory Sort %:   99.99
            Library Hit   %:   99.78        Soft Parse %:   99.50
         Execute to Parse %:   29.32         Latch Hit %:   99.98
Parse CPU to Parse Elapsd %:   96.72     % Non-Parse CPU:   99.02
 
 Shared Pool Statistics        Begin   End
                               ------  ------
             Memory Usage %:   93.60   93.71
    % SQL with executions>1:   20.28   19.05
  % Memory for SQL w/exec>1:   25.59   25.88
 
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file scattered read                            599,637       3,886    63.79
CPU time                                                        1,142    18.74
db file sequential read                           238,983         629    10.33
enqueue                                               263         366     6.00
log file sync                                       5,154          23      .37
          -------------------------------------------------------------
Wait Events for DB: OLTPDB    Instance: OLTPDB    Snaps: 3 -4
-> s  - second
-> cs - centisecond -     100th of a second
-> ms - millisecond -    1000th of a second
-> us - microsecond - 1000000th of a second
-> ordered by wait time desc, waits desc (idle events last)
 
                                                                   Avg
                                                     Total Wait   wait    Waits
Event                               Waits   Timeouts   Time (s)   (ms)     /txn
---------------------------- ------------ ---------- ---------- ------ --------
db file scattered read            599,637          0      3,886      6    115.1
db file sequential read           238,983          0        629      3     45.9
enqueue                               263          4        366   1390      0.1
log file sync                       5,154          0         23      4      1.0
log file parallel write            15,143      8,060         20      1      2.9
db file parallel write              1,008          0         12     12      0.2
control file parallel write         1,065          0          4      4      0.2
buffer busy waits                   3,195          0          4      1      0.6
 
... ...
 
 
SQL ordered by Gets for DB: OLTPDB    Instance: OLTPDB    Snaps: 3 -4
-> End Buffer Gets Threshold:   10000
-> Note that resources reported for PL/SQL includes the resources used by
   all SQL statements called within the PL/SQL code.  As individual SQL
   statements are also reported, it is possible and valid for the summed
   total % to exceed 100
 
                                                     CPU      Elapsd
  Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
      6,509,469        1,060        6,141.0   24.1    66.35     64.61 2833840857
Module: JDBC Thin Client
select lib.code, lib.name, lib.info, leaf.leafdirindex, tp.tpname, from t_commend com, t_leafinfo leaf, t_lib lib,t_tpinfo tp where 
 
      2,258,385          343        6,584.2    8.3    65.41   1289.82 3307794804
Module: oracle@qdaip12 (TNS V1-V3)
SELECT /*+ */ "A2"."PHONENUMBER","A2"."TGROUPID","A2"."LOOPNO
","A2"."TCODE" FROM "RU"."T_LOOP" "A2", (SELECT /*+ */
DISTINCT "A3"."LOOPNO" "LOOPNO","A3"."TGROUPID" "TGROUPID"
 FROM "RU"."MLOG$_T_LOOP" "A3" WHERE "A3"."SNAPTIME$$">:1
AND "A3"."DMLTYPE$$"<>'D') "A1" WHERE "A2"."LOOPNO"="A1"."LOOPNO
 
      2,251,917          343        6,565.4    8.3    66.51   1306.09 1606105459
Module: oracle@qdaip12 (TNS V1-V3)
SELECT /*+ */ DISTINCT "A1"."LOOPNO","A1"."TGROUPID" FROM "RI
NG"."MLOG$_T_LOOP" "A1" WHERE (("A1"."LOOPNO","A1"."TGROU
PID")<>ALL (SELECT "A2"."LOOPNO","A2"."TGROUPID" FROM "RU".
"T_LOOP" "A2" WHERE "A1"."LOOPNO"="A2"."LOOPNO" AND "A1"."TO
NEGROUPID"="A2"."TGROUPID")) AND "A1"."SNAPTIME$$">:1 AND "A1
 
      2,249,895          343        6,559.5    8.3    66.34   1098.20 4039622144
delete from "RU"."MLOG$_T_LOOP" where snaptime$$ <= :1
 
 
... ...
 
 
Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
      2,129,843          343        6,209.5   26.1    66.34   1098.20 4039622144
delete from "RU"."MLOG$_T_LOOP" where snaptime$$ <= :1
 
      2,129,004          343        6,207.0   26.1    65.41   1289.82 3307794804
Module: oracle@qdaip12 (TNS V1-V3)
SELECT /*+ */ "A2"."PHONENUMBER","A2"."TGROUPID","A2"."LOOPNO
","A2"."TCODE" FROM "RU"."T_LOOP" "A2", (SELECT /*+ */
DISTINCT "A3"."LOOPNO" "LOOPNO","A3"."TGROUPID" "TGROUPID"
 FROM "RU"."MLOG$_T_LOOP" "A3" WHERE "A3"."SNAPTIME$$">:1
AND "A3"."DMLTYPE$$"<>'D') "A1" WHERE "A2"."LOOPNO"="A1"."LOOPNO
 
      2,123,001          343        6,189.5   26.0    66.51   1306.09 1606105459
Module: oracle@qdaip12 (TNS V1-V3)
SELECT /*+ */ DISTINCT "A1"."LOOPNO","A1"."TGROUPID" FROM "RI
NG"."MLOG$_T_LOOP" "A1" WHERE (("A1"."LOOPNO","A1"."TGROU
PID")<>ALL (SELECT "A2"."LOOPNO","A2"."TGROUPID" FROM "RU".
"T_LOOP" "A2" WHERE "A1"."LOOPNO"="A2"."LOOPNO" AND "A1"."TO
NEGROUPID"="A2"."TGROUPID")) AND "A1"."SNAPTIME$$">:1 AND "A1
 
      1,724,081          278        6,201.7   21.1    54.05    887.05 2153620124
update "RU"."MLOG$_T_LOOP" set snaptime$$ = :1  where snap
time$$ > to_date('2100-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS')

 

关注statspack中以下信息:Buffer Hit非常低,只有69.88%;在Top Event的等待事件中,db file scattered read的等待最高,占到了63.79%。很明显,这是由于数据文件扫描导致的buffer cache下降非常严重。而就肯定是由于存在全表扫描(全表扫描到的数据块不会被cachebuffer cache,而是会已最快的速度被置换出内存,这就是为什么全表扫描会导致buffer hit下降)语句所致。

再看top sql中的内容,我们发现,大多数top sql都涉及到一个奇特的表:MLOG$_T_LOOP。这张表的使用是不会出现在应用程序中。因此相关语句都是被系统调用的。那么这张表是做什么用的呢?

我们的系统为了将业务处理与业务查询分离,通过高级复制将业务处理端的某些表同步到业务查询端去。而mlog$_表就是记录最后一次刷新后,被同步表的数据变化的日志表,当查询端(即快照端)的刷新作业(我们设置10秒钟间隔运行)通过mlog$_读取到源表的数据变化,更新查询端相应的表。而业务处理端检测到所有快照端都更新过这些数据后,就会将这些数据删除。此外,mlog表上是没有任何索引,因此对mlog表的查询肯定是全表扫描。但是由于我们的刷新作业10秒钟更新一次,而10秒钟所产生的数据变化量非常小,在正确情况下,这样的全表扫描是不会影响到系统性能的。

我查了一下MLOG$_T_LOOP,发现它记录的数据确实非常少(一般只有几条数据),但是对它查询钟,physical reads却达到了6200多,这显然不正常。说明这张表的高水位(HWM,高水位记录的是表曾经达到的最大数据块,就像洪水过后留下的水位线。而对表的扫描并不是以当前的实际水位为基准的,而是以高水位为基准的,也就是说会扫描表曾经占用过的所有数据块)非常高。

通过现场了解,我们得知现场在做升级时,并没有停止同步刷新作业,而是将刷新作业间隔延长到2个小时。升级过程中,两边的数据库都没有停止。但是,期间在业务处理端做了一些数据处理,特别是表T_LOOP有大量的数据更新。

问题很明确了!由于升级过程中T_LOOP有大量的数据更新,而快照没有被刷新。因此mlog$_t_loop中保存了两个小时内的所有数据变化记录,导致高水位上升!而刷新作业再次恢复到10秒间隔后,尽管mlog$_t_loop中的数据量下降了,但是它的高水位并没有下降!

于是安排现场将业务处理端的客户连接断开,然后停止快照端的刷新作业,手工刷新一次同步后,检查业务处理端的mlog$表,确保这些表为空,然后truncate mlog$表,恢复同步作业,最后恢复业务处理端的客户连接。经过一段时间观察后,数据库恢复正常。

Top

Copyright ©2005,HelloDBA.Com 保留一切权利

申明
by fuyuncat