Thursday, July 16, 2015

BMC REORG using REBALANCE hidden glitch


Problem: 

Attempted BMC ROERG using option REBALANCE with SHRLEVEL CHANGE between partitions to resolve space issue on a table partition, as application needed 24/7 availability, which put the table partition in Recovery pending as the job failed during ALTER Table limit key stage after REORG Switch phase. 

 

Reason:

During utility terminate phase of BMC REORG with REBALACNE option it completes the switch to shadow dataset first and then attempt to alter statement, where the control is passed to DB2.   DB2 must invalidate dependent plans and packages on that object to complete the ALTER successfully.

The BMC Reorg was holding a DRAIN on the object and successfully completed the switch phase, But that DRAIN did not prevent dependent packages from running, as long as they are not currently accessing this object. And if the packages are running, then DB2 cannot invalidate them.  It is also likely in this case that this thread was waiting on the Reorg plus DRAIN in order to access this object, resulting in a deadlock. To put it another way, Reorg Plus is waiting on the thread in order to invalidate the package in order to complete the ALTER, and the thread is waiting on Reorg Plus to access the object, so its package cannot be invalidated, and so the ALTER cannot be completed.


The BMC recommended default option ALTRFAIL=RCVRPEND caused the table to be put in Recovery pending status to allow it to be restarted to complete the REORG.  And this option ALTRFAIL is an installation option so it cannot be overridden with REORG Control card. Either you need to change the Default ALTRFAIL=TERM or create new DOPTS module which takes time to make and needs system DBA support.

With ALTRFAIL=TERM, REORG PLUS automatically backs out the reorganization when a failure occurs during processing of limit-key ALTER statements


Fix:

We resolve the space issue with IBM REORG as it build good compression dictionary and does better space compression than BMC REORG.

 

Lessons Learnt:  

BMC REORG REBALANCE with SHRLEVEL CHANGE even though say no outage in the manual, it is not really a solution for an application that is active and busy 24/7.

Always have a user module with ALTRFAIL=TERM and use it to avoid outage to the application if at all you are going to try to attempt BMC REORG REBALANACE with SHRLEVL CHANGE for any immediate fix without an outage.

BMC REORG – Build Compression Dictionary inefficiency


Problem:

Application batch job failed with -904 resource (Reason 00D70014) not available due to table space partition reached maximum limit. This job used to do mass insert but the business data volume was as usual and very much below the maximum limit.   

Note: All table space in our application are compressed.

Reason:

On analysis, we found BMC REORG that ran on that particular partition of the table had built a compression dictionary which is not efficient, causing the partition space reach maximum limit.  Found in the same table highest records stored is over 50 Million in a single partition, whereas, due to poor dictionary built in this particular partition caused to it reach maximum space of 4GB  when it had only 42 Million records.

 

Fix:  

Run IBM REORG on the same partition which brought down space usage to 3GB from 4GB limit.

 

Lessons Learnt:

IBM REORG builds better compression dictionary and provided good space saving compared to BMC REORG.


Use BMC Option ORDER YES and KEEP DICTIONARY NO to get similar compression as that of IBM REORG. But you also see increase in CPU usage time and run time.

Thursday, May 28, 2015

BMC REORG –Options Parameter - Defaults not set properly




Problem:   We had encountered issue with Online BMC DB2 Reorg job, the job was in Terminate utility phase and went into RETRY mode, and able to complete the REORG as some thread was holding the REORG job.  The job was in retry mode for a long time and the table was in locked status, due to the BMC defaults options in the system.  As the table locks was one of the critical table impact was huge.
Reason:
The BMC REORG job was not cancellable by anyone, as BMC have made their job non-cancellable during utility termination phase.  We were not be able to track the thread that was holding the BMC REORG job as ours is a very busy system and even DB2 Omegamon tool was not showing any lock conflicts. There were no application batch jobs other than REORG batch jobs, as in our system we have mechanism to hold all application batch during DBA REORG to avoid contentions.  Display thread on the table shows huge list of threads as it was one on the critical table in the system.

Fix: Followed below steps
  1. Terminate the utility.  The utility was terminated but still kept retry due to the default setting.
  2. Cancelling the reorg – BMC had made the REORG job uncancellable in Utility terminate phase when DRNWAIT is set to UTIL.
  3. Performed DB2 rebounce – table in question was still in recovery pending mode after that.
  4. Restarted the DB2 table in question after health check the table. This caused an outage for about an hour.
  5. We were not able to identify the DB2 Thread causing the issue. (side issue)


However, the best fix would be to identify the thread that is holding the REORG job and cancel it.
Lesson Learnt: 
The combination of DRNRETRY = 255 and DRNWAIT=UTIL had caused the Reorg to retry 255 times every 5 mins during which terminating would not have effect and Cancellation would not be allowed.
Best values to use are
DRNWAIT=NONE  (this was set to UTIL by default) à Which means that the drain request issued by REORG PLUS times out immediately if the drain cannot acquire the lock. NONE prevents any application transactions from being queued during the drain process. BMC recommends that we specify NONE in high-transaction environments.
DSPLOCKS=RETRY which will display the locks and claimers after every RETRY. This will be useful information to identify a common thread across all RETRYs to narrow down the offending thread.  
TIMEOUT=TERM à This parameter leaves the objects in their original states (RW) and terminates the job.
BMC Recommend the default options for DRNWAIT=NONE and TIMEOUT=TERM in there manual.