Sunday, December 8, 2013

Application – Slow Down – high Lock/latch contentions (QVLSLC24)-Buffer Pool and EDM Pool tunning

Problem:


This is one of the interesting problems we have worked on. The application slows down at times, mostly when the work load was very high, high page-in observed. And during the slow down the CICS transaction used to time-out for a minutes and the system recovers on its own within a minute or two. We did not have any clue on the root cause as it was occurring different times, weekly once or twice. And we did not observe any common DB2 activity that was been active during the slowdowns. Analysis started at different angles, Analysis indicates private page-in waits across many CICS regions.

Analysis:

General analysis on SMF, RMF records was not helpful, as we did not find anything wrong in it. When DB2 labs were engaged, they suggested changing the SMF stats interval to be every minute instead every 15 minutes. After the change we were able see the lock/latch contentions (QVLSLC24) was very high (>0.5 Million) for a minute alone whenever there was slow down occurs. It was the first clue for us to know what was causing the slowdown, other than high page-in rate. Latch class 24 signified the problem could be due to EDM Pool or the Pre-fetch. We did not see any failure in the EDM Pool.

QVLSLC24 QTXASLAT TIME

3313 45 23OCT2013:09:34:48.71

3872 52 23OCT2013:09:35:48.71

3931 71 23OCT2013:09:36:48.71

32893 229 23OCT2013:09:37:48.71

999423 818 23OCT2013:09:38:48.71 <== Slow Response

6083 126 23OCT2013:09:39:48.71

3275 91 23OCT2013:09:40:48.71

4551 90 23OCT2013:09:41:48.71



We started analysing the application batch jobs that had significant lock counts and that had large UOW, and reduced the UOW to increase the commit frequency of the job. Several application batch jobs were tuned. But it did not fix the application slow down occurrences.

We added extra 1GB to the central storage of the L-Par. And BP8K0 buffer pool size was increased from 1000 to 6000 pages, based on IBM performance management team suggestion. Even this did not fix the application slow down occurrences.

Started collecting as much data as possible on the slowdown hits, which includes Lock/latch contention report from PDB files, CICS response time report , transaction count, Batch jobs active during the time, DB2 PM lock report . As the latch class 24 increases for just for a minute enabling trace was not possible, as the system recovers very fast.

Found one application batch job that does huge get pages was contributing to the slowdown at particular time, analysing it slower, it showed that it does huge sequential get-pages on the application critical buffer pool. It was going for table space scan on a huge table, and whenever the Buffer pool hit ratio of the job was very bad, we observed the slowdown. As it is a critical job that had to process all records on the table, nothing much can be tuned on the job. One option was to move that table a different BP from critical Buffer Pool, so that the online region does not get impacted.

We started monitoring all the BP and the Buffer Hit ratio and every slow down cases, which hinted to increase size of the most active application tablespace. Collected stats of the buffer pool by DB2 thread for every slowdown time gave more understand on the BP usage in the system.

Close monitoring of the Buffer Pool showed that was problem with system directory buffer pool BP8K0 had it negative buffer pool hit ratio and it also showed the IO rate on BP8K0 was higher any application buffer pool , and the IO were mainly on SPT01. And PT load percentage was around 30 % during peak business hours and around 10% rest of the time. PT load percentage should be maintained less than 10% for a healthy system.
Final Resolution:

We increased the Buffer Pool the most active that has all main table spaces by 50K pages, the result was the slowdown was not seen on one of the month huge data processing day.

Additionally, we suggested to change the BP8K0 by three times to reduce the IO on it and doubled the value on EDM_SKELTON_POOL in the ZPARM to reduce the PT load %. The results were significant. The IO rate on BP8K0 reduced to below 3% from 33% and PT load percentage came below 5%.

CPU Saving is awesome, it is around 20% CPU saving on the L-par, which is significant and huge savings.

The immediate benefits of buffer pool tuning are improved on-line response and greater system throughput, reduced CPU utilization, reduced I/O workloads, and the optimization of DB2 memory resources.



Lessons Learnt

 In DB2 V9, The EDM pool is below the 2-GB bar. The EDM pool space for database descriptors and dynamic statements, and the skeleton cursor table and skeleton package table are above the 2-GB bar.

 If Skeleton Pool is too small, it increases response time due to loading of SKCT, SKPT , and increases I/O against SPT01, pages in use by CT/PT goes below < 50%

 Increase Skeleton Pool size as needed to keep CT/PT HIT RATIO > 95 to 99% to have the system in health state.

 Buffer pool tuning is one of the best and easiest methods for improving DB2 performance. Tuning should be done by separating table spaces from indexes, and separating randomly accessed objects from sequentially accessed ones.

 Tuning is an ongoing process. To achieve good benefits, it is important to implement proper buffer pool monitoring procedures.

 Proper buffer pool tuning can improve application response times by reducing I/O wait times and the CPU resources required to perform physical I/O.

No comments:

Post a Comment