Friday, November 27, 2009

DB2 - NUMLKUS Max reached

Problem:

We had an application job abend with SQL – 904 (Resource Unavailable) with below reason code :

DSNT501I -SSID DSNILMCL RESOURCE UNAVAILABLE 378
CORRELATION-ID=XXXXXXXXXXXXXX
CONNECTION-ID=IMPX
LUW-ID=XXXXX.AP1BDBPC.C30C05F02778=0
REASON 00C90096
TYPE 00000302
NAME DBDNAME.SPACENAM .X'A817D9'


Explanation:

Explanation for this error is follows:

- The page, row or LOB lock on the page or sub page identified by NAME in message DSNT500I or DSNT501I caused the total number of page, row or LOB locks concurrently held to reach the installation maximum number of page, row or LOB locks (NUMLKUS) allowed for a single agent in the system.

Analysis:
It is due to number of locks per user reaching the installation limit.

We observed that in our region the current values are :

NUMLKTS : Max page locks per tablespace : 2000
NUMLKUS : Max page locks per user . . . : 5000

So, a user / batchid can acquire only 5000 locks at any particular time. If the program is inserting/updating into multiple tables the maximum number of locks acquired across all tables should be less than 5000. When this limit exceeds you get SQL ERROR -904 with reason code of C90096.
This type of error may occur when we have either mass insert or delete or updates.
In this case the number of locks for the batch id XXXXXXX could have crossed the threshold limit of 5000. This is not related to any particular table or partition, it is the total number of locks the batch-id had at particular time at subsystem level.

Approach that did not resolve the problem
Suggestion 1:

To avoid this error application team was requested to reduce the check point frequency in the program to 500 from 1000. We requested for a Restart, but job failed again with same reason code.
The batch job issues checkpoint once the counter for INSERT reaches 1000. Update/Delete statements are based on WHERE condition which may even fetch more than 5000 matching rows at a time.

Conclusion 1:
Reducing Check Point frequency should be used only when DML activity is done one row at a time.

Suggestion 2:
Current definition of tablespace :
CREATE TABLESPACE . . .
. . .
BUFFERPOOL BP0
LOCKSIZE PAGE
LOCKMAX 0
LOCKPART YES
CLOSE NO
CCSID EBCDIC;

This was changed to : LOCKSIZE ANY LOCKMAX SYSTEM, to enable lock escalation when threshold limit reaches for any particular tablespace.
This did not solve the purpose, as the job abended again with same error.

Conclusion 2:
LOCKSIZE ANY LOCKMAX SYSTEM will help only when NUMLKTS threshold is reached.

Suggestion 3: Temporary Fix
Batch program access around 6 tables doing update / delete based on WHERE condition.

As the number of updates/deletes on the tables are more, temporarily we changed the lock size parameter from PAGE to TABLESPACE for tables that have high DML activity:

This would make user to acquire only one lock till all inserts/updates/deletes are done.
But the disadvantage is the concurrency; the other users may not be able to insert/update the table when batch job is running.

Note 1:
LOCKPART parameter was changed to NO to make LOCKSIZE as TABLESPACE, as LOCKPART YES is not allowed with LOCKSIZE TABLESPACE.

Note 2: To alter the LOCKPART option, you must stop the entire table space with the
STOP DATABASE command.

After modification the DDL of tablespace look like below:
CREATE TABLESPACE . . .
. . .
BUFFERPOOL BP0
LOCKSIZE TABLESPACE
LOCKPART NO
CLOSE NO
CCSID EBCDIC;

Conclusion 3:
This fixed the problem temporarily.

Concurrency Issue :
This change in LOCKSIZE resulted in Time-Out for some online programs as the TABLESPACE is locked by the batch job. When we looked at the DB2 log we found there were contention of Batch job with OWB online:

02.54.01 STC06697 DSNT375I -SSID PLAN=DISTSERV WITH 176
176 CORRELATION-ID=java
176 CONNECTION-ID=SERVER
176 LUW-ID=A438AA3E.A261.0259C5063726=4655
176 THREAD-INFO=ORDERMB:usahgmowbas003:ordermb:java
176 IS DEADLOCKED WITH PLAN=XXXXXXX WITH
176 CORRELATION-ID=XXXXXXXXXXX
176 CONNECTION-ID=IMPX
176 LUW-ID=XXXXXX.AP1BDBPC.C30C3237F2FF=5183
176 THREAD-INFO=XXXXXXX:*:*:*
176 ON MEMBER SSID
02.54.01 STC06697 DSNT501I -SSID DSNILMCL RESOURCE UNAVAILABLE 177

So, once the batch job completed successfully, we reverted back to LOCKSIZE PAGE


SET CURRENT SQLID='XXXXX';
ALTER TABLESPACE dbname.tsname LOCKPART YES LOCKSIZE PAGE;


Permanent fix:

If the application should be able to run with the current page, row or LOB locking protocol and SQL statements, increase the NUMLKUS value for the installation to allow a higher
limit of page, row or LOB locks to be concurrently held by a single application. If a utility job encountered this resource unavailable condition, the NUMLKUS value must be increased to accommodate the utility, because utilities are programmed to use the minimum number of page, row or LOB locks possible.

We have raised a incident ticket and assigned to System DBA to arrive at a optimum NUMLKUS value to avoid this problem in future.

2 comments:

  1. Nice info... I have one question. From application developer perspective, is there any way to check the value of NUMLKUS value set in dsnzparm? (like running a sql query in qmf or spufi)

    ReplyDelete
  2. Hi Rikdeb, I do not log on to this site everyday. I post my learning when I get free time, thinking it be useful for people who face similar problems .The NUMLKUS value cannot be obtained using SQL . You get the value my call DB2 Store procedure (CALL "SYSPROC".ADMIN_INFO_SYSPARM(null,0,'')) or You can get the DSNZPARM values by running DB2PM program.

    Easiest would be to ask your DBA...:-)

    ReplyDelete