Oracle 11g database with Data Guard active.
A switch of the sites (from production to disaster recovery node) is done and all seems good. Suddenly one day after the switch (more than one month) the enterprise application using it cannot connect any-more (connection pool getting filled up). The underlying issue that causes this is reported as:
[#|2021-05-15T11:39:28.304+0300|WARN|glassfish3.1.2|org.hibernate.engine.jdbc.spi.SqlExceptionHelper|_ThreadID=155;_ThreadName=Thread-2;|SQL Error: 600, SQLState: 60000|#] [#|2021-05-15T11:39:28.312+0300|ERROR|glassfish3.1.2|org.hibernate.engine.jdbc.spi.SqlExceptionHelper|_ThreadID=157;_ThreadName=Thread-2;|ORA-00600: internal error code, arguments: [ktbdchk1: bad dscn], , , , , , , , , , ,  |#]
Looking around Oracle support it can be seen that this is an Oracle known bug.
Oracle Database – Enterprise Edition – Version 22.214.171.124 to 126.96.36.199 [Release 11.1 to 11.2]
Oracle Database – Enterprise Edition – Version 188.8.131.52 to 184.108.40.206 [Release 12.1]
Oracle Database Cloud Schema Service – Version N/A and later
Oracle Database Exadata Express Cloud Service – Version N/A and later
Gen 1 Exadata Cloud at Customer (Oracle Exadata Database Cloud Machine) – Version N/A and later
Information in this document applies to any platform.
Also states that this is a Data Guard specific issue:
May Occur in a Physical Standby database after a switchover.https://support.oracle.com/knowledge/Oracle%20Database%20Products/1577824_1.html
It looks like the affected tables are those that contain blob data that end up having rows with invalid SCN (System Change Number). Somehow you end up with any interaction with the affected table that has blobs returning that error.
One of the guarantees of the Oracle database is that “there are no dirty reads of user data”. If necessary, Oracle will rebuild each row to the stable state at instant in time when the query was issued.
Oracle does it nicely and very accurately through a “timer” of its own, known as the System Change Number, AKA SCN.
Oracle instructs to apply some patch to fix the issue.
If somehow you cannot install the patch due to support issues an upgrade to latest Oracle 12c is a better alternative. Because 11g is no longer actively supported by Oracle, some of the prerequisite patches require extended support licence so you may be out of luck.
Still the patch is not enough. One blog resource in fact instructs that you have to set some hidden parameter _ktb_debug_flags that is supposed to heal blocks having invalid dependent scn’s on switch-over
bash-4.3$ sqlplus “/as sysdba”
SQL> alter system set “_ktb_debug_flags”=8 scope=both sid=’*’;https://rajiboracle.blogspot.com/2017/09/ora-00600-internal-error-code-arguments.html
Note that this will not be 100% effective. You may still end up with entries with blob data that are indeed corrupted with an invalid SCN and cannot be recovered, but at least is not throwing the error message for all the entries.
Note that news generated data will be OK but if that new data touches or uses old data that is still in those rows with invalid SCN’s the error may still occur in the logs.
My supposition on what is happening is this:
- After site switch somehow some Oracle meta-data table gets corrupted and SCNs are not replicated for some rows.
- The patch fixes the problem and the alter is somehow repairing those rows without SCNs.
- It seems that if a row was changed several times Oracle is able to recover a SCN for the row and all is good but if the row is really at the first insert and the SCN is still null, then we still have a problem.
- Interesting fact we still had the issue even the next day with some isolated cases (2 transactions from some thousands) but this time after a fist try that failed, at the second try the database transaction succeeded.