Of all the things that can ruin your day as a DBA and possibly get you fired, database corruption ranks right up there with backups and security enforcement of the database. Database corruption can be a resume generating event! A corrupt database can affect system stability, lead to unnecessary downtime, and possibly the loss of some data. To manage the risk you must know what database corruption is, how to diagnose, how to fix, and how to prevent.
Read on to learn how to understand and diagnose database corruption.
UPDATE: Part 2 on Fixing Database Corruption
Understanding Database Corruption – Baby Don’t Hurt Me?
Database corruption is errors when reading, writing, moving, or processing data inside a database. When the consistency of the database state is inconsistent then corruption exists.
ACID and CIA Models
It helps to understand some of the theory before delving into the practice. We all remember the basic tenants of a relational database management system (RDBMS):
Database corruption affects the consistency of the database. Normally when a transaction tries to change data SQL Server goes from a stable state to a state of flux (your change) and back to a stable state. With database corruption it is never returned to a stable consistent state.
Database corruption affects the consistency of the database
From a information security engineering perspective we have the CIA model that helps us think about data security.
- Confidentiality – people see only the data they should see and no other, information access (think Access Control Lists (ACLs)).
- Integrity – the trustworthiness of the data, kept accurate and not meddled with
- Availability – systems are up and functional as they should be, open for business
In this framework, database corruption threatens Integrity and possibly Availability.
What Causes Database Corruption
The cause lies in layers below SQL Server. The most common are hardware faults; in particular, issues with the I/O Subsystem. Any component in the I/O Subsystem can fail and be the cause of database corruption: disks, controllers, CPU, memory, network switch, network cables, SAN, etc.
Database corruption cannot entirely be prevented. It is not a matter of if but rather when
Disks go bad. So do NICs, cables, routers, and everything else physical below the SQL Server Instance. This is why it is important to know that we cannot entirely prevent corruption – only deal with and mitigate it.
Some points to make about what does not cause corruption:
- Applications do not cause corruption
- Running CHECKDB cannot cause corruption
- Creating a database snapshot cannot cause corruption
- Stopping a shrink operation cannot cause corruption
- Stopping an index rebuild operation cannot cause corruption
Here are some possible things that can cause corruption (not intended as an include all list)
- Storing database files in compressed folders or volumes
- Faulty hard disks (physical spinning disk or SSD)
- ECC memory failure
- Windows file system errors
- Network component failure
Having databases in compressed folders or volumes can cause database corruption
When discussing corruption consider asking the following questions:
- When did you first notice corruption?
- Has the corruption reappeared (if it went away)?
- How often does it strike (if not current)?
- Do you notice any pattern in the recurrences?
- Is it just 1 database or multiple?
- Do you have good recent backups?
NOTE: backups can be corrupted too. It is not surprising for a system rife with corruption to also have corrupt backups. If the customer is not checking their databases regularly for corruption then the also probably are not testing backups by restoring them and running consistency checks.
Diagnosing Database Corruption
There are a few ways we can detect database corruption:
- SQL Server error logs
- Crash dumps
- DBCC CHECKDB
- Windows event logs
NOTE: the root cause is important to understand. Example: if a disk is bad, then fixing the corruption will not help beyond the immediate because it will happen again.
SQL Server Error Logs
By default the SQL Server error logs are located: C:\Program Files\Microsoft SQL Server\<instance name>\MSSQL\Log
Reading logs can be dry and boring but keep with it because there is much to be learned here. SQL Server error logs can tell us information about the corruption and what was happening around when it occurred.
Some keywords you can search are:
- Looking for severity >= 16
- Anything >= 20 is very bad and will likely have crash dumps associated with it
Some examples of suspicious activity:
When we see crash dumps this can indicate either a SQL Server bug or corruption. We cannot properly analyze them. The crash dumps must be analyzed by Microsoft to understand it.
SQL Server crash dumps can be found in the same location as the error logs.
This is the main method of learning the details of the corruption. CHECKDB will output all of the logical checks against database pages and report back what is incorrect. After the SQL Error Log analysis this is the next job to tackle.
Running this command will indicate if there is corruption and tell us some details about it beyond that which is contained in the SQL Server error logs. We want to run CHECKDB on at least the suspected database and preferably on all databases on the SQL Instance with the problem (including all user database and the master system database – you can ignore tempdb).
The last time CHECKDB was run can be known by querying:
DBCC DBINFO(‘<your database name>’) WITH TABLERESULTS
Look in the output for the field “dbi_dbccLastKnownGood” – its value will be the date of the last execution.
After CHECKDB finishes then save the output. Scroll to the bottom and look for a summary line. This will tell you all of what it discovered.
You can run this programmatically over your databases using the techniques I laid out here.
Windows Event Logs
Collect the Windows event logs as they may have some corruption details. This is less important than the SQL error log and CHECKDB output but still worth checking. This is especially important to pick up on disk errors and other OS errors that might lead you to the root cause analysis (RCA).
This system view show us potentially bad pages. They are suspected of being bad – but all database pages are presumed innocent before the grand court of SQL Server.
Typically high severity disk errors show up here: 823 and 824.
Consider this as supplementary to CHECKDB and SQL error logs. Some shops alert on activity here.
The Saga Continues
Stay tuned for my next post on fixing and preventing database corruption.
Thanks for reading!
If you liked this post then you might also like my series on SQL Server Environmental Diagnostics Guide.