Of all the things that can ruin your day as a DBA and possibly get you fired, database corruption ranks right up there with backups and security enforcement of the database. Database corruption can be a resume generating event! A corrupt database can affect system stability, lead to unnecessary downtime, and possibly the loss of some data. To manage the risk you must know what database corruption is, how to diagnose, how to fix, and how to prevent.

Read on to learn how to understand and diagnose database corruption.

UPDATE: Part 2 on Fixing Database Corruption

Understanding Database Corruption – Baby Don’t Hurt Me?

Database corruption is errors when reading, writing, moving, or processing data inside a database. When the consistency of the database state is inconsistent then corruption exists.

ACID and CIA Models

It helps to understand some of the theory before delving into the practice. We all remember the basic tenants of a relational database management system (RDBMS):

Atomicity
Consistency
Isolation
Durability

Database corruption affects the consistency of the database. Normally when a transaction tries to change data SQL Server goes from a stable state to a state of flux (your change) and back to a stable state. With database corruption it is never returned to a stable consistent state.

Database corruption affects the consistency of the database

From a information security engineering perspective we have the CIA model that helps us think about data security.

Confidentiality – people see only the data they should see and no other, information access (think Access Control Lists (ACLs)).
Integrity – the trustworthiness of the data, kept accurate and not meddled with
Availability – systems are up and functional as they should be, open for business

In this framework, database corruption threatens Integrity and possibly Availability.

What Causes Database Corruption

The cause lies in layers below SQL Server. The most common are hardware faults; in particular, issues with the I/O Subsystem. Any component in the I/O Subsystem can fail and be the cause of database corruption: disks, controllers, CPU, memory, network switch, network cables, SAN, etc.

Database corruption cannot entirely be prevented. It is not a matter of if but rather when

Disks go bad. So do NICs, cables, routers, and everything else physical below the SQL Server Instance. This is why it is important to know that we cannot entirely prevent corruption – only deal with and mitigate it.

Some points to make about what does not cause corruption:

Applications do not cause corruption
Running CHECKDB cannot cause corruption
Creating a database snapshot cannot cause corruption
Stopping a shrink operation cannot cause corruption
Stopping an index rebuild operation cannot cause corruption

Here are some possible things that can cause corruption (not intended as an include all list)

Storing database files in compressed folders or volumes
Faulty hard disks (physical spinning disk or SSD)
ECC memory failure
Windows file system errors
Network component failure

Having databases in compressed folders or volumes can cause database corruption

When discussing corruption consider asking the following questions:

When did you first notice corruption?
Has the corruption reappeared (if it went away)?
How often does it strike (if not current)?
Do you notice any pattern in the recurrences?
Is it just 1 database or multiple?
Do you have good recent backups?

NOTE: backups can be corrupted too. It is not surprising for a system rife with corruption to also have corrupt backups. If the customer is not checking their databases regularly for corruption then the also probably are not testing backups by restoring them and running consistency checks.

Diagnosing Database Corruption

There are a few ways we can detect database corruption:

SQL Server error logs
Crash dumps
DBCC CHECKDB
Windows event logs

NOTE: the root cause is important to understand. Example: if a disk is bad, then fixing the corruption will not help beyond the immediate because it will happen again.

SQL Server Error Logs

By default the SQL Server error logs are located: C:\Program Files\Microsoft SQL Server\<instance name>\MSSQL\Log

Reading logs can be dry and boring but keep with it because there is much to be learned here. SQL Server error logs can tell us information about the corruption and what was happening around when it occurred.

Some keywords you can search are:

Corrupt
Severity:
- Looking for severity >= 16
- Anything >= 20 is very bad and will likely have crash dumps associated with it
Error
Dump

Some examples of suspicious activity:

Crash Dumps

When we see crash dumps this can indicate either a SQL Server bug or corruption. We cannot properly analyze them. The crash dumps must be analyzed by Microsoft to understand it.

SQL Server crash dumps can be found in the same location as the error logs.

DBCC CHECKDB

This is the main method of learning the details of the corruption. CHECKDB will output all of the logical checks against database pages and report back what is incorrect. After the SQL Error Log analysis this is the next job to tackle.

Usage for this command: https://docs.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-checkdb-transact-sql?view=sql-server-2017

Running this command will indicate if there is corruption and tell us some details about it beyond that which is contained in the SQL Server error logs. We want to run CHECKDB on at least the suspected database and preferably on all databases on the SQL Instance with the problem (including all user database and the master system database – you can ignore tempdb).

The last time CHECKDB was run can be known by querying:

DBCC DBINFO(‘<your database name>’) WITH TABLERESULTS

1	DBCC DBINFO(‘<your database name>’) WITH TABLERESULTS

Look in the output for the field “dbi_dbccLastKnownGood” – its value will be the date of the last execution.

After CHECKDB finishes then save the output. Scroll to the bottom and look for a summary line. This will tell you all of what it discovered.

You can run this programmatically over your databases using the techniques I laid out here.

Windows Event Logs

Collect the Windows event logs as they may have some corruption details. This is less important than the SQL error log and CHECKDB output but still worth checking. This is especially important to pick up on disk errors and other OS errors that might lead you to the root cause analysis (RCA).

sys.suspect_pages

This system view show us potentially bad pages. They are suspected of being bad – but all database pages are presumed innocent before the grand court of SQL Server.

Typically high severity disk errors show up here: 823 and 824.

Consider this as supplementary to CHECKDB and SQL error logs. Some shops alert on activity here.