In episode 217 Paul Randal from the Microsoft SQL Server team talked to us about CHECKDB, a built-in tool to check the integrity of SQL Server databases.
Carl Franklin: Hey, Paul on your blog recently you have a post about “How long does *your* CHECKDB take?” In which you asked people to send in their results to you. Did you get a lot of responses?
Paul Randal: I have had 10 to 20 responses so far.
Carl Franklin: Have there been, any surprises?
Paul Randal: Not that I have seen so far. No, I mean there’s a lot variety in the size of databases, and lots of good-sized databases [of] half-a-terabyte or more, and in fact I did [something at] Boot Camp in fact with Kimberly [Tripp] a couple of weeks ago, where we had a bunch of internals from Microsoft partners and some hardware partners, and one of the guys from HP that was there offered to run a CHECKDB on their three terabytes data warehouse host they were using for TPC-H, benchmarking runs, then they got it to run in eight hours on three terabytes.
Richard Campbell: And it’s funny to say, to offer to run that, that’s really a gesture. It is a major thing to run a CHECKDB on a database that big.
Paul Randal: Right. I mean, if your hardware can’t handle it, it’s going to take a long, long time. So, I mean there’s a ton of different factors. Do you want to go through the factors that say how long it takes?
Carl Franklin: Sure yeah.
Paul Randal: I mean these are some questions I get asked: How long this CHECKDB takes? How long is it going to take to run on my database? The answer I’d like to give to lots of questions is, “Hey it depends.”
Richard Campbell: That’s a correct answer to everything.
Paul Randal: Right, absolutely. I mean it’s a good way of getting yourself off the hook, especially if you’re busy. But anyway, the things that are going to affect how long CHECKDB takes to run-obviously the size of the database, that’s a no-brainer, right? And then you have got to think if you’ve got any other load on the server; if you have any concurrent I/O load; if you have any concurrent CPU activity. CHECKDB is going to read every single page and allocated page in the database, so it’s going to put a huge I/O load on the server.
Richard Campbell: And you are not just talking about I/O load from SQL Server, but if there is anything else running on that server.
Paul Randal: Oh absolutely, yeah. It’s going to put a load on your I/O subsystem, so if there is anything else using that then you don’t want to be doing a big CHECKDB at the same time. CPU activities, some of the algorithms that are in CHECKDB, are very CPU-intensive because they’re reconciling facts from all different parts of the database. There’s one algorithm, for instance, that checks for every row in a base table. Let’s say you brought a clustered index, there’s exactly one matching row in each non-clustered index and vice versa, and if you’ve used the option that’ll allow that to run, it’s going to run-that takes up 30% of the CPU that CHECKDB uses.
Carl Franklin: Wow!
Paul Randal: So, I mean one of the things I say to VLDB customers to get their runtimes down is to try running with the PHYSICAL_ONLY option, because that turns off all of these deeper logical checks, and makes the run time a lot faster. It turns it from a CPU band process into an I/O band process.
Richard Campbell: Just physically go and touch all my allocations.
Paul Randal: Yeah, look at every single allocated page, audit the page, if there are any page checksums on the page then check the page checksums, and that’s a really good way of being able to check for any gross hardware errors.
Richard Campbell: Right, as opposed to the actual CRC work of making sure everything is consistent.
Paul Randal: Right. So, that’s another couple of factors, then [it looks at] any concurrent update activity on the database, okay? So, for SQL Server 2000, well, the first thing that we have to do when we do a CHECKDB is we have to get a consistent user database. So, for SQL Server 2000 we did, it’s called Log Analysis. So, at the end of reading through all the pages in the databases, we have to look and see what transactions happened during the time that we were reading all the pages, possible resource and inconsistencies during that time, and so what we have to do is, we have to internally run recovery on all that transaction log. So, if you put a whole bunch of update activity then you are going to generate tons of transaction log, and so the internal recovery of all that transaction log is going to take a long time.
Richard Campbell: That’s because you’re doing transactions while CHECKDB is going on.
Paul Randal: Now, CHECKDB is online, so you can do transactions, but if you’ve got a really heavy transaction workload, then it is going to take a lot longer. So, then there’s the throughput (ph) capabilities of the I/O subsystem-that one is obvious. We are going to read all these pages, we are going to generate tons of I/O. So, if your I/O subsystem can only do, I don’t know, say, 10 megabits a second, it’s just going to take a long, long time to do a 10 terabyte database.
Richard Campbell: Yeah, and if it’s a problem, it’s going to show up in disk queuing. It’s very, very busy.
Paul Randal: Yeah, your disk-queue lengths are going to be really long. Number of CPUs on the box, and if you are running Enterprise Edition-an Enterprise Edition on 2000 and 2005 CHECKDB can run in parallel. So, the way we do that is we actually use the query processor internally to drive the actions that CHECKDB takes, so the query processor can decide how far to parallelize CHECKDB. So, if you have-and this is on the Enterprise Edition-so if you have say eight CPUs, and you got a moderate load, then the CPU may decide to split CHECKDB over four CPUs. So, if you have a multi-CPU box and it’s not very heavily loaded, then you can get a much fast run time, because CHECKDB is going to parallelize, and it’s parallelizing in terms of threads. It’s also going to parallelize the I/O, so you’ll get better I/O throughput too.
Richard Campbell: So, it’s like you are working multiple tables simultaneously while you are doing this?
Carl Franklin: Can that be a problem though, running with multiple CPUs?
Paul Randal: Oh, absolutely it can. So we find that some application vendors, for instance SAP, they like to encourage their customers to turn off parallelism while they are running CHECKDB, so they get a more predictable runtime and more predictable load on the server, so there is a documentary trace like 2528-it’s documented in the SQL Server Books Online for CHECKDB, that actually prevents CHECKDB ever going parallel.
Carl Franklin: Cool.
Richard Campbell: And this is only an issue with the Enterprise Edition anyway; any other edition is single-threaded, no matter what?
Paul Randal: Absolutely, yup.
Richard Campbell: Okay.
Paul Randal: So the next factor, let’s say, the speed of the disk, where a TempDB is placed, so this one is unobvious. The way that CHECKDB works is, think about the way that you could check the consistency of a very large B-tree. The obvious brute force method is, you read a page, you see that there is a bunch of links to other pages so you follow the links, make sure they are correct, so on and so on and so on, so you are doing a kind of brute force that first check. Now think of the IO characteristics of doing that, you are going to take a whole bunch of random page reads, all over the place, so it’s not very efficient.
Richard Campbell: You can go all over the drive?
Paul Randal: Yeah, you’ll go all over the drive. So what we instead try to do is, we try to make our I/Os, as efficient as possible. We want to do one single scan across the database. We actually read everything in allocation order, and we generate what we call facts about things that we have seen. So, for instance, if you imagine a very, very simple B-tree with a single root page, and a couple of child pages at the leaf level and there’s links between the various pages so, we might read a parent-the root page which is called the parent. We generate a fact saying, “Hey, we saw this page, this parent page, this page is the parent of say page A at the leaf level and page B at the leaf level.” And then at some later point we may read those leaf level pages and generate facts about those, saying, “Hey, we saw this page A , we saw this page B, page A points to page B and vice versa.” So we have these facts about the different pages that we have seen and throw all those through a processor, we sort them by page id and so on, we get them back out in a sorted list and we do what’s called aggregation on them. So for instance, checking the linkage in a B-tree, every page that we see, has to have three facts about it: it has to have a parent page pointing to it, it has to have a what we call an actual fact, which is we read that page, and we have to have the linkage fact, at the leaf level.
Richard Campbell: Right.
Paul Randal: So by generating these facts, we can read the pages in any order, and we can still do all the reconciliation and aggregation to make sure they are consistent.
Richard Campbell: So, more relatively, you are reading them in the order that they are written so that reads fast, knowing that eventually it’s all going to match up as if you read it deep…
Paul Randal: Yup.
Richard Campbell: …and if it doesn’t, then you have found a problem.
Carl Franklin: Absolutely it’s about indexing and getting it right. Do you find that it’s better if you defrag first, before you run CHECKDB?
Paul Randal: It makes no difference at all because we have written allocation order. What we actually do is we take a bunch of IAM chains and I will explain what those are in a minute if you want-we take a bunch of IAM chains and we merge them all together. So we read a bunch of pages for multiple index and tables, all at the same time. We generate all these different facts, and because they are keyed by page id, object id, index id, none of them get mixed up, so going back to the TempDB disk speed issue, we are generating all these facts and we have got to store them somewhere, so we store them in a word table in memory, but if you’ve got a very, very large database, it’s possible that that word table is going to get bigger than the size of memory you have got, so it has to spill and it’s going to spill to TempDB, so we have got this fact table that’s spilling out to TempDB, which we are reading and writing from, so if your TempDB disks are really crappy, then that’s going to slow down CHECKDB performance, okay?
Carl Franklin: Sure.
Richard Campbell: I’m finding TempDB on SQL 2005 boxes is just way busier.
Paul Randal: It is. There is a bunch of extra stuff that uses TempDB. We put out a white paper on working with TempDB in SQL Server 2005 that Sunil Agarwal on my team wrote with a couple of other people; that he tells a whole bunch of DMVs that you can use to see what the load is on TempDB and where the allocations are and so on. So you can do a whole bunch of tuning. In fact, so now there’s a very good presentation on tuning TempDB and troubleshooting TempDB. That’s actually going to be at SQL Connections in Orlando for Microsoft Day on the 26th of March, where he’s going to do that presentation so, that’s why I want to check that out.
Carl Franklin: Awesome.
Richard Campbell: And we will be there too.
Carl Franklin: We will be there, that’s why we’re not doing anything that day as a matter of fact.
Richard Campbell: And my little acronym Police call says DMV is Dynamic Management View.
Paul Randal: Absolutely, yup.
Richard Campbell: Good.
Carl Franklin: Thanks Richard.
Richard Campbell: Because he slips them in there fast.
Carl Franklin: Yeah.
Paul Randal: Okay, three more things affect the speed of CHECKDB. Complexity of the database schemer, so I described how we generate all these different facts about different things in the database; the more complex the schemer is, then the more things we are going to have to generate facts about, and the more checks we are going to have to run. So for instance I said earlier on, that the non-clustered index checking algorithm can take up to 30% of the CPU if you don’t have any non-clustered indexes, then we are not going to run that algorithm, okay? Same goes for a bunch of other things.
Richard Campbell: But if you don’t have any non-clustered indexes, you probably got problems.
Paul Randal: You probably got problems anyway so…
Carl Franklin: Is there any way to quantify, and I know everything is relative, but is there any way to quantify how complex a given schemer is? Is it possible to miscalculate that?
Paul Randal: Well, it’s not really possible to say what’s too complex or not. It depends on what your application needs. It’s possible to go overboard, say, for instance, there’s a bunch of new features in Server 2005, for instance, Partitioning, and Row-overflow, which is the ability to have rows that are greater than 8k. I have seen customers that change every column in their schemer to be VARCHAR 8000, or VARCHAR (Max) for instance, so that they can spill over, and they go nuts and have hundreds of partitions and what kind of range scan performance that we touched on earlier on when Kimberly was so nice to join us. Range Scan performance doesn’t just depend on fragmentation, okay? If you have rows that are over 8k, then one of the columns has been pushed off into text storage, so if you are doing a range scan and scanning these rows in the Read Ahead for the index or tables that the rows live in, [that] may be wonderful but, if you are having to pull in rows from overflow storage then each one of those column values you are pulling in from row overflow storage is a random IO.
Richard Campbell: Right, because it doesn’t know how long it is.
Carl Franklin: Sure.
Paul Randal: And it doesn’t know where it is in the disk, so it is very, very difficult to do Read Aheads on these randomly placed things, okay? So you may think you’ve got perfect fragmentation but your range scan performance still sucks. It’s because you have allowed larger than 8k rows and, of course, there’s no way to predict which rows are going to have columns that are off-row. The column that gets pushed off-row from one row maybe different than the column that gets pushed off-row for another row. So you got to be careful that even though there’s all these new features you don’t just use them for the hell of them. You have got to …
Richard Campbell: This is traditional Microsoft behavior though. Give us enough rope to hang ourselves with...
Paul Randal: Yeah, I know, one of the things that we don’t do very well, and we are working on it, is giving really good guidelines on what we just use and WAN and Best Practices and so on.
Richard Campbell: It’s always remarkable how much is your fault that they misused your tool,
Paul Randal: Yeah, I mean, we give people nice spangly features to use and we don’t give them a whole bunch of information on when they should really use them and when not to use them and what the tradeoffs and so on are.
Richard Campbell: This will punish you if you do dumb things with it.
Carl Franklin: Sounds like I go to a pharmacy and I get warnings on the bottle, “Warning! Don’t be dumb.”
Paul Randal: Yeah. Do not take an overdose; this is going to kill you, kind of thing.
Richard Campbell: Yeah.
Paul Randal: So there is complexity to the database schemer right, what else? Oh, which options you specify? So I said there’s an option, you can use with physical only that will cut out all the logical checks and make things run faster. So that’s a good example of that. There’s also an option that you can turn off, just the non-clustered index checks, which is no index, use that and that cuts down the speed too.
Richard Campbell: But you are also cutting down the validation of CHECKDB?
Paul Randal: It is cutting down the validation but if you use physical only and you have got page checks on this turned on, then you are going to find any hardware core corruptions as long as page checksums have been written. And that’s another-keep that on the stack and I’ll will talk about page checksums in a minute.
Carl Franklin: All right.
Paul Randal: And then, I guess the last thing is the number and type of corruptions that we find. So there are some algorithms that don’t get run unless we find corruptions. So, for instance, the non-clustered index checking algorithm. When we find a row, going and doing the look up in the non-clustered index to see whether the row really exists, we use a kind of complicated hashing algorithm-it’s the same kind of fact generation algorithm that I described before. So if we find that there’s any mismatch packs, there’s no way to match that back to a particular index or a table row.
Richard Campbell: All you now know there’s something wrong.
Paul Randal: Something is wrong somewhere. So what we have to do is, we have to rescan all the rows again, looking for rows that match to that mismatched pack, and then we actually go and do what we call the Deep Dive, where we actually go in to the exact look up in the table or index that we think is missing. So that algorithm is extremely expensive to run because then you are taking a whole bunch of random I/Os.
Richard Campbell: But that’s also the algorithm you wanted run-you have now found a problem.
Paul Randal: We have now found a problem, so the thing is if we find a problem, it could trigger one of these long running algorithms.