Friday, December 10, 2010

My top 10 SQL Perfmon Counters.

Feleryan on Twitter asked this morning what our top 15 perfmon counters for SQL were. Since my ADD makes 15 too high a number of things to monitor at once, without going completely mental, I thought I'd just go ahead and share my top ten. As usual, any feedback or suggestions are welcome.

Now, for a lot of these counters, I prefer to measure against a baseline. How do you know that a consistent 30% processor time for a given CPU isn't normal? Doesn't mean it's good, but neither does that mean it's out of the norm. In the absence of a baseline, there are general recommendations, so I'll note those.

The first four counters are system-general, meaning things outside of SQL Server could potentially be affecting these counters, especially if the server itself is being shared with processes other than SQL. The rest are SQL-specific.
  1. CPU: % processor time - Total. Gives a good overall impression of how hard the server is working, and how heavy the query load is. MS recommends watching anything over 30%, but anything over 50% usually grabs my attention.
  2. Memory: Available MBytes - I like to have at least a few hundred MB available in RAM for burst processing, but I'm comfortable with anything over 150 MB. MS says anything over 100MB should be fine.
  3. Physical Disk: Avg. Sec\Read (and Write) - A good indication of how fast the disks are responding to read/write requests. In addition to this, I also check the io_stall counters in sys.dm_io_virtual_file_stats view to see how the files are performing. MS says that these counters are considered 'good' if the average is 8ms (.008) or less, but if the counters are around .020, which is considered high, and io_stall is low, then I wouldn't sweat it.
  4. Network Interface: Bytes Total\sec - A quick indication of how much traffic is being passed through the NIC. You'll want to keep this proportional to how busy SQL is. Again, this is one where having a baseline really helps, but I'd start being concerned when more than 60% of the available bandwith is being used. This is a good place to quick-tune, if you can dedicate a NIC to backup or other IO intensive operations. If you're using iSCSI, see if you can dedicate a NIC to storage, as well.
  5. SQL Server: Server Statistics, Batch Requests \ Sec. - Again, baseline is crucial here, but anything over 1000 should give you pause to look at the server a little closer and see how it is performing overall. I have servers that regularly get up in the 2000 - 2500 range with no issues, and some that start choking around 500.
  6. SQL Server: Access Methods, Full Scans \ Sec. - How often table scans are being performed. Here's where knowing your data a bit helps. Do you have a lot of small tables? If so, this counter may be artificially high. Lots of "LIKE '%value%'" queries? Again, this obviates indexing, and can also inflate this counter. However, this counter works best when paired with...
  7. SQL Server: Access Methods, Index Searches \ Sec. - You're looking for a high ratio of index seeks to table scans, with the recommended value for this counter being 1000 for every Full Scan. So a 1000\1 ratio. I agree completely with this one.
  8. SQL Server: Buffer Manager, Page Life Expectancy - The number of seconds a page is expected to stay in cache. You want this as high as possible, but anything under five minutes (300 seconds) is the default cause for alarm. I tend to start paying attention at around 15 minutes, or 900 seconds. This one usually only shows up as a significant issue on extremely busy systems, so if you see this issue - get to work.
  9. SQL Server: Buffer Manager, Lazy Writes \ Sec. - If you're getting a lot of buffer flushes to disk every second, with the default alarm level being more than 20, then it could indicate that you need more memory in the system. This should be as near 0 as possible on most systems.
  10. SQL Server: Buffer Manager, Buffer Cache Hit Ratio - What you want is for this counter to be as near 100% as possible. For this counter, look for big changes. 99.8 to 99.3 is not a big deal, but a change from 99.8 to 75.4 is worth investigating. Some people also note that if you're checking this counter, you should be checking Page Life Expectancy as well, since a large change in PLE could only show up as a small change in Cache Hit Ratio or vice/versa.
That's my ten. Please feel free to comment below, especially if you disagree with anything here - always looking to learn something new! :-)

Thanks,
-David.

2 comments:

Dirk Hondong (@Feleryan) said...

Hi David,

thank you for this really fast blog post as a response to my question.
The SQL Server community is awesome.

These informations are a good starting point.

Regards
Dirk

D said...

Glad to be of service. :-)