RTFM

[Read This Fine Material] from Joshua Hoblitt

Tuning the 3ware 9550 series of SATA RAID controllers on Linux

| 0 comments

This post is dedicated to David Newman. 🙂

For the second time in the last three weeks, I’ve come across an older model 3ware 9550 controller that was causing the system to lurch under load with high. Presumably due to high await times that I know are common with that model of controller.

[root@foo ~]# lsscsi
[0:0:0:0]    disk    AMCC     9550SXU-4L DISK  3.08  /dev/sda
[0:0:1:0]    disk    AMCC     9550SXU-4L DISK  3.08  /dev/sdb

Unfortunately, on el5.x systems sar does not by default log await data so I can’t tell how bad the problem has been historically. At least sar can show us that the system has had a high iowait percentage over the last several hours. Something else to note about el5.x is that tuned is not available so we need to tune I/O things directly instead of via a tuned profile.

[root@foo ~]# date
Fri Mar  8 16:21:12 MST 2013
[root@foo ~]# sar -s 12:00:00
Linux 2.6.18-164.el5 (foo.example.org)  03/08/2013

12:00:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:19 PM       all      1.88      0.00      7.09     69.10      0.00     21.93
12:20:16 PM       all      2.17      0.00      7.19     72.27      0.00     18.37
12:30:18 PM       all      1.98      0.00      7.27     69.64      0.00     21.10
12:40:16 PM       all      1.88      0.00      7.61     83.86      0.00      6.65
12:50:01 PM       all     10.69      0.00      5.55     42.02      0.00     41.74
01:00:02 PM       all     13.76      0.00      4.62     23.19      0.00     58.43
01:10:03 PM       all     11.52      0.00      5.32     31.08      0.00     52.08
01:20:01 PM       all     11.39      0.00      3.75     19.26      0.00     65.59
01:30:02 PM       all     10.11      0.00      2.84     14.51      0.00     72.54
01:40:02 PM       all     11.26      0.00      3.45     16.37      0.00     68.91
01:50:01 PM       all     13.31      0.00      3.13     23.15      0.00     60.41
02:00:01 PM       all     13.41      0.00      3.34     21.79      0.00     61.45
02:10:02 PM       all     11.70      0.00      2.85     16.74      0.00     68.71
02:20:02 PM       all      7.09      0.00      3.27     26.90      0.00     62.73
02:30:02 PM       all      9.59      0.00      3.31     27.77      0.00     59.34
02:40:01 PM       all     11.54      0.00      3.30     30.34      0.00     54.82
02:50:02 PM       all      9.36      0.00      2.75     28.55      0.00     59.35
03:00:01 PM       all      9.56      0.00      2.66     28.71      0.00     59.07
03:10:02 PM       all      9.19      0.00      2.70     29.36      0.00     58.75
03:20:02 PM       all     10.05      0.00      3.12     29.17      0.00     57.67
03:30:01 PM       all     10.60      0.00      2.67     27.90      0.00     58.83
03:40:02 PM       all      9.00      0.00      3.42     29.89      0.00     57.69
03:50:02 PM       all     10.17      0.00      2.57     29.13      0.00     58.13
04:00:01 PM       all      8.88      0.00      2.71     28.05      0.00     60.36
04:10:01 PM       all      7.03      0.00      1.78     25.86      0.00     65.34
04:20:01 PM       all      6.69      0.00      3.07     22.06      0.00     68.19
Average:          all      8.98      0.00      3.90     33.35      0.00     53.76

Spot checking the system with iostat -x sda sdb 1 100 does show high await times.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.75   24.25    0.00   75.00

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  2.00     0.00    24.00    12.00     0.38  291.50 191.00  38.20
sdb              16.00  9742.00 15.00 522.00  3592.00 98904.00   190.87   139.19  303.95   1.86 100.10

Lets look at how the arrays are setup with tw_cli.

//foo> show

Ctl   Model        Ports   Drives   Units   NotOpt   RRate   VRate   BBU
------------------------------------------------------------------------
c0    9550SXU-4LP  4       4        2       0        1       1       -        

//foo> /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    SINGLE    OK             -       -       -       931.312   ON     OFF    
u1    RAID-5    OK             -       -       64K     3725.27   ON     OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     931.51 GB   1953525168    5QJ00G2V            
p1     OK               u1     1.82 TB     3907029168    9WM4X8NK            
p2     OK               u1     1.82 TB     3907029168    9WM4YTLA            
p3     OK               u1     1.82 TB     3907029168    9WM4TY84            

Note that there is no BBU installed but the Cache is enabled. This is a global flag for both read and write caching. While this appears risky from a data protection during lower loss perspective, I know that the controller will by default not enable write caching without a BBU. Write caching is a critical feature of any RAID controller to speed up flush operations and it’s particularly critical on this model of controller which was underpowered even when it was released. We need to try to squeeze everything we can out of this model of controller and lets be honest, you shouldn’t be keeping your only copy of critical data on a decade old RAID controller.

We also want to check that the “Storsave Policy” is set to perform. This will override the the controller disabling write caching with a BBU being installed. This is dangerous but

//foo> /c0/u0 show storsave
/c0/u0 Storsave Policy = protection

//foo> /c0/u1 show storsave
/c0/u1 Storsave Policy = protection

//foo> /c0/u0 set storsave=perform
Warning: Unit /c0/u0 storsave policy is set to Performance which may cause data loss in the event of power failure.
Do you want to continue ? Y|N [N]: y
Setting Command Storsave Policy for unit /c0/u0 to [perform] ... Done.

//foo> /c0/u1 set storsave=perform
Warning: Unit /c0/u1 storsave policy is set to Performance which may cause data loss in the event of power failure.
Do you want to continue ? Y|N [N]: y
Setting Command Storsave Policy for unit /c0/u1 to [perform] ... Done.

//foo> /c0/u0 show storsave
/c0/u0 Storsave Policy = performance 

//foo>  /c0/u1 show storsave
/c0/u1 Storsave Policy = performance

We also want to make sure that SATA NCQ is enabled.

//foo> /c0/u0 show qpolicy
/c0/u0 Command Queuing Policy = on

//foo> /c0/u1 show qpolicy
/c0/u1 Command Queuing Policy = on

That’s about all we can do on the controller itself. Now we need to look at the kernel block device parameters.

The usual suspects:

[root@foo ~]# cat /sys/block/sd[ab]/queue/scheduler 
noop anticipatory [deadline] cfq 
noop anticipatory [deadline] cfq
[root@foo ~]# cat /sys/block/sd[ab]/queue/max_hw_sectors_kb 
128
128
[root@foo ~]# cat /sys/block/sd[ab]/queue/max_sectors_kb 
128
128
[root@foo ~]# cat /sys/block/sd[ab]/queue/read_ahead_kb
128
128
[root@foo ~]# cat /sys/block/sd[ab]/queue/nr_requests
128
128

I see that I’ve touched this host before and already changed the block I/O scheduler from the default of cfq to deadline. The maximum possible size block requests are already being used by default with this controller. Thee read ahead size is very small and nr_requests could use some tuning. The typical recommendation from 3ware/LSI is set the read ahead to something like 16MB and to change nr_requests to a larger value like 256 or 512. However, I have a lot of experience with this model or card and no that, counter intuitively, reducing the size of the request queue improves performance. I can speculate that this has something to do with how the card decides to reorder requests but I honestly don’t know why this improves interactivity to the degree that it does. Via Google, I turned up this page with lots of links to articles discussing this model of controller http://makarevitch.org/rant/raid/.

I’m going to adjust the the parameters I called out needed adjustment via entries in /etc/rc.local.

echo "deadline" > /sys/block/sda/queue/scheduler
echo "deadline" > /sys/block/sdb/queue/scheduler
# blockdev --setra takes a count blocks
blockdev --setra 32768 /dev/sda
blockdev --setra 32768 /dev/sdb
echo 32 > /sys/block/sda/queue/nr_requests
echo 32 > /sys/block/sdb/queue/nr_requests

Note: it’s a mystery to me as to why you can’t use echo values into /sys/block/X/queue/read_ahead_kb

Those commands also need to be cut’n’paste into a root shell to apply then or rc.local needs to be sourced via . /etc/rc.local. The later is preferred as it makes sure you’ve setup that file correctly.

[root@foo ~]# . /etc/rc.local 
[root@foo ~]# cat /sys/block/sd[ab]/queue/scheduler 
noop anticipatory [deadline] cfq
noop anticipatory [deadline] cfq
[root@foo ~]# cat /sys/block/sd[ab]/queue/read_ahead_kb
16384
16384
[root@foo ~]# cat /sys/block/sd[ab]/queue/nr_requests
32
32

At this point, the iowait percentage has dropped seriously since I started working on this system. I’d like to take credit for it but there’s no many reads/writes happening so I’m not going to be able to see any real difference in await values until the work load goes back up.

The last thing I’m going to do is adjust sar so that await values get logged and change the polling interval from 10 minutes to 1 minute.

Pre-modification:

[root@foo ~]# cat /etc/sysconfig/sysstat
# How long to keep log files (days), maximum is a month
HISTORY=7
[root@foo ~]# cat /etc/cron.d/sysstat 
# run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib/sa/sa2 -A

Adjusted:

[root@foo ~]# cat /etc/sysconfig/sysstat
# How long to keep log files (days), maximum is a month
HISTORY=7
SADC_OPTIONS="-d"
[root@foo ~]# cat /etc/cron.d/sysstat 
# run system activity accounting tool every 10 minutes
* * * * * root /usr/lib/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib/sa/sa2 -A

Then run the sa1 script manually to make sure the config file doesn’t contain an error.

[root@foo ~]# /usr/lib/sa/sa1 1 1
[root@foo ~]# 

Argh – it looks like -d does not cause sa1 / sadc to log await values on el5.x. I will update this post if I find a resolution.

Leave a Reply