This post is dedicated to David Newman.
For the second time in the last three weeks, I’ve come across an older model 3ware 9550 controller that was causing the system to lurch under load with high. Presumably due to high await
times that I know are common with that model of controller.
[root@foo ~]# lsscsi [0:0:0:0] disk AMCC 9550SXU-4L DISK 3.08 /dev/sda [0:0:1:0] disk AMCC 9550SXU-4L DISK 3.08 /dev/sdb
Unfortunately, on el5.x systems sar
does not by default log await
data so I can’t tell how bad the problem has been historically. At least sar
can show us that the system has had a high iowait
percentage over the last several hours. Something else to note about el5.x is that tuned
is not available so we need to tune I/O things directly instead of via a tuned
profile.
[root@foo ~]# date Fri Mar 8 16:21:12 MST 2013 [root@foo ~]# sar -s 12:00:00 Linux 2.6.18-164.el5 (foo.example.org) 03/08/2013 12:00:01 PM CPU %user %nice %system %iowait %steal %idle 12:10:19 PM all 1.88 0.00 7.09 69.10 0.00 21.93 12:20:16 PM all 2.17 0.00 7.19 72.27 0.00 18.37 12:30:18 PM all 1.98 0.00 7.27 69.64 0.00 21.10 12:40:16 PM all 1.88 0.00 7.61 83.86 0.00 6.65 12:50:01 PM all 10.69 0.00 5.55 42.02 0.00 41.74 01:00:02 PM all 13.76 0.00 4.62 23.19 0.00 58.43 01:10:03 PM all 11.52 0.00 5.32 31.08 0.00 52.08 01:20:01 PM all 11.39 0.00 3.75 19.26 0.00 65.59 01:30:02 PM all 10.11 0.00 2.84 14.51 0.00 72.54 01:40:02 PM all 11.26 0.00 3.45 16.37 0.00 68.91 01:50:01 PM all 13.31 0.00 3.13 23.15 0.00 60.41 02:00:01 PM all 13.41 0.00 3.34 21.79 0.00 61.45 02:10:02 PM all 11.70 0.00 2.85 16.74 0.00 68.71 02:20:02 PM all 7.09 0.00 3.27 26.90 0.00 62.73 02:30:02 PM all 9.59 0.00 3.31 27.77 0.00 59.34 02:40:01 PM all 11.54 0.00 3.30 30.34 0.00 54.82 02:50:02 PM all 9.36 0.00 2.75 28.55 0.00 59.35 03:00:01 PM all 9.56 0.00 2.66 28.71 0.00 59.07 03:10:02 PM all 9.19 0.00 2.70 29.36 0.00 58.75 03:20:02 PM all 10.05 0.00 3.12 29.17 0.00 57.67 03:30:01 PM all 10.60 0.00 2.67 27.90 0.00 58.83 03:40:02 PM all 9.00 0.00 3.42 29.89 0.00 57.69 03:50:02 PM all 10.17 0.00 2.57 29.13 0.00 58.13 04:00:01 PM all 8.88 0.00 2.71 28.05 0.00 60.36 04:10:01 PM all 7.03 0.00 1.78 25.86 0.00 65.34 04:20:01 PM all 6.69 0.00 3.07 22.06 0.00 68.19 Average: all 8.98 0.00 3.90 33.35 0.00 53.76
Spot checking the system with iostat -x sda sdb 1 100
does show high await
times.
avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.75 24.25 0.00 75.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 2.00 0.00 24.00 12.00 0.38 291.50 191.00 38.20 sdb 16.00 9742.00 15.00 522.00 3592.00 98904.00 190.87 139.19 303.95 1.86 100.10
Lets look at how the arrays are setup with tw_cli
.
//foo> show Ctl Model Ports Drives Units NotOpt RRate VRate BBU ------------------------------------------------------------------------ c0 9550SXU-4LP 4 4 2 0 1 1 - //foo> /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 SINGLE OK - - - 931.312 ON OFF u1 RAID-5 OK - - 64K 3725.27 ON OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 931.51 GB 1953525168 5QJ00G2V p1 OK u1 1.82 TB 3907029168 9WM4X8NK p2 OK u1 1.82 TB 3907029168 9WM4YTLA p3 OK u1 1.82 TB 3907029168 9WM4TY84
Note that there is no BBU installed but the Cache is enabled. This is a global flag for both read and write caching. While this appears risky from a data protection during lower loss perspective, I know that the controller will by default not enable write caching without a BBU. Write caching is a critical feature of any RAID controller to speed up flush operations and it’s particularly critical on this model of controller which was underpowered even when it was released. We need to try to squeeze everything we can out of this model of controller and lets be honest, you shouldn’t be keeping your only copy of critical data on a decade old RAID controller.
We also want to check that the “Storsave Policy” is set to perform
. This will override the the controller disabling write caching with a BBU being installed. This is dangerous but
//foo> /c0/u0 show storsave /c0/u0 Storsave Policy = protection //foo> /c0/u1 show storsave /c0/u1 Storsave Policy = protection //foo> /c0/u0 set storsave=perform Warning: Unit /c0/u0 storsave policy is set to Performance which may cause data loss in the event of power failure. Do you want to continue ? Y|N [N]: y Setting Command Storsave Policy for unit /c0/u0 to [perform] ... Done. //foo> /c0/u1 set storsave=perform Warning: Unit /c0/u1 storsave policy is set to Performance which may cause data loss in the event of power failure. Do you want to continue ? Y|N [N]: y Setting Command Storsave Policy for unit /c0/u1 to [perform] ... Done. //foo> /c0/u0 show storsave /c0/u0 Storsave Policy = performance //foo> /c0/u1 show storsave /c0/u1 Storsave Policy = performance
We also want to make sure that SATA NCQ is enabled.
//foo> /c0/u0 show qpolicy /c0/u0 Command Queuing Policy = on //foo> /c0/u1 show qpolicy /c0/u1 Command Queuing Policy = on
That’s about all we can do on the controller itself. Now we need to look at the kernel block device parameters.
The usual suspects:
[root@foo ~]# cat /sys/block/sd[ab]/queue/scheduler noop anticipatory [deadline] cfq noop anticipatory [deadline] cfq [root@foo ~]# cat /sys/block/sd[ab]/queue/max_hw_sectors_kb 128 128 [root@foo ~]# cat /sys/block/sd[ab]/queue/max_sectors_kb 128 128 [root@foo ~]# cat /sys/block/sd[ab]/queue/read_ahead_kb 128 128 [root@foo ~]# cat /sys/block/sd[ab]/queue/nr_requests 128 128
I see that I’ve touched this host before and already changed the block I/O scheduler from the default of cfq
to deadline
. The maximum possible size block requests are already being used by default with this controller. Thee read ahead size is very small and nr_requests
could use some tuning. The typical recommendation from 3ware/LSI is set the read ahead to something like 16MB and to change nr_requests
to a larger value like 256 or 512. However, I have a lot of experience with this model or card and no that, counter intuitively, reducing the size of the request queue improves performance. I can speculate that this has something to do with how the card decides to reorder requests but I honestly don’t know why this improves interactivity to the degree that it does. Via Google, I turned up this page with lots of links to articles discussing this model of controller http://makarevitch.org/rant/raid/.
I’m going to adjust the the parameters I called out needed adjustment via entries in /etc/rc.local
.
echo "deadline" > /sys/block/sda/queue/scheduler echo "deadline" > /sys/block/sdb/queue/scheduler # blockdev --setra takes a count blocks blockdev --setra 32768 /dev/sda blockdev --setra 32768 /dev/sdb echo 32 > /sys/block/sda/queue/nr_requests echo 32 > /sys/block/sdb/queue/nr_requests
Note: it’s a mystery to me as to why you can’t use echo
values into /sys/block/X/queue/read_ahead_kb
Those commands also need to be cut’n’paste into a root shell to apply then or rc.local
needs to be sourced via . /etc/rc.local
. The later is preferred as it makes sure you’ve setup that file correctly.
[root@foo ~]# . /etc/rc.local [root@foo ~]# cat /sys/block/sd[ab]/queue/scheduler noop anticipatory [deadline] cfq noop anticipatory [deadline] cfq [root@foo ~]# cat /sys/block/sd[ab]/queue/read_ahead_kb 16384 16384 [root@foo ~]# cat /sys/block/sd[ab]/queue/nr_requests 32 32
At this point, the iowait
percentage has dropped seriously since I started working on this system. I’d like to take credit for it but there’s no many reads/writes happening so I’m not going to be able to see any real difference in await
values until the work load goes back up.
The last thing I’m going to do is adjust sar
so that await
values get logged and change the polling interval from 10 minutes to 1 minute.
Pre-modification:
[root@foo ~]# cat /etc/sysconfig/sysstat # How long to keep log files (days), maximum is a month HISTORY=7 [root@foo ~]# cat /etc/cron.d/sysstat # run system activity accounting tool every 10 minutes */10 * * * * root /usr/lib/sa/sa1 1 1 # generate a daily summary of process accounting at 23:53 53 23 * * * root /usr/lib/sa/sa2 -A
Adjusted:
[root@foo ~]# cat /etc/sysconfig/sysstat # How long to keep log files (days), maximum is a month HISTORY=7 SADC_OPTIONS="-d" [root@foo ~]# cat /etc/cron.d/sysstat # run system activity accounting tool every 10 minutes * * * * * root /usr/lib/sa/sa1 1 1 # generate a daily summary of process accounting at 23:53 53 23 * * * root /usr/lib/sa/sa2 -A
Then run the sa1 script manually to make sure the config file doesn’t contain an error.
[root@foo ~]# /usr/lib/sa/sa1 1 1 [root@foo ~]#
Argh – it looks like -d does not cause sa1
/ sadc
to log await
values on el5.x. I will update this post if I find a resolution.