Hi,
I have a server on which resides a fairly visited web app.
It has a raid1 of 2 HDDs, 64MB Buffer, 7200 RPM.
Today it started throwing out errors like:
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
kernel: ata2.00: status: { DRDY }
kernel: ata2: hard resetting link
kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: ata2.00: max_sectors limited to 256 for NCQ
kernel: ata2.00: max_sectors limited to 256 for NCQ
kernel: ata2.00: configured for UDMA/133
kernel: sd 1:0:0:0: timing out command, waited 7s
kernel: ata2: EH complete
kernel: SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
kernel: sda: Write Protect is off
kernel: SCSI device sda: drive cache: write back
All day long it has been in load higher than 10-15.
I am monitoring it with atop and it gives some bizarre readings:
DSK | sda | busy 100% | read 2 | write 208 | KiB/r 16 | KiB/w 32 | MBr/s 0.00 | MBw/s 0.65 | avq 86.17 | avio 47.6 ms |
DSK | sdb | busy 1% | read 10 | write 117 | KiB/r 17 | KiB/w 5 | MBr/s 0.02 | MBw/s 0.07 | avq 4.86 | avio 1.04 ms |
I frankly don't understand why only sda is taking all the hit. I do have one process that is constantly writing with 1-2megs but what the hell.. 100% iowait?
Update:
smartctl -A /dev/sda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 239 239 021 Pre-fail Always - 8050
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 22
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 595
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 20
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 22
194 Temperature_Celsius 0x0022 118 106 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
-
Looks like a faulty disk to me...
w00t : I have updated my primary post with some smartctl output. As you may see, no errors/pending sectors. It is also a fairly new drive (~1 month old)user6738237482 : Drive, cable, or controller is bad on SDA. You can see that the drive is going offline, requiring a hard reset to bring it back online. Fail the drive in mdadm and your IO wait problems will disappear. Then, replace SDA with another drive. Check your smartctl numbers against SDB to see if the startstop count on SDB is around 22.Jeff Hengesbach : The fact the drive is fairly new actually raises suspicion that it is bad or that maybe it wasn't firmly plugged in.w00t : Start/stop count for sda is 26 and for sdb 8. I thought about failing sda, but then again I need redundancy, and if I re-add sda it would take fairly much to resync.From Massimo -
It looks like the drive may be failing - the system is resetting the drive / connection trying to get through to it. I'm guessing since you can see the status of both sda and sdb it is a software raid so you should be able to check /proc/mdstat to see what may be going on with the software raid.
The IOWait is because the software raid is being held up waiting to write to sda. This in turn is holding up processes and causing the high load number.
You should replace your sda ASAP.
w00t : Yes, software raid. I checked mdstat and they are in sync, none was kicked off from the array. It is weird because I thought I should see md1_raid1 in top as it eats my resources away, but it is kind of idle. Instead, the first 3 apps in top are my web ones - httpd, java, varnish. Also, smartctl reports no errors whatsover.Jeff Hengesbach : It would be freak to be a cable but it is possible. It is probably not the controller since sdb is still fine. Check the cable connections to be sure they are firm - both ends - even unplug and replug if you have access and the downtime to spare. For the cost of a drive and the value of your data I'd still look for a replacement - you'll need one sooner or later!user6738237482 : top reports highest CPU by default. A wait for IO doesn't need to be cpu intensive and may not show up on top. The process itself would probably be in the D state (Disk wait), but, would consume very little CPU.w00t : I was about to mention if a faulty cable could be guilty. I will replace the current cable and see if anything changes.w00t : I changed the SATA cable and then I moved the HDD to another slot as that one seemed suspicious (it took some time to detect my HDD drives) and afterward it was ok. I now have in atop: sda busy 55%, sdb busy 23%. So I still don't think it is entirely solved.w00t : I changed my mind, I see that read/writes are fairly equal - sda 153 sdb 143 for writes and sda 32 sdb 25 for reads with some spikes along the way, but I guess that's normal given the fact that sdb has to synchronize after sda. Right?From Jeff Hengesbach -
sda is not processing I/O requests promptly. If it is not the drive, then you have an imbalance in I/O. Do you have a swap partition on sda? If so you could have memory issues resulting in thrashing of memory pages in and out of swap. Make sure you have swap space allocate on both drives at the same priority. The program sar may be helpful.
w00t : Yes, I do have a swap partition set up. But there is no swapping being made. In fact, I am now watching munin graphs and no swapping has been made for 2 days, which is kind of weird because the history shows a constant use of swap.BillThor : What does sar say about paging? I/O by partition? Look at munin memory for unused and cache. Both should have a fair bit of memory allocated.From BillThor
0 comments:
Post a Comment