Thursday, January 14, 2010

Raid Volume Rebuild

How To: Steps to take on a failed RAID-1 array, where one drive appears to have failed. This article assumes you have previously installed an Intel Desktop mirrored RAID array with two disks.

Related Article: Desktop RAID-1 Mirrors - Installation

Note: Many searches arriving at this article are asking this question: "Can you reboot a workstation while the RAID is rebuilding?" The short answer is yes, but the RAID rebuild will start over and the drives are at risk. It is best to leave the computer on. See below for other details.

The other question: "Can I use the computer while the RAID is rebuilding?" Yes, with no restrictions.



When a Desktop's mirrored Raid reports a "degraded" or more commonly a "failed" drive, it sounds like a horrible calamity, when in fact, it is probably a minor event -- but it may still require your attention. In a RAID, the computer has 2 "mirrored" hard drives, where one drive is copy of the other. If either fails, the other will continue operating as-if nothing was wrong. When a RAID fails, one of the drives is out-of-sync, or more rarely, failed.


When a RAID fails, it is seldom a physical failure


Reasons:

Usually an array goes south when there has been a power failure or the operating system has crashed (hence why I had to write this article). Opponents of RAID would say, "see, this is why you shouldn't bother with RAID; half the time it is the problem." I counter by saying with only one disk, the odds of corrupting it are much higher and rebuilding the RAID is easy. When the computer goes down hard, logical damage to the disk is likely.

But almost always there is no physical damage -- usually one of the drives just became "out-of-sync" with the other.

Symptoms:

The Windows System tray shows a failed Raid, with one of several possible messages. For example, an Intel RAID controller, commonly found on almost all desktop computers from the past several years, shows a two-disk drive icon (the mirror), where one is red.

This illustration shows a failed array that is in the process of rebuilding
  • If it shows "Degraded," it is in the process of rebuilding and there is nothing you need to do but wait. I would leave the computer running while it rebuilds. You can still work, open files and surf the net while this is happening.
  • A "Failed" drive requires action on your part; follow the steps outlined in this article.

Steps:

1. In the System Tray (you may need to show hidden icons); choose the RAID icon.

If you do not have this icon, look in the Start Menu "Intel Matrix Storage Manager" or "Intel Rapid Storage Manager." You can confirm the program is installed by looking in the Control Panel's 'Programs and Features' Add-Remove. (Since this article was written, this utility was renamed from "Intel Matrix Storage" to "Intel Rapid Storage Technology; see download link at the end of this article)

2. "Mark the failed drive": Once the panel opens, tunnel down the tree, locating the failed drive.

"Other-mouse-click" and choose "Mark as Normal".

This should start the rebuild. The System Tray icon will show "A RAID volume is being rebuilt. Data redundancy is being restored."

Click the image for a larger view; click right-x to return

3. Watch the progress by opening the Volumes folder (illustration above, locating the Yellow icon). Other-mouse-click and choose "Show Rebuild Progress".

The indicated time is a reasonable estimate and a 1Terabyte drive can take about 1.5hrs to rebuild.


While Rebuilding - Rebooting, etc.:

While the RAID is rebuilding, you can still use the computer as you would normally, but it is best (but not required) that you leave the machine powered on until the rebuild completes. During this time, the computer may behave a little slower and there will be much disk activity.

If you shut-down the computer, the rebuild starts over on the next reboot. If you do power-down the computer, the remaining drive, however unlikely, is at risk for a failure. If power-problems have caused the RAID to fail, be careful about stopping the RAID rebuild prematurely.

Also, if the computer goes to sleep, the rebuild is also suspended. Consider disabling the Screen-savers during the rebuild.


Performance Improvement: Hard Disk Cache

If the RAID is rebuilding too slowly (more than 4 hours), check the local Hard Disk Cache setting.
  • From the Storage Management Console, open the Volumes folder.
  • Other-mouse-click "SysRaid" and choose "Enable Volume Write-Back Cache". This change can be made on the fly.
  • Alternately, this same setting can be found in Control Panel, Disk Drives, locate the hard disk "SysRaid", select Properties, "[x] Enable Write Caching on the Device."

This setting changed my Rebuild from 4 hours to 1.5 hours. This same setting has other benefits and all disk-IO activity will be faster.

Really Long Rebuild Times:

Before looking at these steps, be sure to see the section directly above.
Some readers have reported 80+ hours on a 1Terabyte rebuild. I am waiting on their reply to see if the cache-setting (above) helped the problem and your help in this is welcome.

I would wonder if the machine were infected with Viruses. Consider disconnecting the network cable, then disable the virus scanner and try the rebuild again. If this does not improve the rebuild times, leave the RAID broken while you do a more thorough test for viruses. Consider this Keyliner virus article: Removing Win-Viruses.


More Serious RAID Failures:

Sometimes, rarely, after a particularly brutal power failure, the motherboard may not be able to detect the "failed" drive. In the Matrix-software, you may not see the failed drive.

Try these steps:

a. Power off the computer and disconnect the failed drive. (Use the utility to see the drive's serial-number).

b. Boot the computer, then shut down normally.

c. Re-connect the failed drive and attempt the rebuild again.

(Note: This power-off/disconnect solution has not helped those with 80+ hour rebuilds.)


Re-Occurring Problems:

If the RAID array fails frequently, there could be several causes. Consider the following, where the most common are listed first:
  • Bad power; intermittent power; low-voltage power; get a UPS.
  • A weak PC powersupply (replace/uprgrade to a larger supply)
  • Failing circuitry on the main motherboard (especially if you have been having power problems)
  • An actual drive failure
If the drive has physically failed (note: I have not seen a real failed drive in many years; it seems much rarer than it was in the mid to late 90's), you must replace it with a similar drive with the same capacity (or larger).

The replacement drive does not have to be the same brand and model (although that is my preference, when possible).

RAID is not a Cure-all

Remember, a RAID does not protect you from having to do backups. It will not protect your data if a virus strikes, the computer is stolen, or the house burns down. Here is where an off-site backup is nice to have.

Consider the RAID as insurance against a drive failure (which is admittedly rare). More likely, it is insurance against a corrupted disk due to power-failures and other hard crashes. As I had discovered this week, a power failure caused one drive to depart while the second survived. The RAID saw the problem and still allowed the computer to boot.

This was the behavior I wanted to see during this type of event. I will never know if a single drive would have survived the outage, but I do know all was safe. I simply re-built the RAID and was back to normal a few hours later with nothing more than a few mouse clicks.


Related articles:
Desktop RAID-1 Mirrors - Installation
Acronis vs Ghost
Maxtor External USB


Links:
Intel Download Page (Choose your operating system and computer for best results)

As of 2010.08:
Intel Download: Rapid Storage Technology (SATA ver 9.6.0.1014 2010.03.23 formerly called Intel Storage Manager. Choose the 'AllOS' version, at the bottom)

25 comments:

  1. Reply to Dave: 9hrs seems very long. I'd let it run and check on it in the morning; what would be the harm?

    If it fails again, try the following: Boot the machine normally (and don't bother to rebuild or cancel if you can). Make a backup of your drive.

    Then electronically remove the failed drive from the raid (exact steps -- I can't recall), re-attach and rebuild.

    If that fails, I've done this:
    Power off and Physically remove the disk.
    Boot the computer (with only the good drive)
    Open the RAID control panel and remove the failed drive, essentially turning the system into a non-raid.
    Boot again to make sure everything works properly.

    Then, re-insert the "failed" drive and rebuild the RAID using my "software" instructions. Obviously, when removing and re-installing disks, power off the computer at appropriate times.

    I wish I had the exact steps documented for this and this is worthy of a second article, using a test machine. In any case, I doubt the drive really failed; these disks are very reliable.

    Write back with your results.

    ReplyDelete
  2. Hi Tim, thanks for the reply. Here's an update on what's going on:

    I tried rebuilding again and things were looking good (but very slow)...I reached 25% and I noticed the music I was playing stopped and became choppy when it came back on. Then I got this error:

    A failure occurred during RAID volume verification. Inserting the original target hard drive back into the system, while it is powered off, will allow the application to automatically restart the migration. If you insert a new drive, you will need to manually select the option to rebuild that drive.

    Well I was hopeful that if I shut down, let the computer sit for a few minutes, and turn it back on, it would resume at 25%...unfortunately not, it started rebuilding back at 0%.

    Question - If I need to rebuild the RAID, will I have to use the backup to restore the data? In other words, if I deleted the faulty drive within the RAID bios and tried to recreate the RAID, would I lose all my data?

    Thanks again for your help. Hopefully I'll have this resolved soon.

    Dave

    ReplyDelete
  3. Tim,

    Every time I try to rebuild through the Intel Matrix Storage Console, I get the error at 25%. This has happened 4 times. I'm going to try to update the hard drive firmware now to see if that helps.

    Question - Im almost at the point where I need to follow these suggestions of yours:

    If that fails, I've done this:
    Power off and Physically remove the disk.
    Boot the computer (with only the good drive)
    Open the RAID control panel and remove the failed drive, essentially turning the system into a non-raid.
    Boot again to make sure everything works properly.

    Then, re-insert the "failed" drive and rebuild the RAID using my "software" instructions. Obviously, when removing and re-installing disks, power off the computer at appropriate times.

    I went into BIOS with CTRL+I and tried to convert to a non-RAID, and it said the disk would be wiped. Is that the case? I would really like to get through this without having to do a full reinstall. In fact, if it came to that, I would probably scrap the entire RAID idea. It has caused me nothing but trouble so far.

    Thanks,

    Dave

    ReplyDelete
    Replies
    1. It scares me to make a recommendation, but I think the wiped drive will be the mirror, not the original. an image backup of the c drive would be highly recommended.

      of course the biggest mistake would be picking drive 2 as the master..... huge fan of an image backup before any raid work.

      on my own system, new motherboard surpriseingly does not support raid. I'd still use it if I could.

      Delete
  4. Sorry. No good answer on why this is. Sounds like a bad cluster on the disk. This is a good reason to pull the disk and do my one suggestion where you boot the machine without the disk; destroy the raid, then re-insert the disk.

    When you destroy the raid, your data will survive on the good disk (remember, the RAID is happening at the hardware layer and there are no data changes on the disk). The only worry here is to be sure you pull the bad disk. The RAID screens will show you the serial numbers of the disks.

    If it fails after this, abandon that disk.

    The backup is just for safety -- data is always worth more than hardware.

    I wish I had more precise steps; I've done these before, but did not document the exact steps. I'm probably going to build a new machine next week and I'll play with the idea.

    ReplyDelete
  5. More: If you have a backup that you trust, you should have no fears. (I use Acronis; be sure to make a full-disk "image" backup, not a file-backup.)

    However, messing with RAIDS with or without a backup is always scary and I understand the nervousness.

    On the other hand, the RAID did its job and you should beable to rebuild from the good disk without a backup. Be sure you rebuild the raid using the SOFTWARE steps from my other article (assuming an Intel RAID controller). Best of luck to you.

    ReplyDelete
  6. hi Tim, after a moment of panic with a drive failure I came across this article and had a huge sigh of relief. However, with a volume size of 1.4Tb the Volume Rebuild progress on the Intel Matrix Storage Console is at 2% with 85hrs to go and counting. Will this actually take around 2hrs going by your estimate, or for whatever reason will it actually take this long? I am not running anything else on my PC and I have an i7 Quad Core setup with 12 gigs of ram. I hope it will start to speed up!

    ReplyDelete
  7. Rorxy, you are the second person to write with this same problem; I do not have a good answer for you and I hope you write back with your progress. 85hrs is far too long for the RAID to rebuild.

    In another article, I wrote this paragraph -- and I wonder if this would help your problem. Try this out and tell me the results:

    Hard Disk Cache:

    As an aside, the local Hard Disk has a similar setting, which I also enable on my own computers. In Control Panel, Disk Drives, locate the hard disk, select Properties, "[x] Enable Write Caching on the Device." I only recommend doing this on battery-powered laptops and on desktops with UPS protection.

    ReplyDelete
  8. Hi, I am having a similar problem with a Dell Precision workstation running Win7-64bit (on a UPS!). The RAID volume "degrades" randomly, roughly every 1-2 weeks. The array consists of 3x1 Tb drives...a different drive degrades each time. It takes about 45 hours to rebuild. With Win 7, it is not obvious to me how to check write caching. Norton antivirus scans are negative.

    Thanks for any suggestions...Dell says to just keep rebuilding

    ReplyDelete
  9. Brad: My first suspicion would be a weak internal powersupply -- and the UPS will not help there.

    If you have a small 300W powersupply, along with 3 hard drives and a nice video-card, with a dual-core chip, I'd look at the Powersupply first. If the powersupply is larger, it may be weak.

    Unfortunately, this is hard to test without actually trying a new powersupply.

    I say this especially because the drives are randomly failing -- it can't be the drives are malfunctioning.

    In all of my experiences, RAIDS failing like this are always power-related.

    ReplyDelete
  10. Hi,

    Like Brad, I am having frequent (almost once a week) RAID 1 failures that are resolved simply by marking the defective drive as normal and rebuilding.

    Your comment on the power supply being the cause of such failures is really thought provoking. At first I was just researching if I should use a separate RAID card rather than the onboard Intel RAID but now I think I will try out with a new power supply. Problem is, my current power supply is actually 450W and I am only running 2 x 1TB WD HDD in RAID 1, an i7 Quad Core x980 CPU, an entry-level graphics card and 3x2GB RAM. How much power do I need actually?

    ReplyDelete
  11. Yee and Power Needs: I do not know how to calculate the power requirements, but only two drives and a quad isn't that much. I would think 450 is large enough. If you had a high-end video card, then I would begin to suspect the power supply.

    The trouble with my recommendation is this: It is expensive to test, unless you happen to have a spare power supply laying around.

    By chance is the PC connected to a battery-UPS? My prior experience showed each time I printed on the Laser, the RAID would fail. While this was not a Powersupply problem, it definately was a power issue. If I were to move the printer to another circuit, all would be well.

    I've seen vacuum cleaners also trigger power problems.

    ReplyDelete
  12. I too had a Raid 1 Degraded message. Checked the wiring, pulled and reconnected both drive cables. Then when the system booted up I got both drives recognized and a Rebuild message at startup.
    Intel's Rapid Storage Technology app is running, rebuilding the drive, but after nearly 4 hours I am still only at 0.4%, which means I have nearly 800 hours to go if the rebuild is linear.
    At this point I'm ready to kill RAID entirely as it's been nothing but problems (constantly rebuilding, albeit not this slowly), but am unsure how to proceed without losing everything.
    System is a Dell Inspiron, Windows 7, 2 X 256Gb drives in RAID 1 configuration.
    Any suggestions?
    Thanks

    ReplyDelete
  13. Canajun: You are one of several who have responded with the same type of problem. On my computer, fixing the Hard Disk Cache, as described above, did wonders for improving my speed.

    But I believe others, even after making this change, still see slow speeds. It is possible you have an actual bad drive.

    To completely dissolve the Raid, confirm you have a good full-disk-image backup. Then in BIOS, disable the RAID feature, then remove the (failed) drive. The existing drive should work as-before. Remember, with hardware raid, there is nothing on the physical media that tells it is on a RAID.

    I still wonder if you actually have a bad drive?

    Please reply with your actions and what you find.

    ReplyDelete
  14. Tim - Thanks for your response. I expect you're right as I have also started getting Port 0 errors on startup for that one drive. As soon as I can get a full disk image (assume I'll need Norton Ghost or equivalent), I'll pull the RAID and install a new hard drive.
    I'll be sure to post how it goes.
    Thanks again.

    ReplyDelete
  15. Tim - I did have a failed (failing) drive. I did a full disk image backup then disabled raid and removed the bad drive. When I restarted the system, it took about 20 minutes to load, and then, unfortunately, crashed taking the remaining drive down as well.

    Then I found out the recovery application couldn't access the backup files, so I had to rebuild from scratch. Fortunately I was able to recover some of my data from the 2 raid drives after the system was restored, and I had previously backed up critical data files so it wasn't a total loss.

    Suffice it to say I'm done with RAID. I'll handle my own backups from here on in - at least I know what I'm dealing with there.

    Anyway, thanks again for your help and for maintaining this blog.

    ReplyDelete
  16. All I can say is SHAME on INTEL for now allowing a mirrored disk to be created from the BIOS CONSOLE when creating a RAID array from a non-raid state. If the problem is booting windows and trying to run the software version of rapid storage while in the middle of a crisis, you may be in an unrecoverable situation. I had a sitution where windows would run for about 30seconds and bsod. All that effort invested in the somewhat weak "Recovery" mode rather than something much more beneficial imo. Maybe next release...

    ReplyDelete
  17. Anony Shame: At least on my hardware, I have to disagree with your comment about it being bad that the RAID is defined in BIOS -- I am a strong believer that is exactly where it should be. Software raid is a very dangerous creature, subject to all kinds of failures. By having RAID in hardware, software can be totally corrupted and the RAID will survive. (Remember, RAID is not meant to keep software problems from happening -- if you delete your OS, it will happen on both disks....)

    With Intel, the Windows RAID component is only needed to conveniently monitor the RAID. With out the software, the hardware RAID will still function normally.

    ReplyDelete
    Replies
    1. I think you should have read twice the comment you replied to. In my case I use the Intel with a raid5 and i have to start windows to rebuild the raid. It can not rebuilded from the Intel Matrix Storage Manager BIOS. The problem is, while rebuilding the raid, i have a lot of BSODs. What i have learned is, You shouldnt install window on a raid, without having a seperate OS on a seperate HD in the computer for repairing. Or you know for sure, that your raid bios is capable repairing a damaged RAID by his own.
      PS: Sorry for my bad english.

      Delete
  18. Aloha Tim: I'm sitting here on Christmas Day readin articles on Rebuilding a RAID 1 array as I have discovered a problem in one of my 2 500GB drives with it's uniqueID that caused it to go offline sometime ago. I found an article describing the ease of using 'diskpart' to fix the uniqueID of the offline drive but now wondering if I will safely be able to change the ID, reboot and have the RAID rebuild on its own? I do not have the Intel software tool you refer to as I am running an AMD Quad Core on a Gigabyte board. I don't know if any real RAID control program unless it is something included with my Win 7 - 64bit OS? Any input on if it would be safe to proceed by renaming the uniqueID and rebooting to see what happens. The challenge is I do not have a valid backup image and no extra drive or space large enough to create one. Danger - Danger!! :)
    Gary

    ReplyDelete
  19. Gary, An interesting problem. From your separately-sent-email, I saw more details - Disk Manager did report a Duplicate Disk-ID/conflict: The disk is offline because it has a signature collision with another disk that is online

    This does look like a problem with the disk-ID -- I was not even aware this was a changeable-thing. I don't know the answer to your question, but I would surmise the following.

    Your suggestion of changing the disk-id seems reasonable, but you might try this idea first. Normally you can match the drive's disk-ID, as seen in the RAID software or in DiskManager, with a printed label. In your case, I would:

    1. Shut down the computer and match the ID with the printed labels

    2. Locate the (good) drive, the one with the same printed label as the two identical drives.

    3. Then, remove the offending, mismatched drive, leaving the good drive.

    4. Reboot and confirm the operating system boots.

    (If the boot fails, simply swap the two drives and boot from the other. If this does not work, I would be confused.)

    5. Once booted, consider backing up you data in case something goes wrong.

    6. Then, destroy the mirror-raid (e.g. discontinue the RAID (however you do that on your AMD board)). Hopefully, this can all be done in hardware/bios (I hope you are not using a Software Mirror).

    7. Re-attach the (bad) drive and rebuild the RAID, as-if-new.

    If the drive-ID still conflicts, do your rename step.

    Write back with what you find.

    ReplyDelete
  20. Gary, the other thing I noticed on your attached illustration, is your second drive is considerably larger than the first. I have only mirrored two identical drives.

    If your mirror was successful, wouldn't that waste a lot of disk space?

    ReplyDelete
  21. Aloha Tim! - thanks for the input... I will have some time over the next few days to try a few things.
    As to your second reply; I have 2 - 500GB drives labeled on the printout as Disk's #0 & #2. Disk #1 on the printout is my 300GB Backup drive which is empty at the moment. So the 2 mirrored drives as currently listed as Disk 0 and Disk 2, both the same 500GB. I'll let you know what I discover...

    Mahalo!

    ReplyDelete
    Replies
    1. Aloha Tim - Problem solved!!
      After more research and literally tearing down the system to confirm the location of each of my 4 drives (2-500GB Main & 2-320GD Backup, each setup in RAID1 configurations) I used the cmd.exe DISKPART to rename the drives so that each had its own 'uniqueID'. I downloaded AMD's RAIDXpert management program from the AMD website and proceeded to use it to rebuild each of the two separate RAID arrays. I tested the program on my Backup drive first and when that went without incident I ran it again to rebuild my primary drive. It took just over 3 hours to rebuild the 500GB main drive and everything seems to be back to normal and 100%! Whew!!!

      Again, thank you for your speedy reply and input.
      Mahalo!
      Gary

      Delete
  22. Hi Tim,
    Great blog - thanks for keeping this article.
    I've got a RAID 1 of two WD HDDs (640 Mbs) on an ASUS P5Q dlx. Everything was going OK - even though during the years of use (more than 4 years...I'm thinking of an upgrade anyway) there were some instances were I had to restart the system (freezing) and noticed in the right side the icon with the Array rebuilding (checked with the Intel Matrix sw).
    Yesterday though, my system not only froze but it made some annoying repetitive sound (not HDD click) and had to hard reboot. However after having to step out right after that, came back in a few minutes and noticed a black screen with two options - F1 to go into BIOS and see what's wrong (something like that) or F2 and load the default BIOS. By mistake (I realize now) I choose F2. All the BIOS settings were as from the factory - and because (another fault of mine) not writing down all the changes...didn't remember exactly what to change - so left it like that. Of course, the PC didn't boot (insert a CD/DVD bootable) - got back to BIOS, noticed the boot sequence, changed to 1st - the HDD. The SATA configuration though noticed was on IDE (the default).
    After that - the PC booted, Win XP started recognizing a new HDD...and now I have two HDDs (C: and E:) instead of "one" (C:\). Both are working (tried a few random files from both and everything was OK). Tried to run the Intel Matrix SW - didn't show anything - mentioning not being initialized.
    I would like to have the RAID back. The question is what are the steps and the risk?
    Thanks in advance
    Sorin

    ReplyDelete

Comments are moderated and published upon review.