AUTHORS

RAIDs and Drive Remapping in Vista

Posted 13 August, 2007 at 11:13pm by Michael Chu
(Filed under: Personal Computers)

So a couple months ago I built a new computer around a video card I bought, but I did it with the vision that it would eventually replace my current desktop setup. That meant eventually I'd have to build a new RAID set. For data redundancy reasons, I run a RAID 5 for my primary data storage partition. (Right out of college, I lost a massive amount of data - maybe 60 GB - due to striping two disks for increased performance (RAID 0) for the purposes of video editing. After that, I have been running a RAID 5 of some form or other.) My current (until yesterday) solution is one that I have had for over six years now - the Adaptec ATA-RAID 2400A. A full size card that supports four Parallel ATA (we just called them IDE back then since there wasn't two types of IDE drives) drives and was known for reliability and cost-effectiveness (I bought it in 2000 or 2001 for $350). The reliability of the board has certainly been proven over the last six or seven years of continuous usage and three sets of drives. Unfortunately, the board also known for it's relatively low performance in RAID 5 mode. I figured since I had just built a brand new computer and I was running out of storage space (and would have to upgrade my drive set soon), it was time to upgrade from last generation's technology to a modern RAID controller.

A Brief Description of Popular RAID Configurations

A RAID is a Redundant Array of Inexpensive Disks (although the acronym is most often used to represent the phrase Redundant Array of Independent Drives). The term was coined in 1987 as part of a paper that researched not only the possibility of using multiple drives as a way to reduce the cost of large data storage solutions but also to provide stability and redundancy in the new system. In the paper they defined five different possible configurations (RAID 1 through 5) and their theoretical costs and benefits. The most common RAID configurations you'll see today are RAID 0, RAID 1, RAID 5, and, increasingly, RAID 6.

RAID 0 simply stripes data between the drives in the RAID set. For example, with two disks, blocks are written in alternation on each disk. Think of it as having all the odd blocks on one disk and even ones on the other. Theoretically, reading consecutive data from a RAID 0 set can be twice as fast as reading it from one drive since you can read from two drives at the same time. The downside of this configuration is if one drive fails, the entire set is irrecoverable since half your data is missing (or n-1/n data is missing on a drive set with n members). With this set up, it's also more likely to have a drive failure since you now have more drives. In fact, since complete data failure occurs when ANY of the drives fail, the probability of failure grows proportionately to the number of disks you actually have in the set. Bad news. (Since the set isn't redundant at all, I think it should really be called AID or AID 0 instead of RAID 0…)

RAID 1 is the simplest of the redundant structures. The data is basically mirrored on a second drive. If one drive fails, then the second drive takes over and no data loss occurs. Of course, if the second drive also fails, then the data is lost, so the set is vulnerable for the time it takes to replace the failed disk and rebuild the contents of that disk (the size of which is one individual drive). Pretty good, but the problem is that with two hard drives, you only have the storage capacity of one drive. (For example, with two 500GB disks in RAID 1, you only have 500GB of storage space even though you paid for 1000GB! Of course, your data is safer than it was before.) Since all the data is available on all the disks, you can get the same performance gains as you would with RAID 0. Unfortunately, most consumer market controllers I've seen don't do this and simply copy the data onto the second disk. You can run RAID 1 with more than one disk as well and this drastically increases your data access speed (assuming your controller supports this) as well as the reliability of the set. Again, the downside is that your total storage space is the same as one disk.

There are also RAID 0 and RAID 1 combinations such as RAID 10 (four disks - two mirrored sets striped) and Intel's Matrix RAID (two disks - part of the drives is mirrored and the other part is striped two form a fast partition and a redundant partition).

RAID 5 is one of those great balancing acts which help you maximize storage space, performance, and reliability. Naturally, there isn't anything more reliable than a multiple disk RAID 1, but that's usually not economical. A RAID 5 set stripes data blocks across disks (like RAID 0) but inserts a parity block as well. The parity block is a mathematically calculated block of data that when used in conjunction with all the other disks except one is able to reconstruct the missing data. For example if you were to write two numbers a and b (where a and b are from 0 - 99) in two boxes on a sheet of paper, the parity could be |(a-b)| (generally we also want the parity to be the same size as the data - in this case a number from 0-99). In this case, if a=32 and b=60, the parity would be 28. If we covered up box a and only saw that b=60 and the parity was 28, you could calculate that a must have been 32. In a similar way, parity is stored on the block level in RAID 5 configuration. The parity block is purposely striped across the disks as well, so the parity for the first set of blocks is on the last disk, the next set of block on the second to last disk, and so on in a repeating pattern. The access speed on RAID 5 is almost as high as that on RAID 0 (or RAID 1 that allows striped access) since the data is distributed on many disks. The total capacity of the array is a little less than striped but much greater than mirrored - the sum of all the disks except one (assuming the disks were the same size, otherwise we deal with all the disks as if they were the same size as the smallest disk in the array). So, in a RAID 5 set of four drives (3 disks is the minimum - having a 2 disk RAID 5 is almost like mirroring but with more math involved) that are 200GB each, yields a total storage size of 600GB with redundancy. If one of the disks fails, the data is still intact. (The technical term for an array where one disk has failed is "degraded".) If another drive fails while the array is degraded, the data is irrecoverable. So, the RAID 5 failure can occur when a second disk fails during the time it takes to install a new drive and and to calculate and write one whole disk's worth of data onto the new drive. Time-wise, about the same window of "death" that a RAID 1 with two disks.

RAID 6, like RAID 0, was not part of the original 5 RAID types (1-5), but has been introduced because of larger and larger sets of RAID 5. You can make a RAID 5 array with any number of drives, but at some point it become impractical. In addition, the humongous size drives (1TB!) available today make the chances of a second drive failing while a degraded array is regenerating the failed disk even more likely. With 16 disks on one array, the chances of two drives failing in the set is statistically high when compared to having a smaller 4 disk set. If they were all 1 TB disks, then you could be potentially losing 15 TB of data if two drives fail within hours (or even a day) of each other since it takes quite a long time to recalculate and write 1 TB worth of data (remember it needs to read from all the other 15 disks to calculate the data on the 16th disk). Enter RAID 6 - designed for larger RAID sets, RAID 6 is basically RAID 5 with two parity blocks. In the 16 disk, 1TB per disk example, only 14 TB of storage space would be available, but it would be able to survive a two disk failure. RAID 6 is usually only available on controllers that support 8 or more drives.

I like having a separate card for my RAID since parity (which is usually calculated with bitwise-XOR vs. the more complicated subtraction example I gave) calculations do take CPU time to calculate if you use a software solution. A RAID controller has a calculation engine built onto the card so the operating system simply sees the array as a giant disk and all calculation is off loaded on to the RAID hardware.

And now, back to my story…

Over the years, hard disk capacities have grown fairly rapidly and so has the size of the data files that I store. From documents to videos to digital camera JPEGs to DV captures to digital SLR NEF (RAW) files, my storage needs for irreplaceable data keep increasing. I usually decide to update my drive set when I start running low on disk space AND hard drive density has progressed enough that a reasonably priced disk exists that can store the entire contents of my RAID on one drive. I would backup the entire RAID onto one disk, and build a new set with the other four and copy the data back. The backup disk then sits idle in case of a drive failure. This isn't really a smart way to do things - with hard drive economics, it's almost better to wait for a drive failure and pick up a brand new drive (invariably a lot bigger but still a lot cheaper than the drives in your original set) and rebuild on that one. Of course, you're risking data failure if another drive goes down while you're wasting time driving to Fry's.

So, I picked up a 3Ware 9650SE SATA II RAID Controller (PCIE x4) that is both low profile and half width (about 1/4 the surface area of my other RAID controller) for $350. I then grabbed four 750GB SATA drives and reworked the position of my drives in my computer chassis so everything would fit. I like to physically label my disks with a label sticker printer in case I need to move the drive set to another computer in the future (or if I rebuild the system in another case). The 3Ware card was a piece of cake to set up (like my Adaptec). In BIOS, you just select the disks, tell it to make a RAID 5, and it's done. Boot into Windows, load the driver, and go to Disk Manager to format the drive. A couple minutes later (assuming you selected Quick Format) and the array is ready to rock and roll.

In preparation for copying 700GB worth of pictures (it's amazing how many food pictures I take when preparing articles for Cooking For Engineers), video, and other data, I picked up a Gigabit Ethernet card for my old computer (the one with the Adaptec RAID). It didn't really help - the Adaptec RAID wasn't reading fast enough to warrant the faster data connection. Although occasionally it would burst over 100Mbps, over the duration of the massive network copy, it averaged about 3.75 MB per second.

Remapping a Folder in Windows Vista to a Drive Letter (also works in Windows XP)

Now came the tricky part (turns out it's not so tricky). I had set up several programs on the new computer to work across the network. These programs (most important of which was Adobe Lightroom) expected the original RAID to be mapped to drive W: and the photos directory to be mapped to drive X:. I disconnected the network drives and setup the RAID on the new computer as W: on that computer, but how to mimic the drive X: thing? (Lightroom remembers where the original locations for all the images are and I did not look forward to altering the location for over 50,000 images…)

My first thought was to simply map a network drive (X:) to the photos directory, but I doubted that Windows Vista would be smart enough NOT to attempt to access the drive through the network. Sure, it won't be that slow going through out to a full-duplex Gigabit switch and back in, but that's less than ideal. Luckily for me, Windows Vista still supports the good old DOS command subst.

Typing at the command prompt:

subst X: W:\pics\digital\nikon

gave me the exact drive mapping that I needed without having to go through a network connection.

To get it to work on boot up, I added an entry into my Windows Registry by running regedit.exe and updating

Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run

I added a new String Value that I named SUBST (any name will do) and provided the subst statement above as the value.

On login, the subst is run (black background command window opens up briefly as the command is run) and a few seconds later the remapped drive letter is active.

4 comments to RAIDs and Drive Remapping in Vista

Anonymous, August 14th, 2007 at 12:19 am:

calculating parity is cheap. the memory transactions are expensive.

Michael Chu, August 14th, 2007 at 9:02 am:

True, true. To calculate parity, you need to load all the relevant blocks, calculate, and write it back. Doing it on the controller level keeps the system free from this data intensive operation.

Cent, February 8th, 2008 at 8:26 am:

Ok, subst works fine, I was also using this command because restore CD on my laptop allows to create only one partition and I have some projects which require drive D: to exist. BUT: if you plug in a USB pendrive or any other hot-pluggable device that allocates a drive… Windows will tell you that it fails - it's because drive D: is subst'ed. Then, if you'll go to Drive Manager you'll see that there's your USB drive and if you'll change it's drive letter to something else than substed D:, then it's all ok. And if you'll plug in another drive - the story begins: go to Drive Manager, change drive letter… Annoying incompatibility of Microsoft subst with Microsoft XP

Mike T., October 13th, 2009 at 12:08 pm:

I'd suggest that you get hold of 'robocopy' (part of the Windows resource kit tools) - it can copy files on ntfs file partitions faster that just about anything else (way way quicker then Windows Explorer)

AUTHORS

CATEGORIES

ARCHIVE

RAIDs and Drive Remapping in Vista

Posted 13 August, 2007 at 11:13pm by Michael Chu
(Filed under: Personal Computers)

4 comments to RAIDs and Drive Remapping in Vista

NAVIGATION

MY WEB

SEARCH

AUTHORS

CATEGORIES

ARCHIVE

RAIDs and Drive Remapping in Vista

Posted 13 August, 2007 at 11:13pm by Michael Chu(Filed under: Personal Computers)

4 comments to RAIDs and Drive Remapping in Vista

NAVIGATION

MY WEB

SEARCH

Posted 13 August, 2007 at 11:13pm by Michael Chu
(Filed under: Personal Computers)