|
By Paul Monday, August 2007
|
|
|
This article compares the cost of local disk backup solutions to that of offsite
storage utilities and offers a sample implementation guide.
Overview
Backup and
archiving serve a
variety of purposes and are commonly used throughout companies.
Traditional backup and archiving operations target tape libraries, like the
Sun
StorageTek SL500 Modular Library System. Increasingly, companies are using
local disk storage to facilitate a perceived decrease in access time and latency as well
as to create a "simple" solution. With the decreased access time and latency, archives can remain highly
accessible and restore operations can meet increasingly aggressive service-level agreements.
Smaller companies that want the advantages of local disk solutions
(performance and online browsing features), combined with a level of offsite
backup and archiving for disaster recovery purposes as well as the simplicity of
only having to work with their own disk solutions, are finding that storage
utilities provide highly competitive offerings.
Companies such as SmugMug and
elephantdrive are using storage
utilities to reduce their hardware costs and offload the system administration
costs to the storage utility. The storage utility, in turn, can create better
economies of scale through hierarchical storage management, being able to build
large systems with lower cost per gigabyte, and being able to purchase and build
management tools for managing large amounts of data while being able to amortize
these costs to many customers. These savings are passed back to the smaller
companies. Further, a company well versed in data management can provide a
safer and more reliable solution than a small company where the core competency
is in application software or in managing store inventory.
Considerations and Requirements
When looking at hosting solutions and storage utilities, there are many things
to keep in mind. Many of these things are measurable, but some are not, such
as the core expertise of your company. Here are some considerations and
requirements that I laid out as I took a quick look at building my own
solution versus leveraging a storage utility. I wanted my solution to provide these benefits:
- Up to 12 terabytes of accessible storage with offsite backup (assume 1TB per month utilized)
- A lower monthly cost for backups than a locally hosted solution amortized over one year, including the following costs:
- Hardware expense
- Power expense
- Remote hosting facility expense
- Bandwidth expense
- System administration costs
Assumptions
I then made a few assumptions about my storage consumption and environment. Note
how round all the numbers are. As a company, you would certainly want to be
much more aggressive in mapping your expectations, because every dollar counts and every
point of accuracy will help you manage those dollars. I assume the following:
- Monthly usage: Assume 1TB backed up per month
- Restores: Assume that two 500MB restores must occur in separate months throughout the year
- Access time: Assume "normal" network latency is acceptable
- Access rate: Assume 100 Mbps network connection is acceptable
- Reliability: Assume colocated solution has same availability as storage utility
Out of Scope
Of course, there are many things I simply felt were out of scope as I tried
to create an apple-to-apple comparison. As individuals become more sophisticated
with their storage management abilities, more solutions become readily available that can help decrease costs.
My goal was simplicity, so here are some things I didn't even consider to be in scope:
- Complex local storage and network solutions (assumption is a single X4500 being backed up over the network)
- Tape backup (cohosting solution does not go down in price for hosting tape)
Architecture Overview
I assume a "simple" point-to-point file-based backup infrastructure is
in place, as shown in the illustration below.
 |
|
A "simple" point-to-point file-based backup infrastructure.
|
Implementation Guide for ZFS to Amazon S3
Implementation Overview
The ZFS file system was chosen simply as a practical starting point. Other file systems
should be usable but could provide different performance characteristics and
certainly different snapshot and restore abilities. If you find a substantial
difference in a particular file system, please feel free to send me the
implementation and I will post it on my blog in your name. We are also piloting
a wiki site and I intend to make this pattern "editable" so you could branch
the pattern and add your own solution as well, or change this one.
Amazon Simple Storage Service (Amazon S3) is an interesting choice for a
storage utility. I chose it for the following reasons:
- High usage in the industry
- Stable and predictable pricing model that is the lowest in the industry, so this represents a good "best case"
- Trusted source of storage
The costs are as follows for hosting at Amazon S3:
- Consume 1TB per month, aggregating to 12TB in the last month of the year
- Hardware expense: $0
- Power expense: $0 (included in remote hosting cost)
- Storage Utility expense: $11,700 (1TB + 2TB + ... + 12TB @ $0.15 / GB-Month: $150 + $300 + ... + $1,800)
- Bandwidth expense: $1,380 (Inbound: 12 * $0.10/GB * 1000 = $1200, Outbound: 2 * 500GB * 0.18/GB = $180)
- System administration costs: $0
- Total Cost of Ownership (TCO) for first year: $13,080
One of the original requirements of this particular pattern was to lower the
cost for our storage backups. SmugMug did a
simple
analysis of their savings by using the Amazon S3 storage utility versus
purchasing their own disk, and they saved about $340,000 in seven months.
It is extremely important that you spend time to do a full cost
analysis based on your own internal information. You can quickly see that the Amazon S3
service is a flat rate and that over the course of several years, the trajectory of the Amazon S3
cost for storing 12TB of data will cross owning a server with 12TB of storage.
Implementation Assumptions
My implementation makes the following assumptions:
Bill of Materials
Implementation Steps
This version of the implementation is going to take the "simple is elegant"
route. We use the built-in capabilites of ZFS to take snapshots,
save them, and restore them, and we couple this with simple pipes to move data
between our file system and Amazon S3. A "snapshot" is defined by
the Storage Networking Industry Association (SNIA)
to be: "A fully usable copy of a defined collection of data that contains an
image of the data as it appeared at the point in time at which the copy was
initiated. A snapshot may be either a duplicate or a repliate of the data
it represents." The built-in ZFS snapshot capabilities are very good and are not
an add-on, as they are for most file systems.
We can also restore our file system from Amazon S3 back to ZFS using the
reverse flow of the process here.
Step 1: Basic Setup
To set up our system:
- Create one storage pool:
- Create a file system within the media storage pool:
zfs create media/mypictures
|
- Change the mountpoint to
/export:
zfs set mountpoint=/export/media/mypictures media/mypictures
|
- Set to share over NFS:
zfs set sharenfs=on media/mypictures
|
- Turn on compression:
zfs set compression=on media/pictures
|
- Copy a set of pictures into the
media/pictures directory (enough to create an acceptable snapshot).
Step 2: Snapshot and Store
Our first process consists of creating a snapshot and sending it to
Amazon's S3 for backup. There are some assumptions that must be made at this point:
- You have an existing file system or volume that you would like to backup (as created above)
- The snapshot size will fit within the constraints of Amazon's S3 license or your own service contract with Amazon S3
The steps we take are:
- Create a snapshot of the file system
- Turn the snapshot into a data stream
- Compress the snapshot
- Send it to Amazon S3 with appropriate metadata
To create a snapshot and store it:
- To create a snapshot, use the
zfs snapshot command. I will snapshot
the entire /export/media/mypictures directory and name the snapshot "20070607" using this command:
zfs snapshot media/mypictures@20070607
|
The snapshot should initially take up no additional space in my file system.
As files change, the snapshot space grows as well, because the changes in the
data must be duplicated. Still, saving the snapshot requires the full amount
of space because I am creating a "file" full of the snapshot of the data (which
happens to be all of the original data).
- It is relatively easy to turn the snapshot itself into a stream of data by using the
send
command to send the snapshot to a file:
zfs send media/mypictures@20070607 > ~/backups/20070607
|
- We can also insert compression into the pipe, so the actual command I used is this one:
zfs send media/mypictures@20070607 | gzip > ~/backups/20070607.gz
|
Mileage varies with compression on snapshots.
- Finally, we can send the snapshot to Amazon S3.
I assume that a
"bucket"
is already created
and that we are merely sending the final compressed snapshot to the Amazon S3 bucket.
To be honest, I tried using Curl, Perl, and a variety of other things but I
couldn't quickly get the right libraries to create the signatures. I just
hate scrounging around the Internet for the right this or that and changing
compilation flags and recompiling and so on. So, I went with the Java - REST approach.
Use the
Amazon S3 Library for REST in Java
library. This has classes for doing all of your favorite Amazon S3 operations
and is quite easy to use. I created the following "simple" program that passes
in a key and the location of my uuencoded snapshot for upload. (This program is based on the samples from Amazon S3.)
public static void main(String args[]) throws Exception {
if (awsAccessKeyId.startsWith("<INSERT")) {
System.err.println("Please examine S3Driver.java and update it with your credentials");
System.exit(-1);
}
if (args.length < 2) {
System.err.println("Send snapshot key and location with program: SendSnapshot key path");
System.exit(-1);
}
AWSAuthConnection conn =
new AWSAuthConnection(awsAccessKeyId, awsSecretAccessKey);
System.out.println("----- putting object -----");
S3Object object = new S3Object("this is a test".getBytes(), null);
try {
File file = new File(args[1]);
InputStream is = new FileInputStream(file);
long length = file.length();
if (length > Integer.MAX_VALUE) {
System.err.println("File too large: "+args[1]);
System.exit(-1);
}
byte[] bytes = new byte[(int)length];
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+args[1]);
}
// Close the input stream and return bytes
is.close();
object = new S3Object(bytes, null);
} catch (IOException ioe) {
System.err.println("Error reading file: "+args[1]);
System.exit(-1);
}
Map headers = new TreeMap();
headers.put("Content-Type", Arrays.asList(new String[] { "text/plain" }));
System.out.println(
conn.put(bucketName, args[0], object, headers).connection.getResponseMessage()
);
System.out.println("----- listing bucket -----");
System.out.println(conn.listBucket(bucketName, null, null, null, null).entries);
}
}
|
Send up your snapshot and you are GOOD TO GO!
Step 3: Retrieve and Restore
The process of retrieving and restoring the snapshot when you lose your data or want to return to a previous time in your history is relatively
simple as well. Simply reverse the process above. Here is the Java code (using the Amazon S3 Java REST libraries again).
public static void main(String args[]) throws Exception {
if (awsAccessKeyId.startsWith("<INSERT")) {
System.err.println("Please examine S3Driver.java and update it with your credentials");
System.exit(-1);
}
if (args.length < 2) {
System.err.println("Get snapshot and write data needs 2 parameters: GetSnapshot key path");
System.exit(-1);
}
AWSAuthConnection conn =
new AWSAuthConnection(awsAccessKeyId, awsSecretAccessKey);
System.out.println("----- getting object -----");
byte[] bytes = conn.get(bucketName, args[0], null).object.data;
try {
FileOutputStream fos = new FileOutputStream(args[1]);
fos.write(bytes);
fos.flush();
fos.close();
} catch (Exception e) {
e.printStackTrace();
}
}
|
When you have your gzipped snapshot, decompress it and use it, using this command:
Now you have to decide what to do with your snapshot. I moved the existing
mypictures pool and restored my old snapshot into its place to give me a complete "time travel"
back to my snapshot. Here are the commands:
# zfs rename media/mypictures media/mypictures.old
# zfs receive media/mypictures < 20070607
|
That's it! Going to /export/media/mypictures brings me to the pictures I snapshotted on June 7, 2007!
Billing
Using Amazon S3 was simple for billing. I entered my credit information before
using it, and I receive a monthly bill. Here is a copy of the bill I received
for all of the storage used in the process of building this article.
Greetings from Amazon Web Services,
This e-mail confirms that your latest billing statement is available on the AWS
web site. Your account will be charged the following:
Total: $0.09
Please see the Account Activity area of the AWS web site for detailed account
information:
http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=activity
-summary
Thank you for your continuing interest in Amazon Web Services.
Sincerely,
Amazon Web Services
This message was produced and distributed by Amazon Web Services LLC, 1200 12th
Avenue South, Seattle, Washington 98144-2734
|
Issues With Implementation
Feel free to tackle any of these issues and feed the solutions back to me, or
add to the list of issues if you perceive there to be more.
- Amazon S3 size limitations: In the "Terms of Use" Amazon S3
specifies the following: "You may not, however, store 'objects'
(as described in the user documentation) that contain more than 5Gb of
data, or own more than 100 'buckets' (as described in the user
documentation) at any one time." As a result, one would want to slice up
the snapshot appropriately so as to conform to the Amazon S3
limitations, or possibly work with Amazon S3 to find another solution. The limitation is
completely reasonable though, due to lengths and limitations in the
HTTPS protocol itself.
- Encryption: Data should be encrypted appropriately before being
sent to a third-party storage site.
- Cron Job: Timely snapshots could be made automatically.
- Non-Java: It would be nice to do the whole process from scripts,
but I got hung up on the key generation so I hopped to my native tongue (Java code).
|
|