back in 2012 amazon introduced “Amazon Glacier”. It allows you to store large amounts of data very cheaply (at moment $0.004 per GB / Month) It is an incredibly cheap price, compared to S3 (at moment $0.023 / GB) But where is the trick? Well, actually, there is one important detail: retrieval time for your data could be up to several hours.
Amazon Glacier is an online data storage service provided by AWS. Just like AWS’s popular S3 service, Glacier provides users with a simple, secure, cloud-based data storage that can quickly be scaled up or down as needed. But unlike S3, which is designed to provide users with quick access to their data, Amazon Glacier is designed for the long-term storage of inactive data that will not need to be quickly retrieved (it commonly takes between three-and-five hours to make a retrieval from Glacier). This long-term, slow-moving method is known as cold storage, hence the reason AWS named their service Glacier.
As a disclaimer: I am not an Amazon employee and there is no information about Glacier architecture available in any public source. So, I can only imagine and suppose how does it actually may work.
Let’s imagine that we want to build a service like Glacier. First of all, we would need lots of storage hardware. And it must be pretty cheap (in terms of cost per gigabyte) hardware because we want to sell it for such a little amount of money. There are only two types of hardware that fit these requirements: hard disk drives and magnetic tape. The last one is much cheaper, but less reliable because of magnetic layer degradation. It means one should perform periodic data refresh to prevent data loss. They may use special custom hard drives with big capacity a slow access time, simply because speed is not critical for them. It makes the overall solution cost even less. I don’t know what kind of data storage hardware Amazon uses, but I think hard drives are a little bit more possible option.
The second component of big data warehouses is infrastructure that connects users with their data and makes it available for them in the timeframe described in SLA. It could be network, power supplies, cooling and lots and lots of things you can find in modern datacenters. If you would build a service like S3 infrastructure cost would be even bigger than storage cost. But here are one important difference between S3 and Glacier: you don’t have to provide access to data quickly. It means that you don’t have to keep your hard drive turned on, which means reduced power consumption. It means that you don’t even have to keep your hard drive plugged into the server case! It could be stored in a simple locker. And all you need is an employee who is responsible for finding and plugging your hard drive into the server when the user asks access to data. And several hours are definitely enough to do it even for a human being.
Sounds crazy? Well, let’s look at this solution from the other side. What is Amazon, first of all? Cloud vendor? Nope. They are a retail company. One of the biggest in the world. And they have probably the best logistics and warehouse infrastructure in the world. Let’s imagine you order hard drive on Amazon web site. How much time does it usually take for Amazon to find it in their warehouse, pack and send it to you? Several hours? Just imagine that they don’t send it, but plug it into a server and turn on instead. It sounds like a pretty similar task, isn’t it?
It is amazing how Amazon integrates their businesses with each other. AWS was a side product of its main retail business. The product they started to sell just because they realized that it has value not only for their business but also for others. And now we can see how AWS uses the offline infrastructure of Amazon to provide an absolutely new kind of service. A fantastic fusion of online and offline infrastructures working together to create something new!