Disk Failure Investigations at the Internet Archive
Goals and methodology
The Internet Archive is a nonprofit organization based in San Francisco established to preserve Web sites by taking “regular snapshots”. It has now extended its mission to preserve as much digital and digitalizable data as possible. Its data set currently grows at about 25 TB/month. Its Wayback machine not only allows future historians to access data that would otherwise not be archived, but has already become a daily tool for investigating such things as trademark disputes. It complements its small staff with highly dedicated and educated volunteers.
The Internet Archive stores its archival data at several sites, including sites in Alexandria, Egypt, Amsterdam, Netherlands, and several sites in San Francisco, CA. For cost reasons, it stores data on desktop ATA disks. These are now located in four-disk pizza-box-form-factor nodes that replace ones with a bulkier form factor. Of the four disks, one is the Linux OS boot disk, while the others only store data configured as a JBOD. There are 40 storage nodes in a rack and the San Francisco clusters have 36 racks. Over time, the environmental quality of sites has differed.