The Internet Archive and the Wayback Machine

Some years ago, I posted an article here about Google Books. At that time, I was very impressed with what Google had done and had high hopes for its future. Sadly, since that time, I have become increasingly disappointed with the quality and availability of the Google Book collections. However, that deficit is much less frustrating for me since I learned about the Internet Archive. The discovery of the Wayback Machine was an added bonus.

Why you want to know about the Internet Archive and the Wayback Machine …

The Internet Archive is older than Google Books, having been founded in 1996, while Google Books was first introduced in 2004. The Internet Archive is also a much more ambitious project, since its stated aim " … is building a digital library of Internet sites and other cultural artifacts in digital form." This means they archive not only print content, but also audio and visual materials as well as entire web sites. One can think of the Internet Archive as the digital version of the ancient Royal Library of Alexandria, which was the largest library in the ancient world, holding a copy of nearly every book produced during its era. Quite appropriately, the mirror site for the Internet Archive is the Bibliotheca Alexandrina.

Though the Internet Archive is working hard to preserve film, music and electronic records of all kinds, my focus here is its massive Text collection. They have nearly two dozen scanning centers in five different countries and it is estimated that between them they scan at least 1,000 books every day. From 2006 to 2008, Microsoft Corporation was partnering with the Internet Archive in its now defunct Live Search Books project. Microsoft provided financial support as well as superior scanning equipment for this effort. Over the course of those two years, more than 300,000 books were scanned, all of which were added to the collections of the Internet Archive. When Microsoft closed down its Live Search Books project, in May of 2008, they donated their scanning equipment to the Internet Archive, where it continues to be used at the scanning centers operated by Internet Archive members.

It is that specialized book scanning equipment which makes the scanned texts at the Internet Archive light-years better than anything done by Google Books. Though I am not a noted fan of Microsoft, the scanning equipment which they developed and donated to the Internet Archive effort was carefully and intelligently designed and manufactured. I have what I consider a great advantage with regard to my ongoing Regency research, I work less than a block from the main branch of the Boston Public Library, which operates one of the Internet Archive scanning centers. Over a year ago, I was introduced to the director of the scanning center there and was very fortunate to be offered a tour of the facility. That tour included an opportunity to watch one of the technicians at work scanning a book with the Microsoft scanning equipment. In that case, it was a book from the personal library of John Adams, which is owned by the Boston Public Library. At the time, they were in the process of scanning John Adams’ entire book collection into digital format, thus making it available to the world for study, while keeping the original volumes safe in their climate-controlled storage vault.

Unlike the scanning process for Google, in which books are apparently laid flat on a table and held open by a human (all those inadvertent images of thumbs and fingers on Google book pages attesting to that method), the Internet Archive scanners do not require human hands on the book pages while they are scanned. Instead, the scanner has an adjustable V-shaped cradle into which the book is laid so that it is fully supported with no more pressure on its often fragile spine than is necessary. Once the book has been placed into the cradle, a matching V-shape of a pair of panes of glass are lowered to the book, the point of the V-shaped glass cover pressing into the gutter of the book just firmly enough to hold the book in the cradle and keep the two visible pages flat without putting undue pressure on the spine of the book. There are a pair of lenses set above this book cradle and glass cover unit, each of which is set at an angle so that it focuses precisely on the plane of one side of the V-glass cover. When the scanning technician triggers the scanner, an image is taken of each open page. With a foot pedal, they can then raise the glass cover, turn the page, lower the cover and scan another two pages and so on, until the entire book has been scanned.

Not only does the Internet Archive scanning method put less pressure on a book’s delicate spine, since it is not forced fully open on a flat surface by occassionally careless humans, the image of each page is clean and crisp. The glass cover of the scanner holds the pages completely flat and still while they are scanned, so there are never any pages with hideously warped and twisted text, or parts in focus and other parts blurry. Each page of a book which has been scanned by the Internet Archive is clean, in sharp focus and fully legible, with nary a human finger or thumb in sight. [Author’s Note: There is a caveat which must be included here. In the fall of 2007, it is estimated that approximately 900,000 books from the Google Book project were uploaded to the Internet Archive. These are all full copies of books, but sadly, due to Google’s sloppy scanning practices, not all of the books in that group will have pages which are fully legible. However, in most cases, search results at the Internet Archive note if the book was scanned by Google, so you will be aware of that fact and can select a copy from another source, if it is available.]

Another advantage of the Internet Archive is that it offers only full copies of books, since they only scan books which are out of copyright and in the public domain. There are no teasing "Preview" views, which might appear during a search on Google Books, often with the information you need not visible. Nor are there any of those annoying "Snippet" pages which only offer meta data on the book, but none of its contents. And unlike Google Books, the Internet Archive pays close attention to books with multiple volumes, making the effort to clearly differentiate each volume in a set. This can be crucial to those of us who might want to read a complete Regency-era novel, most of which were published in three volumes.

Most of the books at the Internet Archive are available in multiple file formats, which provides more options for reading those books on various devices. But even better, you can read the books online. And when you do so, the book can be viewed on your full computer screen. Unlike Google Books, which restricts your viewing area to a small portion of the screen. When Google Books was first introduced, the text of the book was available in about three-quarters of the screen. But now, when trying to read a book online at Google Books, the actual text of the book is restricted to less than half the screen. The rest of the screen is cluttered up with ads and controls which take up entirely too much of your screen real estate. That is not a problem at Internet Archive. When you choose to view a book online, the book fills your whole screen, which is much easier on the eyes when doing in-depth research.

Search at the Internet Archive can easily be refined based on the type of media you are seeking. Since I only go there looking for books, I select "Texts" from the "Media Types" pick list before I run my search. A texts search can be refined even more by selecting from other criteria on the pick list, such as American Libraries, University Libraries, Project Gutenberg or Children’s Library. These refinements will reduce the number of search results that are returned, since a smaller portion of the database will have to be searched. However, since I am not always certain where books I am seeking might be found, I prefer to run a wider search. I do get a larger results set, but I prefer scrolling through that longer list to taking the chance that I might miss something by using a tighter search. There is also an option to search all media types, which will return an even longer lists of search results, but covers all media types in the Internet Archive collection. An advanced search option is also available, if you want to more closely refine your search for a specific item based on any keywords you are using.

One of my favorite features of the Internet Archive is the Wayback Machine. It got its name from the machine that the very smart dog, Mr. Peabody, built for his friend Sherman, in the Peabody’s Impossible History segments which were part of the Rocky and Bullwinkle cartoons from the 1960s. Since 1996, the Internet Archive has been archiving as many web pages as they can, and all of those archived pages are available for search using the Wayback Machine. The Wayback Machine makes it possible for you to see how quite a lot of web sites looked in the past. Would you like to see how http://www.janeauten.org or http://www.georgetteheyer.com looked five or ten years ago? Just type that URL into the search box of the Wayback Machine and click the "Take Me Back" button. In fact, the Wayback Machine has become the only way to see the Good Ton web site once that wonderful traditional Regency resource went offline.

Once you enter a URL into the Wayback Machine search box and click the button, you will be presented with a unique search results page. Across the top, is a grid of years and below that is a twelve month calendar. When you click on a specific year, blue dots will appear on various dates on the calendar below. Each of those blue dots is a link to a snapshot of the web site for that specific date. Keyword searching is not currently supported at the Wayback Machine, so you do need the correct URL for any web site you would like to see. It must be noted that the Wayback Machine does respect robots.txt files. These are files which a web site uses to tell search engines to stay away. Any web site which has posted a robots.txt file will not be included in the Internet Archive web site database. Therefore, the Internet Archive cannot make every web site of the past available to searchers, but it does have quite a lot of them available for viewing. Would you like to see how the web site of your favorite Regency author looked when they first started writing? Just type their URL into the Wayback Machine search box and go back in time to have a look.

One can sign up for a free Virtual Library Card with the Internet Archive. The advantages of setting up an account are that you can then create bookmarks of materials within the collection that you are using for research. And, you can also sign up for a monthly newsletter from the Internet Archive which will keep you apprised of new additions to the collections. If you live in the San Francisco area, the home of the Internet Archive, you can also sign up for email notifications of local Internet Archive events which are held in the Bay area.

Earlier this week, I received my copy of the Internet Archive monthly newsletter, in which they announced a most impressive statistic. They now have over two million books scanned and available online for researchers. And they are already at work scanning more books for their next million. The Internet Archive is a rich resource for researchers and scholars around the world, regardless of the topics in which they are interested. All of the books you will find there are complete copies and are all available for download. The Internet Archive also has large audio and visual archives, which include popular music and television news and entertainment programs, along with all of the web sites which can be viewed using the Wayback Machine. Take some time to look over the offerings at the Internet Archive, you are certain to find something there of interest, whether or not it is related to the Regency.

Advertisements

About Kathryn Kane

Historian with a particular interest the English Regency era.   An avid reader of novels set in that time, holding strong opinions on the historical accuracy to be found in said novels.
This entry was posted in Reviews and tagged , . Bookmark the permalink.

18 Responses to The Internet Archive and the Wayback Machine

  1. This is all a bit technical for me… I will read it through a few times.

    • Kathryn Kane says:

      Fortunately, the Internet Archive has a fairly simple interface, so if your primary interest is in books, you can just go to the site, select “Texts” from the Media Type pick list, enter your keywords into the search box and click the “Go” button. That will return a list of the books in the collection which contain your keywords or phrases.

      In the US, most librarians are very familiar with the Internet Archive, and I suspect that many British librarians are as well. So, you may be able to call your local library for assistance when you are running a search. They should be able to walk you through it on the phone. Once you have run a few searches, you will probably get the hang of it.

      Good Luck!

      =^..^=

      • Excellent, thank you… it all looked very technical and scary. I never have trouble finding googlebooks or gutenburg press online from a general search, I get scared of sites with bells and whistles! [too locked in the past lol… especially as I’m head down onto chapter 27 of the current WIP!]

  2. I’m a dork… I blame cold cocking myself with the poker this morning [long story] … I’ve been using this to look at Ackermann’s repository and La Belle Assemblee, only I got there by a different route…

    • Kathryn Kane says:

      I am sorry to hear about the poker incident. I do hope there are no lasting ill-effects!!!

      Sounds like you have already found a couple of the real gems of the Internet Archive collections, IMHO. They have more copies of both Ackermann’s Repository and La Belle Assemblee than does Google, and those copies which did not come from Google are much more legible.

      =^..^=

      • Rather! but some copies are missing the fashion plates and you have to turn to Google to find them – or occasionally ebay, because people WILL tear out the prints to mount and sell… grrrr….

        • Kathryn Kane says:

          I am right there with you on that one! I wish people had more respect for the integrity of the book or magazine they are destroying than their own financial gain. But I suppose that is asking too much from those philistines who pillage old documents for their personal profit. However, I do hope there is an especially nasty spot in hell reserved for those people!!!!

          =^..^=

  3. Ah, great reference! I tried it and got lost in the Internet Archive this afternoon, researching, reading and grinning to myself. I like The Internet Archive better than The Gutenberg Project. Both technology and the readability are great – indeed even better than google.

    Here is one of the treasures I stumbled on and would like to share:

    “Some one the other day asked the Prince of Wales at the Antient Music wether he did not think some girl pretty. ‘Girl!’ answered he, ‘Girls are not to my taste. I don’t like lamb; but mutton dressed like lamb!’ ” (from: An Irish beauty of the regency; – Frances Pery Calvert (1767-1859), p. 177)

    I shall only be allowed to the The Internet Archive for limited time. Otherwise, my novel will never ever be finished…

    • Kathryn Kane says:

      I have gotten lost in the Internet Archive myself a time or three, so I quite understand why one must set limits.

      Thank you very much for that quote from Prinny. It is priceless! I downloaded the Irish Beauty book a couple of week ago, but have not had time to read it. It just moved up a few notches on my list.

      Regards,

      Kat

  4. KWillow says:

    Thank you so much for this information! Thank-yew!Thank-yew!Thank-yew!

    I was regretful when Goggle Books seemed to deflate or fall apart. Many university websites are available on line, but hard to navigate (possibly on purpose). This one has a rather crammed & cluttered interface, but it is worth a bit of extra effort!

    Your blog is one of the most interesting, informative, and useful I’ve ever read.

    • Kathryn Kane says:

      You are quite welcome! It was my pleasure!

      I like your choice of words, “deflate” is pretty much what seemed to happen to Google Books. I was so delighted when I first learned of it and spent hours there doing research with books to which I would never have had access otherwise. Then, in the past couple of years, it did just seem to deflate. More and more books which were once available disappeared, more often than not leaving behind those annoying and useless Snippet pages.

      It got so bad that I bought an external hard drive and now, every time I find a full copy of a book on Google Books, I download a copy on the spot because I am not sure it will be there the next time I try to view it. And, since Google has so restricted the viewing window for reading books online, by downloading a copy, I can view it full screen, which is much less frustrating and makes reading much easier.

      You are quite right about the Internet Archive, its home page is way more cluttered than it needs to be. But since I am only interested in searching for printed materials or using the Wayback Machine, I just ignore everything except the main search box and the Wayback search box. Once I have run a search, I find the results listings fairly clean and easy to use. However, if I were searching their music or film files, I think it would be much more aggravating and would almost certainly make me very cranky!

      I am glad you enjoy the blog. My primary goal is to try to ferret out interesting snippets of Regency history and put them online for the benefit of Regency authors who do not have access to the rich resources which are available to me here in Boston.

      Regards,

      Kat

  5. ULTRAGOTHA says:

    This is one of the most helpful research posts I’ve read in ages! Thank you so very much.

  6. Pingback: Using Archive.org to Research Your Novel » Mark Lord's Historical and Fantasy Fiction

  7. D.E. Schaefer says:

    The Good Ton website is back at thenonesuch.org. I wasn’t able to bring up the archive of the previous site, so I can’t compare, but it appears to have a wealth of material regarding Regencies from the glorious era.

    • Kathryn Kane says:

      YAAAAAAAAAAAYYYYYYYYYYYYYYY!!!!!!!!!!!! 🙂

      This is wonderful news!!! Thank you so much for sharing this information!!!

      You are quite right, Good Ton is a wonderful resource for those of us who love the traditional Regencies of the past. Many left comments about its demise this past summer. I am so glad to know it is back online.

      Regards,
      Kat

  8. Pingback: The New British Traveller by James Dugdale | The Regency Redingote

  9. Pingback: Rural Residences by John B. Papworth | The Regency Redingote

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s