Archiving the Web

Introduction to Web Archiving :

According to wikipedia, “web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public”. NYARC uses the proprietary/subscription based software, Archive-It, developed by the Internet Archive. Archive-It uses a web crawler, Heritrix, to capture sites (settings are programmed by the institutional subscriber to ensure scope content and extant is appropriate and desired). The Heritrix software, in addition to many features, also creates WArc files. We have decided not to go too in depth on the technical side of web archiving and instead will focus more on NYARC’s collection development and the Quality Assurance process that we spend the majority of our time learning and practicing. A brief article that give a nice overview of the web archiving program at NYARC by our supervisor, Sumitra Duncan, can be found here. Also see the NYARC’s web archiving page.

After the 2012 pilot study, it became clear to members of NYARC that relevant and important web content connected to the study of art history may be in jeopardy of being lost. According to an article from 2014, websites have a lifespan of somewhere between 44-75 days. This means that within that time period a page may disappear, be removed, be deleted, or simply change. In the fall of 2013, NYARC was awarded a Mellon Foundation grant to initiate a web archiving program. The establishmentNYARC’s collection development policy includes ten collection categories (including each institutional website). They are:

  1. Art Resources
  2. Artists’ Websites
  3. Auction Houses
  4. Brooklyn Museum
  5. Catalogue Raisonnés
  6. Museum of Modern Art
  7. New York Art Resources Consortium
  8. New York City Art Galleries
  9. Restitution of Lost and Looted Art
  10. The Frick Collection

More information about these collections can be found on the NYARC’s archive-it page. Below is an illustration of NYARC’s web archiving workflow.

Screen Shot 2016-04-26 at 9.46.25 PM.png

 

As NYARC web archiving Pratt fellows, the majority of our work was focused on Quality Assurance, QA (in the 6-o’clock position in the illustration above).

Part of collection development is also website nomination.NYARC has developed a workflow for website nominations, see the diagram below, created by Sumitra Duncan.

website nomination workflow

NYARC is open to considering websites for inclusion in their collection. To nominate a site, fill out the online form.

Part of the nomination process is writing permission letters. NYARC follows a similar non-intrusive approach that Columbia Libraries developed. This process involves sending an initial letter and detailed background information on NYARC’s web archiving initiative If there is no response, a second letter is sent. If still no response, a third letter is sent informing the institution/website author that NYARC intends to harvest the site and include in their archive. NYARC honors all requests to remove archived content as well. Sydney and I did not participate in this part of the process, but are familiar with the procedure and are aware that NYARC maintains a database with records of all letters sent and responses.


QA Workflow:

The majority of our time was spent learning how to QA and put into practice. Every website is unique, which means each site has unique requirements regarding the best workflow for success. Once a website has been chosen for inclusion, the first step is to do a test crawl. This helps to ensure that scope captured desired content. Once a test crawl is complete, Archive-It generates a crawl report that allows for review and crawl adjustment if needed. Once this is done and the crawl has been completed, the Quality Assurance truly begins.

Each QA work station is set up with two monitors so that QA technicians (NYARC interns) can view the site in Wayback on one screen and view the site in Proxy Mode on the other screen. To track progress when doing QA, each technician develops a content inventory spreadsheet. Generally included in a content inventory:

  • a unique ID
  • navigation title
  • page name
  • URL
  • content hierarchy
  • collection # assigned from Archive-It
  • any help ticket documentation
  • any other fields deemed important by QA technician

Everyone who does QA has their own style of doing content inventories, there is no right or wrong way as long as the site progress is being tracked. We found this very helpful since we were often only in one day a week. It ensured we knew exactly where we left off.

An important note, QA technicians, especially in the beginning often refer frequently to the NYARC wiki , particularly section 6 (developed by NDSR fellow Karl-Rainer Blumenthal).

Overview of the QA process:

  • Test Crawl
    • there must be set amount of time for crawler to run
  • Evaluate
    • was that a sufficient amount of time to capture the entire website, should this be adjusted?
  • Adjust Scope (if necessary)
    • are there any host sites which should be excluded or included?
  • Run full crawl of site
  • Compare live site to archived version
  • Make sure everything has be captured successfully
    • should anything be adjusted for future crawls? are future crawls necessary?

Common Issues:

  • Robots.txt
  • Dynamic Content
    • Flash
    • Video
  • Complex site structures
  • Link rot

Patch Crawling:

In essence, patch crawling is selecting missing web content (URLs) that were deemed out of scope by Heritrix, the Archive-It web crawler. During the process of QA’ing a site, missing URLs are triggered. It is crucial that every link on a page is explored/clicked to ensure its inclusion in the capture and if it wasn’t to trigger it as missing. Our general practice is to set up a patch crawl of missing URLs at the end of the day. Often times the patch crawl only takes a few minutes, but we have found that it is best viewed 24 hours after completion. Another reason we do this at the end of the day is that we have found doing page by page patch crawling can tie up Archive-It’s web crawlers. Ideally we patch crawl 40-100+ URLs at a time. If patch crawls don’t resolve issues, QA technicians reach out to Archive-It with a support ticket which opens the conversation with Archive-It staff and sometimes involves getting their engineers to assist.


For web developer tips on creating archive-able sites see the NYARC wiki here.


Citing a Web Archived Source:

In addition to our QA work, we were tasked with to updating NYARC’s FAQ page for how to cite a web page.

NYARC

It is important for researcher to know how to cite born-digital works such as an archived web resource. Since there is not yet a standard format for resources such as the Wayback Machine, NYARC enlisted the MLA to assist in creating a citation formula for archived websites. The final example is comprised of the normal webpage citation, adding the information obtained from the Wayback Machine at the end, as follows:

Author, “article or page name”, website name, date published, date archived, [live website URL], Internet Archive, [Wayback Machine URL]

Additionally stating that if the date the information was updated is missing one should simply use the closest date in the Wayback Machine followed by the date when the page was retrieved.

HGSE – Harvard Graduate School of Education

Harvard’s Graduate School of Education has published examples of web citations:

Boston Public Schools (n.d.). Boston Public Schools: Focus on children. Retrieved May 31, 2008, from http://www.bostonpublicschools.org/

Name of the site (date if available). “Page name or title.” Retrieval date month day, year, from URL

Purdue OWL

Since there is no official standard for citing archived websites, one must provide the bibliographic information for a traditional website, adding the Wayback Machine information as previously discussed. The following is the general webpage citation template as laid down by MLA and Purdue OWL:

Editor, author, or compiler name (if available). Name of Site. Version number. Name of institution/organization affiliated with the site (sponsor or publisher), date of resource creation (if available). Medium of publication. Date of access.

It is significant to note that not all websites will provide all the noted information and citations should be adjusted slightly to fit the information available. If there is no date provided for publication indicate with n.d.

There are also several variations on the citation process according to which section of the page one is referencing. The above can be used for websites in their entirety while the following are only applicable in a given situation.

Course/Department Websites:

Author, Title of course, department name, school name, date published, date accessed.

Webpage:

Author, “article or page title”, website name, date published, date accessed.

Images:

Artist (last, first), name of work, date of creation, home institution, city. Website name, date accessed.

Adaptations

Now that the ground work as been laid out as to the various ways to cite a website and its parts we have adapted the information provided by Purdue (MLA) adding to our entry the necessary information, as discussed by NYARC, from the Wayback Machine. Resulting in the following:

General Websites:

Author, “article or page name”, website name, date published, date archived, [live website URL], Internet Archive, [Wayback Machine URL]

Course/Department Websites:

Author, Title of course, department name, school name, date published (if none – n.d.), date archived, [live website URL], Internet Archive, [Wayback Machine URL]

Webpage:

Author, “article or page title”, website name, date published (if none – n.d),date archived, [live website URL], Internet Archive, [Wayback Machine URL]

Images:

Artist (last, first), name of work, date of creation, home institution, city. Website name, date archived, [live website URL], Internet Archive, [Wayback Machine URL]


Above: Mary, Lady van Dyck, nee Ruthven, Anthony van Dyck, 1640

Advertisements