Notes on Hahn 2008

Commercial companies like google do not seem as invested in long term preservation as librarians. Librarians need to set the best practices
Concern is with the programs for mass digitization like google, yahoo and otehrs
Five areas of concern:
  1. pace

  1. is everything happening too fast to think through policy
  2. lists many projects that are happening too fast (need to find out the status of these projects
  1. foolish risk versus vision

  1. general paradigm is full speed ahead of clean up any mess later
  2. opt out policy by google stating that people have to tell them they do not want to be included
  1. justification for digitizing

  1. she asks why digitize the books? who is using books for research anyway?
  2. unusual collections are being digitized to give access to those around the world (21)
  3. Google is digitize everything "the more of it, the better" (p. 21)
  4. she wants us to stop and think, prioritize and make sure we are meeting our long term preservation goals
  1. trust and

  1. many argue that digitization is the best form of insurance (p. 21) 
  2. she argues that current digitization is too slapshod and does not follow standards and values of preservation
  3. need google as partners but need them to do a better job and remember they are following a corporate model
  4. Quality--she asks, is mass digitization preservation? (p. 21-22) But others argue that it isn't meant to be preservation but simply access
  1.  

  1.  

  1. google is not following good preervation values
  1. Secrecy--google is being secret about what they are doing
  2. stability--could google disappear go bankrupt?
  1. leadership

  1. 2004 report form the Association of Research Libraries endorses support of digitization as a preservation method
  2. she argues that information professionals must uphold standards and best practices (p. 23)
  3. no hits for "preservation" and google in search engines

Questions
  1. What is the current situation with google's digitization project?
  2. Is there no concern with copyright issues? She states that google and publishers agree that "intellectual property rights should be respected" (p. 20). However, many lawsuits suggest people aren't convinced
  3. Open access? Who will own what?
  4. How is the semantic web/linked data involved?
Maybe she is coming at this from the wrong angle: google is not in the business of preservation so why focus on that?


Responses
Piper, 2013
"We should not look to google to do what libraries do best" (p. 23)
Instead lets look at HathiTrust and Digital Public Library of America whose goals are "to preserve, digitally, the great and rare collections housed in libraries, musems and other cultural institutions" (p. 23) while also allowing free access
Google's purpose, "to create a comprehensive, searchable, virtual card catalog of all books in all langauges that helps users discover new books and publishers discover new readers" (p. 23)
Different goals!
Hathitrust was created in 2008 by 12 universites of the committee on institutional cooperation and the U of California library consotrium and today has 60+ libraries. Libraries can join if they have content that contributes to the whole and/or if they are willing to pay. So far its all academic libraries plus the New York public library, the library of congress and the getty research institute (feels very exclusive--one way of keeping poor people from access). Some early scanned documents by google included are poor quality but Hathi is committed to rescanning. They contain at least 11 million volumes with 31% of the work in the public domain and can be viewed or downloaded by anyone
DPLA launched in 2010 but there is some quesiton about what is a "public" library.
Defines a public library
"Public libraries are many things to many people, and a large percentage of these are social. Library as space has always been a critical function of a public library, as is its role within a community. Public libraries host numerous events, discussions, and training. In addition, they are gathering spots, nodes of connectivity, and (in some cases) refuges. They are arbiters of censorship and privacy." (p. 24).
SF public library is a member of DPLA. Does not really haave stated goals

My response
Both Rothenberg (1999) and Hahn (2008) raise interesting issues about what role the digital library can/will play in preserving records.
Questions I have
  1. What is the current situation with Google's digitization project? And what alternatives are out there?
  2. What are the concerns with copyright issues? She states that google and publishers agree that "intellectual property rights should be respected" (p. 20). However, many lawsuits suggest people aren't convinced
  3. Open access? Who will own what?
  4. How is the semantic web/linked data involved?
  5. What is already lost?

Goethals, Oury, Pearson, Sierman et al., (2015) discuss the work of the IIPC (International Internet Preservation Consortium) group, noting that there are still many challenges to digital preservation. They did a survey across their 46 members (mostly libraries) and found that while most, 78%, have a preservation policy, 18% do not. And, within the group that has a preservation policy, about 33% say that their preservation policy has a specific web archiving focus. The article presents some very interesting information about the actual preservation strategies that the libraries are using and their risk factors, raising the trust issue that Hart and Liu (2003) discuss in "Trust in the Preservation of Digital Information".

And they raise an additional interesting question: how should the preservation materials be accessed and presented to the user?

Another interesting issue I came across when reading a New Yorker article on preserving web materials (Lepore, 2015) is the process of identifying web materials to be archived. Lepore notes that San Fran has a Wayback Machine (sounds like something from a sci-fi flick), and the article provides real life examples of the reasons for archiving. People can always erase web information, and it is difficult  if not impossible to retrieve the original. Lepore notes that "BuzzFeed deleted more than four thousand of its staff writers' early posts, apparently because, as time passed, they looked stupider and stupider".

But that raises another issue, should we be archiving every page on the web, every day? Every hour?  Lepore (2015) notes that the average life of a web page is 100 days. The Internet Archive, the owners of the Wayback machine, are apparently trying to archive the web using a combination of robots that crawl the internet making copies of web pages it finds and librarians that submit suggestions. The Wayback machine, on average, takes a shot of each web page it locates about every 2 months, which raises the question, is that often enough?

Goethals, A., Oury, C., Pearson, D., Sierman, B., & Steinke, T. (2015). Facing the challenge of web archives preservation collaboratively: The role and work of the IIPC preservation working group.
 D-Lib Magazine, 
21(5), 4.
Hart, P. E., & Liu, Z. (2003). Trust in the preservation of digital information. Communications of the ACM, 46(6), 93-97.
Hahn, T. B. (2008). Mass digitization: Implications for preserving the scholarly record.
 Library Resources & Technical Services, 

52(1), 18.

My response 2
After a bit of research, I did find that the Author's Guild is taking the mass digitization case to the Supreme Court, though it isn't clear whether the supreme court will hear the case or not (O'Brien, 2015). Currently, it seems like the mass digitization efforts tend to be given the benefit of the doubt in the courts (O'Brien, 2015). O'Brien mentions HathiTrust Digital Library (HTDL) as the library alternative to Google Books, and apparently, it built on the Google Books Scanning project, and currently holds 13.7 million volumes. Of these materials, almost 40% are available to the public.  Another mass project is the Internet Archive (IA) which started in 1996

Comments

Popular Posts