social media question

Wilhelmina Randtke randtke at GMAIL.COM
Mon Feb 11 13:22:23 EST 2013


Here are three simple technologies to preserve online information:

1)  Print to PDF.  At the simplest level, when you are in a website, use
ctrl+P to print that website to a PDF.  I have seen blogs put into
institutional repositories like this - one long PDF of all posts.  The most
significant collection development aspect of this is that it is easy to
explain to a potential donor, or to employees or subject librarians who
have a casual relationship with web archiving, and can't be expected to
learn and follow detailed policies.  They can make a PDF that they feel is
representative of the digital object they will have archived.

2)  Firefox extension DownLoadThemAll.  You install this extension, then
you can right click on a page and download all linked files from that page
at once.  This is good for places where you have a big dump of PDFs, and a
single splash page with links to those PDFs.  For example, where the
university has posted many old student newspapers, but these are not in a
CMS that can share Dublin Core records, you can take all the PDFs at once,
and process the incoming links to get metadata connected with the files.

3)  WinHTTrack.  This is a program that spiders in a website.  So, it will
start at the first page, then go through all the linked pages, and pages
each of those links to, and so on, and save those to disk.  You can also
set it to save the first page only that it hits when it leaves the website
you gave it to go to another domain.  (That's important for microblogs,
where the content the link points to is a central part of the message.)


When you pick a technology to use, the most important thing is to
understand the general architecture of the website you will scrape and then
to understand what you will and won't be able to capture.

15 years ago, most websites were static HTML, and a simple spider could
make an exact copy of the website. Now, most websites are database driven.
The text that you see in a Wordpress, Drupal, Facebook page is all stored
in a database.  There is no HTML page.  There is just a program that
detects the URL you want, processes it, then connects to the database to
write some HTML on the fly.  Without being able to talk with the
webmaster/developer, you will never actually get an exact copy.  (Usually
files live where you find them, so a URL pointing to a PDF/JPG/etc means
the PDF/JPG/etc really is there, instead of a set of rules that retrieves
the correct PDF/JPG/etc when the URL is entered.)  Talking to them and
coordinating a copy is technologically more difficult because you need a
deeper understanding of the architecture of that website, takes time, and
may not be feasible.

For most projects you will have to use a technology that makes an imperfect
copy.  You need to quality control to make sure that the copy you are
making is good enough.

-Wilhelmina Randtke


On Mon, Feb 11, 2013 at 9:23 AM, <dlgreen2 at fhsu.edu> wrote:

> I received a question from the person in charge of University Relations
> and I'm unsure how to answer.
>
> "Is there a library technique to preserve FHSU's history on the web or
> through social media? Much of the historic information UR uses came from
> yearbooks and campus papers. Now that those are gone, we need to be sure to
> preserve our history in a different way."
>
> What are others doing in relation to such a question? Or Is anything being
> done at all? Any ideas or comments would be helpful.
> Thanks,
> Deborah
>
>
> Deborah L. Green, MLIS
> Digital Collections Librarian
> Fort Hays State University
> dlgreen2 at fhsu.edu
> (785) 628-5713 - office
> (785) 639-6179 - work cell
> ============================
>
> To unsubscribe: http://bit.ly/web4lib
>
> Web4Lib Web Site: http://web4lib.org/
>
> 2013-02-11
>

============================

To unsubscribe: http://bit.ly/web4lib

Web4Lib Web Site: http://web4lib.org/

2013-02-11
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.nd.edu/pipermail/web4lib/attachments/20130211/b6d9f93e/attachment.htm>


More information about the Web4lib mailing list