[Web4lib] Lies, damn lies, and usage statistics?

Wed Mar 14 08:43:12 EST 2007

Stacy:

      I have been wondering about the accuracy of those statistics for over
a year now.  When we implemented our federated search engine (fse) some of
the statistics for HTML article views left me wondering about the
intereaction between the source and the federated search engine.  It is not
clear to me if HTML article views are being counted for citations displayed
by the fse.

             I have also noticed a growing discrepency between counter
compliant and native Ebsco reports over the past 3 years.  When looking at
the total number of full text articles retrieved:  in 2004 the
Counter-compliant results were 95% of the Ebsco number; in 2005 it dropped
to 74% and in 2006 it dropped farther to 65%.  Ebsco's abstracts viewed
also seemed to go way up in the last 12 months for no apparent reason.  Did
anyone else notice that?

             The auditing done for Counter-compliance is key to statistics
being useful measures.  Ultimately these statistical reports influence our
purchasing decisions.  I agree with the people who want to know more about
this.

Katy

Kathryn K. Silberger
Automation Resources Librarian
James A. Cannavino Library
Marist College
3399 North Road
Poughkeepsie, NY  12601
Kathryn.Silberger at marist.edu
(845) 575-3000 x.2419

Stacy Pober <stacy.pober at manhattan.edu>
Sent by: web4lib-bounces at webjunction.org
03/08/2007 09:38 PM
To
web4lib at webjunction.org
cc

Subject
[Web4lib] Lies, damn lies, and usage statistics?

I have been downloading annual database usage statistics for our
library's electronic databases.

Looking at the statistics from one of our vendors (EBSCOhost), I
noticed  a peculiar thing.  Some of the databases in the report were ones
for which we had no subscriptions and no access. Yet the report showed
usage for those databases,  and it was for multiple months in several
databases,  so  it was not a one-time computer foul up.

When I contacted their technical support and reported this, they said that
this was a "known issue" and explained that we could deselect those
databases when generating the usage report.  But I didn't want simply
to make the obviously bad data invisible,  I wanted to know why it was
there and whether the other figures, those that were not so obviously
fictional, were accurate.

When pressed for more information on the exact nature of the problem, the
helpful  support person did not elaborate, but wrote:

   " I have filed a Service Issue (think of it as a work order) to have
   your  statistics "scrubbed" so that you will only be left with your
   actual statistics."

When asked for the specific reason that we are seeing fictional
usage statistics for several databases, he again assured me it was a
"known issue"  (I don't know if he thought that this was a good
substitute for a detailed explanation.  It is not.)   He sent  no
technical details and wrote:

   "please rest assured that this type of problem is rare, and that
    the statistics gathered by the system are quite accurate."

Which seems to miss the point.  If we don't know what caused the
problem, why would we assume any of the usage statistics are
accurate?  The erasure of  glaringly wrong figures isn't a a reason
to believe that the remaining information in a report is correct.

This isn't the only vendor that provided inaccurate usage data this year.
Another  vendor's statistics showed zero usage after our subscription
started. Since I had used it  at the beginning of  the subscription period,
it was clear something was wrong.  When this anomaly was reported, the
vendor never explained what the problem was, but sent us some
completely  different (and - surprise! - much higher) usage figures.

In the past, I never really thought about this issue, and just assumed that
most  of the database usage information provided  by our vendors was
resonably accurate.  This  was an inappropriately optimistic assumption.
As far as I know, there is no way to validate the most of the statistics
provided to us by database vendors.

Some independent data can be obtained from our EXproxy logs, as
they show the number of times users accessed particular databases.
However, though the EZproxy server has some detailed information
about off-campus use, our on-campus users don't interact with it past
the initial database link selection.

Even if all of our usage was routed through the EZproxy server, those
logs aren't kept for that purpose, and I don't think they show some of
the most useful information, such as the number of  abstracts and
full-text documents accessed.  For the databases with full-text, the
number of full-text articles or documents used is a significant figure.
The EZproxy logs can be analyzed to show pdf downloads, but
many of our databases offer much of the full-text as HTML.

Our openURL system offers some statistics on full-text retrievals,
but that system only works with full-text access across different
databases.  The openURL system  won't come into play for those
sessions  where the search and the full-text are in the  same database.

Aside from the limited nature of the independent usage statistics
available,  doing  accuracy checks on the vendor-supplied statistics
would be a major pain to do on  a regular basis.

I'm just bringing this up as a concern.  I'm sure that I am not the only
librarian who assumed in the past that the vendor supplied usage data
was correct.  Since we  use that data as an important factor in our
database acquisition and renewal decisions, it would be nice to have
some independent assurance of the accuracy of the data we're getting
from vendors.

I don't really think that our database providers are using Ouija boards
to produce our usage reports.  The question is whether they are routinely
checking the validity of  the figures they collect and supply to us.
Apparently
some of them  are not doing logic and accuracy testing of the software
they use to produce the usage statistics.

Has anyone checked the accuracy of vendor-supplied database usage
data?  If you  have, how did you do it and what results did you find?

--
Stacy Pober
Information Alchemist
Manhattan College
O'Malley Library
Riverdale, NY 10471
stacy.pober at manhattan.edu <mailto:stacy.pober at manhattan.edu>

"If you want to inspire confidence, give plenty of statistics.
It does not matter that they should be accurate, or even intelligible,
as long as there is enough of them."  - Lewis Carroll
_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/