followup to Chris on proxy servers

Mon Jan 24 13:18:48 EST 2000

Couple of additional comments on Chris' note, mostly just reaffirming
and extending what he said.  This is a long posting, but I hope it will
give libraries considering proxy facilities something to chew on.

Chris Zagar wrote:

> I have to admit that I have learned a great deal about the true HTTP and
> HTML specifications as a result of writing this program.  It's all there
> in the spec, but how many of you really knew that an href value of
> "//www.cnn.com/WORLD" without the http protocol header was a perfectly
> valid way to say "transfer this file using the same protocol under which
> the current document was loaded"?

At least one library-database vendor actually uses this construct, too.

> JavaScript is indeed one thing that can throw these servers off....
> From a practical standpoint, it hasn't been a problem for any institution
> using EZproxy, but that is the type of thing that could throw off this
> type of proxy server.

The Chadwyck-Healey databases (because of a bug in how they assemble URLs
via JavaScript) do seem to cause a problem here, albeit not fatal.  One
sees it show up when the pass-through proxy is using SSL (https) and the
user is displaying full-text search results.  C-H's JavaScript uses a lit-
eral string 'http' when it reassembles URLs on the assumption that their
URLs will always begin 'http' (a reasonable, but false, assumption once
the pass-through proxy has filtered its pages).

> URLs that are contained in other formats such as PDF are likely to beyond
> what one of these products can handle.

At Brown we also do not rewrite PDF files.  We could theoretically develop
plug-ins for every media type.  But we have a small IT department, and it's
just not feasible for us to maintain a lot of plug-ins.  So far it's been
much, much easier for us to explain these sorts of limitations to the few
users who have run into them than to fall back to a standard proxy, and try
to teach our users how to alter their browsers to accommodate them.  We also
just can't stomach the lousy authentication facilities that regular proxies
offer.  Pass-through proxies like ours and EZproxy offer much better fac-
ilities here.  Brown's pass-through proxy uses Kerberos, but can easily be
reconfigured to use any other widely available auth method.  EZproxy is ap-
parently going to be able to use Innovative's own user/PIN number database
for authentication.  Innovative happens to be Brown's PAC vendor (note, by
the way, that I use the term PAC below in a different sense).

> > Depending on the particular rewriting server, they may not even handle
> > such simple constructs as URLs in (or to) cascading style sheets.  To
> > handle domain-based cookies seems to require that the rewriting proxy
> > server maintain a notion of a client session.
> 
> Both of these are definitely issues that must be factored into this type
> of server, and both are handled by EZproxy.

EZproxy has a notion of a session that Brown's proxy does not.  This enables
it to forward domain-restricted cookies on behalf of clients.  This offers
a workaround to the problem that if you rewrite, say, web.lexis-nexis.com
as revproxy.institution.edu:1099, cookies set for distribution to all lexis-
nexis.com machines won't work (you can't just reset them for the domain in
which revproxy.institution.edu lies [institution.edu], because doing this
for more than just a few vendors will create a cookie storm any time the
user accesses an institution.edu machine).

> > The rewriting
> > proxy servers that we might look at have generally been tested against
> > the sites we care about (a few database providers); as a practical
> > matter, do they handle all the special cases that are relevant to
> > those sites, and can we count on those sites not using newfangled HTML
> > constructs that the rewriting proxy doesn't understand?
> 
> The risks are clearly much greater in this area than any other for URL
> rewriting proxy servers over traditional proxy servers.

There is one solution we've used (one that creates some of its own prob-
lems):  That's the use of a proxy autoconfiguration (PAC) file.  When we
do builds of our internal proxy database (the list of vendor machines we
proxy), we automatically create a PAC file that users can configure into
their browsers in much the same way that they'd configure a traditional
proxy.  What the PAC file does is trap any references to proxied machines
that 'leak' through the rewriting mechanism.

This solves the problem of JavaScript URLs created on the fly (and many
others to boot).

The problem it creates is that users have trouble setting this up.  And
often it won't work for firewalled ISPs.

It's an option, however.  A number of our users make use of PAC files
from home, and seem to be happy with the setup.  We don't use it as our
first line of defense, though.  We merely document its use for people
who understand what it does.  We have had only a few complaints about
it so far, most of them emanating from users who are annoyed at how,
when the proxy server is down, their browsers can't access the PAC file,
and they therefore see a variety of hard-to-self-diagnose problems.

In my opinion, these issues are all minor.  The filtering pass-through
proxy methodology is the only one to use if you are part of a larger
institution that cares about the security of its authentication methods
and the convenience of its user base.  In reality, the main problem we
have faced with the pass-through proxy has nothing to do with its cov-
erage or reliability, but rather with an issue that hasn't been raised
here:  That you have to maintain a list of machines to be proxied.

Unlike a traditional proxy, a filtering pass-through proxy only proxies
what you tell it to.  If you license a bunch of new IP-authenticated jour-
nal websites, you have to forward a list of machine names used by those
journals to the person maintaining the proxy.  Although this might seem
an easy task (when serials catalogs a resource, they can just forward its
location on to the technical staff), the fact is that cataloguers aren't
used to doing things this way.  Most of the burden falls on our library
webmaster to check the new URLs for serials as they come in, to be sure
they're being proxied.

Ideally libraries should have their serial databases set up for outside
(SQL) access, so that their URLs can be lifted automatically out of the
database by the proxy server.  I doubt many institutions are to this
point, however.  So there will generally be some unpleasant overhead in-
volved in maintaining a pass-through proxy.

Despite this, the pass-through proxy has been an extremely popular ser-
vice here.  Through it we get some really detailed statistics (that the
vendors don't always provide) on internal database usage.  Right now we
can generate exhaustive lists not only of who is using what vendors'
databases, but what URLs they are accessing within the vendors' web-
sites.  We naturally mask off actual user names, replacing them with
opaque ID numbers.  (Libraries are traditionally very careful about
protecting the privacy of their patrons.)

Our networking services people are very happy that we're leveraging the
existing Kerberos infrastructure, and that we're not doing it over a
plain-text link.  This would not be possible with a traditional proxy,
which (as noted above) is, in practical terms, restricted to a fright-
eningly insecure method of authentication.

Like Chris, I don't see pass-through proxying as a permanent solution
to the problem.  But for now our proxy is holding up remarkably well.
The fact that we can pull it off with our small (albeit dedicated)
library and central IT staff shows that pass-through proxying is a
viable option.

====

Richard Goerwitz