Squid is a free, high-speed, Internet proxy-caching program. So, what is a "proxy cache"?
According to Project Gutenberg's Online version of Webster's Unabridged Dictionary:
Proxy. An agent that has authority to act for another.
Cache. A hiding place for concealing and preserving provisions which it is inconvenient to carry
Squid acts as an agent, accepting requests from clients (such as browsers)
and passes them to the appropriate Internet server. It stores a copy of
the returned data in an on-disk cache. The real benefit of Squid emerges
when the same data is requested multiple times, since a copy of the on-disk
data is returned to the client, speeding up Internet access and saving
bandwidth. Small amounts of
disk space can have a significant impact on bandwidth usage and browsing
speed. (costs)
Internet Firewalls (which are used to protect company networks) often have a proxy component. What makes the Squid proxy different from a firewall proxy? Most firewall proxies do not store copies of the returned data, instead they re-fetch requested data from the remote Internet server each time.
Squid differs from firewall proxies in other ways too:
Many protocols are supported (firewalls often have specific proxies for specific protocols: it's difficult to ensure code security of a large program)
Hierarchies of proxies, arranged in complex relationships are possible
When, we refer to a 'cache', we are referring to a 'caching proxy'
- something that keeps copies of returned data. A 'proxy' on the other
hand, is a program which do not cache replies.
The web consists of HTML pages, graphics and sound files (to name but
a few!). Since only a very small portion of the web is made up of text,
referring to all cached data as pages is misleading. To avoid
ambiguity, caches store objects, not pages.
Many Internet servers support more than one protocol. A given server can support more than one type of query protocol. A web server uses the Hyper Text Transfer Protocol (HTTP) to serve data. An older protocol, the File Transfer Protocol (FTP) often runs on web servers too. Muddling them up would be bad. Caching an FTP response and returning the same data to the client on a subsequent HTTP request would be incorrect. Squid uses the complete URL to uniquely identify everything stored in the cache.
So as to avoid returning out of date data to clients, objects must be expired. Squid therefore allows you to set refresh times for objects, ensuring old data is not returned to clients.
Squid is based on software developed for the Harvest project, which developed their 'cached' (pronounced 'Cache-Dee') as a side project. Squid development is funded by the National Laboratory of Network Research (NLANR), who are in turn funded by the National Science Foundation (NSF). Squid is 'open source' software, and although development is done mainly with NSF funding (??), features are added and bugs fixed by a team of online collaborators.
Why Cache?
In the USA
Small Internet Service Providers (ISPs) cache to reduce their line costs, since a large portion of their operating costs are infrastructural, rather than staff related.
Companies and content providers (such as AOL) have recently
started caching. These organizations are not short of bandwidth
(indeed, they often have as much bandwidth as a small country), but
their customers occasionally see slow response. There are numerous
reasons for this:
Origin Server Load
Raw bandwidth is increasing faster than overall computer
performance. These days many servers act as a back-end for one
site, load balancing incoming requests. Where this is not done, the
result is slow response. If you have ever received a call complaining
about slow response, you will know the benefit of caching - in many
cases the user's mind is already made up: it's your fault.
Quick Abort
Squid can be configured to continue fetching objects (within certain
size limits) even although somebody who starts a download aborts it.
Since there is a chance of more than one person wanting the same
file, it is useful to have a copy of the object in your cache, even
if the
first user aborts. Where you have plenty of bandwidth, this
continued-fetching ensures that you will be a local copy of the object
available, just in case someone else wants it. This can dramatically
reduce latency, at the cost of higher bandwidth usage.
Peer Congestion
As bandwidth increases, router speed needs to increase at the same
rate. Many peering points (where huge volumes of Internet traffic are
exchanged) often do not have the router horsepower to support their
ever-increasing load. You may invest vast sums of money to maintain
a network that stays ahead of the growth curve, only to have all your
effort wasted the moment packets move off your network onto a large
peering point, or onto another service provider's network.
Traffic spikes
Large sporting, television and political events can cause spikes in
Internet traffic. Events like The Olympics, the Soccer World Cup, and
the Starr report on the Clinton-Lewinsky issue create large traffic
spikes.
You can plan ahead for sports events, but it's difficult to estimate
the
load that they will eventually cause. If you are a local ISP, and a
local
team reaches the finals, you are likely to get a huge peak in traffic.
Companies can also be affected by traffic spikes, with bulk transfer
of
large databases or presentations flooding lines at random intervals.
Though caching cannot completely solve this problem, it can reduce
the impact.
Unreachable sites
If Squid attempts to connect to an origin server, only to find that
it is
down, it will log an error and return the object (even if there is
a
chance of sending out-of-date data to the client) from disk. This
reduces the impact of a large-scale Intenet outage, and can help
when a backhoe digs up a major segment of your network backbone.
Outside of the USA
Outside of the USA, bandwidth is expensive and latency due to very
long haul links is high.
Costs
Outside of the USA and Canada, bandwidth is expensive. Saving
bandwidth reduces Internet infrastructural costs significantly. Since
Internet connectivity is so expensive, ISPs and their customers
reduce their bandwidth requirements with caches.
Latency
Although reduction of latency is not normally the major reason for
introducing caching in these countries, the problems experienced in
the USA are exacerbated by the high latency and lower speed of the
lines to the USA.