Are you afraid of robots?
Don't be afraid, the newest robots in your life are not threatening
to take over the world. A web looks nothing like R2D2 of Star
Wars nor is it a machine at all. A web robot is software- a computer
program. It is designed to automatically roam through the World Wide Web,
snooping around on web sites, collecting information and doing chores.
So where does the word "robot" come from? It appeared in Karel Capek's
1921 play "R.U.R.,"
which stood for for Rossum's Universal Robots. "Robota" is a Czech word for "work." |
Sound dangerous? Rest assured, Web Bots, as they are also known, have been operating on the Web since about 1994 and are generally quite benign and are used for tasks like collecting information for major Internet search engines. They make it possible for you to search the web on sites like WebCrawler, Altavista, Lycos and others.
Does it sound like a virus to you? It isn't. A software virus is a program that replicates itself through different computers and networks. A web robot on the other hand doesn't actually move or install itself, it simply marches, or crawls, visiting web sites by requesting web pages.
A robot is programmed to visit a web site, retrieve the home page and then follow all the links contained in the site. Various robots use different schemes for this and may not do all their travels in a single visit.
What are robots used for?
Web bots have been written for many different tasks including:
Some other Internet creature names:
Are there any unfriendly robots?
Some robots have been built, either intentionally or unintentionally, which can be quite annoying. Rarely, but it has happened, a malformed robot program has overloaded some networks and servers and caused systems to crash. Almost all of the time, though, they work, safely and invisibly, behind the scenes unnoticed.
Block That Robot! - Can I stop robots snooping around my web site?
As we've said, most web bots have been silently creeping through web sites through years, mostly unnoticed. That should give you an indication that they haven't been causing too much havoc.
However, for those who have web sites or areas of web sites they wish to keep private, there are a few ways to stop most robots cold in their tracks and outside of your door.
One basic way is to use server side security on web pages or sites, if your web pages require password to access, the robots won't be able to get in. For those who just don't want robots collecting information from your web sites or parts of it, the Robots Exclusion Protocol has been developed and implemented by all well designed web bots. It allows administrators to indicate to visiting robots which parts of their site, if any, are off limits to all or particular robots.
When a well behaved, well formed robot contacts a web site, it first checks for a file called robots.txt on the root of the web server. If the file is present the robot analyzes to see if it is there are any "Do Not Enter" requests.
An example of a robots.txt file that says "just go away" to a robot
- meaning - "Robots not Welcome!" would be:
# This is a comment line in robots.txt
# User-agent - specifies the robot program name, * = all robots # The Disallow command lists directories from which robots are banned User-agent: * Disallow: / |
A more detailed example shows how you can permit certain robots access
but restrict which directories they can access:
User-agent: Nastybot
Disallow: / User-agent: *
|
In this example, the robot call Nastybot is not allowed any access to the web site but any other robot is welcome. However none of those other bots should access the /private and /finance directories nor the web page "orfderfrm.htm" in the /cgi-bin/ directory.
Blocking a robot at a particular web page: the Robots META tag
The Robots META tag allows you to to indicate to visiting robots that they should not index or harvest more links from this web page. Unfortunately, not all robots implement this feature yet and may barge in even though you politely ask them not to. But most major robots abide by the Robots META tag.
Like any META tag it should be placed in the HEAD section of your HTML
page.
<HTML>
<HEAD> <TITLE>Sample Web Page which Blocks Robots using Robots META Tag</TITLE> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </HEAD>
|
The Robots META tag example above tells all well behaved robots that they should neither index this document, nor follow any of the included links.
How do I get a robot to visit my web site?
Several search engines have robots that constantly roam the internet, 24 hours a day indexing web sites. A round trip across the web currently takes about a month. Most indexing services - for example Lycos.com - allow you to submit a URL web address for their robot to visit and index for their search engine.
Want your site to be indexed by a robot? You can submit it to any particular search engine, usually at their home page. There is a free service that will let you submit your web site to seven popular search engines all at once: Submit-it -for free (http://www.siteowner.com). Their spiders will visit your site and add it to their indexes within a few weeks.
Can I have my own web robot?
There are many bots you can download and put to work for you searching
the web for airfare bargains, web page updates, job opening, etc.
You can even write your own bot program, but it can be a lot of work even for advanced programmers. One of the best ways to start working with web bots is to download and try Harvest downloadable for free from http://www.tardis.ed.ac.uk/~harvest/.
Want to know more?
There are several practical and ethical issues problems involved in the use of robots still under debate. Check out the Web robots web page at http://info.webcrawler.com/mak/projects/robots/robots.html
Feedback
Hate this column? Love this column? Have ideas for what should be covered
- Send suggestions for Internet Basics topics by email to basics@y2kegypt.com
or kilenm@bigfoot.com.
The best suggestion gets a PC World-Egypt T-shirt! Kilen Matthews (kilenm@bigfoot.com)
is an Internet and Year 2000 Consultant for Y2KEgypt LLC (www.y2kegypt.com).