|
Just like our universe, that constellation
of computers known as the Internet is expanding at an incredible
rate. The collapsed costs of electronic publishing has resulted
in an explosion of "free" information - or rather it would be
free if you could find it quickly. Unfortunately, the average
Internet user's experience of looking for information often entails
expending a great deal of that other social denominator of cost:
time.
Want to search the Web? Forget it -
nobody does. When using a search engine what you are in fact searching
is a database of indexed webpages that are available on the Web.
Research published in the April 98 issue of Science showed that
the percentage of the publicly available web indexed by the various
search engines ranged from an abysmal 3 per cent for Lycos to
28 per cent for AltaVista, with the top slot going to HotBot at
34 per cent coverage. Also, as search engines only index HTML
webpages, they don't even touch what's referred to as "the invisible
web" - information that resides in databases and is accessed via
the web rather than on it.
The fact is that search engines simply
can't keep up with the growth of the Web. However, whilst the
percentage of the Web they index is decreasing or remaining static,
the number of Web pages indexed continues to grow as a whole.
This greater availability of indexed pages doesn't equate to greater
relevancy and there are only so many pages of irrelevant results
users are prepared to scroll through. Consequently, search engine
technology is focusing not on increasing the size of search engine
databases, but on improving search capabilities and the relevance
of results.
Search engines work by matching the
location and frequency of users' search terms against the indexed
Web pages in their databases and presenting a list of results
ranked by relevance. They don't consider the context of the search
term, and by looking for exact, literal matches, fail to consider
semantics. Furthermore, a Web page is also important if it is
popular, has lots of hyperlinks connecting to it, or if itself
refers to other related Web pages.
Several key technologies have emerged
to exploit these factors. Two new search engines launched in the
summer, Ask
Jeeves and The
Electric Monk (now defunct) use natural language searching, which allows
you to search for information exactly as you would ask it; "How
do I do business in Russia?"
Ask Jeeves compares your question to
its database of seven million questions whilst the Electric Monk
uses artificial intelligence to conduct a syntactical and semantic
analysis of your query, converting it to a complex search strategy
which is submitted to AltaVista. In other words, "How do I fix
my washing machine?" will also look for words such as "repair",
"mend" and "manual". The results for Electric Monk are reassuringly
accurate and the technology behind it is now directly available
from AltaVista. No more silly brackets, plus signs or quotes when
searching.
Two other technologies, Google and Clever,
are based on analysing the link structure of the Web. Developed
by students at Stanford University, Google
crawls the Web, analysing how websites link to each other and
ranking the results on importance - how many Web pages link to
each particular website. If you, as a website author, have included
hyperlinks to other Web pages or websites that you deem important
then you have exercised an editorial judgement. The text that
you may have written around this hyperlink is your editorial commentary.
By analysing hyperlinks and their surrounding text, Google seeks
to capitalise on the editorial judgement and commentary of thousands
of Web authors worldwide.
Meanwhile at IBM, a team of researchers
examining search engine effectiveness have developed a system
which was initially referred to as HITS (Hyperlink-induced topic
search). Then the marketing staff became involved with the project
and it was branded as 'Clever'.
Like Google, Clever analyses hyperlinks
and their surrounding text. Unlike Google, however, Clever first
submits your query to a search engine and then conducts its analysis
on the results which have been produced. This analysis divides
Web pages into two categories: authorities - pages about a particular
topic that have lots of links to them (i.e. they are authoritative
sources of information) and hubs - pages which are a guide to,
or list, authoritative sources.
IBM has been experimenting with Clever
to automatically develop Yahoo! style Web directories. Whilst
not yet available for general release, IBM is currently seeking
to licence this technology to both portal sites and to organisations
with large intranets seeking to create their own internal directories.
Whilst Google is a search engine, Clever
and Direct Hit are supplements to search engines. Already licensed
for use with HotBot, Lycos, Apple's Sherlock search utility and
Netscape Communicator 4.5, Direct
Hit is based on the concept of popularity. It monitors which
websites users are visiting from search engines and ranks how
popular they are. For example, provided their search term is a
popular one, say "Bill Clinton", HotBot users can perform a second
level analysis of their search results by clicking on the option
"Top 10 most visited websites for Bill Clinton". The White House
website will be top of the list. Direct Hit will also give an
extra boost to "hidden gems" - any website buried further down
in the list of search results that a user visits - the next time
it appears in someone else's search results.
Direct Hit provides an excellent second-level
filter to identify which of your search results are really relevant,
as will Clever when it becomes available. However, with their
emphasis on "popularity" and "importance", they have one disturbing
element in common: they all reinforce the gravitational effect
exerted by large portal sites. On the Internet, content and commerce
are inextricably linked and whoever controls the distribution
of content is guaranteed substantial revenue from ecommerce.
This article is reprinted in its entirety
with permission from The
Independent. All material copyright The Independent.
|