David Green BA (Hons), PgDipLIS, MCLIP    
Home CV Publications

Research Papers

Events Useful Links

 

Section four - Search technology evolves

Online Information Review - Apr 00 - The evolution of web searching

Anyone who has ever seen a diagrammatic representation of the evolution of life on our planet, as we currently understand it, would notice that basic cellular lifeforms were around for a very long time before the evolution of more complex biological entities. However, once this point had been reached, the rapid diversification of life into ever more organised and intelligent forms occurred in ever decreasing timescales.

The same can be said for web search technology. By focusing their efforts on ecommerce and portalisation the first generation of search sites - the 'big 5' - neglected their core search functionality. Whilst they reigned supreme for several years, this neglect, and failure to appropriately adapt to a changing environment, created niche opportunities which were soon exploited by new types of search providers…

Meta search | Popularity based | Natural language | Links analysis | Newsgroups | Subject specific

Meta Search engines
Meta Search engines enable the user to search across several search engines and web directories simultaneously. Some of the most popular meta search engines include:

Dogpile - searches fourteen different engines and directories but doesn't eliminate duplicates. Was acquired in August 1999 by search engine GO2 for US$40M in stock and a further US$15M in cash. At the time of acquisition, Dogpile had five employees! (8)

Mamma - searches seven engines but de-duplicates and re-orders results according to its own relevance ranking algorithm.

Popularity based analysis
The first generation of search engines created indexes by spidering web sites, analysing the location and frequency of words. Web directories were compiled manually. Launched in April 1998, Direct Hit represented a radical new departure from these approaches, and dubbed its methodology 'the third way'.


The system claimed to be 'user-controlled' as the ranking of results is based on web sites that users have visited. Like many of the second-generation search technologies, it is not a separate search engine with its own index that can be accessed directly. Instead it provides a second-level analysis of search results where it is incorporated within existing search engines, one being HotBot.

Prior to licensing Direct Hit, HotBot returned a list of results based on the standard methodology of matching search terms with content on the websites in its index. Now, Direct Hit will run a second level analysis on the user's set of results. From its database it will identify those web sites which are 'popular', according to the number of visits that each web site has received, and then re-rank the search results accordingly, with the most popular websites that match your search term presented first in the list of results.

However, the popularity of a web site can be largely determined by its search engine rankings and there are all sorts of ways to manipulate those if you have a good understanding of how search engines work. Direct Hit tries to compensate for this by boosting 'hidden gems'. For example, a web site could provide lots of valuable information about a particular topic but could feature further down the list of results of search engines. If a searcher has been tenacious enough to dig down as far as result number 100 (they'd probably be an information professional!), and click on it, then Direct Hit's algorithms will give this site a big boost up the list of results next time it appears in other searchers' list of results. If other users don't click on this 'hidden gem' website, then it will drop down the list of results for subsequent searchers - because it didn't prove popular! (9)

Since its launch, the company has successfully licensed its technology to ten search sites including AOL, HotBot, Lycos, MSN and LookSmart and is available within Netscape Communicator 4.5 and Apple's Sherlock search utility. back to the top

Natural language searching
As already discussed, the first generation of search engines operated by matching the keywords submitted by the user to the contents of the web pages in their databases. They did not consider the context of the search terms i.e. the syntactical relationships between the search terms and other vocabulary within their index. Furthermore, they search for literal exact matches and therefore fail to consider semantics or use thesauri (7). Most search engines also automatically ignore frequently used words such as 'or', 'to', 'not' etc.

June 1998 represented a great landmark in addressing these limitations. Two new search engines launched within weeks of one another. Both offered natural language searching, but adopted different philosophies in developing their solutions. Strangely both were named after characters in books.

Ask Jeeves
Launched June 1st 1998. Billed as 'the first natural language search agent on the Internet' it operates by matching a user's query against a database of seven million template questions. If there is no match then the user is presented with the nearest alternatives from the database and asked to select the most appropriate. It will also conduct a metasearch across AltaVista, Go (Infoseek), Lycos and Yahoo! Has now been licensed by AltaVista for it's own search site. However Artificial Intelligence (AI) experts have criticised the company's 'natural language' claims. Named after the butler in a PG Woodhouse novel.

The Electric Monk (now defunct)
Launched a few weeks later, this search service conducts a syntactical analysis of the query using natural language algorithms. These algorithms will also make use of thesauri to consider alternative related words. The 'natural language' search is then translated into a complex boolean query and submitted to AltaVista. Named after a character in a Douglas Adams novel. back to the top

Link based analysis
The first-generation search engines have focused on building huge indexes with the goal of answering every possible kind of general query. They focus on the content of each specific page they visit with little consideration of how these pages inter-relate and connect. As already discussed, the indexing methodologies they use fail to consider the complexity of human language - syntax (sentence structure), synonyms (different words for the same meaning) and polysemy (different meanings to the same word).

Links based analysis attempts to overcome these problems by examining the relationships between pages - the one billion or so hyperlinks that weave the web together. (1) By examining how web pages link together, links-based analysis offers methodologies for identifying authoritative sources of topic-specific information, eliciting quality, highly relevant results to user's queries. Not surprisingly, links-based analysis has quickly gained prominence amongst Internet users and is attracting a lot of attention from both computer information scientists and corporate Internet investors.

Google
Like Yahoo!, Google was developed by students at Stanford University. This technology uses a methodology known as PageRank (named after Larry Page, one of its creators) to crawl the web and analyse how websites link to each other. Results are ranked on importance i.e. how many other websites link to them. If you, as a website author, have included hyperlinks to other sites that you deem important, then you have exercised some editorial judgement. In the same way that web directories such as Yahoo! are compiled by editors on a manual basis, Google seeks to capitalise on the editorial judgement of millions of website authors on an automated basis.

As a result, of course, it can analyse far more websites than the humans who build directories such as Yahoo!. In fact, unlike search engines that become less useful the larger their index of websites becomes, Google claims to return even better results with a bigger index. Google also seeks to capitalise on the accompanying editorial commentary by processing the text around each hyperlink. (9)

Links-based analysis does feature in the relevance ranking algorithms of some search engine providers such as Excite and HotBot. However, Google is the only search engine that is exclusively focused on links-based searching that is currently publicly available for web-wide searching. The company estimates that its index is between 70M -- 100M pages, but, through the links analysis, enables users to reach an estimated 300M web pages. Google's combination of extensive reach and greater accuracy of results has quickly catapulted this relative late-comer to top-ten status in search engine popularity. Data released by Nielsen Net Ratings in August 1999 showed that Google gained the largest month-on-month increase in unique audience figures. Visits to Google increased by a massive 88% compared to the average of 2.1% for the other top ten search engines for that month. Later that month Google signed its first licensing deal with AOL subsidiary Netscape, to be the main search provider on the Netcenter portal. back to the top

Clever
A team of IBM researchers examining search engine effectiveness developed a system that was referred to internally as HITS (Hyperlink-Induced Topic Search). The project later became known as 'Clever'.

Related to the scientific citation index (the study of how scientific papers refer to one another), Clever examines the hypertext context of a keyword search. Like Google, Clever examines hyperlinks and the surrounding commentary. Unlike Google, which crawls the web, Clever first submits the query to a search engine such as AltaVista, and then conducts its links analysis on a set of pages from the results produced by that search engine - typically about 200 pages. By adding all the pages that link to and from these 200 pages, Clever creates what is called a root set - usually between 1,000 and 5,000 pages. Using linear algebraic analysis, Clever then begins an iterative process of analysing this root set of results to divide them into two categories: authorities and hubs. (1)

Authorities are webpages about a particular topic that have lots of links to them, i.e. they are authoritative sources of information.

Hubs are webpages which are a guide to, or list, authoritative sources, i.e. they do the most citing.

Hubs are similar to portals in that they act as a jump point for anyone interested in the particular topic they cover. Unlike Google, which retains rankings for individual websites in its index, independently of the user's search query, Clever will always create a new root set for each query and prioritise each page according to the context of the specific search statement. Whilst not yet available for web-wide searching, IBM's research team is currently refining the Clever search engine and have been experimenting with Clever to automatically develop web directories.

Focused Crawler
This is another search engine technology that is being developed by IBM. However, it is not yet as developed as Clever. Unlike other search engines (including Google and Clever) which perform an analysis after they have crawled through a collection of hyperlinks, Focused Crawler, as its name suggests, seeks to identify highly relevant collections of data to topic-specific searching by crawling the web with a specific goal, ignoring irrelevant sections of the web. In other words, it only crawls websites of relevance to the user's query, rather than identify a subset of relevant websites as a result of an analysis of a larger set of crawled sites.

Focused Crawler crawls the web guided by a relevance and popularity mechanism that has two parts: a classifier that evaluates the relevance of a web page to the user's search query, and a distiller that identifies 'hypertext nodes that are great access points to many relevant pages within a few links'. (10)
back to the top

Newsgroup searching
The Internet delivers two primary benefits: content and connectivity. Whilst distinct, the two are often closely inter-related. Portals are a perfect example - they represent the synergistic exploitation of both content and connectivity to create ecommerce opportunities.

However, whilst the web is the primary repository of human knowledge on the Internet, it is not the only one. Newsgroups, where collections of individuals share their experiences, knowledge and opinions on a subject of common interest, constitute an important area of consideration for information retrieval. The distinction between the web and newsgroups is that the web primarily represents a large body of explicit human knowledge whilst newsgroups primarily represent a large body of implicit knowledge. Explicit, codifiable, knowledge can help individuals and organisations learn from the past to prepare for the future, but it is implicit knowledge - the realm of experience, creativity and ideas - that offers the greatest potential of adaptability for the flux that is the future. In an increasingly knowledge-based information society, it will be implicit knowledge that will be needed to successfully exploit explicit knowledge to create new opportunities and develop adaptability.

Considering this, the role of specialised newsgroup search engines will become more important as individuals use the Internet to seek out experts (or indeed anyone who is qualified) to help with their problems. This prediction is not merely based on a belief in human altruism on the part of the author, but also on phenomena such as the emergent sociology of citations on the web (10), the explosive growth of the volunteer-based Open Directory (see appendix) and the emphasis on people/expert connectivity in many corporate intranets.

There are literally thousands of newsgroups covering all manner of topics. These are organised in a tree-like structure with eight main categories: Comp, Rec, Sci, Soc, Talk, News, Alt and Misc. Due to the huge number of groups available, specialised search engines emerged to identify relevant groups and postings to users information needs:

Deja News
Was once the most widely known newsgroup search engine. Went out of service in 2001, but the Usenet archives were acquired by Google and re-introduced as Google Groups. Further details on Wikipedia.

Reference.com
Similar to Deja News, but also enables searches in web forums (web-based bulletin boards) and mailing lists (where each posting is sent to your email address). Users also have the option to save searches for later re-use. back to the top

Subject-specific indexes

Company Information - There are many sites (usually from company and business information providers) that any researcher can visit. The amount and quality of information that will be provided for free varies. However all such sites are web-enabled versions of commercial databases, rather than true search indexes. In a test on the performance of leading search engines and directories to deliver relevant results for searches on company names, conducted by the online industry publication 'Search EngineWatch', HotBot and Google were ranked joint first search engines whilst Netscape Search was ranked first web directory. (11) However, despite this impressive performance, company research is not the exclusive focus of these search sites. Launched in August 1999, 1Jump is a specialised search index that focuses exclusively on information and news about companies. In addition to providing news, this search engine also provides details of company executives (titles, age, background and email addresses), patents (every patent owned by a company) and ''Peers''(subsidiaries, parents and related companies). It also enables the user to visit other web pages that are relevant to a particular company, e.g. an industry association.

Multimedia and image files - According to industry analyst organisation Future Image in its report 'Comparative Evaluation of Web Image Search Engines' almost 70% of the web is non-textual. Considering that humans assimilate and process information in visual format more readily than textual format, and the greater availability of broadband capacity in the near-future, the role of multimedia search engines will continue to grow. The three main specialised search engines in this area are: Ditto, Scour, (see Wikipedia article on Scour), AltaVista PhotoFinder

Some other specialised search indexes include:
Finding People - www.whowhere.com
Law - www.lexisweb.com
Health - www.drkoop.com
Movies - www.imdb.com/search
Information Please - www.infoplease.com

Next: Section five - Search Utilites and Intelligent Agents

Back to the top