| |
| Home | CV | Publications | Research Papers |
Events | Useful Links |
| |
|
Section four - Search technology evolves Online Information Review - Apr 00 - The evolution of web searching Anyone who has ever seen a diagrammatic representation of the evolution of life on our planet, as we currently understand it, would notice that basic cellular lifeforms were around for a very long time before the evolution of more complex biological entities. However, once this point had been reached, the rapid diversification of life into ever more organised and intelligent forms occurred in ever decreasing timescales. The same can be said for web search technology. By focusing their efforts on ecommerce and portalisation the first generation of search sites - the 'big 5' - neglected their core search functionality. Whilst they reigned supreme for several years, this neglect, and failure to appropriately adapt to a changing environment, created niche opportunities which were soon exploited by new types of search providers Meta search | Popularity
based | Natural language | Links
analysis | Newsgroups | Subject
specific Dogpile
- searches fourteen different engines and directories but doesn't eliminate
duplicates. Was acquired in August 1999 by search engine GO2 for US$40M
in stock and a further US$15M in cash. At the time of acquisition, Dogpile
had five employees! (8) Popularity based analysis
Prior to licensing Direct Hit, HotBot returned
a list of results based on the standard methodology of matching search
terms with content on the websites in its index. Now, Direct Hit will
run a second level analysis on the user's set of results. From its database
it will identify those web sites which are 'popular', according to the
number of visits that each web site has received, and then re-rank the
search results accordingly, with the most popular websites that match
your search term presented first in the list of results. However, the popularity of a web site can be largely determined by its search engine rankings and there are all sorts of ways to manipulate those if you have a good understanding of how search engines work. Direct Hit tries to compensate for this by boosting 'hidden gems'. For example, a web site could provide lots of valuable information about a particular topic but could feature further down the list of results of search engines. If a searcher has been tenacious enough to dig down as far as result number 100 (they'd probably be an information professional!), and click on it, then Direct Hit's algorithms will give this site a big boost up the list of results next time it appears in other searchers' list of results. If other users don't click on this 'hidden gem' website, then it will drop down the list of results for subsequent searchers - because it didn't prove popular! (9) Since its launch, the company has successfully
licensed its technology to ten search sites including AOL, HotBot, Lycos,
MSN and LookSmart and is available within Netscape Communicator 4.5
and Apple's Sherlock search utility. back
to the top Natural language searching June 1998 represented a great landmark in addressing
these limitations. Two new search engines launched within weeks of one
another. Both offered natural language searching, but adopted different
philosophies in developing their solutions. Strangely both were named
after characters in books. Ask
Jeeves The
Electric Monk (now defunct) Link based analysis Links based analysis attempts to overcome these problems by examining the relationships between pages - the one billion or so hyperlinks that weave the web together. (1) By examining how web pages link together, links-based analysis offers methodologies for identifying authoritative sources of topic-specific information, eliciting quality, highly relevant results to user's queries. Not surprisingly, links-based analysis has quickly gained prominence amongst Internet users and is attracting a lot of attention from both computer information scientists and corporate Internet investors. Google
As a result, of course, it can analyse far more websites than the humans who build directories such as Yahoo!. In fact, unlike search engines that become less useful the larger their index of websites becomes, Google claims to return even better results with a bigger index. Google also seeks to capitalise on the accompanying editorial commentary by processing the text around each hyperlink. (9) Links-based analysis does feature in the relevance ranking algorithms of some search engine providers such as Excite and HotBot. However, Google is the only search engine that is exclusively focused on links-based searching that is currently publicly available for web-wide searching. The company estimates that its index is between 70M -- 100M pages, but, through the links analysis, enables users to reach an estimated 300M web pages. Google's combination of extensive reach and greater accuracy of results has quickly catapulted this relative late-comer to top-ten status in search engine popularity. Data released by Nielsen Net Ratings in August 1999 showed that Google gained the largest month-on-month increase in unique audience figures. Visits to Google increased by a massive 88% compared to the average of 2.1% for the other top ten search engines for that month. Later that month Google signed its first licensing deal with AOL subsidiary Netscape, to be the main search provider on the Netcenter portal. back to the top Clever Related to the scientific citation index (the study of how scientific papers refer to one another), Clever examines the hypertext context of a keyword search. Like Google, Clever examines hyperlinks and the surrounding commentary. Unlike Google, which crawls the web, Clever first submits the query to a search engine such as AltaVista, and then conducts its links analysis on a set of pages from the results produced by that search engine - typically about 200 pages. By adding all the pages that link to and from these 200 pages, Clever creates what is called a root set - usually between 1,000 and 5,000 pages. Using linear algebraic analysis, Clever then begins an iterative process of analysing this root set of results to divide them into two categories: authorities and hubs. (1) Authorities are webpages about a particular topic that have lots of links to them, i.e. they are authoritative sources of information. Hubs are webpages which are a guide to, or list, authoritative sources, i.e. they do the most citing. Hubs are similar to portals in that they act as a jump point for anyone interested in the particular topic they cover. Unlike Google, which retains rankings for individual websites in its index, independently of the user's search query, Clever will always create a new root set for each query and prioritise each page according to the context of the specific search statement. Whilst not yet available for web-wide searching, IBM's research team is currently refining the Clever search engine and have been experimenting with Clever to automatically develop web directories. Focused Crawler Focused Crawler crawls the web guided by a
relevance and popularity mechanism that has two parts: a classifier
that evaluates the relevance of a web page to the user's search query,
and a distiller that identifies 'hypertext nodes that are great access
points to many relevant pages within a few links'. (10) Newsgroup searching However, whilst the web is the primary repository of human knowledge on the Internet, it is not the only one. Newsgroups, where collections of individuals share their experiences, knowledge and opinions on a subject of common interest, constitute an important area of consideration for information retrieval. The distinction between the web and newsgroups is that the web primarily represents a large body of explicit human knowledge whilst newsgroups primarily represent a large body of implicit knowledge. Explicit, codifiable, knowledge can help individuals and organisations learn from the past to prepare for the future, but it is implicit knowledge - the realm of experience, creativity and ideas - that offers the greatest potential of adaptability for the flux that is the future. In an increasingly knowledge-based information society, it will be implicit knowledge that will be needed to successfully exploit explicit knowledge to create new opportunities and develop adaptability. Considering this, the role of specialised newsgroup search engines will become more important as individuals use the Internet to seek out experts (or indeed anyone who is qualified) to help with their problems. This prediction is not merely based on a belief in human altruism on the part of the author, but also on phenomena such as the emergent sociology of citations on the web (10), the explosive growth of the volunteer-based Open Directory (see appendix) and the emphasis on people/expert connectivity in many corporate intranets. There are literally thousands of newsgroups covering all manner of topics. These are organised in a tree-like structure with eight main categories: Comp, Rec, Sci, Soc, Talk, News, Alt and Misc. Due to the huge number of groups available, specialised search engines emerged to identify relevant groups and postings to users information needs: Deja
News Reference.com
Company Information - There are many sites (usually from company and business information providers) that any researcher can visit. The amount and quality of information that will be provided for free varies. However all such sites are web-enabled versions of commercial databases, rather than true search indexes. In a test on the performance of leading search engines and directories to deliver relevant results for searches on company names, conducted by the online industry publication 'Search EngineWatch', HotBot and Google were ranked joint first search engines whilst Netscape Search was ranked first web directory. (11) However, despite this impressive performance, company research is not the exclusive focus of these search sites. Launched in August 1999, 1Jump is a specialised search index that focuses exclusively on information and news about companies. In addition to providing news, this search engine also provides details of company executives (titles, age, background and email addresses), patents (every patent owned by a company) and ''Peers''(subsidiaries, parents and related companies). It also enables the user to visit other web pages that are relevant to a particular company, e.g. an industry association. Multimedia and image files - According to industry analyst organisation Future Image in its report 'Comparative Evaluation of Web Image Search Engines' almost 70% of the web is non-textual. Considering that humans assimilate and process information in visual format more readily than textual format, and the greater availability of broadband capacity in the near-future, the role of multimedia search engines will continue to grow. The three main specialised search engines in this area are: Ditto, Scour, (see Wikipedia article on Scour), AltaVista PhotoFinder Some other specialised search indexes include: Next: Section five - Search Utilites and Intelligent Agents |
| Back to the top |