David Green BA (Hons), PgDipLIS, MCLIP    
Home CV Publications

Research Papers

Events Useful Links

 

Section six - XML

Online Information Review - Apr 00 - The evolution of web searching

My tag cloud - xml links

Since XML was completed by the World Wide Web Consortium (W3C - the body responsible for developing technical standards for the web) in early 1998 - it has attracted an almost hysterically evangelical response. So just what is it, and what are its implications for web searching?

Most web pages are currently produced in HyperText Mark-up Language (HTML). Whilst HTML's ease of use fuelled its widespread adoption, it is somewhat limited in that it is primarily concerned with the design/layout of a webpage, rather than the information that actually appears on that page. Considering that a primary use of the web is for information retrieval this design focus is something of a drawback.

HTML is a spin-off from a much more robust mark up language that was approved by the International Organisation for Standards (ISO) in 1986 - SGML. However, SGML is too complex for the web. Seeking to address the limitations of HTML, the W3C developed a subset of SGML that would address the semantic and structural considerations of information retrieval and exchange that would work on the web - XML.

XML is an open technology that offers tremendous possibilities for electronic publishing, ecommerce, information retrieval and data exchange. It consists of rules that enable anyone to create their own mark-up language. XML describes information using pairs of tags that are nested inside one another to multiple levels. (13) These create a tree structure of nested hierarchies. This convention allows users direct access to just the particular segment of the information that they are interested in e.g. hyperlinks can go through to the relevant section of a document rather than the entire document. It also enables powerful structured searching akin to database field searching, but on textual web pages. In other words, XML not only enables explicit description of webpage content, but also describes the rules for manipulating each data set contained within the information. This enables a small program such as a java script to process the information on the user's local hard drive according to their requirements, rather than the user requesting a new web page from the central server. Multiply by millions of web users, and this capability will dramatically decrease the demands on web servers and improve network traffic. (14) Based on open standards, XML will allow data exchange between different computer systems regardless of operating system or hardware.

As XML is also based on Unicode, a character encoding system that supports the intermingling of text in all of the world's major languages, it will also allow the exchange of information across national and cultural boundaries. (13)

Using various XML style sheets (XSL) publishers will also be able to automatically re-purpose their content for various devices. There are even stylesheets that will read the text of the webpage aloud, which is of great benefit to the visually impaired.

However, whilst XML will deliver great benefits for searching, publishing and exchanging information, these benefits will not be realised without some effort:

  • Firstly, each industry will need to agree standards for the tags used to describe information that is specific to their discipline. Mathematicians, genealogists and chemists have already agreed standards to facilitate the realisation of XML's benefits, In other areas, standards are yet to be agreed and there will be struggles over who controls the standard. (15)

  • Web publishers will require greater sophistication than simply knowledge of HTML, graphics and a few other applications. They will need new XML tools and computer programmers and information scientists who will be able to interpret the content of the information being published and provide the appropriate data trees/nested hierarchies, hyperlink structures, meta data, style sheets and document definition types (DDTs).

  • Search engines will need to learn the standard tag structure that has been agreed by each industry/interest group. They will also need to change their search interfaces to offer users the choice between text searching and field/tag searching. Currently, text-based search engines will return a list of documents that will contain some information relating to the user's request. XML enabled query-searching, like any other query language, will return the relevant data that has been extracted from a document, rather than the entire document. Such query-based searching can also be used to perform computational analysis and manipulation of presentation on retrieved data items. (15)

To facilitate the transition to XML, the W3C released a hybrid of HTML 4.0 and XML - XHTML 1.0, in August 1999 for review. It is highly unlikely that there will ever be an HTML 5.0. Earlier in April, IBM launched the Internet's first search engine that is exclusively focused on XML data, called xCentral. This search engine is available from IBM's XML website.


Next: Section seven - The Future

Related article: XML and Information Management, published in Information World Review, Dec 01

Back to the top