David Green BA (Hons), PgDipLIS, MCLIP    
Home CV

Publications

Research Papers Events Useful Links

XML - the semantic web and information management

Information World Review - Dec 01

My tag cloud - XML

Introduction | XML defined | Information vendors | Info Management | Semantic web | Records management

HTML is DEAD. There have been no new standards developments to HTML for some time now. It is being replaced by XML. Whilst easy to learn, HTML was limiting in that it only addressed the design and layout of information, and not its meaning. Given that most Internet users and systems are primarily concerned with information retrieval and exchange, this limitation was quite a handicap.

The migration from HTML to XML as the de facto web publishing mechanism will have far reaching implications for information professionals and publishers alike. Originally I was briefed to write an article on why technologists have dominated XML developments to date with lesser input from information specialists. In researching this article I quickly came to the conclusion that this wasn't really such an issue - much work to date has been on the development and agreement of open standards. In recent months the World Wide Web Consortium (W3C), an international industry consortium that sets open standards for the web, has finally rubber-stamped the remaining related publishing and linking standards that complement XML. Now XML will move into a new phase - wide-scale implementation. It is here that information professionals' skills in information management, classification schema and indexing, search skills and records management will be called upon.

XML defined
XML is a semantically focused open technology that allows far greater possibilities than mere metadata. Not only does it enable explicit description of the content, but through related technology standards (more below) it allows manipulation of how data should be formatted and output. This takes the web page beyond a flat display of data and allows the user to manipulate the data. Every major player in the technology industry is touting this XML-driven interoperable future. Indeed, in the database arena, XML has already become the standard approach for distributing data from one application to another. back to the top

Business information vendors
For the business information industry, which has witnessed consolidation into three mega-players; Thomson Financial, Reed Elsevier (owners of Lexis-Nexis) and Factiva, distribution has become a key area of competitive edge. We no longer think about internal and external sources - the goal is to seamlessly aggregate these into unified information - the Enterprise Information Portal e.g. Factiva Select is an XML content feed allows corporate customers to host and integrate news into their intranet environment. In many ways MAID's LiveIntranet product pre-dated this - however that was fundamentally flawed in that it was based on their proprietary InfoSort indexing technology and not on open standard technology such as XML. Nobody wants to be locked into a single supplier if it can be helped.

Information management
Information management can be said to follow a cycle of discovery, acquisition, cataloguing and dissemination. XML content management systems (e.g. Interwoven) will allow information managers to centrally manage independent content stores. Data can be pulled from several sources, aggregated, and documents (web page or other format) generated 'on the fly'. Agent-based indexing and retrieval tools such as Autonomy can also add value by identifying related terms within and between documents and data sets, and automatically generate XML-based hyperlinks. Just as XML is a technology standard, there is much scope for it to also become a knowledge management standard. For example, a taxonomy would be integral to supplying the rules for automatically XML tagging internal data. back to the top

The inter-operable semantic web
Although XML tags content that scripts can then manipulate in complex ways, until recently the system interrogating the data needed to know what each tag is used for. In other words XML allowed users to add arbitrary structure to documents without saying what that structure meant. This has been resolved with the W3C issue of XML Schema - these will define shared mark-up vocabularies and provide hooks to associate semantics with them.

To re-iterate, the central tenet of XML is that it addresses semantics. Tim Berners-Lee, a director of W3C and often referred to as 'the godfather' of the Internet, has been working on 'the semantic web', which he describes as an extension of the Internet as it is today. The semantic web will allow programs to browse around and exchange data without human intervention, in effect turning the Internet into a single giant computer. Microsoft is also placing a multi-million dollar bet on this vision of the near-future inter-operable Internet with its .NET project. This will allow for the automatic exchange of content and messages between software programs, applications and databases and, where appropriate, towards people. Clearly this raises the requirement for verification and authentication of information sources in order to address data security and personal privacy concerns. XML Schema will allow for better validation and assurance in information exchange (e.g. e-commerce transactions) through digital signatures and other verification tools. back to the top.

Publishing formats and records management
Again, another recently W3C issued standard has resolved the other outstanding impediment to XML's generic adoption. Extensible Stylesheet language (XSL) makes complex formatting of documents possible. This allows authors to write once and publish many times and to many platforms e.g. different content formatted for print, web and mobile channels. In the future documents will be nebulous entities generated 'on the fly'. Automated personalised editions could be created for each customer. This may allow for the optimal storage of data but virtual data repositories used to generate multiple documents raises a records management issue. Like any other electronic document management system, there will be a need to save transactional documents for legal, regulatory or business purposes, as opposed to saving the base data elements. These documents must be as accessible as today's hardcopy docs. Again this is another area of XML implementation that information professionals are best placed to address for their organisations.

XML also augments developments with peer-to-peer computing and information exchange and has clear ramifications for Internet search engines. Whilst there hasn't been room in this article to explore these issues, you may also find it useful to read a previous XML article that I wrote in the Feb 99 issue of IWR ('Here come the X files') which is available from the archive at the IWR website.

Related material: Section six of research paper 'The evolution of web searching' examines XML and web searching

Related articles:
The semantic web, Information World Review, Dec 02
Here come the X files, Information World Review, Feb 99


Information World Review is Europe's leading information industry publication. This article is reprinted in its entirety with permission from Learned Information Europe Ltd. All material copyright Learned Information Europe Ltd.

Back to the top