The XS3 Prototype for Comparing XML Documents and Grammars

With the ever-increasing size of the Web, XML has emerged as a common data representation model that simplifies the tasks of interoperation and integration among heterogeneous data sources and data management systems.
In fact, XML underlines hierarchically structured self-describing information, made of atomic and complex elements (i.e., containing sub-elements) as well as atomic attributes, thus incorporating structure and semantically rich data in one entity. In addition, XML documents usually conform to predefined grammars (i.e., DTDs or XML Schemas) which identify corresponding XML document elements and attributes, as well as element/attribute structural dispositions and the rules they adhere to in the XML document. Similarly to schemas in traditional DBMS, XML grammars are valuable for the protection, indexing, querying and retrieval of corresponding documents [Bertino et al. 2004] [Nierman and Jagadish 2002].
Yet, with the rapidly growing amount of heterogeneous XML information on the Web, i.e., documents originated from different data-sources and not conforming to the same grammar, there is an overwhelming need to automatically process those documents for data integration, and consequently information extraction, retrieval and search functions. All these applications require, in some way or another, XML document and grammar similarity evaluation. In this area, most work has focused on estimating similarity between XML documents (i.e., data layer), which is relevant in several scenarios such as change management [Chawathe et al., 1996] [Cobéna et al., 2002], XML structural querying (finding and ranking results according to their similarity) [Schlieder T., 2001] [Zhang et al., 2003] as well as the structural clustering of XML documents gathered from the web [Nierman and Jagadish, 2002] [Dalamagas et al., 2006]. Nonetheless, few efforts have been dedicated to comparing XML grammars (i.e., type layer), useful for data integration purposes, in particular the integration of DTDs/XML schemas that contain nearly or exactly the same information but are constructed using different structures [Doan et al., 2001] [Melnik et al., 2002)]. XML grammar comparison is also exploited in data warehousing (mapping data sources to warehouse schemas), message translation (central in B2B applications) as well as XML data maintenance and schema evolution where there is a need to detect differences/updates between different versions of a given grammar to consequently revalidate corresponding XML documents [Rahm and Bernstein 2001].
In this study, we focus on the XML grammar comparison problem, i.e., comparing DTDs and/or XML Schemas, based on their most common characteristics.Our main goal is to develop an effective grammar matching method minimizing the amount of manual work needed to perform the match task. This requires i) considering the various characteristics and constraints of XML grammars being matched (in comparison with existing ‘grammar simplifying’ approaches), and ii) providing a flexible and extensible framework for combining different matching criterions (in comparison with existing static methods) that is adapted to the semi-structured nature of XML grammars (in comparison with relatively generic approaches).

Hereunder, we provide links to various technical reports related to our study, detailing certain components and algorithms, as well as the experimental results. The whole study will be available online soon.

Joe TEKLI		Richard CHBEIR
SOE, Dept. of Electrical & Computer Eng. Lebanese American University 36 Byblos, LEBANON		UPPA Laboratory, IUT of Bayonne University of Pau and Adour Countries 64600 Anglet, FRANCE
joe.tekli@lau.edu.lb www.lau.edu.lb		richard.chbeir@univ-pau.fr www.univ-pau.fr