An Extensible User-based XML Grammar Matching and Comparison Framework

Joe TEKLI   Richard CHBEIR
SOE, Dept. of Electrical & Computer Eng.
Lebanese American University
36 Byblos, LEBANON
  UPPA Laboratory, IUT of Bayonne
University of Pau and Adour Countries
64600 Anglet, FRANCE
joe.tekli@lau.edu.lb
www.lau.edu.lb
  richard.chbeir@univ-pau.fr
www.univ-pau.fr

    Abstract

    With the ever-increasing size of the Web, XML has emerged as a common data representation model that simplifies the tasks of interoperation and integration among heterogeneous data sources and data management systems.
    In fact, XML underlines hierarchically structured self-describing information, made of atomic and complex elements (i.e., containing sub-elements) as well as atomic attributes, thus incorporating structure and semantically rich data in one entity. In addition, XML documents usually conform to predefined grammars (i.e., DTDs or XML Schemas) which identify corresponding XML document elements and attributes, as well as element/attribute structural dispositions and the rules they adhere to in the XML document. Similarly to schemas in traditional DBMS, XML grammars are valuable for the protection, indexing, querying and retrieval of corresponding documents [Bertino et al. 2004] [Nierman and Jagadish 2002].
    Yet, with the rapidly growing amount of heterogeneous XML information on the Web, i.e., documents originated from different data-sources and not conforming to the same grammar, there is an overwhelming need to automatically process those documents for data integration, and consequently information extraction, retrieval and search functions. All these applications require, in some way or another, XML document and grammar similarity evaluation. In this area, most work has focused on estimating similarity between XML documents (i.e., data layer), which is relevant in several scenarios such as change management [Chawathe et al., 1996] [Cobéna et al., 2002], XML structural querying (finding and ranking results according to their similarity) [Schlieder T., 2001] [Zhang et al., 2003] as well as the structural clustering of XML documents gathered from the web [Nierman and Jagadish, 2002] [Dalamagas et al., 2006]. Nonetheless, few efforts have been dedicated to comparing XML grammars (i.e., type layer), useful for data integration purposes, in particular the integration of DTDs/XML schemas that contain nearly or exactly the same information but are constructed using different structures [Doan et al., 2001] [Melnik et al., 2002)]. XML grammar comparison is also exploited in data warehousing (mapping data sources to warehouse schemas), message translation (central in B2B applications) as well as XML data maintenance and schema evolution where there is a need to detect differences/updates between different versions of a given grammar to consequently revalidate corresponding XML documents [Rahm and Bernstein 2001].
    In this study, we focus on the XML grammar comparison problem, i.e., comparing DTDs and/or XML Schemas, based on their most common characteristics.Our main goal is to develop an effective grammar matching method minimizing the amount of manual work needed to perform the match task. This requires i) considering the various characteristics and constraints of XML grammars being matched (in comparison with existing ‘grammar simplifying’ approaches), and ii) providing a flexible and extensible framework for combining different matching criterions (in comparison with existing static methods) that is adapted to the semi-structured nature of XML grammars (in comparison with relatively generic approaches).

    Hereunder, we provide links to various technical reports related to our study, detailing certain components and algorithms, as well as the experimental results. The whole study will be available online soon.


    References

    1. Bertino E., Guerrini G., Mesiti M., A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications, Elsevier Computer Science, 29 (23-46), 2004.
    2. Chawathe S., Rajaraman A., Garcia-Molina H., and Widom J., Change Detection in Hierarchically Structured Information. In Proc. of SIGMOD, 1996.
    3. Cobéna G., Abiteboul S. and Marian A., Detecting Changes in XML Documents. In Proc. of the IEEE Int. Conf. on Data Engineering, 2002.
    4. Dalamagas, T., Cheng, T., Winkel, K., and Sellis, T. 2006. A methodology for clustering XML documents by structure. IS Journal, 31, 3, pp. 187-228, 2006.
    5. Doan A., Domingos P. and Halevy A.Y., Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach. In Proc. of the SIGMOD Conference, 2001
    6. Melnik S., Garcia-Molina H. and Rahm E., Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. In Proc. of ICDE, 2002.
    7. Nierman A. and Jagadish H. V., Evaluating structural similarity in XML documents. In Proc. of ACM SIGMOD WebDB, pp. 61-66, 2002.
    8. Rahm E. and Bernstein P.A., A Survey of Approaches to Automatic Schema Matching. The VLDB Journal, 10:334-350, 2001.
    9. Schlieder T., Similarity Search in XML Data Using Cost-based Query Transformations. In Proc. of ACM SIGMOD WebDB, pp. 19-24, 2001.
    10. Zhang Z., Li R., Cao S. and Zhu Y., Similarity Metric in XML Documents. Knowledge and Experience Management Workshop, 2003.