UPPA        
         

 

SemIndex: Semantic Aware Inverted Index

Download prototype: SemIndex (console mode) or SemIndex+ (graphical, with alternative algorithms)

Joe Tekli, Christian Kallas & Marc Al Assad   Richard Chbeir   Yi Luo and Kokou Yetongnon  
SOE, Dept. of Electrical & Computer Eng.
Lebanese American University
36 Byblos, Lebanon
  UPPA Laboratory, IUT of Bayonne
University of Pau & Pays Adour
64600 Anglet, France
  LE2I Laboratory UMR-CNRS
University of Bourgogne
21000 Dijon, France
[email protected]
marc.alassad.lau.edu

www.lau.edu.lb
  [email protected]
www.univ-pau.fr
  [email protected]
[email protected]

www.u-bourgogne.fr

Carlos Raymundo Ibanez   Caetano Traina Jr. and Agma J.M. Traina
ICMC, Computer Science and Statistics Department
University of Sao Paulo
Sao Carlos, BRAZIL
  Universidad Peruana de Ciencias Aplicadas
Computer Science Department
Lima, PERU
[email protected]
[email protected]

www.icmc.usp.br
  [email protected]
www.upc.edu.pe

    I. Introduction

    Processing keyword-based queries is a fundamental problem in the domain of Information Retrieval (IR), where several studies have been done to develop effective keyword-based search techniques [5, 6, 7]. A standard containment keyword-based query, which retrieves textual identities containing a set of keywords, is generally supported by a full-text index. The inverted index is considered as one of the most useful full-text indexing techniques for very large textual collections [32], supported by many RDBMSs . It is also increasingly used on semi-structured [6] and unstructured data [5] to support keyword-based queries. Yet, the standard inverted index, which only supports exact term matching, cannot deal with data semantics.

    Various approaches combining different types of data and semantic knowledge have been proposed to enhance query processing. In this study, we develop a new approach called SemIndex integrating domain knowledge into an inverted index to support semantic-aware querying. Major benefits of our work over existing methods include:


    Our semantic-aware inverted index, SemIndex, maps two data resources into a single data structure, a textual data collection (represented as a traditional inverted index), and a semantic knowledge base (represented as a traditional semantic network). An extended query model with different levels of semantic awareness is defined, so that both semantic-aware queries and standard containment queries are processed within the same framework. Fig. 1 depicts the overall framework of our approach and its main components. Briefly, the Indexer manages SemIndex generation and maintenance, while the Query Processor processes and answers the semantic-aware (or standard) queries issued by the user.

    II. System Architecture

    We have implemented our SemIndex framework using Java. We also have used MySQL 5.6 as an RDBMS, and WordNet 3.0 as a knowledge base. In addition to the two basic SemIndex components consisting of: i) the indexer, and ii) the query processor, our implementation also includes: iii) a lemmatizer used to transform index terms into their lemmas, as well as iv) extensible weight computation components which are called upon within the indexer and/or query processor to compute edge/node weights as needed (weights are computed initially during indexing, and are then updated dynamically during querying). The RDBMS initially holds the input textual data collection and the knowledge base in the form of native RDBs . Java is used to send SQL queries to the RDBMS in the following order required to build SemIndex.

    Fig. 1. Simplified activity diagram describing our SemIndex framework

    In an initial battery of experiments, we evaluated the quality of our indexing approach by assessing four main criteria: i) index building time, ii) index size and characteristics, ii) query processing time, and iii) the number of returned results. We used the IMBD movies table as an input textual collection, including the attributes movie_id and (title, plot) concatenated in one column (cf. Table 1) with a total size of around 75 MBytes including more then 7 million rows. WordNet 3.0 had a total size of around 26 Mbytes, including more than 117k synsets (senses). The early prototype system and experimental results can be downloaded from the following links:

    In a subsequent study, we have extended the above framework toward SemIndex+, allowing to search, select, and rank unstructured, structured (relational) and partly structured (NoSQL) textual data. At the indexer level, we added: i) an extension of SemIndex 's logical design to handle varying multi-attribute datasets (using attribute sensitive indexers), ii) a dedicated algorithm to handle terms with missing semantic connections (which we designate as missing terms ), and iii) a mathematical model for weighting SemIndex+ entries (i.e., the graph's nodes and edges). At the query processing level, we developed: v) four alternative query processing algorithms (including a parallelized version of the core algorithm), coupled with vi) a dedicated relevance scoring measure, required in the query evaluation process in order to retrieve and rank relevant query answers.

    Fig. 2. Simplified activity diagram describing our SemIndex framework

    In addition, we conducted an extensive experimental study comparing SemIndex+ 's effectiveness and efficiency with various generic approaches (including inverted index search , query relaxation , query disambiguation , and query refinement ). The new prototype system and experimental results can be downloaded from the following links:

    References

    1. Burton-Jones A.; Storey V.C.; Sugumaran V. and Purao S., A Heuristic-Based Methodology for Semantic Augmentation of User Queries on the Web. In Proceedings ot the International Conference on Conceptual Modeling (ER'03), 2003. pp. 476\96489
    2. Carpineto C.; Romano G. and Giannini V., Improving Retrieval Feedback with Multiple Term-Ranking Function Combination. ACM Transactions on Information Systems (TOIS), 2002. 20(3):259-290
    3. Chandramouli K., Kliegr T. , Nemrava J., and S. V., Query Refinement and user Relevance Feedback for contextualized image retrieval. 5th International Conference on Visual Information Engineering (VIE), 2008. pp. 453 - 458
    4. Cimiano P.; Handschuh S. and Staab S., Towards the Self-Annotating Web. In Proceedings of the International World Wide Web Conference (WWW'04), 2004. pp. 462-471
    5. Das S., e.a., Making unstructured data sparql using semantic indexing in oracle database. In Proceedings of 29th IEEE ICDE Conf., 2012. pp. 1405\961416
    6. Florescu D., Kossmann D., and Manolescu I., Integrating Keyword Search into Xml Query Processing, . Computer Networks, 2000. 33(1-6):119\96135
    7. Frakes W.B. and R.A. Baeza-Yates, Information retrieval: Data structures and algorithms. Prentice-Hall, 1992
    8. Li Y., Yang H., and Jagadish H.V., Term Disambiguation in Natural Language Query for XML. In Proceedings of the International Conference on Flexible Query Answering Systems (FQAS), 2006. LNAI 4027, pp. 133\96146
    9. Martinenghi D. and Torlone R., Taxonomy-based relaxation of query answering in relational databases. VLDB Journal, 2014. 23(5):747-769
    10. Mishra C. and Koudas N., Interactive Query Refinement International Conference on Extending Database Technology (EDBT'09), 2009. pp. 862-873
    11. de Lima E.F. and Pedersen J.O., Phrase Recognition and Expansion for Short, Precision biased Queries based on a Query Log. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. pp. 145-152, Berkeley, CA
    12. Navigli R., Word Sense Disambiguation: a Survey. ACM Computing Surveys, 2009. 41(2):1\9669
    13. Navigli R. and Crisafulli G., Inducing Word Senses to Improve Web Search Result Clustering. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010. pp. 116\96126, MIT, USA
    14. Voorhees E.M., Query Expansion Using Lexical-Semantic Relations. In Proceedings of the 17th International ACM/SIGIR Conference on Research and Development in Information Retrieval, 1994. pp. 61-69
    15. Sinh Hoa Nguyen, Wojciech Swieboda, and G. Jaskiewicz, Semantic Evaluation of Search Result Clustering Methods. Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence Volume 467, 2013. 467(393-414)
    16. Wen H., Huang G.S., and L. Z., Clustering web search results using semantic information International Conference on Machine Learning and Cybernetics, 2009. 3(1504 - 1509)