SemIndex: Semantic Aware Inverted Index
Download prototype: SemIndex (console mode) or SemIndex+ (graphical, with alternative algorithms)
Joe Tekli, Christian Kallas & Marc Al Assad | Richard Chbeir | Yi Luo and Kokou Yetongnon | |||
SOE, Dept. of Electrical & Computer Eng. Lebanese American University 36 Byblos, Lebanon |
UPPA Laboratory, IUT of Bayonne University of Pau & Pays Adour 64600 Anglet, France |
LE2I Laboratory UMR-CNRS University of Bourgogne 21000 Dijon, France |
|||
[email protected] marc.alassad.lau.edu www.lau.edu.lb |
[email protected] www.univ-pau.fr |
[email protected] [email protected] www.u-bourgogne.fr |
Carlos Raymundo Ibanez | Caetano Traina Jr. and Agma J.M. Traina | |
ICMC, Computer Science and Statistics Department University of Sao Paulo Sao Carlos, BRAZIL |
Universidad Peruana de Ciencias Aplicadas Computer Science Department Lima, PERU |
|
[email protected] [email protected] www.icmc.usp.br |
[email protected] www.upc.edu.pe |
I. Introduction
Processing keyword-based queries is a fundamental problem in the domain of Information Retrieval (IR), where several studies have been done to develop effective keyword-based search techniques [5, 6, 7]. A standard containment keyword-based query, which retrieves textual identities containing a set of keywords, is generally supported by a full-text index. The inverted index is considered as one of the most useful full-text indexing techniques for very large textual collections [32], supported by many RDBMSs . It is also increasingly used on semi-structured [6] and unstructured data [5] to support keyword-based queries. Yet, the standard inverted index, which only supports exact term matching, cannot deal with data semantics.
Various approaches combining different types of data and semantic knowledge have been proposed to enhance query processing. In this study, we develop a new approach called SemIndex integrating domain knowledge into an inverted index to support semantic-aware querying. Major benefits of our work over existing methods include:
II. System Architecture
Fig. 1. Simplified activity diagram describing our SemIndex framework |
In an initial battery of experiments, we evaluated the quality of our indexing approach by assessing four main criteria: i) index building time, ii) index size and characteristics, ii) query processing time, and iii) the number of returned results. We used the IMBD movies table as an input textual collection, including the attributes movie_id and (title, plot) concatenated in one column (cf. Table 1) with a total size of around 75 MBytes including more then 7 million rows. WordNet 3.0 had a total size of around 26 Mbytes, including more than 117k synsets (senses). The early prototype system and experimental results can be downloaded from the following links:
In a subsequent study, we have extended the above framework toward SemIndex+, allowing to search, select, and rank unstructured, structured (relational) and partly structured (NoSQL) textual data. At the indexer level, we added: i) an extension of SemIndex 's logical design to handle varying multi-attribute datasets (using attribute sensitive indexers), ii) a dedicated algorithm to handle terms with missing semantic connections (which we designate as missing terms ), and iii) a mathematical model for weighting SemIndex+ entries (i.e., the graph's nodes and edges). At the query processing level, we developed: v) four alternative query processing algorithms (including a parallelized version of the core algorithm), coupled with vi) a dedicated relevance scoring measure, required in the query evaluation process in order to retrieve and rank relevant query answers.
Fig. 2. Simplified activity diagram describing our SemIndex framework |
In addition, we conducted an extensive experimental study comparing SemIndex+ 's effectiveness and efficiency with various generic approaches (including inverted index search , query relaxation , query disambiguation , and query refinement ). The new prototype system and experimental results can be downloaded from the following links:
References