SemIndex

Download prototype: SemIndex (console mode) or SemIndex+ (graphical, with alternative algorithms)

I. Introduction

Processing keyword-based queries is a fundamental problem in the domain of Information Retrieval (IR), where several studies have been done to develop effective keyword-based search techniques [5, 6, 7]. A standard containment keyword-based query, which retrieves textual identities containing a set of keywords, is generally supported by a full-text index. The inverted index is considered as one of the most useful full-text indexing techniques for very large textual collections [32], supported by many RDBMSs . It is also increasingly used on semi-structured [6] and unstructured data [5] to support keyword-based queries. Yet, the standard inverted index, which only supports exact term matching, cannot deal with data semantics.

Various approaches combining different types of data and semantic knowledge have been proposed to enhance query processing. In this study, we develop a new approach called SemIndex integrating domain knowledge into an inverted index to support semantic-aware querying. Major benefits of our work over existing methods include:

Pre-processing the index: Existing works use semantic knowledge to pre-process queries, such as query rewriting/relaxation and query suggestion [1, 4, 9], or to post-process the query results, such as using semantic result clustering [13, 15, 16]. Our work proposes a better alternative considering the semantic gap and enclosing semantic knowledge directly into an inverted index, so that the semantic-aware processing tasks can be done prior to query processing: at the most basic index level, where more opportunities can be explored toward both speed-ups and semantic-based filtering.
User involvement: Most existing works introduce some predefined parameters (heuristics) or user feedback to rewrite queries such that users are only involved in the query refinement (expansion, filtering, etc.) process after providing the first round of results [2, 3, 10]. In our work, we aim at allowing end-users to write, using the same framework, classical queries as well as semantically enriched queries, according to their needs. The users can get themselves more involved in the whole process: during the initial query writing, and then performing query rewriting.
Providing more relevant results: Most existing works focus on understanding the meaning of data/queries through semantic disambiguation [1, 8, 13], which: i) is usually a complex process requiring substantial processing time [12], and ii) depends on the query/data context which is not always sufficiently available, and thus does not guarantee correct results [11, 14]. The goal of our work is, with the help of semantic knowledge, to find more semantically relevant results than what a traditional inverted index or semantic-aware approaches can provide, while doing it more efficiently than existing disambiguation techniques .

Our semantic-aware inverted index, SemIndex, maps two data resources into a single data structure, a textual data collection (represented as a traditional inverted index), and a semantic knowledge base (represented as a traditional semantic network). An extended query model with different levels of semantic awareness is defined, so that both semantic-aware queries and standard containment queries are processed within the same framework. Fig. 1 depicts the overall framework of our approach and its main components. Briefly, the Indexer manages SemIndex generation and maintenance, while the Query Processor processes and answers the semantic-aware (or standard) queries issued by the user.

II. System Architecture

We have implemented our SemIndex framework using Java. We also have used MySQL 5.6 as an RDBMS, and WordNet 3.0 as a knowledge base. In addition to the two basic SemIndex components consisting of: i) the indexer, and ii) the query processor, our implementation also includes: iii) a lemmatizer used to transform index terms into their lemmas, as well as iv) extensible weight computation components which are called upon within the indexer and/or query processor to compute edge/node weights as needed (weights are computed initially during indexing, and are then updated dynamically during querying). The RDBMS initially holds the input textual data collection and the knowledge base in the form of native RDBs . Java is used to send SQL queries to the RDBMS in the following order required to build SemIndex.

Fig. 1. Simplified activity diagram describing our SemIndex framework

In an initial battery of experiments, we evaluated the quality of our indexing approach by assessing four main criteria: i) index building time, ii) index size and characteristics, ii) query processing time, and iii) the number of returned results. We used the IMBD movies table as an input textual collection, including the attributes movie_id and (title, plot) concatenated in one column (cf. Table 1) with a total size of around 75 MBytes including more then 7 million rows. WordNet 3.0 had a total size of around 26 Mbytes, including more than 117k synsets (senses). The early prototype system and experimental results can be downloaded from the following links:

SemIndex prototype
Experimental results

In a subsequent study, we have extended the above framework toward SemIndex+, allowing to search, select, and rank unstructured, structured (relational) and partly structured (NoSQL) textual data. At the indexer level, we added: i) an extension of SemIndex 's logical design to handle varying multi-attribute datasets (using attribute sensitive indexers), ii) a dedicated algorithm to handle terms with missing semantic connections (which we designate as missing terms ), and iii) a mathematical model for weighting SemIndex+ entries (i.e., the graph's nodes and edges). At the query processing level, we developed: v) four alternative query processing algorithms (including a parallelized version of the core algorithm), coupled with vi) a dedicated relevance scoring measure, required in the query evaluation process in order to retrieve and rank relevant query answers.

Fig. 2. Simplified activity diagram describing our SemIndex framework

In addition, we conducted an extensive experimental study comparing SemIndex+ 's effectiveness and efficiency with various generic approaches (including inverted index search , query relaxation , query disambiguation , and query refinement ). The new prototype system and experimental results can be downloaded from the following links:

SemIndex+ prototype
New experimental results

References

Burton-Jones A.; Storey V.C.; Sugumaran V. and Purao S., A Heuristic-Based Methodology for Semantic Augmentation of User Queries on the Web. In Proceedings ot the International Conference on Conceptual Modeling (ER'03), 2003. pp. 476\96489
Carpineto C.; Romano G. and Giannini V., Improving Retrieval Feedback with Multiple Term-Ranking Function Combination. ACM Transactions on Information Systems (TOIS), 2002. 20(3):259-290
Chandramouli K., Kliegr T. , Nemrava J., and S. V., Query Refinement and user Relevance Feedback for contextualized image retrieval. 5th International Conference on Visual Information Engineering (VIE), 2008. pp. 453 - 458
Cimiano P.; Handschuh S. and Staab S., Towards the Self-Annotating Web. In Proceedings of the International World Wide Web Conference (WWW'04), 2004. pp. 462-471
Das S., e.a., Making unstructured data sparql using semantic indexing in oracle database. In Proceedings of 29th IEEE ICDE Conf., 2012. pp. 1405\961416
Florescu D., Kossmann D., and Manolescu I., Integrating Keyword Search into Xml Query Processing, . Computer Networks, 2000. 33(1-6):119\96135
Frakes W.B. and R.A. Baeza-Yates, Information retrieval: Data structures and algorithms. Prentice-Hall, 1992
Li Y., Yang H., and Jagadish H.V., Term Disambiguation in Natural Language Query for XML. In Proceedings of the International Conference on Flexible Query Answering Systems (FQAS), 2006. LNAI 4027, pp. 133\96146
Martinenghi D. and Torlone R., Taxonomy-based relaxation of query answering in relational databases. VLDB Journal, 2014. 23(5):747-769
Mishra C. and Koudas N., Interactive Query Refinement International Conference on Extending Database Technology (EDBT'09), 2009. pp. 862-873
de Lima E.F. and Pedersen J.O., Phrase Recognition and Expansion for Short, Precision biased Queries based on a Query Log. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. pp. 145-152, Berkeley, CA
Navigli R., Word Sense Disambiguation: a Survey. ACM Computing Surveys, 2009. 41(2):1\9669
Navigli R. and Crisafulli G., Inducing Word Senses to Improve Web Search Result Clustering. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010. pp. 116\96126, MIT, USA
Voorhees E.M., Query Expansion Using Lexical-Semantic Relations. In Proceedings of the 17th International ACM/SIGIR Conference on Research and Development in Information Retrieval, 1994. pp. 61-69
Sinh Hoa Nguyen, Wojciech Swieboda, and G. Jaskiewicz, Semantic Evaluation of Search Result Clustering Methods. Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence Volume 467, 2013. 467(393-414)
Wen H., Huang G.S., and L. Z., Clustering web search results using semantic information International Conference on Machine Learning and Cybernetics, 2009. 3(1504 - 1509)

Joe Tekli, Christian Kallas & Marc Al Assad	Richard Chbeir	Yi Luo and Kokou Yetongnon
SOE, Dept. of Electrical & Computer Eng. Lebanese American University 36 Byblos, Lebanon	UPPA Laboratory, IUT of Bayonne University of Pau & Pays Adour 64600 Anglet, France	LE2I Laboratory UMR-CNRS University of Bourgogne 21000 Dijon, France
[email protected] marc.alassad.lau.edu www.lau.edu.lb	[email protected] www.univ-pau.fr	[email protected] [email protected] www.u-bourgogne.fr

Carlos Raymundo Ibanez		Caetano Traina Jr. and Agma J.M. Traina
ICMC, Computer Science and Statistics Department University of Sao Paulo Sao Carlos, BRAZIL		Universidad Peruana de Ciencias Aplicadas Computer Science Department Lima, PERU
[email protected] [email protected] www.icmc.usp.br		[email protected] www.upc.edu.pe