Improved Count Suffix Trees for Natural Language Data

Guido  Sautter; Klemens B�hm

Call for Papers - Ongoing round of submission, notification and publication.

Home | Login or Register | Contact CSC

Home > CSC-OpenAccess Library > Manuscript Information

Full Text Available
(no registration required)

(1005.09KB)

-- CSC-OpenAccess Policy

-- Creative Commons Attribution NonCommercial 4.0 International License

>> COMPLETE LIST OF JOURNALS

EXPLORE PUBLICATIONS BY COUNTRIES


	EUROPE

	MIDDLE EAST

	ASIA

	AFRICA
.............................

	United States of America

	United Kingdom

	Canada

	Australia

	Italy

	France

	Brazil

	Germany

	Malaysia

	Turkey

	China

	Taiwan

	Japan

	Saudi Arabia

	Jordan

	Egypt

	United Arab Emirates

	India

	Nigeria

Improved Count Suffix Trees for Natural Language Data

Guido Sautter, Klemens B�hm

Pages - 1 - 27 | Revised - 15-01-2012 | Published - 21-02-2012

Published in International Journal of Data Engineering (IJDE)

Volume - 3 Issue - 1 | Publication Date - February 2012 Table of Contents

MORE INFORMATION

References | Cited By (1) | Abstracting & Indexing

KEYWORDS

Selectivity Estimation, Count Suffix Tree, Query Optimization

ABSTRACT

With more and more text data stored in databases, the problem of handling natural language query predicates becomes highly important. Closely related to query optimization for these predicates is the (sub)string estimation problem, i.e., estimating the selectivity of query terms before query execution based on small summary statistics. The Count Suffix Trees (CST) is the data structure commonly used to address this problem. While selectivity estimates based on CST tend to be good, they are computationally expensive to build and require a large amount of memory for storage. To fit CST into the data dictionary of database systems, they have to be pruned severely. Pruning techniques proposed so far are based on term (suffix) frequency or on the tree depth of nodes. In this paper, we propose new filtering and pruning techniques that reduce the building cost and the size of CST over natural-language texts. The core idea is to exploit the features of the natural language data over which the CST is built. In particular, we aim at regarding only those suffixes that are useful in a linguistic sense. We use (wellknown) IR techniques to identify them. The most important innovations are as follows: (a) We propose and use a new optimistic syllabification technique to filter out suffixes. (b) We introduce a new affix and prefix stripping procedure that is more aggressive than conventional stemming techniques, which are commonly used to reduce the size of indices. (c) We observe that misspellings and other language anomalies like foreign words incur an over-proportional growth of the CST. We apply state-of-the-art trigram techniques as well as a new syllable-based non-word detection mechanism to filter out such substrings. – Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80%, already during construction, and at the same time increase the accuracy of selectivity estimates computed from the final CST by up to 70%.

CITED BY (1)

1	Tahir, d. n. international journal of data engineering (ijde).

ABSTRACTING & INDEXING

1	Google Scholar

2	CiteSeerX

3	Scribd

4	SlideShare

5	PdfSR

REFERENCES

D. D. Lewis. Reuters-21578 [online] Available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/.

D. Graff. “The Aquaint corpus of english news text”. Linguistic Data Consortium, Philadelphia, 2002.

D. W. Cummings. “American English spelling: an informal description”. Johns Hopkins University Press, 1988.

E. M. McCreight. ”A space-economical suffix tree construction algorithm”. J. Assoc. Comput. Mach., 23(2): 262–272, 1976.

E. M. Zamora, J. Pollock, and A. Zamora. “The use of trigram analysis for spelling error detection”. Information of Processing and Management, 17: 305–316, 1981.

E. Ukkonen. “On-line construction of suffix trees”. Algorithmica, 14(3): 249–260, 1995.

F. M. Lian. “Word hy-phen-a-tion by com-put-er”. PhD thesis, Stanford University, Stanford, August 1983.

G. Sautter, C. Abba, K. Böhm. “Improved Count Suffix Trees for Natural Language Data”. In Proceedings of IDEAS 2008, Coimbra, Portugal, 2008

H. Jagadish, O. Kapitskaia, and D. Srivastava. “One-dimensional and multi-dimensional substring selectivity estimation”. The International Journal on Very Large Data Bases, 9(3): 214–230, 2000.

J. B. Lovins. “Development of a stemming algorithm”. Mechanical Translation and Computational Linguistics, 11:22–31, 1968.

J. Bae and S. Lee. “Substring count estimation in extremely long strings”. IEICE – Transactions in Information Systems, E89-D(3):1148–1156, 2006.

K. Kukich. “Techniques for automatically correcting words in text”. ACM Computer Surveys, 24:379–439, 1992.

M. F. Porter. ”An algorithm for suffix stripping”. Program, 14(3):130–137, 1980.

M. Lennon, D. Pierce, B. Tarry, and P. Willett. “An evaluation of some conflation algorithms for information retrieval”. Journal of Information Science, 3(177–183), 1981.

P. Krishnan, J. S. Vitter, and B. Iyer. “Estimating alphanumeric selectivity in the presence of wildcards”. In ACM SIGMOD International Conference on Management of Data, pages 12–13. ACM, 1996.

P. Weiner. “Linear pattern matching algorithms”. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1–11, 1973.

Personal communication with Torsten Grabs, Microsoft SQL Server development team.

R. Baeza-Yates and B. Ribeiro-Neto. “Modern information retrieval”. Addison-Wesley Longman, 1. print. edition, 1999.

R. Giegerich, S. Kurtz, J. Stoye. “Efficient Implementation of Lazy Suffix Trees”. Software: Practice and Experience, Volume 33, No 11, John Wiley & Sons Ltd., 2003

R. Krovetz. “Viewing morphology as an inference process”. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–203, 1993.

S. Bressan and R. Irawan. “Morphologic non-word error detection”. In Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA ’04), pages 31–35, 2004.

S. Chaudhuri, V. Ganti, and L. Gravano. “Selectivity estimation for string predicates: Overcoming the underestimation problem”. In Proceedings of ICDE 2004, Boston, MA, USA, 2004.

Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. “Practical methods for constructing suffix trees”. The International Journal on Very Large Data Bases, 14(3): 281–299, 2005.

Z. Chen, F. Korn, N. Koudas, S. Muthukrishnan. “Selectivity Estimation for Boolean Queries”. In Proceedings of PODS 2000, Dallas, TX, USA, 2000

MANUSCRIPT AUTHORS

Dr. Guido Sautter

KIT - Germany

sautter@ipd.uka.de

Professor Klemens B�hm

Karlsruhe Institute of Technology - Germany

CREATE AUTHOR ACCOUNT

LAUNCH YOUR SPECIAL ISSUE

View all special issues >>

PUBLICATION VIDEOS