Efficient multifaceted search in information retrieval systems

التفاصيل البيبلوغرافية
العنوان: Efficient multifaceted search in information retrieval systems
Patent Number: 8,032,532
تاريخ النشر: October 04, 2011
Appl. No: 12/124272
Application Filed: May 21, 2008
مستخلص: A method and system for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.
Inventors: Broder, Andrei Z. (Bronx, NY, US); Eiron, Nadav (San Jose, CA, US); Fontoura, Felipe Marcus (San Jose, CA, US); Lempel, Ronny (Haifa, IL); Li, Ning (Raleigh, NC, US); McPherson, Jr., John Ai (San Jose, CA, US); Neumann, Andreas (Muelheim an der Ruhr, DE); Ofek-Koifman, Shila (Haifa, IL); Qi, Runping (Cupertino, CA, US); Shekita, Eugene J. (San Jose, CA, US)
Assignees: International Business Machines Corporation (Armonk, NY, US)
Claim: 1. A computer-implemented method of querying multifaceted information in an information retrieval system, comprising: constructing, by said information retrieval (IR) system, an inverted index having a plurality of unique indexed tokens associated with a plurality of posting lists in a one-to-one correspondence, each posting list including one or more documents of a plurality of documents, wherein an indexed token of said plurality of unique indexed tokens is one of a facet token included as an annotation in a document of said plurality of documents and a path prefix of said facet token, wherein said annotation indicates a path within a tree structure representing a facet that includes said document, said tree structure including a plurality of nodes representing a category and one or more sub-categories that categorize said document; receiving, by said IR system, a query that includes a plurality of constraints on said plurality of documents, said plurality of constraints being associated with multiple indexed tokens of said plurality of unique indexed tokens and multiple posting lists corresponding to said multiple indexed tokens; and executing said query by said IR system, said executing including: identifying said multiple posting lists via a utilization of said plurality of constraints and said inverted index, and intersecting said multiple posting lists to obtain a result of said query.
Claim: 2. The method of claim 1 , wherein said plurality of constraints includes one or more facet constraints and one or more free-text constraints, and wherein said identifying said multiple posting lists comprises: identifying, via said inverted index, a first set of one or more indexed tokens associated with said one or more facet constraints in a one-to-one correspondence and a second set of one or more indexed tokens associated with said one or more free-text constraints in a one-to-one correspondence, said first set and said second set of one or more indexed tokens included in said plurality of unique indexed tokens; and identifying, via said inverted index, a first group of one or more posting lists and a second group of one or more posting lists, said one or more posting lists of said first group associated with said one or more indexed tokens of said first set in a one-to-one correspondence and said one or more posting lists of said second group associated with said one or more indexed tokens of said second set in a one-to-one correspondence.
Claim: 3. The method of claim 1 , wherein said constructing said inverted index comprises: generating a full path token and a full path token posting list associated therewith by said inverted index, said full path token posting list including a plurality of identifiers representing said plurality of documents, wherein an identifier of said plurality of identifiers represents said document and includes a payload value, said payload value identifying a full path of said document in said tree structure, and said payload value including a set of full path indicators provided by a scheme that uniquely labels each sibling node of said tree structure.
Claim: 4. The method of claim 3 , further comprising: building a hierarchy of a plurality of counters, each counter being associated with a node of said plurality of nodes of said tree structure, wherein a counter of said plurality of counters is indexed by said set of full path indicators; and updating a value stored in said counter, said value indicating a count of one or more documents of said plurality of documents, said one or more documents categorized by a sub-category of a category or sub-category indicated by a constraint of said plurality of constraints.
Claim: 5. The method of claim 3 , wherein said scheme is a Dewey labeling scheme.
Claim: 6. The method of claim 1 , further comprising: receiving, by said IR system, an arithmetic expression included in said query, said arithmetic expression associated with at least one numeric field included in one or more documents of said plurality of documents; and computing an evaluation of said arithmetic expression, said computing performed per sub-category of a category indicated by a constraint of said plurality of constraints.
Claim: 7. The method of claim 6 , wherein said arithmetic expression includes at least one of an aggregate function and a basic formula, wherein said aggregate function includes at least one of a sum, a product, a maxima, a minima, and an average, and wherein said basic formula includes any combination of one or more numeric fields, one or more numeric constants, one or more arithmetic operators, and parentheses that indicate a parenthetical expression.
Claim: 8. The method of claim 1 , wherein said constructing said inverted index comprises: designating, for each document of said plurality of documents, a set of one or more indexed tokens of said plurality of unique indexed tokens as a set of one or more exact tokens, each exact token indicating a final sub-category categorizing one or more documents of said plurality of documents, wherein each document categorized by said final sub-category is not categorized by a child sub-category of said final sub-category; and identifying, via said query utilizing said set of one or more exact tokens, said document as residing in said final sub-category, but not in any child sub-category of said final sub-category.
Claim: 9. A computer system comprising: a central processing unit (CPU); a memory coupled to said CPU; a computer-readable, tangible storage device coupled to said CPU, said storage device containing instructions that are carried out by said CPU via said memory to implement a method of querying multifaceted information, said method comprising: constructing an inverted index having a plurality of unique indexed tokens associated with a plurality of posting lists in a one-to-one correspondence, each posting list including one or more documents of a plurality of documents, wherein an indexed token of said plurality of unique indexed tokens is one of a facet token included as an annotation in a document of said plurality of documents and a path prefix of said facet token, wherein said annotation indicates a path within a tree structure representing a facet that includes said document, said tree structure including a plurality of nodes representing a category and one or more sub-categories that categorize said document; receiving a query that includes a plurality of constraints on said plurality of documents, said plurality of constraints being associated with multiple indexed tokens of said plurality of unique indexed tokens and multiple posting lists corresponding to said multiple indexed tokens; and executing said query by identifying said multiple posting lists via a utilization of said plurality of constraints and said inverted index, and intersecting said multiple posting lists to obtain a result of said query.
Claim: 10. The system of claim 9 , wherein said plurality of constraints includes one or more facet constraints and one or more free-text constraints, and wherein said identifying said multiple posting lists comprises: identifying, via said inverted index, a first set of one or more indexed tokens associated with said one or more facet constraints in a one-to-one correspondence and a second set of one or more indexed tokens associated with said one or more free-text constraints in a one-to-one correspondence, said first set and said second set of one or more indexed tokens included in said plurality of unique indexed tokens; and identifying, via said inverted index, a first group of one or more posting lists and a second group of one or more posting lists, said one or more posting lists of said first group associated with said one or more indexed tokens of said first set in a one-to-one correspondence and said one or more posting lists of said second group associated with said one or more indexed tokens of said second set in a one-to-one correspondence.
Claim: 11. The system of claim 9 , wherein said constructing said inverted index comprises: generating a full path token and a full path token posting list associated therewith by said inverted index, said full path token posting list including a plurality of identifiers representing said plurality of documents, wherein an identifier of said plurality of identifiers represents said document and includes a payload value, said payload value identifying a full path of said document in said tree structure, and said payload value including a set of full path indicators provided by a scheme that uniquely labels each sibling node of said tree structure.
Claim: 12. The system of claim 11 , further comprising: building a hierarchy of a plurality of counters, each counter being associated with a node of said plurality of nodes of said tree structure, wherein a counter of said plurality of counters is indexed by said set of full path indicators; and updating a value stored in said counter, said value indicating a count of one or more documents of said plurality of documents, said one or more documents categorized by a sub-category of a category or sub-category indicated by a constraint of said plurality of constraints.
Claim: 13. The system of claim 11 , wherein said scheme is a Dewey labeling scheme.
Claim: 14. The system of claim 9 , further comprising: receiving, by said IR system, an arithmetic expression included in said query, said arithmetic expression associated with at least one numeric field included in one or more documents of said plurality of documents; and computing an evaluation of said arithmetic expression, said computing performed per sub-category of a category indicated by a constraint of said plurality of constraints.
Claim: 15. The system of claim 14 , wherein said arithmetic expression includes at least one of an aggregate function and a basic formula, wherein said aggregate function includes at least one of a sum, a product, a maxima, a minima, and an average, and wherein said basic formula includes any combination of one or more numeric fields, one or more numeric constants, one or more arithmetic operators, and one or more sets of parentheses that indicate parenthetical expressions.
Claim: 16. The system of claim 9 , wherein said constructing said inverted index comprises: designating, for each document of said plurality of documents, a set of one or more indexed tokens of said plurality of unique indexed tokens as a set of one or more exact tokens, each exact token indicating a final sub-category categorizing one or more documents of said plurality of documents, wherein each document categorized by said final sub-category is not categorized by a child sub-category of said final sub-category; and identifying, via said query utilizing said set of one or more exact tokens, said document as residing in said final sub-category, but not in any child sub-category of said final sub-category.
Claim: 17. A computer program product comprising a computer-readable, tangible storage device having a computer-readable program code stored therein, said computer-readable program code containing instructions that are carried out by a processor of a computer system to implement a method of querying multifaceted information in an information retrieval system, said method comprising: constructing, by said information retrieval (IR) system, an inverted index having a plurality of unique indexed tokens associated with a plurality of posting lists in a one-to-one correspondence, each posting list including one or more documents of a plurality of documents, wherein an indexed token of said plurality of unique indexed tokens is one of a facet token included as an annotation in a document of said plurality of documents and a path prefix of said facet token, wherein said annotation indicates a path within a tree structure representing a facet that includes said document, said tree structure including a plurality of nodes representing a category and one or more sub-categories that categorize said document; receiving, by said IR system, a query that includes a plurality of constraints on said plurality of documents, said plurality of constraints being associated with multiple indexed tokens of said plurality of unique indexed tokens and multiple posting lists corresponding to said multiple indexed tokens; and executing said query by said IR system by: identifying said multiple posting lists via a utilization of said plurality of constraints and said inverted index, and intersecting said multiple posting lists to obtain a result of said query.
Claim: 18. The program product of claim 17 , wherein said plurality of constraints includes one or more facet constraints and one or more free-text constraints, and wherein said identifying said multiple posting lists comprises: identifying, via said inverted index, a first set of one or more indexed tokens associated with said one or more facet constraints in a one-to-one correspondence and a second set of one or more indexed tokens associated with said one or more free-text constraints in a one-to-one correspondence, said first set and said second set of one or more indexed tokens included in said plurality of unique indexed tokens; and identifying, via said inverted index, a first group of one or more posting lists and a second group of one or more posting lists, said one or more posting lists of said first group associated with said one or more indexed tokens of said first set in a one-to-one correspondence and said one or more posting lists of said second group associated with said one or more indexed tokens of said second set in a one-to-one correspondence.
Claim: 19. The program product of claim 17 , wherein said constructing said inverted index comprises: generating a full path token and a full path token posting list associated therewith by said inverted index, said full path token posting list including a plurality of identifiers representing said plurality of documents, wherein an identifier of said plurality of identifiers represents said document and includes a payload value, said payload value identifying a full path of said document in said tree structure, and said payload value including a set of full path indicators provided by a scheme that uniquely labels each sibling node of said tree structure.
Claim: 20. The program product of claim 19 , wherein said method further comprises: building a hierarchy of a plurality of counters, each counter being associated with a node of said plurality of nodes of said tree structure, wherein a counter of said plurality of counters is indexed by said set of full path indicators; and updating a value stored in said counter, said value indicating a count of one or more documents of said plurality of documents, said one or more documents categorized by a sub-category of a category or sub-category indicated by a constraint of said plurality of constraints.
Current U.S. Class: 707/742
Patent References Cited: 5704060 December 1997 Del Monte
5787421 July 1998 Nomiyama
6212494 April 2001 Boguraev
6236985 May 2001 Aggarwal et al.
6243713 June 2001 Nelson et al.
6381354 April 2002 Mennie et al.
6490579 December 2002 Gao et al.
6519586 February 2003 Anick et al.
6665666 December 2003 Brown et al.
6745206 June 2004 Mandler et al.
6748387 June 2004 Garber et al.
6925608 August 2005 Neale et al.
6963871 November 2005 Hermansen et al.
7472347 December 2008 Cooper et al.
7499915 March 2009 Chandrasekar et al.
7836050 November 2010 Jing et al.
2002/0032672 March 2002 Keith, Jr.
2003/0018622 January 2003 Chau
2004/0167889 August 2004 Chang et al.
2004/0267700 December 2004 Dumais et al.
2005/0108200 May 2005 Meik et al.
2006/0112079 May 2006 Holt et al.
2006/0167930 July 2006 Witwer et al.
2006/0282411 December 2006 Fagin et al.
2006/0288039 December 2006 Acevedo-Aviles et al.
2007/0050753 March 2007 Holt et al.
2007/0055680 March 2007 Statchuk
2007/0106658 May 2007 Ferrari et al.
2007/0208738 September 2007 Morgan
2008/0010250 January 2008 Fontoura et al.
1716252A January 2006
2003091419 March 2003

Other References: Notice of Allowance (Mail Date Sep. 22, 2008) for U.S. Appl. No. 11/564,915, filed Nov. 30, 2006; Confirmation No. 3338. cited by other
Search Tools.Com (from Internet Jul. 13, 2006); XML and Search; 6 pages. cited by other
Assistant Examiner: Uddin, Mohammed R
Primary Examiner: Cottingham, John R.
Attorney, Agent or Firm: Schmeiser, Olsen & Watts
رقم الانضمام: edspgr.08032532
قاعدة البيانات: USPTO Patent Grants