Z 534: Search Assignment 2: Retrieval Algorithm and Evaluation

$30.00

Download Details:

  • Name: z534-a2-fa8pxi.zip
  • Type: zip
  • Size: 533.31 KB

Category:

Description

5/5 - (1 vote)

Task 1: Implement your first search algorithm Based on the Lucene index, we can start to design and implement efficient retrieval algorithms. Let’s start from the easy ones. Please implement the following ranking function using the Lucene index we provided through Canvas (index.zip): � �, ��� = �(�, ���) �����ℎ(���) ∙ log (1 + � � � ) !∈! , where q is the user query, ��� is the target (candidate document in AP89), � is the query term, �(�, ���) is the count of term � in document ���, N is total number of documents in AP89, and � � is the total number of documents that have the term �. Please use Lucene API to get the information. From retrieval viewpoint, !(!,!”#) !”#$%!(!”#) is called normalized TF (term frequency), while log (1 + ! ! ! ) is IDF (inverse document frequency). The following code (using Lucene API) can be useful to help you implement the ranking function: // Get the preprocessed query terms Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser(“TEXT”, analyzer); Query query = parser.parse(queryString); Set queryTerms = new LinkedHashSet(); query.extractTerms(queryTerms); for (Term t : queryTerms) { System.out.println(t.text()); } IndexReader reader = DirectoryReader .open(FSDirectory .open(new File(pathToIndex))); //Use DefaultSimilarity.decodeNormValue(…) to decode normalized document length DefaultSimilarity dSimi=new DefaultSimilarity(); //Get the segments of the index List leafContexts = reader.getContext().reader() .leaves(); for (int i = 0; i < leafContexts.size(); i++) { AtomicReaderContext leafContext=leafContexts.get(i); int startDocNo=leafContext.docBase; int numberOfDoc=leafContext.reader().maxDoc(); for (int docId = startDocNo; docId < startDocNo+numberOfDoc; docId++) { //Get normalized length for each document float normDocLeng=dSimi.decodeNormValue(leafContext.reader() .getNormValues(“TEXT”).get(docIdstartDocNo)); System.out.println(“Normalized length for doc(“+docId+”) is “+normDocLeng); } //Get the term frequency of “new” within each document containing it for TEXT DocsEnum de = MultiFields.getTermDocsEnum(leafContext.reader(), MultiFields.getLiveDocs(leafContext.reader()), “TEXT”, new BytesRef(“new”)); int doc; while ((doc = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) { System.out.println(“\”new\” occurs “+de.freq() + ” times in doc(” + (de.docID()+startDocNo)+”) for the field TEXT”); } } For each given query, your code should be able to 1. Parse the query using Standard Analyzer (Important: we need to use the SAME Analyzer that we used for indexing to parse the query), 2. Calculate the relevance score for each query term, and 3. Calculate the relevance score � �, ��� . The code for this task should be saved in a java class: easySearch.java Task 2: Test your search function with TREC topics Next, we will need to test the search performance with the TREC standardized topic collections. You can download the query test topics from Canvas (topics.51-100). In this collection, TREC provides a number of topics (total 50 topics), which can be employed as the candidate queries for search tasks. For example, one TREC topic is: Tipster Topic Description Number: 054 Domain: International Economics