Description

5/5 - (1 vote)

Objectives:

o Experience using Solr

o Investigating ranking strategies

Preparation

In the previousexercise you used crawler4j to crawla portion of theUSCwebsite.Asa result of this

crawlyoushouldhavedownloadedandsavedHTML/PDF/DOC files.Inthisexerciseyouwill indexthose

pages using Solr and then modify Solr to compare different ranking strategies.

Solr can beinstalled onUnix orWindowsmachines.However,itismucheasier to runSolr on aUnix

computer. Therefore, we have provided instructions for installing an implementation of Unix, called

Ubuntu, on a Windows computer. For details see

http://www-scf.usc.edu/~csci572/Exercises/UbuntuVirtualBoxFinal.pdf

OnceUbuntuissuccessfullyinstalled,orifyouareusingaMacorsomeotherUnixcomputer, youcan

then follow the instructions for installing Solr directly, which can be found here

http://www-scf.usc.edu/~csci572/2016Spring/hw3/SolrInstallation.pdf

Theaboveinstructionsareforsolr-5.3.1.zip.Youcaneitherdownloadthefilefrom

http://lucene.apache.org/solr/downloads.html/

orfromtheclasswebsiteat

http://www-scf.usc.edu/~csci572/software/solr-5.3.1-src.tgz

OnceSolrisinstalledyouneedtohaveitindex thewebpagesthatyousaved.Instructionsfordoingthis

can be found here

http://www-scf.usc.edu/~csci572/2016Spring/hw3/IndexingwithTIKAV3.pdf

Solrprovidesasimpleuserinterfacethatyoucanusetoexploreyourindexedwebpages.

Description of the Exercise

Step 1

Nowthatyoursetupiscomplete youneedtohaveaccesstoawebserverthatcandeliverwebpages

and run scripts. Using this web server you will create a web page with a text box which a user can

retrieve and then enter a query. The user’s query will be processed by a program at your web server

whichformatsthe queryandsendsittoSolr.SolrwillprocessthequeryandreturnsomeresultsinJSON

format.A programonyourwebserverwill re-format the resultsandpresent them to theuserasany

search engine would do.

Belowisaroughoutlineforhowyoucould structureyoursolutiontotheexercise.Allthreeelements:

web browser, web server, and Solr would be located on your laptop. Your web server might be the

Apacheweb servercoupledwith the PhP programminglanguage.Analternative solutionwould be to

usenode.jsastheserver/programmingcomponent.Inthecaseofnode.js,theprogramminglanguageis

JavaScript.Whateveryouuse,yourprogramwouldsendthequery webpagetotheuser,andthensend

theuser’squery toSolrwhich produces the results.The resultsare returnedbySolr to thesameweb

server and converts the results into a nice looking web page that is eventually returned.

Solrserversupportsseveralclients(differentlanguages).ClientsuserequeststoaskSolrtodothingslike

performqueriesorindexdocuments. ClientapplicationscanreachSolrbycreatingHTTPrequestsand

parsingtheHTTPresponses.ClientAPIsencapsulatemuchoftheworkofsendingrequestsandparsing

responses, which makes it much easier to write client applications.

Clientsuse Solr’sfivefundamentaloperationstoworkwithSolr.Theoperationsarequery,index,delete,

commit,andoptimize. QueriesareexecutedbycreatingaURLthatcontainsallthequeryparameters.

Solrexamines the requestURL,performs thequery,and returns the results.Theotheroperationsare

similar,althoughincertaincasestheHTTPrequestisaPOSToperationandcontainsinformationbeyond

whateverisincludedintherequestURL.Anindexoperation, forexample,maycontainadocumentin

the body of the request.

ThereareseveralclientAPIsavailableforSolr,refer https://wiki.apache.org/solr/IntegratingSolr .Asan

example,hereweexplainhowtocreateaPHPclientthatacceptsinputfromtheuserinaHTMLform,

andsendstherequesttotheSolrserver.AftertheSolrserverprocessesthequery,itreturnstheresults

which are parsed by the PHP program and formatted for display.

We are using the solr-php-client which is available here https://github.com/PTCInc/solr-php-client .

Clonethisrepositoryonyourcomputer inthefolderwhereyouaredevelopingtheUserInterface. (git

clone https://github.com/PTCInc/solr-php-client.git). Below is the sample code from the wiki of this

repository.

<?php

// make sure browsers see this page as utf-8 encoded HTML

header(‘Content-Type: text/html; charset=utf-8’);

$limit = 10;

$query = isset($_REQUEST[‘q’]) ? $_REQUEST[‘q’] : false;

query box

User’s web browser Your web server Your Solr

format query;

send to Solr;

format results;

process query;

return results;

$results = false;

if ($query)

{

//TheApacheSolrClientlibraryshouldbeontheincludepath

//whichisusuallymosteasilyaccomplishedbyplacinginthe

//samedirectoryasthisscript(.orcurrentdirectoryisadefault

//phpincludepathentryinthephp.ini)

require_once(‘Apache/Solr/Service.php’);

//createanewsolrserviceinstance- host,port,andcorename

//path(alldefaultsinthisexample)

$solr= new Apache_Solr_Service(‘localhost’,8983,’/solr/core_name/’);

//ifmagicquotesisenabledthenstripslasheswillbeneeded

if (get_magic_quotes_gpc()== 1)

{

$query= stripslashes($query);

}

//inproductioncodeyou’llalwayswanttouseatry/catchforany

//possibleexceptionsemittedbysearching(i.e.connection

//problemsoraqueryparsingerror)

try

{

$results= $solr->search($query,0,$limit);

}

catch (Exception $e)

{

//inproductionyou’dprobablylogoremailthiserrortoanadmin

//andthenshowaspecialmessagetotheuserbutforthisexample

//we’regoingtoshowthefullexception

die(“<html><head><title>SEARCHEXCEPTION</title><body><pre>{$e->__toString()}</pre></body></html>”);

}

<html>

<head>

<title>PHPSolrClientExample</title>

</head>

<body>

<formaccept-charset=”utf-8″ method=”get”>

<label for=”q”>Search:</label>

</form>

<?php

// display results

if ($results)

{

$total= (int)$results->response->numFound;

$start= min(1,$total);

$end= min($limit,$total);

<div>Results<?phpecho $start;?>- <?phpecho $end;?>of<?phpecho $total;?>:</div>

<ol>

<?php

//iterateresultdocuments

foreach ($results->response->docsas $doc)

{

<li>

<?php

//iteratedocumentfields/values

foreach ($docas $field=> $value)

{

<tr>

</tr>

<?php

}

</table>

</li>

<?php

}

</ol>

<?php

}

</body>

</html>

In order to provide additional parameters to the Solr server, create an array of these parameters as

shown below:

$additionalParameters = array(

‘fq’ => ‘afilteringquery’,

‘facet’ => ‘true’,

//noticeIuseanarrayforamulti-valuedparameter

‘facet.field’ => array(

‘field_1’,

‘field_2’

)

);

$results = $solr->search($query, $start, $rows, $additionalParameters);

Step 2

InthisstepyouaregoingtorunaseriesofqueriesontheSolrindex.Thequeriesyouaregoingtouse

arethesameonesyouusedforExercise#1.Thecomparisonnowwillbetousetwodifferentranking

algorithms.

Solr uses Lucene to facilitate ranking. Lucene uses a combination of the Vector SpaceModel and the

Boolean model to determine how relevant a given document is to a user’s query. The vector space

modelisbasedontermfrequency.TheBooleanmodelisfirstusedtonarrowdownthedocumentsthat

need to be scored based on the use of Boolean logic in the query specification.

Solr permits us to change the ranking algorithm. This gives us an opportunity to use the Page rank

algorithm and see how the results differ. There are several ways to manipulate the ranking of a

documentinSolr.Here,weexplain themethodwhich usesa field that refers toanExternalFile that

storesthePageRankscores.Thisexternalfilebasicallycontainsamappingfromakeyfieldtothefield

value.

InordertocreatethisExternalFile youmustfirstcarryoutthePageRankprocessonyoursetof

collected web pages. You can use the web graph structure that you created in your assignment #2. There is

no need to re-compute the edge relationships (see pagerankdata.csv).

hereareseverallibrariesavailabletohelpyoucomputethePageRankgivenagraph.Oneofthemisthe

NetworkX library –

http://networkx.github.io/documentation/networkx1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html

ThisPageRank function takesaNetworkXgraph (http://networkx.github.io/documentation/networkx1.10/reference/classes.digraph.html#networkx.DiGraph) as input and returns a dictionary of graph

nodes with corresponding Page Rank scores. These are the steps you would need to perform:

i. Compute the incoming and outgoing links to the web pages, and create a NetworkX graph

ii. Compute the Page Rank for this graph and store this in a file in the format

<document_id>=<page_rank_score>

Makesurethedocumentidisthesameastheonewhichispresentinyourindex.InyourSolrindex,you

would typically have the filename as the id as shown below:

OnceyouhavecomputedthePageRankscores,youneedtoplacethisfile inthedatafolderofthecore

youhavedefined.Thiscanbefoundinthepathsolr-5.3.1/server/solr/core_name/.Thenameofthefile

should be external_fieldname or external_fieldname.*. For this example the file could be

named external_pageRankFile or external_pageRankFile.txt.

Thenextstepistoaddthefieldintheschema.xmlwhichreferstothisscore.

We are defining a field “pageRankFile” which is of the type “external”. The field name should be the

suffixafter“external_”youprovidedforyourfileinthepreviousstep. The keyField attributedefinesthe

keythatwillbedefinedintheexternalfile.Itisusuallytheuniquekeyfortheindex.A defVal definesa

default value that will be used if there is no entry in the external file for a particular document.

The valType attributespecifiestheactualtypeofvaluesthatwillbefoundinthefile.Thetypespecified

mustbeeitherafloatfieldtype,sovalidvaluesforthisattributeare pfloat, float or tfloat.Thisattribute

can be omitted.

Once the field has been defined,we need tomake sure thatwhen theindexis reloaded,itisable to

accessthe rankfile.Inorderto do that,therearesomemodificationsrequiredinthesolrconfig.xmlfile.

WeneedtodefineeventListenerstoreloadtheexternalfilewheneitherasearcherisloadedorwhena

new searcher is started. The searcher is basically a Lucene class which enables searching across the

Lucene index. Define these listeners within the <query> element in the solrconfig.xml file.

Now reload the index, by going to the Solr Dashboard UI->Core Admin and clicking on the “Reload”

button.

Now,youcan run thequeriesandcompare the resultswithandwithoutpagerankscores.In theSolr

queryview,thereisa“sort”textfield,whereyoucanspecifythefieldonwhichtheresultshavetobe

sorted along with the order (ascending, descending)

Forexample,inthebelowscreenshotIhavecreatedanexternalfileandaddedpagerankscoresfora

few documents in the index.

“/home/solr-5.3.1/CrawlData/E0-001-086391282-0.html”istheidofthefilethathasbeenindexed.The

values indicate the page rank scores.

The following screenshot consists of the results which uses Solr’s internal ranking algorithm based on tfidf for the default query “*:*”. For the sake of clarity, only the ids have been displayed.

Thenextscreenshotis the resultof thedefaultqueryagain,butwith the rankingbasedonpage rank

that was defined in the external file.

Comparing the two resultswecansee that theorderof the resultshaschanged.The fileendingwith

“E0-001-079680609-8.html”isthefirstresultinthepagerankscoringresults,aswehaddefinedahigh

score(5.5) for it in the external file.

Oncethatisdoneyoushouldrunthesamesetofqueries fromthepreviousassignment, andsave the

results.

Step 3

Returningtheresults:

Theresultsthatyoureturnshouldinclude thetotalnumberofdocumentsfoundatthetopofthepage.

Eachresultshouldincludea:title,author,datecreated,sizeinKB,andalinkthatyoucanclickonandbe

re-directed to the corresponding document.

Comparing the results:

Comparetheresultsofthetworankingalgorithms. Youcandothisbyreturningtoyourassignment#1.

Usethesamequeriesthatyouusedinassignment#1.Foreachquerycollecttheresultsgivenbyboth

ranking algorithms, and in a document compare them.

What needs to be submitted

Thereshouldbea reportdescribingwhatyouhavedoneandanalyzing the resultsof the two ranking

algorithms.

Using the submit command you should provide the following files:

– the external_PageRankFile.txt that was used as input to the PageRank algorithm

– allsourcecode thatyouwrote,including thecode forcreating thewebpage thataccepts the

query, the program that sends the query to Solr and the program that processes the Solr results

CSCI 572 Homework 3: Comparing Search Engine Ranking Algorithms

Download Details:

Description

CSCI 572 Homework 3: Comparing Search Engine Ranking Algorithms

Download Details:

Description

Related products

CSCI572 Web Search Engine Comparison

CSCI572 HW4: Inverted-index creation, using Lunr and Solr

CSCI 572 Homework 4: Adding Spell Checking and AutoComplete to Your Solr-based Search Engine