Description
Objectives:
o Experience using Solr
o Investigating ranking strategies
Preparation
In the previousexercise you used crawler4j to crawla portion of theUSCwebsite.Asa result of this
crawlyoushouldhavedownloadedandsavedHTML/PDF/DOC files.Inthisexerciseyouwill indexthose
pages using Solr and then modify Solr to compare different ranking strategies.
Solr can beinstalled onUnix orWindowsmachines.However,itismucheasier to runSolr on aUnix
computer. Therefore, we have provided instructions for installing an implementation of Unix, called
Ubuntu, on a Windows computer. For details see
http://www-scf.usc.edu/~csci572/Exercises/UbuntuVirtualBoxFinal.pdf
OnceUbuntuissuccessfullyinstalled,orifyouareusingaMacorsomeotherUnixcomputer, youcan
then follow the instructions for installing Solr directly, which can be found here
http://www-scf.usc.edu/~csci572/2016Spring/hw3/SolrInstallation.pdf
Theaboveinstructionsareforsolr-5.3.1.zip.Youcaneitherdownloadthefilefrom
http://lucene.apache.org/solr/downloads.html/
orfromtheclasswebsiteat
http://www-scf.usc.edu/~csci572/software/solr-5.3.1-src.tgz
OnceSolrisinstalledyouneedtohaveitindex thewebpagesthatyousaved.Instructionsfordoingthis
can be found here
http://www-scf.usc.edu/~csci572/2016Spring/hw3/IndexingwithTIKAV3.pdf
Solrprovidesasimpleuserinterfacethatyoucanusetoexploreyourindexedwebpages.
Description of the Exercise
Step 1
Nowthatyoursetupiscomplete youneedtohaveaccesstoawebserverthatcandeliverwebpages
and run scripts. Using this web server you will create a web page with a text box which a user can
retrieve and then enter a query. The user’s query will be processed by a program at your web server
whichformatsthe queryandsendsittoSolr.SolrwillprocessthequeryandreturnsomeresultsinJSON
format.A programonyourwebserverwill re-format the resultsandpresent them to theuserasany
search engine would do.
2
Belowisaroughoutlineforhowyoucould structureyoursolutiontotheexercise.Allthreeelements:
web browser, web server, and Solr would be located on your laptop. Your web server might be the
Apacheweb servercoupledwith the PhP programminglanguage.Analternative solutionwould be to
usenode.jsastheserver/programmingcomponent.Inthecaseofnode.js,theprogramminglanguageis
JavaScript.Whateveryouuse,yourprogramwouldsendthequery webpagetotheuser,andthensend
theuser’squery toSolrwhich produces the results.The resultsare returnedbySolr to thesameweb
server and converts the results into a nice looking web page that is eventually returned.
Solrserversupportsseveralclients(differentlanguages).ClientsuserequeststoaskSolrtodothingslike
performqueriesorindexdocuments. ClientapplicationscanreachSolrbycreatingHTTPrequestsand
parsingtheHTTPresponses.ClientAPIsencapsulatemuchoftheworkofsendingrequestsandparsing
responses, which makes it much easier to write client applications.
Clientsuse Solr’sfivefundamentaloperationstoworkwithSolr.Theoperationsarequery,index,delete,
commit,andoptimize. QueriesareexecutedbycreatingaURLthatcontainsallthequeryparameters.
Solrexamines the requestURL,performs thequery,and returns the results.Theotheroperationsare
similar,althoughincertaincasestheHTTPrequestisaPOSToperationandcontainsinformationbeyond
whateverisincludedintherequestURL.Anindexoperation, forexample,maycontainadocumentin
the body of the request.
ThereareseveralclientAPIsavailableforSolr,refer https://wiki.apache.org/solr/IntegratingSolr .Asan
example,hereweexplainhowtocreateaPHPclientthatacceptsinputfromtheuserinaHTMLform,
andsendstherequesttotheSolrserver.AftertheSolrserverprocessesthequery,itreturnstheresults
which are parsed by the PHP program and formatted for display.
We are using the solr-php-client which is available here https://github.com/PTCInc/solr-php-client .
Clonethisrepositoryonyourcomputer inthefolderwhereyouaredevelopingtheUserInterface. (git
clone https://github.com/PTCInc/solr-php-client.git). Below is the sample code from the wiki of this
repository.
<?php
// make sure browsers see this page as utf-8 encoded HTML
header(‘Content-Type: text/html; charset=utf-8’);
$limit = 10;
$query = isset($_REQUEST[‘q’]) ? $_REQUEST[‘q’] : false;
query box
User’s web browser Your web server Your Solr
format query;
send to Solr;
format results;
process query;
return results;
3
$results = false;
if ($query)
{
//TheApacheSolrClientlibraryshouldbeontheincludepath
//whichisusuallymosteasilyaccomplishedbyplacinginthe
//samedirectoryasthisscript(.orcurrentdirectoryisadefault
//phpincludepathentryinthephp.ini)
require_once(‘Apache/Solr/Service.php’);
//createanewsolrserviceinstance- host,port,andcorename
//path(alldefaultsinthisexample)
$solr= new Apache_Solr_Service(‘localhost’,8983,’/solr/core_name/’);
//ifmagicquotesisenabledthenstripslasheswillbeneeded
if (get_magic_quotes_gpc()== 1)
{
$query= stripslashes($query);
}
//inproductioncodeyou’llalwayswanttouseatry/catchforany
//possibleexceptionsemittedbysearching(i.e.connection
//problemsoraqueryparsingerror)
try
{
$results= $solr->search($query,0,$limit);
}
catch (Exception $e)
{
//inproductionyou’dprobablylogoremailthiserrortoanadmin
//andthenshowaspecialmessagetotheuserbutforthisexample
//we’regoingtoshowthefullexception
die(“<html><head><title>SEARCHEXCEPTION</title><body><pre>{$e->__toString()}</pre></body></html>”);
}
}
?>
<html>
<head>
<title>PHPSolrClientExample</title>
</head>
<body>
<formaccept-charset=”utf-8″ method=”get”>
<label for=”q”>Search:</label>
<input id=”q” name=”q” type=”text” value=”<?php echo htmlspecialchars($query,ENT_QUOTES,’utf-8′);?>”/>
<input type=”submit”/>
</form>
<?php
// display results
if ($results)
{
$total= (int)$results->response->numFound;
4
$start= min(1,$total);
$end= min($limit,$total);
?>
<div>Results<?phpecho $start;?>- <?phpecho $end;?>of<?phpecho $total;?>:</div>
<ol>
<?php
//iterateresultdocuments
foreach ($results->response->docsas $doc)
{
?>
<li>
<table style=”border:1pxsolidblack;text-align:left”>
<?php
//iteratedocumentfields/values
foreach ($docas $field=> $value)
{
?>
<tr>
<th><?phpecho htmlspecialchars($field,ENT_NOQUOTES,’utf-8′);?></th>
<td><?phpecho htmlspecialchars($value,ENT_NOQUOTES,’utf-8′);?></td>
</tr>
<?php
}
?>
</table>
</li>
<?php
}
?>
</ol>
<?php
}
?>
</body>
</html>
In order to provide additional parameters to the Solr server, create an array of these parameters as
shown below:
$additionalParameters = array(
‘fq’ => ‘afilteringquery’,
‘facet’ => ‘true’,
//noticeIuseanarrayforamulti-valuedparameter
‘facet.field’ => array(
‘field_1’,
‘field_2’
)
);
$results = $solr->search($query, $start, $rows, $additionalParameters);
5
Step 2
InthisstepyouaregoingtorunaseriesofqueriesontheSolrindex.Thequeriesyouaregoingtouse
arethesameonesyouusedforExercise#1.Thecomparisonnowwillbetousetwodifferentranking
algorithms.
Solr uses Lucene to facilitate ranking. Lucene uses a combination of the Vector SpaceModel and the
Boolean model to determine how relevant a given document is to a user’s query. The vector space
modelisbasedontermfrequency.TheBooleanmodelisfirstusedtonarrowdownthedocumentsthat
need to be scored based on the use of Boolean logic in the query specification.
Solr permits us to change the ranking algorithm. This gives us an opportunity to use the Page rank
algorithm and see how the results differ. There are several ways to manipulate the ranking of a
documentinSolr.Here,weexplain themethodwhich usesa field that refers toanExternalFile that
storesthePageRankscores.Thisexternalfilebasicallycontainsamappingfromakeyfieldtothefield
value.
InordertocreatethisExternalFile youmustfirstcarryoutthePageRankprocessonyoursetof
collected web pages. You can use the web graph structure that you created in your assignment #2. There is
no need to re-compute the edge relationships (see pagerankdata.csv).
hereareseverallibrariesavailabletohelpyoucomputethePageRankgivenagraph.Oneofthemisthe
NetworkX library –
http://networkx.github.io/documentation/networkx1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html
ThisPageRank function takesaNetworkXgraph (http://networkx.github.io/documentation/networkx1.10/reference/classes.digraph.html#networkx.DiGraph) as input and returns a dictionary of graph
nodes with corresponding Page Rank scores. These are the steps you would need to perform:
i. Compute the incoming and outgoing links to the web pages, and create a NetworkX graph
ii. Compute the Page Rank for this graph and store this in a file in the format
<document_id>=<page_rank_score>
Makesurethedocumentidisthesameastheonewhichispresentinyourindex.InyourSolrindex,you
would typically have the filename as the id as shown below:
6
OnceyouhavecomputedthePageRankscores,youneedtoplacethisfile inthedatafolderofthecore
youhavedefined.Thiscanbefoundinthepathsolr-5.3.1/server/solr/core_name/.Thenameofthefile
should be external_fieldname or external_fieldname.*. For this example the file could be
named external_pageRankFile or external_pageRankFile.txt.
Thenextstepistoaddthefieldintheschema.xmlwhichreferstothisscore.
We are defining a field “pageRankFile” which is of the type “external”. The field name should be the
suffixafter“external_”youprovidedforyourfileinthepreviousstep. The keyField attributedefinesthe
keythatwillbedefinedintheexternalfile.Itisusuallytheuniquekeyfortheindex.A defVal definesa
default value that will be used if there is no entry in the external file for a particular document.
The valType attributespecifiestheactualtypeofvaluesthatwillbefoundinthefile.Thetypespecified
mustbeeitherafloatfieldtype,sovalidvaluesforthisattributeare pfloat, float or tfloat.Thisattribute
can be omitted.
Once the field has been defined,we need tomake sure thatwhen theindexis reloaded,itisable to
accessthe rankfile.Inorderto do that,therearesomemodificationsrequiredinthesolrconfig.xmlfile.
WeneedtodefineeventListenerstoreloadtheexternalfilewheneitherasearcherisloadedorwhena
new searcher is started. The searcher is basically a Lucene class which enables searching across the
Lucene index. Define these listeners within the <query> element in the solrconfig.xml file.
Now reload the index, by going to the Solr Dashboard UI->Core Admin and clicking on the “Reload”
button.
Now,youcan run thequeriesandcompare the resultswithandwithoutpagerankscores.In theSolr
queryview,thereisa“sort”textfield,whereyoucanspecifythefieldonwhichtheresultshavetobe
sorted along with the order (ascending, descending)
Forexample,inthebelowscreenshotIhavecreatedanexternalfileandaddedpagerankscoresfora
few documents in the index.
7
“/home/solr-5.3.1/CrawlData/E0-001-086391282-0.html”istheidofthefilethathasbeenindexed.The
values indicate the page rank scores.
The following screenshot consists of the results which uses Solr’s internal ranking algorithm based on tfidf for the default query “*:*”. For the sake of clarity, only the ids have been displayed.
Thenextscreenshotis the resultof thedefaultqueryagain,butwith the rankingbasedonpage rank
that was defined in the external file.
8
Comparing the two resultswecansee that theorderof the resultshaschanged.The fileendingwith
“E0-001-079680609-8.html”isthefirstresultinthepagerankscoringresults,aswehaddefinedahigh
score(5.5) for it in the external file.
Oncethatisdoneyoushouldrunthesamesetofqueries fromthepreviousassignment, andsave the
results.
Step 3
Returningtheresults:
Theresultsthatyoureturnshouldinclude thetotalnumberofdocumentsfoundatthetopofthepage.
Eachresultshouldincludea:title,author,datecreated,sizeinKB,andalinkthatyoucanclickonandbe
re-directed to the corresponding document.
Comparing the results:
Comparetheresultsofthetworankingalgorithms. Youcandothisbyreturningtoyourassignment#1.
Usethesamequeriesthatyouusedinassignment#1.Foreachquerycollecttheresultsgivenbyboth
ranking algorithms, and in a document compare them.
9
What needs to be submitted
Thereshouldbea reportdescribingwhatyouhavedoneandanalyzing the resultsof the two ranking
algorithms.
Using the submit command you should provide the following files:
– the external_PageRankFile.txt that was used as input to the PageRank algorithm
– allsourcecode thatyouwrote,including thecode forcreating thewebpage thataccepts the
query, the program that sends the query to Solr and the program that processes the Solr results

