代做COMPSCI5011 Information Retrieval 2023代写留学生Matlab语言

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

Information Retrieval (M)

COMPSCI5011

Wednesday 10 May 2023

1.

(a) The  following  corpus  of documents  has  been  processed  by  an  IR  system  where stemming is not applied:

Doc1: Real estate speculation is of interest.

Doc2: Interest rates are increasing interest in home costs.

Doc3: Students have no real interest in interest rates.

Doc4: As interest rates fall, the real estate market is heating up.

Doc5: The government is considering increasing interest rates.

(i)        Assume that the following terms are stopwords: an, as, and, are, do, in, is, of, not, the, up. Give the vector space model representations of documents Doc1 and Doc2, assuming that you are using (raw) term frequency to weight the terms. Show clearly the terms of your vectors as well as their weights.   [2]

(ii)       Consider the following query Q:

Q= interest rates

Provide the vector space model representation of Q, showing both the terms as well as their weights.  [1]

Compute the cosine similarity between the query Q and Doc1 as well as the cosine similarity between Q and Doc2. Show your working. [2]

(iii)      Assume the same list of stopwords as (i) above. Construct an inverted file for

all the documents of the corpus, showing clearly the dictionary and posting list components.  Your  inverted  file  needs  to  store  sufficient  information  for computing a simple tf*idf term weight, where wij = tfij *log2(N/dfi).   [5]

(iv)      Assuming the use of a best match ranking algorithm, rank all documents of the

corpus using their relevance scores for the following query:

real estate interest

Show your working. Note  that  log2(1.5)=  0.3219,  log2(1.6666)=  0.7369, log2(2.5)= 1.3219 and log2(5)= 2.3219 (you may not need all of these).   [3]

(b) Consider the following non-interpolated recall-precision graph, showing the performance of an IR system on a given query Q.  For this query Q, the IR system has returned 20 documents. Assume that Q has 16 relevant documents in the ground truth, not all of which have been retrieved by the system for the query.

(i)        Compute  the  interpolated  precision  values  for  this  query  Q.  Show  your working.   [4]

(ii)       The IR system has returned 20 documents ranked from rank=1 to rank=20. In the tables below, indicate if the corresponding document at that rank was relevant (R) or non-relevant (X). [2]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

(iii)      Compute the Average Precision (AP) for query Q.           [1]

2.

(a) Consider a collection A with 5000 documents with a lexicon size of 10000 words (number of unique words). Consider a collection B with 11250 documents and a lexicon size of 15000 words. Suppose that all documents have an average length of 200 words. We then add 2200 documents to each collection.

Use Heaps’ Law with β= 0.5 to estimate how many additional unique terms we would add to A & B. Show your working.

Explain whether the obtained estimates are as you would expect, given the number of documents in A & B, and why?  [4]

(b) Assume  that a user’s original query is bargain books bargain DVDs really bargain books. The user examines two documents, d1 and d2. She judges d1, with the content great books bargain bestsellers really bargain books relevant and d2 with content big big bargain DVDs non-relevant.

Assume that the IR system is using the vector space model with raw term frequency (i.e. with no IDF). Moreover, for relevance feedback, suppose the use of Rocchio’s Relevance    Feedback     (RF)  algorithm, with  parameters α=1, β= 0.75, γ= 0.25.

(i)        First, provide the original query vector. Moreover, provide the vector representations for d1  and d2. You need to clearly show the terms of the vector in addition to the weights.       [2]

(ii)       Next, provide the reformulated query that will be run by the IR system after the application of  Rocchio’s  RF algorithm.  Terms  in   the reformulated query with negative weights can be dropped, i.e. their weights can be changed back to 0. Show your working.        [3]

(iii)      The users of the IR system requested a new feature in the user interface,

where next to each returned document, they can click “Find documents like this one” to obtain more similar documents. You have been asked to re-use Rocchio’s algorithm to deploy such a feature and return similar documents. Explain which weight setting for α, β and γ you would use to deploy this new feature. Justify your answer. [2]

(c) Consider the following six documents.

Doc1: I like Terrier Terrier Terrier

Doc2: I like Terrier and I like the course

Doc3: I like the course

Doc4: I don’t like Terrier

Doc5: I like dogs

Doc6: I like puppies

Assume  an  IR  system  that uses the probabilistic Binary  Independence  Retrieval Model. Recall that in this model, the relevance score of a document to a query is computed using the following RSV formulae:

Assume a large corpus (N=1000), and that the terms: like, Terrier, course, dogs and puppies only occur in the documents above (i.e. they do not occur in the rest of the 994 documents).

(i)        First,  assume  that there is no relevance information available to the system. Moreover, assume that pi is a constant, namely pi=0.5; i.e., half of all relevant documents will contain each query term. Now, consider the following query Q:

Q = Terrier

Calculate the Retrieval Status Value (RSV) of the six documents above for query Q, and rank the documents. Use log base  10. Show your workings and justify your answer. [3]

(ii)       Next, assume that we have some information about term occurrences in the relevant and non-relevant documents for the above query Q. Indeed, through the use of relevance  feedback, we were told that the  only relevant documents for Q1 are: Doc1, Doc2, and Doc6. Everything else is  non-relevant.  Use this relevance information to compute  the probabilities pi and si. Then, calculate the RSV between the query Q and the six documents. Rank the documents for each query.      [4]

(iii)      Briefly provide the main take-away messages you might draw from your results in Parts (i) and (ii). [2]

3.

(a) We discussed a number of IR models in the class. These models can make a number of assumptions to simplify the relevance estimation of a document given a query. Provide 4 examples of assumptions made by IR models, which do not necessarily hold true.  In particular,  for each assumption,  briefly  provide  a  concrete  and  informative  example showing why the assumption does not hold in a real-world search environment.        [4]

(b) Explain why recording the upper bound contribution of a weighting model on a given posting list facilitates the deployment of techniques that improve the efficiency of a search engine.         [4]

(c) Consider a query q, which returns all web pages shown in the hyperlink structure below.

(i)        Write the adjacency matrix A for the above graph.        [1]

(ii)       Using the iterative HITS algorithm, provide the hub and authority scores for all  the webpages of the above graph  by running the algorithm for two iterations.  Show the hub and authority scores for each page after each iteration. Show your workings. [4]

You can write matrices in plain text like a table, or write matrices row by row, e.g. ([a,b];[c,d]) shows a matrix with two columns and two rows where the first row is [a,b] and the second row is [c,d].

(d) Explain why dense retrieval approaches cannot easily handle long documents,

and hence passages are preferred. Discuss two methods for aggregating passage scores, and the intuitions these approaches encode. How would they handle a long document where the relevant content for the query is spread through the document? [7]



热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图