代做COMPSCI 5011 INFORMATION RETRIEVAL M 2020代做留学生SQL语言

DEGREES OF MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

INFORMATION RETRIEVAL M

COMPSCI 5011

Wednesday 13 May 2020, 09:30 BST

SECTION A

1.    (a)

The following documents have been processed by an IR system where no stemming has been applied:

DocID

Text

doc1

france is world champion 1998

doc2

croatia and france played each other in the semifinal

doc3

croatia was in the semifinal 1998

doc4

croatia won the semifinal in russia 2018

(i)        Assume that the list of stopwords is the following: and, in, the, was, each, other. Build an inverted file for these documents, showing clearly the dictionary and posting list components. Your inverted file needs to store sufficient information for computing a simple tf*idf term weight, where wij = tfij *log2(N/dfi)              [5]

(ii)       Compute the term weights (wij) of the terms ‘world ’ and ‘ 1998 ’ in doc1. Show your calculations.           [2]

(iii)      Assuming the use of a best match ranking algorithm, rank all documents after

computing their relevance scores for the following query:

1998 semifinal

Show your calculations.   Note that log2(0.75)= -0.4150 and log2(1.3333)= 0.4150.           [4]

(iv)      Typically, a log scale is applied to the tf (term frequency) component when

scoring documents using a simple tf*idf term weighting scheme. Explain why this is the case illustrating your answer with a suitable example in IR. Explain through  examples how models such as BM25 and PL2 control the term frequency counts.           [4]

(b)     Assume that the set of relevant documents for a query q is Rq, where:

Rq = {doc1, doc2, doc5, doc7, doc8}

An IR system ranks the top 10 results for this query as follows (the leftmost item is the top ranked search result):

doc1  doc12  doc7  doc6  doc4  doc5  doc2  doc8  doc9  doc3

(i)        Show the computed precision and recall values at each rank of the produced ranking for the query q.           [3]

(ii)          Show the calculation of the interpolated precision values at the 11 standard

recall levels for the ranking produced by the system for the query q.          [2]

(iii)      Compute the average precision of query q for the returned ranking. Show your calculations.         [2]

(iv)      Based on your understanding of the average precision measure, explain

whether the obtained average precision score for query q was expected.

Justify your answer.                    [3]

(c)     In  various  DFR models, term frequency tf is replaced with a normalised term frequency tfn as follows:

Using an example of at least two documents of varying length, explain why such a normalisation is needed. Describe in detail how the above formula will behave for your chosen example documents.

What is the role of the c parameter in the formula?          [5]

2.      (a)     Consider a corpus of documents C written in English, where the frequency distribution of words approximately follows Zipf’s law r * p(wr |C) = 0.1, where r = 1, 2, … , n is the rank ofa word by decreasing order of frequency. wr is the word at rank r, and p(wr |C) is the probability of occurrence of word wr in the corpus C. Compute the probability of the most frequent word in C. Compute the probability of the 2nd most frequent word in C. Justify your answers.       [4]

(b)     Suppose we have a corpus of documents with a dictionary of 6 words w1, ..., w6.

Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the word counts for doc1 (third column) and doc2  (fourth column), where ct(w, doci) is the count of word w (i.e. its term frequency) in document doci . Let the query q be the following:

q = w1 w2

Word

p(w|C)

ct(w, doc1)

ct(w, doc2)

w1

0.8

2

7

w 2

0.1

3

1

w3

0.025

2

1

w4

0.025

2

1

w 5

0.025

1

0

w6

0.025

0

0

SUM

1.0

10

10

(i)        Assume that we do not apply any smoothing technique to the language model for doc1  and doc2. Calculate the query likelihood for both doc1  and doc2, i.e. p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations.

Provide the resulting ranking of documents and state the document that would be ranked the highest.      [3]

(ii)       Suppose we now smooth the language model for doc1 and doc2 using Jelinek-Mercer Smoothing with λ = 0.1. Recalculate the likelihood of the query for both doc1  and doc2, i.e., p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations.

Provide the resulting ranking of documents and state the document that would be ranked the highest.           [4]

            (iii)      Explain which document you think should be reasonably ranked higher (doc1 or doc2) and why?               [3]

(c)

Assume that a user’s original  query is bargain books bargain DVDs  really bargain books. The user examines two documents, d1 and d2. She judges d1, with the content books bargain bestsellers really bargain books, as relevant and d2 with the content bargain great DVDs as non-relevant.

Assume that the IR system is using raw term frequency (with no IDF). Moreover, for relevance feedback, suppose the use of Rocchio’s Relevance Feedback (RF) algorithm, with alpha= 1, beta= 0.75, gamma= 0.25.

First, provide the original query vector, clearly stating its dimensions. Provide also the vector representations for d1 and d2. Next, provide the reformulated query that will be run by the IR system after the application of Rocchio’s RF algorithm.

Terms in the reformulated query with negative weights can be dropped, i.e. their weights can be changed back  to 0 (please list all the vector dimensions in alphabetical order). Show your calculations.        [7]

(d)     Explain, with a suitable example, why the terms added in a query that has been reformulated by a query expansion approach are weighted.        [4]

(e)     Using an appropriate analogy from Computer Science of your choosing, explain in detail why positive feedback is likely to be more useful than negative feedback to an information retrieval system. Illustrate your answer with an example from a suitable search scenario.           [5]

3.      (a)     A Web search engine receives a large number of ambiguous queries.  Briefly discuss three possible techniques a Web search engine might use to deal with such queries.       [4]

(b)     Using a suitable example, explain why Precision at Rank R (P@R) is not a good metric to use for learning an effective learning-to-rank model. What metric would you suggest to use instead and why?        [5]

(c)     A posting list for a term in an inverted index contains the following three entries:

id=3 tf=2      id=6 tf=2       id=8  tf=3

Applying the delta compression of ids, show the binary form of the posting list when compressed using Unary compression for both ids and frequencies. What is the resulting (mean) compression rate, in bits per integer?      [5]

(d)     Given the following hyperlink structure among  five web pages, and assuming teleportation, show the step by step construction of the Google matrix for PageRank with a teleportation factor λ = 0.1. You need to write down all the matrices  and intermediate  results but you do not need to compute  the  final PageRank scores.

NB: You can indicate matrices in plain text like a table, or write matrices row by row, e.g. ([a,b];[c,d]) shows a matrix with two columns and two rows where the first row is [a,b] and the second row is [c,d].

 

Explain which web page you think should intuitively have the highest PageRank value and why?         [6]

(e)     The University of Glasgow has deployed a new search engine to support students and staff accessing up-to-date information. The University’s search engine receives a large number of time-sensitive queries such as ‘exams timetable ’ and ‘graduation ceremony’ as well as queries where the users aim to find a particular web page (e.g. ‘international student support appointments ’ or assessment scale’).

The University would like its search engine to be evaluated and appoints you to be in charge of this evaluation. Using a maximum of 750 words, explain how you will evaluate such a system. As part of your answer, you will need to describe a number of alternative  evaluation approaches you will suggest to the University, what resources you will require from the University and how you will practically deploy these approaches, describing their advantages and disadvantages as well as the evaluation metrics you will use.

Which evaluation approach you would recommend to the University and why?         [10]

SECTION B

4.

(a) In the classical offline retrieval evaluation method, given a corpus of documents and a set of queries, the assessors judge the relevance of each document in relation to each query. The relevance of a given document is also assessed independently from other documents. Apart from the cost of developing a test collection, using suitable examples, briefly describe the three other main limitations of this type of evaluation.         [3]

(b) Consider the following TF*IDF ranking formula:

 

where ct(w,d) is the raw count of word w in document d (in other words, the term frequency TF of w in d), and IDF(w) is the IDF of word w.

(i)        Provide three reasons why the ranking formula above is unlikely to perform well.   [3]

(ii)       How would IDF(w) change (i.e., increase, decrease or stay the same) in each of the following cases: (1) adding the word w to a document; (2) making each document twice as long as its original length by concatenating the document with itself; (3) Adding some documents to the collection.   [4]

(c)   Suppose that on a four-point relevance scale, we give a 0 score for a non-relevant document, 1 for a partially-relevant document, 2 for a relevant document, and 3 for a perfect document. Suppose also that a query q returns five results doc1, doc7, doc20, doc15, doc2 (in that order) that are assessed as: perfect, relevant, non-relevant, perfect, and relevant, respectively.

Assume that 1/log2  (rank) is used as the discount factor. You might wish to note that:

log2 2 = 1; log2  3 = 1.59; log2  4 = 2; log2  5 = 2.32.

(i)        Provide the ideal DCG values for the perfect ranking for the given query q.       [2]

(ii)       Compute the Normalised Discounted Cumulative Gain (NDCG) at rank 5 for the given query q (provide NDCG with 2 decimal places).     Show your calculations.         [3]

(d)  Consider a query with two terms, whose posting lists are as follows:

term1 -> [id=3, tf=2], [id=8, tf=1], [id=12, tf=1]

term2 -> [id=3, tf=5], [id=12, tf=4]

Explain and provide the exact order in which the above two posting lists will be traversed by the TAAT & DAAT query evaluation strategies and the memory requirements of both strategies for obtaining a result set ofK documents from a corpus of N documents (K<N).       [5]

(e) Twitter is experiencing an increasing rate of queries during the COVID-19 pandemic, and seeks to provide an effective search functionality that better serves its customers’ information needs. Within Twitter, some data scientists are still contending that classical IR techniques such as TF*IDF or BM25 are holding well and they are sufficient as a basis to  the search functionality so there is no need for them to do more work. Other researchers within the company are advocating a new ranking functionality based on learning to rank. Twitter decides to recruit you as an IR consultant to help convince the data scientists that it is time to move to learning rank and to help deploy the new search functionality. Discuss how you would convince the Twitter data scientists to move to learning to rank.

As part of your answer (750 word max):

-   Describe the major technical challenges for the deployment of a search functionality for Twitter.

-   What has to be changed in a classical TF*IDF-based system and why?

-   What type of learning to rank models will you use and why?

-   Provide three types of features the system should use to increase the effectiveness of searching for tweets, and illustrate your answer with suitable examples.

-   What other functionalities might you advise the company to add to their Twitter search functionality to support information seeking in relation to COVID-19 and why?        [10]




热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图