代做COMPSCI 5096 TEXT AS DATA 2023代做Java语言-留学生作业帮

代做COMPSCI 5096 TEXT AS DATA 2023代做Java语言

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

TEXT AS DATA M

COMPSCI 5096

Monday 24 April 2023

1. Question on Tokenisation. (Total marks: 20)

(a) What is the difference between a word and a token? Why do text processing systems usually operate over tokens instead of words? [2]

(b) Why do tokenisers differ across languages? Provide an example where a tokeniser for English may not be suitable for another language. [2]

(c) Build a byte pair tokenisation vocabulary of size 8 for the text: “mississippi is pie”. Show all your steps. [8]

(d) Name two advantages of byte pair tokenisation over traditional rule-based tokenisation, and two advantages of rule-based tokenisation over byte pair tokenisation. [4]

(e) Consider a situation where you are building an automatic writing assistant. It aims to provide grammar suggestions for a human writer as they write. Would you use stemming, lemmatisation, and/or stopword removal in this application? Explain why or why not. [4]

2. Question on Language Models. (Total marks: 20)

Consider the bi-gram language models, Model X and Model Y, defined as follows:

Model	X	Y	Model	X	Y	Model	X	Y	Model	X	Y
P(a \| ⟨S〉)	0.1	0.3	P(a \| a)	0.1	0.0	P(a \| b)	0.3	0.3	P(a \| c)	0.1	0.0
P(b \| ⟨S〉)	0.5	0.3	P(b \| a)	0.3	0.1	P(b \| b)	0.4	0.6	P(b \| c)	0.6	0.4
P(c \| ⟨S〉)	0.4	0.4	P(c \| a)	0.1	0.8	P(c \| b)	0.2	0.0	P(c \| c)	0.2	0.4
P(⟨E〉\| ⟨S〉)	0.0	0.0	P(⟨E〉\| a)	0.5	0.1	P(⟨E〉\| b)	0.1	0.1	P(⟨E〉\| c)	0.1	0.2

a, b, and c are the tokens for a particular language, and ⟨S〉and ⟨E〉indicate the start-of- sequence and end-of-sequence tokens, respectively.

(a) For Model X and Model Y without smoothing, calculate the perplexity of the sequence “ ⟨S〉a b b a ⟨E〉”. Which model better represents the sequence? Show your work. [6]

(b) Using Model X without smoothing, apply both greedy generation and beam search (with 2 beams), given the prefix: “⟨S〉a c”. The beam search should extend each beam using the two most likely tokens. Show your work, and provide the probability of each generated sequence. [8]

(c) Why is smoothing often applied to language models? Which values of Model X and Model Y would most benefit from smoothing? What are the negative effects of too much smoothing? [3]

(d) For an n-gram language model, what are the problems of a very small or very large value of n? How would you select the best n to use for a particular problem? [3]

To decide an optimal value of n, an experiment should be set up with some held-back test data for the speciﬁc problem.

3. Question on the Factuality of Language Model Responses. (Total marks: 20)

Recall that neural language models can generate responses to prompts (such as questions), but that they often make factual errors. We prompt GPT3 with various questions related to this course and provide the response. Critique each response, and where applicable, correct any factual errors present.

(a) Question: Provide and explain the two probabilities that language models commonly measure. [3]

GPT3 Response:

1. Perplexity: Perplexity is a measure of how well a language model predicts the next word in a sentence. It is calculated by taking the average log probability of the words in the sentence. A lower perplexity score indicates that the language model is more accurate.

2. Coverage: Coverage is a measure of how many words in a corpus (a collection of written or spoken language) are correctly predicted by the language model. A higher coverage score indicates that the language model is able to accurately predict more words from the corpus.

(b) Question: Compare and contrast TF and IDF. [2]

GPT3 Response:

TF (term frequency) and IDF (inverse document frequency) are two methods of determining the importance or relevance of a given word or phrase within a document.

TF is a measure of how often a given word appears in a document. It is calculated by dividing the number of times the word appears in a document by the total number of words in the document. The higher the TF value, the more important the word is in the document. IDF is a measure of how important a given word is in a collection of documents. It is calculated by taking the logarithm of the number of documents in the collection divided by the number of documents in the collection that contain the word. The higher the IDF value, the more important the word is in the collection.

In summary, TF measures the importance of a word within a single document, while IDF measures the importance of a word across a collection of documents.

(c) Question: What role does dimensionality reduction play when building dense word embed- dings from sparse word vectors? [4]

GPT3 Response:

Dimensionality reduction can help accelerate the training process of building dense word embeddings from sparse word vectors by reducing the number of dimensions of the input space, making it easier to work with and process. This can reduce the computational complexity of the training process, allowing more efficient and faster training. Additionally, by reducing the dimensions of the input space, the model can more easily discern the contextual information associated with each word, as the noise created by irrelevant dimensions is removed.

(d) Question: Why are Hidden Markov Models used for Part-of-Speech Tagging? [3]

GPT3 Response:

Hidden Markov Models are used for Part-of-Speech Tagging because they are probabilistic models that can capture the probability of transitioning from one part-of-speech tag to another. This helps the model better estimate which tag is most likely given a certain context. Additionally, the Hidden Markov Model allows for the use of Viterbi algorithm, which further helps increase the accuracy of the tagging process.

(e) Question: Give an example of a nested entity mention, and explain what the nested entity mentions are. [3]

GPT3 Response:

Example: ”The CEO of Apple, Tim Cook, announced...”

Explanation: The nested entity mention in this sentence is ”Tim Cook”, which is comprised of two separate entities - ”CEO” and ”Apple”. The combination of these two entities forms the nested entity mention ”Tim Cook”.

(f) Question: Explain how GPT3 is able to provide knowledge (such as the current CEO of Apple is Tim Cook) in its answers. What are two challenges of a language model encoding knowledge? [5]

GPT3 Response:

GPT3 is able to provide knowledge by using a large-scale language model that is trained on a massive amount of data. The language model is able to capture patterns in the data and use them to generate answers to questions. For example, it can look at data about Apple and its executives and use that to answer questions about the current CEO of Apple. Two challenges of a language model encoding knowledge are accuracy and generalizability. Accuracy refers to how accurately the language model can answer questions. Generalizab-ility refers to how well the language model can answer questions about topics that it has not seen before.

• The inability to provide a source for the knowledge

• The ethical problems that we are unable to check what knowledge is encoded, and its likely biases

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们

EMail: 99515681@qq.com

QQ: 99515681

留学生作业帮-留学生的知心伴侣！

工作时间：08:00-21:00

微信客服：codinghelp

热门主题

课程名