代写CS918 Sentiment Classification for Social Media Assignment Two 2024-25代做留学生Python程序

Assignment Two

Sentiment Classification for Social Media

CS918: 2024-25

Submission:  12 pm (midday) Thursday 27 March 2025

Notes

a)   This exercise will contribute towards 30% of your overall mark.

b)   Submission should be made on Tabula and should include a Python code written in Juypter Notebook and a report of 3-5 pages summarising the techniques and features you have used for the classification, as well as the performance results.

c)    You can use any Python libraries you like. Re-use of existing sentiment classifiers will receive lower marks. The idea is for you to build your own sentiment classifier.

Topic: Building a sentiment classifier for Twitter/X

SemEval competitions involve addressing different challenges pertaining to the extraction of meaning from text (semantics). The  organisers of those competitions provide a dataset and a task, so that different participants can develop their system. In this exercise, we will focus on the Task 4 of Semeval 2017 (http://alt.qcri.org/semeval2017/task4/). We will focus particularly on Subtask A, i.e. classifying the overall sentiment of a tweet as positive, negative or neutral.

As part of the classification task, you will need to preprocess the tweets. You are allowed (and in  fact encouraged) to reuse and adapt the preprocessing code you developed for Coursework 1. You may want to tweak your preprocessing code to deal with particularities of tweets, e.g. #hashtags or @user mentions.

You are requested to produce a standalone Juypter Notebook that somebody else could run on their computer, with the only requirement  of having the SemEval data downloaded. Don’t produce a Juypter Notebook that runs on some preprocessed files that only you have, as we will not be able to run that.

Exercise Guidelines

•    Data: The training, development and test sets can be downloaded from the module website (semeval-tweets.tar.bz2). This  compressed archive includes 5 files, one that is used for training (twitter-training-data.txt) another one for development (twitter-dev-data.txt) and another 3 that are used as different subsets for testing (twitter-test[1-3].txt). You may use the development set as the test set while you are developing your classifier, so that you tweak your classifiers and features; the development set can also be useful to compute hyperparameters, where needed. The files are formatted as TSV (tab-separated-values), with one tweet per row that includes the following values:

tweet-id<tab>sentiment<tab>tweet-text

where sentiment is one of {positivenegativeneutral}. The tweet IDs will be used as unique identifiers to build a Python dictionary with the predictions of your classifiers, e.g.:

predictions  =  {‘163361196206957578’ : ‘positive’,

‘768006053969268950’ : ‘negative’,

…}

•    Classifier: You are  requested to develop 3 classifiers that learn from the training data and test on each of the 3 test sets separately (i.e. evaluating on 3 different sets). You are given the skeleton of the code (sentiment-classifier.tar.bz2), with evaluation script included, which will help you develop your system in a way that we will then be able to run on our computers. Evaluation on different tests allows you to generalise your results. You may achieve an improvement over a particular test set just by chance (e.g. overfitting), but improvements over multiple test sets make it more likely to be a significant improvement.

You should develop at least 3 different classifiers, which you will then present and compare in your report. Please develop at least 2 classifiers based on traditional machine learning methods such as MaxEnt, SVM or Naïve Bayes trained on different sets of features (you could use Scikit-learn library). Then, train another classifier based on the LSTM using PyTorch (and optionally the torchtext library) by following the steps below:

a)     Download the GloVe word embeddings and map each word in the dataset into its pre-trained GloVe word embedding.

First go tohttps://nlp.stanford.edu/projects/glove/and download the pre-trained embeddings from 2014 English Wikipedia into the "data" directory. It's a 822MB zip file named glove.6B.zip,  containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens). Un-zip it. Parse the unzipped file (it's a txt file) to build an index mapping words (as strings) to their vector representation (as number vectors).

Build an embedding matrix that will be loaded into an Embedding layer later. It must be a matrix of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in our reference word index (built during tokenization). Note that the index 0 is not supposed to stand for any word or token -- it's a placeholder.

b)     Build and train a neural model built on LSTM.

Define a model which contains an Embedding Layer with maximum number of tokens to be 5,000 and embedding dimensionality as 100. Initialise the Embedding Layer with the pre-trained GloVe word vectors. You need to determine the maximum length of each document. Add an LSTM layer and add a Linear Layer which is the classifier. Train the basic model with an 'Adam' optimiser. You need to freeze the embedding layer by setting its weight.requires_grad attribute to False so that its weights will not be updated during training.

•    Evaluation:  You will compute and output the macroaveraged F1 score of your classifier for the positive and negative classes over the 3 test sets.

An evaluation script is provided which has to be used in the skeleton code provided. This evaluation script produces the macroaveraged F1 score you will need to use. You can also compute a confusion matrix, which will help you identify where your classifier can be improved as part of the error analysis.

If you perform error analysis that has led to improvements in the classifier, this should be described in the report.

To read more about the task and how others tackled it, see the task paper:

http://alt.qcri.org/semeval2017/task4/data/uploads/semeval2017-task4.pdf

Marking will be based on:

a.     Your performance on the task: good and consistent performance across the test sets. While you are given 3 test sets, we will be running your code in 5 test sets to assess its generalisability. Therefore, making sure that your code runs is very important. [25 marks]

b.     Clarity of the report. [20 marks]

c.      Producing runnable, standalone code. [20 marks]

d.     Innovation in the use of features and deep learning architectures (e.g.: BERT, prompting strategies with language models, etc.). [25 marks]

e.     Methodological innovation. [10 marks]

Total: 100 marks


热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图