代写EMATM0067 Text Analytics Coursework Spring 2024代写留学生Python语言

EMATM0067

Text Analytics Coursework

Spring 2024

Deadline: 13.00 on Wednesday 22nd May

Overview

This coursework is worth 50% of the unit. It will take you through several text analytics tasks to give you experience with applying and analysing the techniques taught during the labs and lectures. The work will be assessed through your written report, in which you should aim to demonstrate your understanding of text analytics methods, evaluate the methods critically and incorporate ideas from the lectures.

We recommend that you first get a basic implementation for all parts of the required assignment, then start writing your report with some results for all tasks. You can then gradually improve your implementation and results.

Total time required: 40 hours.

Support

The lecturers and teaching assistants are available to provide clarifications about what you are required to do for any part of the coursework. You can ask questions during our lab sessions, post questions on MS Teams, or to the Blackboard discussion forum. If you don’t want to share your question with the class, please contact Edwin by email ([email protected]).

Task 1: Emotion Classification in Tweets (max 59%)

People often express opinions and feelings on social media sites and processing them automatically can help to identify patterns and trends, from medical symptoms to market sentiment or the   popularity of a product. A key challenge is to recognise the emotions that the authors express.

Your task is to design, run and evaluate an emotion classifier for social media posts using the TweetEval Emotion dataset, which contains English tweets tagged with (0) anger, (1) joy, (2) optimism or (3) sadness. You may use any existing classifier implementations in libraries such as Scikit-learn, Gensim, NLTK and Transformers to achieve this. We provide a copy of the data and a ‘data_loader_demo’ Jupyter notebook containing code for loading the data. The notebook is available in ourGithub repository. Further information about the dataset is available onHuggingface and in the paper by Barbieria et al., “TWEETEVAL: Unified Benchmark and Comparative Evaluation for Tweet Classification”, Findings of EMNLP 2020.

1.1. Train one non-neural method for classifying emotions in tweets. Refer to the labs, lecture materials and textbook to identify a suitable method. In your report:

.     Briefly explain how your chosen method works and its main strengths and limitations.

.     Describe the preprocessing steps and the features you use to represent each text instance.

.     Explain why you chose those features and preprocessing steps and hypothesise how they will affect your results.

.     Higher marks are given for good, well-justified classifier design.

(10 marks)

1.2. Train one neural network-based method for classifying emotions in tweets. Refer to the labs, lecture materials and textbook to identify a suitable method. In your report:

.     Briefly explain how your method works, including details of the model architecture and how you chose this configuration.

.     Discuss any use of model transfer or transfer learning in your approach.

.    State the method’s strengths and limitations in comparison to your previous method.

.     Describe any preprocessing steps needed to prepare the data.

.     Plot the changes in the losses during training as learning curves. Explain what the learning curves show and how this information can be used during training.

.     Higher marks are given for good, well-justified classifier design.

(15 marks)

1.3. Evaluate both methods, then interpret and discuss your results. Include the following points:

.     Define your performance metrics and state their limitations.

.     Describe the testing procedure (e.g., how you used each split of the dataset).

.    Show your results using suitable plots or tables.

.     How could you improve the method or experimental process? To inform. this discussion, you may want to analyse some examples of misclassified texts.

(14 marks)

1.4. Using the dataset, can you identify topics that people appear to be optimistic or joyful about?

.     Explain the method you use to identify themes or topics.

.    Show your results (e.g., by listing or visualising example topics or themes).

.     Interpret the results and summarise the limitations of your approach.

(20 marks)

High performance figures are less important for getting high marks than motivating your method well and implementing and evaluating it correctly.

Suggested length of report for task 1: 4 pages.

Task 2: Named Entity Recognition (max. 41%)

In scientific research, information extraction can help researchers to discover relevant findings from across a wide body of literature. As a first step, your task is to build a tool for named entity recognition in scientific journal articles. We will be working with the Bio Creative V dataset containing sentences from articles on PubMed, a database of biomedical research literature. Each sentence is annotated with mentions of chemicals and diseases. We provide a cache of the data and code for loading the data in ‘data_loader_demo’ in ourGithub repository. The data can be sourced   fromHuggingFace. More information can be found in Wei, Chih-Hsuan, et al. "Assessing the state of (CDR) task." Database 2016 (2016).

2.1. Design and run a sequence tagger for tagging chemicals and diseases in Bio Creative V. Refer to the labs, lecture materials and textbook to identify a suitable method. You may choose any sequence tagging method you think is suitable, and you may wish to experiment with some variations in the choice of features or model architecture to help justify your design. In your report:

.     Explain how your chosen method works and its main strengths and limitations.

.     If your model uses its own tokenizer, explain how you align the tokens with tags (this step is only needed if you use a neural sequence tagger that requires a particular tokenizer).

.     Briefly explain how entity spans are encoded as tags for each token in a text.

.     Detail the features you have chosen, why you chose them, and hypothesise how your choice will affect your results.

.     Higher marks are given for good, well-justified model design.

(15 marks)

2.2. Evaluate your method, then interpret and discuss your results. Include the following points:

.     Explain your choice of performance metrics and their limitations.

.     Describe the testing procedure (e.g., how you used each split of the dataset).

.    Show your results using suitable plots and/or tables.

.     Do your methods make any particular kinds of error? Show some examples of mislabelled sentences and suggest how the methods could be improved in future.

(14 marks)

2.3. This task requires you to apply techniques for computing similarity between words or phrases.

.    Select one disease entity from the test set as a “query” .

.     Use two techniques to identify five similar and five dissimilar diseases to your query.

.     Explain and compare the results from each technique. You may wish to use tables or figures to support your discussion.

.     Marks are given for correct use of the techniques, your understanding of them, and your interpretation of the results. If it supports your interpretation, you may include more than one query entity.

(12 marks)

Suggested length of report for task 2: 3 pages.

Implementation

The lab notebooks provide useful example Python code, which you may reuse. You may libraries introduced in the labs, or others of your choice. You may write your code in either Jupyter notebooks or standard Python files.

Report Formatting

.    Absolute maximum 8 pages

o References do not count toward the page limit.

o Aim for quality rather than quantity: you do not have to use the maximum number of pages and will receive higher marks if you write concisely and clearly.

.    To set the page layout, fonts, margins, etc., we recommend using the template from an academic conference, such asLREC-COLING 2024 if writing the report in Latex

o You can use this template directly to write in Latex or follow the formatting style in Word, Libreoffice, etc.

o You don’t need to include an abstract or introduction or conclusion.

o Please number your answers to each task clearly so that we can find them.

o No less than 11pt font

o Single line spacing

o A4 page format

.    The text in your figures must be big enough to read without zooming in.

Citations and References

Make sure to cite a relevant source when you introduce a method or discuss results from previous work. You can use the citation style. given in the LREC-COLING 2024 style guide above. The details of  the cited papers must be given at the end in the references section (no page limits on the references list). Please only include papers that you discuss in the main body of the report.

Google Scholar and similar tools are useful for finding relevant papers. The ‘cite’ link provides bibtex code for use with latex and references that you can copy, but beware that this often contains errors.

Submission

.     Deadline for report + code: 13:00 (GMT+1) on 22nd  May.

.    On Blackboard under the “assessment, submission and feedback” link.

Please upload the following two files:

1.   Your report as a PDF with filename <student_number>.pdf, where “<student_number>” is replaced by your student number (not your username). Upload this to the submission point “Text Analytics Coursework (Turnitin)” .

2.    Your text analytics code inside a single zip file with filename <student_number>.zip. Inside  the zip file there should be a single folder containing your code, with your student number    as the folder name. Please remove datasets and other large files to minimise the upload size

– we only need the code itself. Upload this file to the submission point “Code for Text Analytics Coursework” .

We will briefly review your Python code by eye – we do not need to run it. Your marks will be based on the contents of your report, with the code used to check how you carried out the experiments described in your report. We will not give marks for the coding style, comments, or organisation of the code.

Please do not include your name in the report text itself: to ensure fairness, we mark the reports anonymously.

Assessment Criteria

Your coursework will be evaluated based on your submitted report containing the presentation of methods, results and discussions for each task. To gain high marks your report will need to demonstrate a thorough understanding of the tasks and the methods used, backed up by a clear explanation (including figures) of your results and error analysis.  The exact structure of the report and what is included in it is your decision and you should aim to write it in a professional and objective manner. Marks will be awarded for appropriately including concepts and techniques from the lectures.





热门主题

课程名

int2067/int5051 bsb151 babs2202 mis2002s phya21 18-213 cege0012 mgt253 fc021 mdia1002 math39512 math38032 mech5125 cisc102 07 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 efim20036 mn-3503 comp9414 math21112 fins5568 comp4337 bcpm000028 info6030 inft6800 bcpm0054 comp(2041|9044) 110.807 bma0092 cs365 math20212 ce335 math2010 ec3450 comm1170 cenv6141 ftec5580 ecmt1010 csci-ua.0480-003 econ12-200 ectb60h3f cs247—assignment ib3960 tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 econ7230 msinm014/msing014/msing014b math2014 math350-real eec180 stat141b econ2101 fit2004 comp643 bu1002 cm2030 mn7182sr ectb60h3s ib2d30 ohss7000 fit3175 econ20120/econ30320 acct7104 compsci 369 math226 127.241 info1110 37007 math137a mgt4701 comm1180 fc300 ectb60h3 llp120 bio99 econ7030 csse2310/csse7231 comm1190 125.330 110.309 csc3100 bu1007 comp 636 qbus3600 compx222 stat437 kit317 hw1 ag942 fit3139 115.213 ipa61006 econ214 envm7512 6010acc fit4005 fins5542 slsp5360m 119729 cs148 hld-4267-r comp4002/gam cava1001 or4023 cosc2758/cosc2938 cse140 fu010055 csci410 finc3017 comp9417 fsc60504 24309 bsys702 mgec61 cive9831m pubh5010 5bus1037 info90004 p6769 bsan3209 plana4310 caes1000 econ0060 ap/adms4540 ast101h5f plan6392 625.609.81 csmai21 fnce6012 misy262 ifb106tc csci910 502it comp603/ense600 4035 csca08 8iar101 bsd131 msci242l csci 4261 elec51020 blaw1002 ec3044 acct40115 csi2108–cryptographic 158225 7014mhr econ60822 ecn302 philo225-24a acst2001 fit9132 comp1117b ad654 comp3221 st332 cs170 econ0033 engr228-digital law-10027u fit5057 ve311 sle210 n1608 msim3101 badp2003 mth002 6012acc 072243a 3809ict amath 483 ifn556 cven4051 2024 comp9024 158.739-2024 comp 3023 ecs122a com63004 bms5021 comp1028 genc3004 phil2617
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图