代做LM Data Mining and Machine Learning (2024) Lab 1 – Text Retrieval代做Matlab编程

LM Data Mining and Machine Learning (2024)

Lab 1 – Text Retrieval

PART 1: TF-IDF BASED TEXT RETRIEVAL

Objective

The objective of this lab session is to apply the text-based Information Retrieval (IR) techniques which we have studied in lectures, namely:

1. Stop word removal

2. Stemming

3. Construction of the index – calculation of TF-IDF weights

4. Retrieval – calculating the similarity between a query and document

We will apply these techniques to a ‘toy’ corpus consisting of 112 documents – BEng final year project specifications.  These project specifications were submitted by staff in Word format, but I have converted them all into plain text files for the purposes of this lab.   However,  I  did  not remove the formatting or the pieces of text which are common to all of the files.

Copy the zip archive lab1-2024from Canvas and ‘unzip’ it.  You should end up with a new folder called lab1-2024 containing all of the files that you need to complete the lab, including a folder called docOrig which contains 112 text files.

The folder lab1-2024 will be the default folder that you work from.  Have a look at one of the text files in the docOrig folder. You should be able to identify the common formatting.

Processing of the documents

Before we can do IR we need to apply stop word removal and stemming to each of the documents in our corpus.  To do this you will use two executable (.exe) files of the C programmes that are in your  lab1-2024 folder:  stop.exe and porter-stemmer.exe.   Note  that  there are also source C programmes provided in a case your computer runs on a non-Windows operating system –  in  that   case,  you  will   need  to   compile  the   source  C   programmes  (stop.c,  porter- stemmer.c, index.c and retrieve.c).

Task 1: Stop word removal: The next task is to remove stop words from each of the documents. The 50-word stop word list stopList50 should already be in your lab1-2024folder. Now run the program stop on one of the documents – AbassiM.txt for example. To run the program, just type the below in the Command Prompt window:

stop stoplist50 docOrig\AbassiM.txt

(note that the above includes the path name to tell stop where AbassiM.txt is – this is the docOrig folder).  This should cause a version of AbassiM.txt with stop words removed to be printed onto your screen.  You need to store this output in a text file AbassiM.stp.     To   keep   the   ‘stopped’   documents   separate   from   the   original documents, there is created folder in lab1-2024 called docStop.  All of the ‘stopped’ documents should go in this new folder.

You need to apply stop to all of the project description files.  To do this I have created a  batch  file  called  stopScript.bat,  which  you  should  have  in  your  lab1-2024 folder.  In the Command Prompt window just type stopScript followed by ‘return’ . You need to be in the lab1-2024 folder when you do this.

You should now have 112 files in the docStop folder, each with a name of the form filename.stp.

Question 1: What is the percentage reduction in the number of words in a document as a consequence of stop-word removal – specifically, what is the reduction in the case of the file AgricoleW.txt?

Task 2:    Stemming: The next task is to apply the porter stemmer to each ‘ .stp’ file.  There is created  another  folder  in  lab1-2024 called  docStem.    This  folder  will  contain  a stemmed version of each file from the docStop folder.

Basically, for each  .stp file you create a  .stm file by typing, for example, porter-stemmer docStop\AbassiM.stp

This causes a ‘stemmed’ version of AbassiM.stp to be printed on screen.  You need this data to be stored in a file called docStem/AbassiM.stm.  You need to do this for every  .stp file.  To do this I have created another batch file called stemScript.bat, which you should have in your lab1-2024 folder.  In Command Prompt window just type stemScript followed by ‘return’ . You need to be in the lab1-2024folder when you do this.

Question 2: Find the file AgricoleW.stm. What are the results of applying the porter- stemmer to the words communications, sophisticated and transmissions?

You should now have:

-    112 original .txt documents in the folder docOrig

-    112 ‘stopped’ documents in the folder docStop

-    112 ‘stemmed’ documents in the folder docStem

Task 3:    Create the document index files: If you’ve forgotten what the document index is, or

what it is for, look again at the lecture slides.  The next task is to create 3 index files: one for the original  .txt documents, one for the  .stp documents, and one for the  .stm documents.

You should have the executable index.exe in your lab1-2024folder (or compile the program index.c if needed).

You should have a text file called textFileList in your lab1-2024 folder.  This is simply a list of all of the original .txt files – one file per line.  Type:

index textFileList

followed by ‘return’ .  After a short pause a text version of the index file will be printed on your screen.  You need to store this data in a file called textIndex.  Type:

index textFileList > textIndex

followed by ‘return’ .  Look at this index file (open it in a text editor such as Notepad) and try to understand the information it contains.  The lecture notes will help you.  The first part of the file gives the list of documents with their document length (this is not the length in bytes – see lecture notes if you are unclear).  The second part of the file gives the list of all words (ordered based on IDF) that occurred in the set of documents and information related to each word.  For each word (its position is indicated in front of the word name), there is the total number of times the word appeared (wordCount), number of documents it appeared in (docCount), and the IDF value of the word.  This is then followed with the list of documents the word appeared in, the count and calculated weight.

Now repeat this on the ‘stopped’ and ‘stemmed’ files:

index stopFileList > stopIndex

index stemFileList > stemIndex

Question 3:  What are the ‘document lengths’ of documents: docOrig\DongP.txt, docStop\DongP.stp and docStem\DongP.stm?  Why are they different?  Why is the   difference    between   the   document    lengths   of    docStem\DongP.stm and docOrig\DongP.txt greater than the difference between the document lengths of docStop\DongP.stp and docOrig\DongP.txt?

Question 4: The IDF of the term design is approx. 0.009.  Why is it so close to zero?

Question 5 : Find the word algorithm in the three index files.  Explain why the entries for this word are different in the three files.

Task 4:    Retrieval:  The final task in this part of the lab is retrieval.  To do this you will need to create a query.  This is just a text file containing your query – you can create it using Notepad or Wordpad. An example query – in file query – is in your lab1-2024folder. This query just contains the text: circuits and devices

Next you need to apply stop word removal and stemming to the query:

stop stoplist50 query > query.stp

porter-stemmer query.stp > query.stm

You should have the executable retrieve.exe of the C program in your lab1-2024 folder (or compile the source C program if needed).  You can now do retrieval.

Start with the raw text files:

retrieve textIndex query

followed by ‘return’ .  This will return a list of all the documents for which the similarity with the  query  is  greater  than  0.    It  also  tells  you  the  identity  of  the  most  similar document.

Now  repeat  this  for  the  stopped  documents  and  stopped  query,  and  stemmed documents and stemmed query:

retrieve stopIndex query.stp

retrieve stemIndex query.stm

Question 6: Compare the results of the above two searches (using .stp and .stm) with the result for the original raw text files.  What do you conclude?

Question 7: Repeat Task 4 with one query of your own and report the results.

PART 2: LATENT SEMANTIC ANALYSIS

Objective

The objective of the second part of the lab is to apply Latent Semantic Analysis (LSA) to the set of BEng final year project specifications in the docOrig folder. Look at the notes on LSA to remind yourself about the technique, to put the following sequence of tasks into context.

Task 1:  Create the Word-Document matrix

Recall that the Word-Document matrix Wis an N x V matrix, where N is the number of documents and V is the vocabulary size (the number of different words in the corpus).  The nth  row of W is the document vector vec(dn) for the nth  document.

The executable doc2vec.exe of the C program will create the matrix W (or compile the source C program if needed).  We will apply this program to the stemmed documents.  The command is:

doc2vec stemFileList.txt > WDM

This creates a document vector for each document in the docStem folder and stacks them to create the matrix in the file  WDM.

Task 2:  Apply Singular Value Decomposition (SVD) to the Word-Document matrix

This is done in MATLAB.  You will need the following commands (the quote symbols used below should in Matlab be single quotes):

>>W=load(‘WDM’);

This reads the data in  WDM into the MATLAB matrix  W

>>[U,S,V]=svd(W);

This runs SVD on  W, decomposing it as  W = USVT.

Question 1: Are the matrices U and V as you would expect?  Explain.

Verify that the singular values, the diagonal elements of S, are ordered according to size.

Question 2: What are the values of the first 3 diagonal entries in S?

Now recall that the singular vectors, the ‘latent semantic classes’, correspond to the columns of V. You can access, for example, the first column of V and write it into the vector sv1 by using the MATLAB command:

>>sv1=V(:,1);

Do this for the first 3 columns of V, creating singular vectors sv1, sv2 and sv3.

Now you are going to try to interpret these vectors.  Intuitively, the most important words that determine the interpretation of the vector sv1 are those for which the corresponding coordinate of sv1 is biggest (positive or negative).

To find the biggest positive value in sv1 we can just use: >>m=max(sv1);

But we don’t just want to know the size of the biggest number, we also need to know its position in the vector so that we know which word it corresponds to.  So use:

>>[m,am]=max(sv1);

In this case m is the maximum value in sv1 and am is its index (argmax).   Find the words that correspond to the three biggest values in sv1.  To achieve this you need to know the order that the words occur in when the document vectors were constructed.  The program doc2vec.exe is based on index.exe, and the word order is the same in both programs.  So the nth  component of a document vector corresponds to the nth  word in the corresponding index file.  Hint, the most significant word for sv1 turns out to be ‘project’ .

Question 3:  Find the three most significant words for each of the singular vectors sv1, sv2 and sv3.  What is your interpretation of the corresponding semantic classes?


热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图