Programming Homework Help

Programming Homework Help. Python and ​Natural Language Processing

Natural Language Processing

Q1: Define a tokenize function

which does the following in sequence:

takes a string as an input

converts the string into lowercase

segments the lowercased string into tokens. A token is defined as follows:

Each token has at least two characters.

The first/last character can only be a letter (i.e. a-z) or a number (0-9)

In the middle, there are 0 or more characters, which can only be letters (a-z), numbers (0-9), hyphens (“-“), underscores (“_”), dot (“.”), or “@” symbols.

lemmatizes all tokens using WordNetLemmatizer

removes stop words from the tokens (use English stop words list from NLTK)

generate token frequency dictionary, where each unique token is a key and the frequency of the token is the value. (Hint: you can use nltk.FreqDist to create it) returns the token frequency dictionary as the output

Q2: Find duplicate questions by similarity

A data file ‘qa.txt’ has been provided for this question. This dataset has two columns: question and answer as shown in screenshot blow. Here we only use “question” column.

Define a function find_similar_doc as follows:

takes two inputs: a list of documents as strings (i.e. docs), and the index of a selected document as an integer (i.e. doc_id).

uses the “tokenize” function defined in Q1 to tokenize each document

generates tf_idf matrix from the tokens (hint: reference to the tf_idf function defined in Section 7.5 in lecture notes)

calculates the pairwise cosine distance of documents using the tf_idf matrix

for the selected document, finds the index of the most similar document (but not the selected document itself!) by the cosine similarity score

returns the index of the most similar document and the similarity score

Test your function with two selected questions 15 and 51 respectively, i.e., doc_id = 15 and doc_id = 51.

Check the most similar questions discovered for each of them

Do you think this function can successfully find duplicate questions? Why does it work or not work? Write down your analysis in a document

Programming Homework Help

 
"Our Prices Start at $11.99. As Our First Client, Use Coupon Code GET15 to claim 15% Discount This Month!!"