C2W1.Assignment.Autocorrect.Part2

理论课：C2W1.Auto-correct

文章目录

3. Combining the edits
- 3.1 Exercise 8.Edit one letter
- 3.2 Exercise 9.Edit two letters
- 3.3 Exercise 10.suggest spelling suggestions
4. Minimum Edit Distance
- 4.1 Dynamic Programming
- Exercise 11
- Test All-in-one
5. Backtrace (Optional)

理论课： C2W1.Auto-correct

Part1在这里

3. Combining the edits

有了上面的四种操作，下面就是要实现单个和两个字母的编辑函数：edit_one_letter() 和 edit_two_letters()

3.1 Exercise 8.Edit one letter

完成edit_one_letter，对给定单词完成一个编辑距离的修改，由于switch不是常规操作，函数对是否需要添加switch操作设置了参数"allow_switches"。

# UNIT TEST COMMENT: Candidate for Table Driven Tests
# UNQ_C8 GRADED FUNCTION: edit_one_letter
def edit_one_letter(word, allow_switches = True):"""Input:word: the string/word for which we will generate all possible wordsthat are one edit away.Output:edit_one_set: a set of words with one possible edit. Please return a set. and not a list."""edit_one_set = set()### START CODE HERE ###edit_one_set.update(delete_letter(word))if allow_switches:edit_one_set.update(switch_letter(word))edit_one_set.update(replace_letter(word))edit_one_set.update(insert_letter(word))### END CODE HERE #### return this as a set and not a listreturn set(edit_one_set)

测试：

tmp_word = "at"
tmp_edit_one_set = edit_one_letter(tmp_word)
# turn this into a list to sort it, in order to view it
tmp_edit_one_l = sorted(list(tmp_edit_one_set))print(f"input word {tmp_word} \nedit_one_l \n{tmp_edit_one_l}\n")
print(f"The type of the returned object should be a set {type(tmp_edit_one_set)}")
print(f"Number of outputs from edit_one_letter('at') is {len(edit_one_letter('at'))}")

结果：
input word at
edit_one_l
[‘a’, ‘aa’, ‘aat’, ‘ab’, ‘abt’, ‘ac’, ‘act’, ‘ad’, ‘adt’, ‘ae’, ‘aet’, ‘af’, ‘aft’, ‘ag’, ‘agt’, ‘ah’, ‘aht’, ‘ai’, ‘ait’, ‘aj’, ‘ajt’, ‘ak’, ‘akt’, ‘al’, ‘alt’, ‘am’, ‘amt’, ‘an’, ‘ant’, ‘ao’, ‘aot’, ‘ap’, ‘apt’, ‘aq’, ‘aqt’, ‘ar’, ‘art’, ‘as’, ‘ast’, ‘ata’, ‘atb’, ‘atc’, ‘atd’, ‘ate’, ‘atf’, ‘atg’, ‘ath’, ‘ati’, ‘atj’, ‘atk’, ‘atl’, ‘atm’, ‘atn’, ‘ato’, ‘atp’, ‘atq’, ‘atr’, ‘ats’, ‘att’, ‘atu’, ‘atv’, ‘atw’, ‘atx’, ‘aty’, ‘atz’, ‘au’, ‘aut’, ‘av’, ‘avt’, ‘aw’, ‘awt’, ‘ax’, ‘axt’, ‘ay’, ‘ayt’, ‘az’, ‘azt’, ‘bat’, ‘bt’, ‘cat’, ‘ct’, ‘dat’, ‘dt’, ‘eat’, ‘et’, ‘fat’, ‘ft’, ‘gat’, ‘gt’, ‘hat’, ‘ht’, ‘iat’, ‘it’, ‘jat’, ‘jt’, ‘kat’, ‘kt’, ‘lat’, ‘lt’, ‘mat’, ‘mt’, ‘nat’, ‘nt’, ‘oat’, ‘ot’, ‘pat’, ‘pt’, ‘qat’, ‘qt’, ‘rat’, ‘rt’, ‘sat’, ‘st’, ‘t’, ‘ta’, ‘tat’, ‘tt’, ‘uat’, ‘ut’, ‘vat’, ‘vt’, ‘wat’, ‘wt’, ‘xat’, ‘xt’, ‘yat’, ‘yt’, ‘zat’, ‘zt’]

The type of the returned object should be a set <class ‘set’>
Number of outputs from edit_one_letter(‘at’) is 129

3.2 Exercise 9.Edit two letters

根据edit_one_letter函数完成edit_two_letters

# UNIT TEST COMMENT: Candidate for Table Driven Tests
# UNQ_C9 GRADED FUNCTION: edit_two_letters
def edit_two_letters(word, allow_switches = True):'''Input:word: the input string/word Output:edit_two_set: a set of strings with all possible two edits'''edit_two_set = set()### START CODE HERE ###edit_one = edit_one_letter(word,allow_switches)for w in edit_one:edit_two_set.update(edit_one_letter(w,allow_switches))### END CODE HERE #### return this as a set instead of a listreturn set(edit_two_set)

测试：

tmp_edit_two_set = edit_two_letters("a")
tmp_edit_two_l = sorted(list(tmp_edit_two_set))
print(f"Number of strings with edit distance of two: {len(tmp_edit_two_l)}")
print(f"First 10 strings {tmp_edit_two_l[:10]}")
print(f"Last 10 strings {tmp_edit_two_l[-10:]}")
print(f"The data type of the returned object should be a set {type(tmp_edit_two_set)}")
print(f"Number of strings that are 2 edit distances from 'at' is {len(edit_two_letters('at'))}")

3.3 Exercise 10.suggest spelling suggestions

实现get_corrections函数，它返回一个由 0 到 n 个可能的建议元组组成的列表，其形式为 (word, probability_of_word)。
大概步骤如下：
步骤1：使用前面完成的编辑功能，为提供的单词生成建议，生成建议需要遵循如下原则：具体编辑次数较少的词比编辑次数较多的词更有可能产生。具体为：

如果词汇表中有该词，则优先建议使用该词。
否则，如果词汇表中有来自edit_one_letter的建议，则优先使用这些建议。
否则，如果词汇表中有来自edit_two_letters的建议，则优先使用这些建议。
否则，建议使用输入的单词。

这里为了更高效的判断以上原则，使用了Python中的Short circuit技巧，具体看代码：

# example of logical operation on lists or sets
print( [] and ["a","b"] )
print( [] or ["a","b"] )
#example of Short circuit behavior
val1 =  ["Most","Likely"] or ["Less","so"] or ["least","of","all"]  # selects first, does not evalute remainder
print(val1)
val2 =  [] or [] or ["least","of","all"] # continues evaluation until there is a non-empty list
print(val2)

结果：
[]
[‘a’, ‘b’]
[‘Most’, ‘Likely’]
[‘least’, ‘of’, ‘all’]

在Python中，and 和 or 运算符不仅用于布尔逻辑，还可以用于非布尔值，包括列表（lists）。以下是Short circuit技巧的总结：

print( [] and ["a","b"] )

这里，空列表 [] 被视为 False，而非空列表 [“a”,“b”] 被视为 True。
and 运算符在遇到第一个 False 值时会停止计算，因此这个表达式的结果为 []。

print( [] or ["a","b"] )

or 运算符在遇到第一个 True 值时会停止计算。
由于 [“a”,“b”] 是 True，表达式的结果为 [“a”,“b”]。

val1 = ["Most","Likely"] or ["Less","so"] or ["least","of","all"]

这是一个连续的 or 运算，or 运算符从左到右计算，直到遇到第一个非空（True）值。
val1 将被赋值为 [“Most”,“Likely”]，因为这是第一个非空列表。

val2 = [] or [] or ["least","of","all"]

这里连续使用 or 运算符，但前两个操作数都是空列表（False）。
or 运算符会继续计算，直到遇到最后一个非空列表 [“least”,“of”,“all”]。

总的来说：

and 运算符在遇到第一个 False 值时会停止计算，并返回该值。
or 运算符在遇到第一个 True 值时会停止计算，并返回该值。
这种计算行为被称为“短路”（short-circuiting），因为它避免了对表达式其余部分的不必要计算。

步骤2：创建一个best_words字典，其中key是一个建议，value是该词在词汇表中出现的概率。如果词汇表中没有该词，则将其概率定为 0。+
步骤3：选择前 n 个最佳建议。当然，建议结果可能少于 n 个。

# UNIT TEST COMMENT: Candidate for Table Driven Tests
# UNQ_C10 GRADED FUNCTION: get_corrections
def get_corrections(word, probs, vocab, n=2, verbose = False):'''Input: word: a user entered string to check for suggestionsprobs: a dictionary that maps each word to its probability in the corpusvocab: a set containing all the vocabularyn: number of possible word corrections you want returned in the dictionaryOutput: n_best: a list of tuples with the most probable n corrected words and their probabilities.'''suggestions = []n_best = []### START CODE HERE ###entered_word=word#Step 1: create suggestions as described above    suggestions = list((word in vocab) or edit_one_letter(word).intersection(vocab) or edit_two_letters(word).intersection(vocab) or word)#Step 2: determine probability of suggestionsbest_words = {}for word in suggestions:if word not in vocab:best_words[word]=0continuebest_words[word]=probs.get(word,0)#Step 3: Get all your best words and return the most probable top n_suggested words as n_bestn_best= sorted(best_words.items(), key=lambda x: x[1], reverse=True)[:n]### END CODE HERE ###if verbose: print("entered word = ", entered_word, "\nsuggestions = ", suggestions)return n_best

测试：

# Test your implementation - feel free to try other words in my word
my_word = 'dys' 
tmp_corrections = get_corrections(my_word, probs, vocab, 2, verbose=True) # keep verbose=True
for i, word_prob in enumerate(tmp_corrections):print(f"word {i}: {word_prob[0]}, probability {word_prob[1]:.6f}")# CODE REVIEW COMMENT: using "tmp_corrections" insteads of "cors". "cors" is not defined
print(f"data type of corrections {type(tmp_corrections)}")

结果：
entered word = dys
suggestions = [‘dye’, ‘days’]
word 0: days, probability 0.000410
word 1: dye, probability 0.000019
data type of corrections <class ‘list’>

4. Minimum Edit Distance

虽然已经基本实现了自动更正，但是还有一些没有解决的问题，例如：
如何评估两个字符串之间的相似性？例如：“waht”和 "what”
如何有效地找到从单词 “waht”到单词 “what”的最短路径？
这一节将利用动态编程，找出一个字符串转换成另一个字符串所需的最少编辑次数。

4.1 Dynamic Programming

动态编程将问题分解为多个子问题，这些子问题可以组合起来形成最终解决方案。在这里，给定一个字符串源[0…i]和一个字符串目标[0…j]，我们将计算所有子串[i, j]的组合，并计算它们的编辑距离。为了高效地完成这项工作，可使用一个表格来保存之前计算过的子串，并使用这些子串来计算更大的子串。
初始化一个矩阵，并对矩阵中的每个元素进行如下更新：
$\text{Initialization}\\ \begin{aligned} D[0,0] &= 0 \\ D[i,0] &= D[i-1,0] + del\_cost(source[i]) \\ D[0,j] &= D[0,j-1] + ins\_cost(target[j]) \\ \end{aligned}\tag{4}$

$\text{Per Cell Operations}\\ \begin{aligned} \\ D[i,j] =min \begin{cases} D[i-1,j] + del\_cost\\ D[i,j-1] + ins\_cost\\ D[i-1,j-1] + \left\{\begin{matrix} rep\_cost; & if src[i]\neq tar[j]\\ 0 ; & if src[i]=tar[j] \end{matrix}\right. \end{cases} \end{aligned}\tag{5}$
根据以上公式，将play变成stay结果为：

	#	s	t	a	y
#	0	1	2	3	4
p	1	2	3	4	5
l	2	3	4	5	6
a	3	4	5	4	5
y	4	5	6	5	4

主要操作为插入、删除和替换，没有用到交换操作。
下图为表格/矩阵的初始化（根据公式4），每个格子/元素代表从原字符串source[0:i]到目标字符串target[0:j]所需要的最小编辑代价/距离。第一列的初始化值是将原字符串"ERR"变成目标字符串""所需要的删除代价，如图所示：
"“→”"不要删除任何东西，代价为0
“E"→”"需要删除一个字母，代价为1
…
“EER"→”“需要删除三个字母，代价为3
第一行的初始化值是将原字符串”"变成目标字符串"NEAR"所需要的插入代价，如图所示：
""→"N"需要插入一个字母，代价为1
…
""→"NEAR"需要插入四个字母，代价为4

在这里插入图片描述
接下来根据公式5填充剩下的部分，需要注意的是，每个位置 $D [i, j]$ 要考虑三种情况，并取其中最小值（使用min_edit_distance()函数完成）

下图中给出了一个使用替换操作的编辑代价计算实例

仔细阅读右边绿色箭头对应的计算方法。图中的substitute/substitution跟之前提到的替换操作是一样的，只是说法不一样。

Exercise 11

# UNQ_C11 GRADED FUNCTION: min_edit_distance
def min_edit_distance(source, target, ins_cost = 1, del_cost = 1, rep_cost = 2):'''Input: source: a string corresponding to the string you are starting withtarget: a string corresponding to the string you want to end withins_cost: an integer setting the insert costdel_cost: an integer setting the delete costrep_cost: an integer setting the replace costOutput:D: a matrix of len(source)+1 by len(target)+1 containing minimum edit distancesmed: the minimum edit distance (med) required to convert the source string to the target'''# use deletion and insert cost as  1m = len(source) n = len(target) #initialize cost matrix with zeros and dimensions (m+1,n+1) D = np.zeros((m+1, n+1), dtype=int) ### START CODE HERE (Replace instances of 'None' with your code) #### Fill in column 0, from row 1 to row m, both inclusivefor row in range(1,m+1): # Replace None with the proper rangeD[row,0] = D[row-1,0] + del_cost# Fill in row 0, for all columns from 1 to n, both inclusivefor col in range(1,n+1): # Replace None with the proper rangeD[0,col] = D[0,col-1] + ins_cost# Loop through row 1 to row m, both inclusivefor row in range(1,m+1): # Loop through column 1 to column n, both inclusivefor col in range(1,n+1):# Intialize r_cost to the 'replace' cost that is passed into this functionr_cost = rep_cost# Check to see if source character at the previous row# matches the target character at the previous column, if source[row-1] == target[col-1]: # Replace None with a proper comparison# Update the replacement cost to 0 if source and target are the samer_cost = 0# Update the cost at row, col based on previous entries in the cost matrix# Refer to the equation calculate for D[i,j] (the minimum of three calculated costs)D[row,col] = min([D[row-1,col]+del_cost, D[row,col-1]+ins_cost, D[row-1,col-1]+r_cost])# Set the minimum edit distance with the cost found at row m, column n med = D[m,n]### END CODE HERE ###return D, med

测试：

# testing your implementation 
source =  'play'
target = 'stay'
matrix, min_edits = min_edit_distance(source, target)
print("minimum edits: ",min_edits, "\n")
idx = list('#' + source)
cols = list('#' + target)
df = pd.DataFrame(matrix, index=idx, columns= cols)
print(df)

结果：

minimum edits:  4#  s  t  a  y
#  0  1  2  3  4
p  1  2  3  4  5
l  2  3  4  5  6
a  3  4  5  4  5
y  4  5  6  5  4

# testing your implementation 
source =  'eer'
target = 'near'
matrix, min_edits = min_edit_distance(source, target)
print("minimum edits: ",min_edits, "\n")
idx = list(source)
idx.insert(0, '#')
cols = list(target)
cols.insert(0, '#')
df = pd.DataFrame(matrix, index=idx, columns= cols)
print(df)

结果：

minimum edits:  3 #  n  e  a  r
#  0  1  2  3  4
e  1  2  1  2  3
e  2  3  2  3  4
r  3  4  3  4  3

Test All-in-one

单个字母最小编辑距离结果：

source = "eer"
targets = edit_one_letter(source,allow_switches = False)  #disable switches since min_edit_distance does not include them
for t in targets:_, min_edits = min_edit_distance(source, t,1,1,1)  # set ins, del, sub costs all to oneif min_edits != 1: print(source, t, min_edits)

结果：
eer iir 2
eer bbr 2
eer zzr 2
eer qqr 2
eer ffr 2
eer jjr 2
eer ttr 2
eer vvr 2
eer ssr 2
eer llr 2
eer hhr 2
eer nnr 2
eer ddr 2
eer aar 2
eer oor 2
eer ppr 2
eer mmr 2
eer ccr 2
eer wwr 2
eer kkr 2
eer xxr 2
eer yyr 2
eer rrr 2
eer uur 2
eer ggr 2
两个字母最小编辑距离结果：

source = "eer"
targets = edit_two_letters(source,allow_switches = False) #disable switches since min_edit_distance does not include them
for t in targets:_, min_edits = min_edit_distance(source, t,1,1,1)  # set ins, del, sub costs all to oneif min_edits != 2 and min_edits != 1: print(source, t, min_edits)