理论课:C2W1.Auto-correct
文章目录
- 3. Combining the edits
- 3.1 Exercise 8.Edit one letter
- 3.2 Exercise 9.Edit two letters
- 3.3 Exercise 10.suggest spelling suggestions
- 4. Minimum Edit Distance
- 4.1 Dynamic Programming
- Exercise 11
- Test All-in-one
- 5. Backtrace (Optional)
理论课: C2W1.Auto-correct
Part1在这里
3. Combining the edits
有了上面的四种操作,下面就是要实现单个和两个字母的编辑函数:edit_one_letter()
和 edit_two_letters()
3.1 Exercise 8.Edit one letter
完成edit_one_letter
,对给定单词完成一个编辑距离的修改,由于switch不是常规操作,函数对是否需要添加switch操作设置了参数"allow_switches"。
# UNIT TEST COMMENT: Candidate for Table Driven Tests
# UNQ_C8 GRADED FUNCTION: edit_one_letter
def edit_one_letter(word, allow_switches = True):"""Input:word: the string/word for which we will generate all possible wordsthat are one edit away.Output:edit_one_set: a set of words with one possible edit. Please return a set. and not a list."""edit_one_set = set()### START CODE HERE ###edit_one_set.update(delete_letter(word))if allow_switches:edit_one_set.update(switch_letter(word))edit_one_set.update(replace_letter(word))edit_one_set.update(insert_letter(word))### END CODE HERE #### return this as a set and not a listreturn set(edit_one_set)
测试:
tmp_word = "at"
tmp_edit_one_set = edit_one_letter(tmp_word)
# turn this into a list to sort it, in order to view it
tmp_edit_one_l = sorted(list(tmp_edit_one_set))print(f"input word {tmp_word} \nedit_one_l \n{tmp_edit_one_l}\n")
print(f"The type of the returned object should be a set {type(tmp_edit_one_set)}")
print(f"Number of outputs from edit_one_letter('at') is {len(edit_one_letter('at'))}")
结果:
input word at
edit_one_l
[‘a’, ‘aa’, ‘aat’, ‘ab’, ‘abt’, ‘ac’, ‘act’, ‘ad’, ‘adt’, ‘ae’, ‘aet’, ‘af’, ‘aft’, ‘ag’, ‘agt’, ‘ah’, ‘aht’, ‘ai’, ‘ait’, ‘aj’, ‘ajt’, ‘ak’, ‘akt’, ‘al’, ‘alt’, ‘am’, ‘amt’, ‘an’, ‘ant’, ‘ao’, ‘aot’, ‘ap’, ‘apt’, ‘aq’, ‘aqt’, ‘ar’, ‘art’, ‘as’, ‘ast’, ‘ata’, ‘atb’, ‘atc’, ‘atd’, ‘ate’, ‘atf’, ‘atg’, ‘ath’, ‘ati’, ‘atj’, ‘atk’, ‘atl’, ‘atm’, ‘atn’, ‘ato’, ‘atp’, ‘atq’, ‘atr’, ‘ats’, ‘att’, ‘atu’, ‘atv’, ‘atw’, ‘atx’, ‘aty’, ‘atz’, ‘au’, ‘aut’, ‘av’, ‘avt’, ‘aw’, ‘awt’, ‘ax’, ‘axt’, ‘ay’, ‘ayt’, ‘az’, ‘azt’, ‘bat’, ‘bt’, ‘cat’, ‘ct’, ‘dat’, ‘dt’, ‘eat’, ‘et’, ‘fat’, ‘ft’, ‘gat’, ‘gt’, ‘hat’, ‘ht’, ‘iat’, ‘it’, ‘jat’, ‘jt’, ‘kat’, ‘kt’, ‘lat’, ‘lt’, ‘mat’, ‘mt’, ‘nat’, ‘nt’, ‘oat’, ‘ot’, ‘pat’, ‘pt’, ‘qat’, ‘qt’, ‘rat’, ‘rt’, ‘sat’, ‘st’, ‘t’, ‘ta’, ‘tat’, ‘tt’, ‘uat’, ‘ut’, ‘vat’, ‘vt’, ‘wat’, ‘wt’, ‘xat’, ‘xt’, ‘yat’, ‘yt’, ‘zat’, ‘zt’]
The type of the returned object should be a set <class ‘set’>
Number of outputs from edit_one_letter(‘at’) is 129
3.2 Exercise 9.Edit two letters
根据edit_one_letter
函数完成edit_two_letters
# UNIT TEST COMMENT: Candidate for Table Driven Tests
# UNQ_C9 GRADED FUNCTION: edit_two_letters
def edit_two_letters(word, allow_switches = True):'''Input:word: the input string/word Output:edit_two_set: a set of strings with all possible two edits'''edit_two_set = set()### START CODE HERE ###edit_one = edit_one_letter(word,allow_switches)for w in edit_one:edit_two_set.update(edit_one_letter(w,allow_switches))### END CODE HERE #### return this as a set instead of a listreturn set(edit_two_set)
测试:
tmp_edit_two_set = edit_two_letters("a")
tmp_edit_two_l = sorted(list(tmp_edit_two_set))
print(f"Number of strings with edit distance of two: {len(tmp_edit_two_l)}")
print(f"First 10 strings {tmp_edit_two_l[:10]}")
print(f"Last 10 strings {tmp_edit_two_l[-10:]}")
print(f"The data type of the returned object should be a set {type(tmp_edit_two_set)}")
print(f"Number of strings that are 2 edit distances from 'at' is {len(edit_two_letters('at'))}")
3.3 Exercise 10.suggest spelling suggestions
实现get_corrections
函数,它返回一个由 0 到 n 个可能的建议元组组成的列表,其形式为 (word, probability_of_word)。
大概步骤如下:
步骤1:使用前面完成的编辑功能,为提供的单词生成建议,生成建议需要遵循如下原则:具体编辑次数较少的词比编辑次数较多的词更有可能产生。具体为:
- 如果词汇表中有该词,则优先建议使用该词。
- 否则,如果词汇表中有来自
edit_one_letter
的建议,则优先使用这些建议。 - 否则,如果词汇表中有来自
edit_two_letters
的建议,则优先使用这些建议。 - 否则,建议使用输入的单词。
这里为了更高效的判断以上原则,使用了Python中的Short circuit技巧,具体看代码:
# example of logical operation on lists or sets
print( [] and ["a","b"] )
print( [] or ["a","b"] )
#example of Short circuit behavior
val1 = ["Most","Likely"] or ["Less","so"] or ["least","of","all"] # selects first, does not evalute remainder
print(val1)
val2 = [] or [] or ["least","of","all"] # continues evaluation until there is a non-empty list
print(val2)
结果:
[]
[‘a’, ‘b’]
[‘Most’, ‘Likely’]
[‘least’, ‘of’, ‘all’]
在Python中,and 和 or 运算符不仅用于布尔逻辑,还可以用于非布尔值,包括列表(lists)。以下是Short circuit技巧的总结:
print( [] and ["a","b"] )
这里,空列表 [] 被视为 False,而非空列表 [“a”,“b”] 被视为 True。
and 运算符在遇到第一个 False 值时会停止计算,因此这个表达式的结果为 []。
print( [] or ["a","b"] )
or 运算符在遇到第一个 True 值时会停止计算。
由于 [“a”,“b”] 是 True,表达式的结果为 [“a”,“b”]。
val1 = ["Most","Likely"] or ["Less","so"] or ["least","of","all"]
这是一个连续的 or 运算,or 运算符从左到右计算,直到遇到第一个非空(True)值。
val1 将被赋值为 [“Most”,“Likely”],因为这是第一个非空列表。
val2 = [] or [] or ["least","of","all"]
这里连续使用 or 运算符,但前两个操作数都是空列表(False)。
or 运算符会继续计算,直到遇到最后一个非空列表 [“least”,“of”,“all”]。
总的来说:
- and 运算符在遇到第一个 False 值时会停止计算,并返回该值。
- or 运算符在遇到第一个 True 值时会停止计算,并返回该值。
- 这种计算行为被称为“短路”(short-circuiting),因为它避免了对表达式其余部分的不必要计算。
步骤2:创建一个best_words
字典,其中key
是一个建议,value
是该词在词汇表中出现的概率。如果词汇表中没有该词,则将其概率定为 0。+
步骤3:选择前 n 个最佳建议。当然,建议结果可能少于 n 个。
# UNIT TEST COMMENT: Candidate for Table Driven Tests
# UNQ_C10 GRADED FUNCTION: get_corrections
def get_corrections(word, probs, vocab, n=2, verbose = False):'''Input: word: a user entered string to check for suggestionsprobs: a dictionary that maps each word to its probability in the corpusvocab: a set containing all the vocabularyn: number of possible word corrections you want returned in the dictionaryOutput: n_best: a list of tuples with the most probable n corrected words and their probabilities.'''suggestions = []n_best = []### START CODE HERE ###entered_word=word#Step 1: create suggestions as described above suggestions = list((word in vocab) or edit_one_letter(word).intersection(vocab) or edit_two_letters(word).intersection(vocab) or word)#Step 2: determine probability of suggestionsbest_words = {}for word in suggestions:if word not in vocab:best_words[word]=0continuebest_words[word]=probs.get(word,0)#Step 3: Get all your best words and return the most probable top n_suggested words as n_bestn_best= sorted(best_words.items(), key=lambda x: x[1], reverse=True)[:n]### END CODE HERE ###if verbose: print("entered word = ", entered_word, "\nsuggestions = ", suggestions)return n_best
测试:
# Test your implementation - feel free to try other words in my word
my_word = 'dys'
tmp_corrections = get_corrections(my_word, probs, vocab, 2, verbose=True) # keep verbose=True
for i, word_prob in enumerate(tmp_corrections):print(f"word {i}: {word_prob[0]}, probability {word_prob[1]:.6f}")# CODE REVIEW COMMENT: using "tmp_corrections" insteads of "cors". "cors" is not defined
print(f"data type of corrections {type(tmp_corrections)}")
结果:
entered word = dys
suggestions = [‘dye’, ‘days’]
word 0: days, probability 0.000410
word 1: dye, probability 0.000019
data type of corrections <class ‘list’>
4. Minimum Edit Distance
虽然已经基本实现了自动更正,但是还有一些没有解决的问题,例如:
如何评估两个字符串之间的相似性?例如:“waht”和 "what”
如何有效地找到从单词 “waht”到单词 “what”的最短路径?
这一节将利用动态编程,找出一个字符串转换成另一个字符串所需的最少编辑次数。
4.1 Dynamic Programming
动态编程将问题分解为多个子问题,这些子问题可以组合起来形成最终解决方案。在这里,给定一个字符串源[0…i]和一个字符串目标[0…j],我们将计算所有子串[i, j]的组合,并计算它们的编辑距离。为了高效地完成这项工作,可使用一个表格来保存之前计算过的子串,并使用这些子串来计算更大的子串。
初始化一个矩阵,并对矩阵中的每个元素进行如下更新:
Initialization D [ 0 , 0 ] = 0 D [ i , 0 ] = D [ i − 1 , 0 ] + d e l _ c o s t ( s o u r c e [ i ] ) D [ 0 , j ] = D [ 0 , j − 1 ] + i n s _ c o s t ( t a r g e t [ j ] ) (4) \text{Initialization}\\ \begin{aligned} D[0,0] &= 0 \\ D[i,0] &= D[i-1,0] + del\_cost(source[i]) \\ D[0,j] &= D[0,j-1] + ins\_cost(target[j]) \\ \end{aligned}\tag{4} InitializationD[0,0]D[i,0]D[0,j]=0=D[i−1,0]+del_cost(source[i])=D[0,j−1]+ins_cost(target[j])(4)
Per Cell Operations D [ i , j ] = m i n { D [ i − 1 , j ] + d e l _ c o s t D [ i , j − 1 ] + i n s _ c o s t D [ i − 1 , j − 1 ] + { r e p _ c o s t ; i f s r c [ i ] ≠ t a r [ j ] 0 ; i f s r c [ i ] = t a r [ j ] (5) \text{Per Cell Operations}\\ \begin{aligned} \\ D[i,j] =min \begin{cases} D[i-1,j] + del\_cost\\ D[i,j-1] + ins\_cost\\ D[i-1,j-1] + \left\{\begin{matrix} rep\_cost; & if src[i]\neq tar[j]\\ 0 ; & if src[i]=tar[j] \end{matrix}\right. \end{cases} \end{aligned}\tag{5} Per Cell OperationsD[i,j]=min⎩ ⎨ ⎧D[i−1,j]+del_costD[i,j−1]+ins_costD[i−1,j−1]+{rep_cost;0;ifsrc[i]=tar[j]ifsrc[i]=tar[j](5)
根据以上公式,将play变成stay结果为:
# | s | t | a | y | |
---|---|---|---|---|---|
# | 0 | 1 | 2 | 3 | 4 |
p | 1 | 2 | 3 | 4 | 5 |
l | 2 | 3 | 4 | 5 | 6 |
a | 3 | 4 | 5 | 4 | 5 |
y | 4 | 5 | 6 | 5 | 4 |
主要操作为插入、删除和替换,没有用到交换操作。
下图为表格/矩阵的初始化(根据公式4),每个格子/元素代表从原字符串source[0:i]到目标字符串target[0:j]所需要的最小编辑代价/距离。第一列的初始化值是将原字符串"ERR"变成目标字符串""所需要的删除代价,如图所示:
"“→”"不要删除任何东西,代价为0
“E"→”"需要删除一个字母,代价为1
…
“EER"→”“需要删除三个字母,代价为3
第一行的初始化值是将原字符串”"变成目标字符串"NEAR"所需要的插入代价,如图所示:
""→"N"需要插入一个字母,代价为1
…
""→"NEAR"需要插入四个字母,代价为4
接下来根据公式5填充剩下的部分,需要注意的是,每个位置 D [ i , j ] D[i,j] D[i,j]要考虑三种情况,并取其中最小值(使用min_edit_distance()
函数完成)
下图中给出了一个使用替换操作的编辑代价计算实例
仔细阅读右边绿色箭头对应的计算方法。图中的substitute/substitution跟之前提到的替换操作是一样的,只是说法不一样。
Exercise 11
# UNQ_C11 GRADED FUNCTION: min_edit_distance
def min_edit_distance(source, target, ins_cost = 1, del_cost = 1, rep_cost = 2):'''Input: source: a string corresponding to the string you are starting withtarget: a string corresponding to the string you want to end withins_cost: an integer setting the insert costdel_cost: an integer setting the delete costrep_cost: an integer setting the replace costOutput:D: a matrix of len(source)+1 by len(target)+1 containing minimum edit distancesmed: the minimum edit distance (med) required to convert the source string to the target'''# use deletion and insert cost as 1m = len(source) n = len(target) #initialize cost matrix with zeros and dimensions (m+1,n+1) D = np.zeros((m+1, n+1), dtype=int) ### START CODE HERE (Replace instances of 'None' with your code) #### Fill in column 0, from row 1 to row m, both inclusivefor row in range(1,m+1): # Replace None with the proper rangeD[row,0] = D[row-1,0] + del_cost# Fill in row 0, for all columns from 1 to n, both inclusivefor col in range(1,n+1): # Replace None with the proper rangeD[0,col] = D[0,col-1] + ins_cost# Loop through row 1 to row m, both inclusivefor row in range(1,m+1): # Loop through column 1 to column n, both inclusivefor col in range(1,n+1):# Intialize r_cost to the 'replace' cost that is passed into this functionr_cost = rep_cost# Check to see if source character at the previous row# matches the target character at the previous column, if source[row-1] == target[col-1]: # Replace None with a proper comparison# Update the replacement cost to 0 if source and target are the samer_cost = 0# Update the cost at row, col based on previous entries in the cost matrix# Refer to the equation calculate for D[i,j] (the minimum of three calculated costs)D[row,col] = min([D[row-1,col]+del_cost, D[row,col-1]+ins_cost, D[row-1,col-1]+r_cost])# Set the minimum edit distance with the cost found at row m, column n med = D[m,n]### END CODE HERE ###return D, med
测试:
# testing your implementation
source = 'play'
target = 'stay'
matrix, min_edits = min_edit_distance(source, target)
print("minimum edits: ",min_edits, "\n")
idx = list('#' + source)
cols = list('#' + target)
df = pd.DataFrame(matrix, index=idx, columns= cols)
print(df)
结果:
minimum edits: 4# s t a y
# 0 1 2 3 4
p 1 2 3 4 5
l 2 3 4 5 6
a 3 4 5 4 5
y 4 5 6 5 4
# testing your implementation
source = 'eer'
target = 'near'
matrix, min_edits = min_edit_distance(source, target)
print("minimum edits: ",min_edits, "\n")
idx = list(source)
idx.insert(0, '#')
cols = list(target)
cols.insert(0, '#')
df = pd.DataFrame(matrix, index=idx, columns= cols)
print(df)
结果:
minimum edits: 3 # n e a r
# 0 1 2 3 4
e 1 2 1 2 3
e 2 3 2 3 4
r 3 4 3 4 3
Test All-in-one
单个字母最小编辑距离结果:
source = "eer"
targets = edit_one_letter(source,allow_switches = False) #disable switches since min_edit_distance does not include them
for t in targets:_, min_edits = min_edit_distance(source, t,1,1,1) # set ins, del, sub costs all to oneif min_edits != 1: print(source, t, min_edits)
结果:
eer iir 2
eer bbr 2
eer zzr 2
eer qqr 2
eer ffr 2
eer jjr 2
eer ttr 2
eer vvr 2
eer ssr 2
eer llr 2
eer hhr 2
eer nnr 2
eer ddr 2
eer aar 2
eer oor 2
eer ppr 2
eer mmr 2
eer ccr 2
eer wwr 2
eer kkr 2
eer xxr 2
eer yyr 2
eer rrr 2
eer uur 2
eer ggr 2
两个字母最小编辑距离结果:
source = "eer"
targets = edit_two_letters(source,allow_switches = False) #disable switches since min_edit_distance does not include them
for t in targets:_, min_edits = min_edit_distance(source, t,1,1,1) # set ins, del, sub costs all to oneif min_edits != 2 and min_edits != 1: print(source, t, min_edits)
结果:略,太长
eer zzrs 3
eer qppr 3
eer aary 3
eer nnkr 3
eer pprf 3
eer zxzr 3
eer qkqr 3
eer iio 3
eer llrg 3
eer ikir 3
5. Backtrace (Optional)
略