05.序列模型 W3.序列模型和注意力机制(作业:机器翻译+触发词检测)

文章目录

  • 作业1:机器翻译
    • 1. 日期转换
      • 1.1 数据集
    • 2. 用注意力模型进行机器翻译
      • 2.1 注意力机制
    • 3. 可视化注意力
  • 作业2:触发词检测
    • 1. 数据合成:创建语音数据集
      • 1.1 听一下数据
      • 1.2 音频转频谱
      • 1.3 生成一个训练样本
      • 1.4 全部训练集
      • 1.5 开发集
    • 2. 模型
      • 2.1 建模
      • 2.2 训练
      • 2.3 测试模型
    • 3. 预测
      • 3.3 在开发集上测试
    • 4. 用自己的样本测试

测试题:参考博文

笔记:W3.序列模型和注意力机制

作业1:机器翻译

建立一个神经元机器翻译(NMT)模型来将人类可读日期(25th of June, 2009)翻译成机器可读日期(“2009—06—25”

将使用注意力模型来实现这一点,这是最复杂的 序列到序列 模型之一

注意安装包

pip install Faker==2.0.0
pip install babel
  • 导入包
from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from keras.layers import RepeatVector, Dense, Activation, Lambda
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.models import load_model, Model
import keras.backend as K
import numpy as npfrom faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
from nmt_utils import *
import matplotlib.pyplot as plt
%matplotlib inline

1. 日期转换

模型将输入以各种可能格式书写的日期(例如"the 29th of August 1958", "03/30/1968", "24 JUNE 1987"),并将其转换为标准化、机器可读的日期(如 "1958-08-29", "1968-03-30", "1987-06-24")。我们将让模型学习以通用机器可读格式YYYY-MM-DD输出日期

1.1 数据集

  • 1万条数据
m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)
  • 打印看看
dataset[:10]

输出:

[('9 may 1998', '1998-05-09'),('10.11.19', '2019-11-10'),('9/10/70', '1970-09-10'),('saturday april 28 1990', '1990-04-28'),('thursday january 26 1995', '1995-01-26'),('monday march 7 1983', '1983-03-07'),('sunday may 22 1988', '1988-05-22'),('08 jul 2008', '2008-07-08'),('8 sep 1999', '1999-09-08'),('thursday january 1 1981', '1981-01-01')]

上面加载了:

  • dataset
  • human_vocab: 字典, human readable dates : an integer-valued index
  • machine_vocab: 字典, machine readable dates : an integer-valued index
  • inv_machine_vocab: 字典,machine_vocab的反向映射,indices : characters
Tx = 30 # 最大输入长度,如果大了,就截断
Ty = 10 # 输出日期长度 YYYY-MM-DD
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

输出:

X.shape: (10000, 30)
Y.shape: (10000, 10)
Xoh.shape: (10000, 30, 37) # 37 是 len(human_vocab)
Yoh.shape: (10000, 10, 11) # 11 是 日期中的字符种类 0-9 和 ‘-’
  • 看看数据(数据不够长度的,会补充 pad,所有 x 都是 30 长度)
index = 52
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])
print("Target after preprocessing (one-hot):", Yoh[index])

输出:

Source date: saturday october 9 1976
Target date: 1976-10-09Source after preprocessing (indices): [29 13 30 31 28 16 13 34  0 26 15 30 26 14 17 28  0 12  0  4 12 10  9 3636 36 36 36 36 36]
Target after preprocessing (indices): [ 2 10  8  7  0  2  1  0  1 10]Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.]...[0. 0. 0. ... 0. 0. 1.][0. 0. 0. ... 0. 0. 1.][0. 0. 0. ... 0. 0. 1.]]
Target after preprocessing (one-hot): [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.][0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.][0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.][0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

2. 用注意力模型进行机器翻译

2.1 注意力机制


context<t>=∑t′=0Txα<t,t′>a<t′>context^{<t>} = \sum_{t' = 0}^{T_x} \alpha^{<t,t'>}a^{<t'>}context<t>=t=0Txα<t,t>a<t>

将输入重复几次:https://keras.io/zh/layers/core/#repeatvector
输入张量通过 axis 轴串联起来 https://keras.io/zh/layers/merge/#concatenate_1
https://keras.io/zh/layers/wrappers/#bidirectional

# Defined shared layers as global variables
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') 
# We are using a custom softmax(axis = 1) loaded in this notebookdotor = Dot(axes = 1)
  • 注意力计算
# GRADED FUNCTION: one_step_attentiondef one_step_attention(a, s_prev):"""Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights"alphas" and the hidden states "a" of the Bi-LSTM.Arguments:a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)Returns:context -- context vector, input of the next (post-attetion) LSTM cell"""### START CODE HERE #### Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)s_prev = repeator(s_prev)# Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)concat = concatenator(inputs=[a, s_prev])# Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)e = densor1(concat)# Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)energies = densor2(e)# Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)alphas = activator(energies)# Use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)context = dotor([alphas, a])### END CODE HERE ###return context
n_a = 32
n_s = 64
post_activation_LSTM_cell = LSTM(n_s, return_state = True)
output_layer = Dense(len(machine_vocab), activation=softmax)
# GRADED FUNCTION: modeldef model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):"""Arguments:Tx -- length of the input sequenceTy -- length of the output sequencen_a -- hidden state size of the Bi-LSTMn_s -- hidden state size of the post-attention LSTMhuman_vocab_size -- size of the python dictionary "human_vocab"machine_vocab_size -- size of the python dictionary "machine_vocab"Returns:model -- Keras model instance"""# Define the inputs of your model with a shape (Tx,)# Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)X = Input(shape=(Tx, human_vocab_size))s0 = Input(shape=(n_s,), name='s0')c0 = Input(shape=(n_s,), name='c0')s = s0c = c0# Initialize empty list of outputsoutputs = []### START CODE HERE #### Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)a = Bidirectional(LSTM(n_a, return_sequences=True))(X)# Step 2: Iterate for Ty stepsfor t in range(Ty):# Step 2.A: Perform one step of the attention mechanism to get back the context vector at step t (≈ 1 line)context = one_step_attention(a, s)# Step 2.B: Apply the post-attention LSTM cell to the "context" vector.# Don't forget to pass: initial_state = [hidden state, cell state] (≈ 1 line)s, _, c = post_activation_LSTM_cell(context, initial_state=[s, c])# Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)out = output_layer(s)# Step 2.D: Append "out" to the "outputs" list (≈ 1 line)outputs.append(out)# Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)model = Model(inputs=[X, s0, c0], outputs=outputs)### END CODE HERE ###return model
  • 定义模型
model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))
  • 定义优化器、配置模型
### START CODE HERE ### (≈2 lines)
opt = Adam(learning_rate=0.005, beta_1=0.9, beta_2=0.999,decay=0.01)
model.compile(loss='categorical_crossentropy',optimizer=opt, metrics=['accuracy'])
### END CODE HERE ###
  • 训练
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))
model.fit([Xoh, s0, c0], outputs, epochs=1, batch_size=100)
  • 为了节省时间,老师准备好了训练好的权值
model.load_weights('models/model.h5')
EXAMPLES = ['5th Otc 2019', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']
for example in EXAMPLES:source = string_to_int(example, Tx, human_vocab)source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))).swapaxes(0,1)source = source.transpose() #交换两个轴source = np.expand_dims(source, axis=0) #增加一维轴prediction = model.predict([source, s0, c0])prediction = np.argmax(prediction, axis = -1)output = [inv_machine_vocab[int(i)] for i in prediction]print("source:", example)print("output:", ''.join(output))

输出:

source: 5th Otc 2019
output: 2019-10-05
source: 5 April 09
output: 2009-04-05
source: 21th of August 2016
output: 2016-08-20
source: Tue 10 Jul 2007
output: 2007-07-10
source: Saturday May 9 2018
output: 2018-05-09
source: March 3 2001
output: 2001-03-03
source: March 3rd 2001
output: 2001-03-03
source: 1 March 2001
output: 2001-03-01

3. 可视化注意力

attention_map = plot_attention_map(model, human_vocab, inv_machine_vocab, "Tuesday 09 Oct 1993", num = 7, n_s = 64)


可以看出大部分的注意力用来预测年份

作业2:触发词检测

  • 导入包
import numpy as np
from pydub import AudioSegment
import random
import sys
import io
import os
import glob
import IPython
from td_utils import *
%matplotlib inline

1. 数据合成:创建语音数据集

1.1 听一下数据

有正向音频 activates(触发词)、负向音频(非触发词)、背景噪声

IPython.display.Audio("./raw_data/backgrounds/1.wav")

1.2 音频转频谱

音频为 44100 Hz 的,时长 10秒

x = graph_spectrogram("audio_examples/example_train.wav")



本作业训练样本时长 10 秒,频谱时间步为 5511,所以 Tx=5511T_x = 5511Tx=5511

_, data = wavfile.read("audio_examples/example_train.wav")
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)

输出:

Time steps in audio recording before spectrogram (441000,)
Time steps in input after spectrogram (101, 5511)
  • 定义参数
Tx = 5511 # The number of time steps input to the model from the spectrogram
n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram
Ty = 1375 # The number of time steps in the output of our model

1.3 生成一个训练样本

  • 随机选择10s 背景噪声
  • 随机插入 0-4 段 触发词音频
  • 随机插入 0-2 段 非触发词音频
# Load audio segments using pydub 
activates, negatives, backgrounds = load_raw_audio()print("background len: " + str(len(backgrounds[0])))    # Should be 10,000, since it is a 10 sec clip
print("activate[0] len: " + str(len(activates[0])))     # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot)
print("activate[1] len: " + str(len(activates[1])))     # Different "activate" clips can have different lengths 

输出:

background len: 10000
activate[0] len: 721
activate[1] len: 731
  • 获取背景音频中的随机时间段
def get_random_time_segment(segment_ms):"""Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.Arguments:segment_ms -- the duration of the audio clip in ms ("ms" stands for "milliseconds")Returns:segment_time -- a tuple of (segment_start, segment_end) in ms"""segment_start = np.random.randint(low=0, high=10000-segment_ms)   # Make sure segment doesn't run past the 10sec background segment_end = segment_start + segment_ms - 1return (segment_start, segment_end)
  • 检测插入的音频是否重叠
# GRADED FUNCTION: is_overlappingdef is_overlapping(segment_time, previous_segments):"""Checks if the time of a segment overlaps with the times of existing segments.Arguments:segment_time -- a tuple of (segment_start, segment_end) for the new segmentprevious_segments -- a list of tuples of (segment_start, segment_end) for the existing segmentsReturns:True if the time segment overlaps with any of the existing segments, False otherwise"""segment_start, segment_end = segment_time### START CODE HERE ### (≈ 4 line)# Step 1: Initialize overlap as a "False" flag. (≈ 1 line)overlap = False# Step 2: loop over the previous_segments start and end times.# Compare start/end times and set the flag to True if there is an overlap (≈ 3 lines)for previous_start, previous_end in previous_segments:if previous_end >= segment_start and previous_start <= segment_end:overlap = True### END CODE HERE ###return overlap
  • 插入音频
# GRADED FUNCTION: insert_audio_clipdef insert_audio_clip(background, audio_clip, previous_segments):"""Insert a new audio segment over the background noise at a random time step, ensuring that the audio segment does not overlap with existing segments.Arguments:background -- a 10 second background audio recording.  audio_clip -- the audio clip to be inserted/overlaid. previous_segments -- times where audio segments have already been placedReturns:new_background -- the updated background audio"""# Get the duration of the audio clip in mssegment_ms = len(audio_clip)### START CODE HERE ### # Step 1: Use one of the helper functions to pick a random time segment onto which to insert # the new audio clip. (≈ 1 line)segment_time = get_random_time_segment(segment_ms)# Step 2: Check if the new segment_time overlaps with one of the previous_segments. If so, keep # picking new segment_time at random until it doesn't overlap. (≈ 2 lines)while is_overlapping(segment_time, previous_segments):segment_time = get_random_time_segment(segment_ms)# Step 3: Add the new segment_time to the list of previous_segments (≈ 1 line)previous_segments.append(segment_time)### END CODE HERE #### Step 4: Superpose audio segment and backgroundnew_background = background.overlay(audio_clip, position = segment_time[0])return new_background, segment_time
  • 插入标签 1
# GRADED FUNCTION: insert_onesdef insert_ones(y, segment_end_ms):"""Update the label vector y. The labels of the 50 output steps strictly after the end of the segment should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the50 followinf labels should be ones.Arguments:y -- numpy array of shape (1, Ty), the labels of the training examplesegment_end_ms -- the end time of the segment in msReturns:y -- updated labels"""# duration of the background (in terms of spectrogram time-steps)segment_end_y = int(segment_end_ms * Ty / 10000.0)# Add 1 to the correct index in the background label (y)### START CODE HERE ### (≈ 3 lines)for i in range(segment_end_y+1, segment_end_y+51):if i < Ty:y[0, i] = 1### END CODE HERE ###return y
  • 合成训练数据
# GRADED FUNCTION: create_training_exampledef create_training_example(background, activates, negatives):"""Creates a training example with a given background, activates, and negatives.Arguments:background -- a 10 second background audio recordingactivates -- a list of audio segments of the word "activate"negatives -- a list of audio segments of random words that are not "activate"Returns:x -- the spectrogram of the training exampley -- the label at each time step of the spectrogram"""# Set the random seednp.random.seed(18)# Make background quieterbackground = background - 20### START CODE HERE #### Step 1: Initialize y (label vector) of zeros (≈ 1 line)y = np.zeros((1, Ty))# Step 2: Initialize segment times as empty list (≈ 1 line)previous_segments = []### END CODE HERE #### Select 0-4 random "activate" audio clips from the entire list of "activates" recordingsnumber_of_activates = np.random.randint(0, 5)random_indices = np.random.randint(len(activates), size=number_of_activates)random_activates = [activates[i] for i in random_indices]### START CODE HERE ### (≈ 3 lines)# Step 3: Loop over randomly selected "activate" clips and insert in backgroundfor random_activate in random_activates:# Insert the audio clip on the backgroundbackground, segment_time = insert_audio_clip(background, random_activate, previous_segments)# Retrieve segment_start and segment_end from segment_timesegment_start, segment_end = segment_time# Insert labels in "y"y = insert_ones(y, segment_end)### END CODE HERE #### Select 0-2 random negatives audio recordings from the entire list of "negatives" recordingsnumber_of_negatives = np.random.randint(0, 3)random_indices = np.random.randint(len(negatives), size=number_of_negatives)random_negatives = [negatives[i] for i in random_indices]### START CODE HERE ### (≈ 2 lines)# Step 4: Loop over randomly selected negative clips and insert in backgroundfor random_negative in random_negatives:# Insert the audio clip on the background background, _ = insert_audio_clip(background, random_negative, previous_segments)### END CODE HERE #### Standardize the volume of the audio clip background = match_target_amplitude(background, -20.0)# Export new training example file_handle = background.export("train" + ".wav", format="wav")print("File (train.wav) was saved in your directory.")# Get and plot spectrogram of the new recording (background with superposition of positive and negatives)x = graph_spectrogram("train.wav")return x, y
x, y = create_training_example(backgrounds[0], activates, negatives)

plt.plot(y[0])

1.4 全部训练集

老师已经处理完了所有数据

# Load preprocessed training examples
X = np.load("./XY_train/X.npy")
Y = np.load("./XY_train/Y.npy")

1.5 开发集

使用真人录制的音频

# Load preprocessed dev set examples
X_dev = np.load("./XY_dev/X_dev.npy")
Y_dev = np.load("./XY_dev/Y_dev.npy")

2. 模型

  • 导入包
from keras.callbacks import ModelCheckpoint
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from keras.optimizers import Adam

2.1 建模


模型先由一个 1维的卷积 来抽取一些特征,还可以加速GRU计算只需要处理 1375 个时间步,而不是5511个

注意:不要使用双向RNN,我们需要检测到触发词后马上输出动作,如果使用双向RNN,我们需要等待 10s 音频被记录下来,再判断

  • 一些 Keras 参考

conv1d https://keras.io/zh/layers/convolutional/#conv1d

BN https://keras.io/zh/layers/normalization/#batchnormalization

GRU https://keras.io/zh/layers/recurrent/#gru

timedistributed https://keras.io/zh/layers/wrappers/#timedistributed

# GRADED FUNCTION: modeldef model(input_shape):"""Function creating the model's graph in Keras.Argument:input_shape -- shape of the model's input data (using Keras conventions)Returns:model -- Keras model instance"""X_input = Input(shape = input_shape)### START CODE HERE #### Step 1: CONV layer (≈4 lines)X = Conv1D(filters=196,kernel_size=15,strides=4)(X_input)   # CONV1DX = BatchNormalization()(X)               # Batch normalizationX = Activation('relu')(X)                # ReLu activationX = Dropout(rate=0.8)(X)              # dropout (use 0.8)# Step 2: First GRU Layer (≈4 lines)X = GRU(128, return_sequences=True)(X) # GRU (use 128 units and return the sequences)X = Dropout(rate=0.8)(X)              # dropout (use 0.8)X = BatchNormalization()(X)          # Batch normalization# Step 3: Second GRU Layer (≈4 lines)X = GRU(128, return_sequences=True)(X) # GRU (use 128 units and return the sequences)X = Dropout(rate=0.8)(X)              # dropout (use 0.8)X = BatchNormalization()(X)          # Batch normalizationX = Dropout(rate=0.8)(X)              # dropout (use 0.8)# Step 4: Time-distributed dense layer (≈1 line)X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)### END CODE HERE ###model = Model(inputs = X_input, outputs = X)return model  
model = model(input_shape = (Tx, n_freq))
model.summary()

输出:

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 5511, 101)         0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 1375, 196)         297136    
_________________________________________________________________
batch_normalization_2 (Batch (None, 1375, 196)         784       
_________________________________________________________________
activation_2 (Activation)    (None, 1375, 196)         0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 1375, 196)         0         
_________________________________________________________________
gru_2 (GRU)                  (None, 1375, 128)         124800    
_________________________________________________________________
dropout_3 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 1375, 128)         512       
_________________________________________________________________
gru_3 (GRU)                  (None, 1375, 128)         98688     
_________________________________________________________________
dropout_4 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 1375, 128)         512       
_________________________________________________________________
dropout_5 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 1375, 1)           129       
=================================================================
Total params: 522,561
Trainable params: 521,657
Non-trainable params: 904

2.2 训练

训练很费时,在4000个样本上,老师已经训练好了该模型

model = load_model('./models/tr_model.h5')

再用我们的数据集,训练1代

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
model.fit(X, Y, batch_size = 5, epochs=1)

2.3 测试模型

loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)

输出:

25/25 [==============================] - 1s 46ms/step
Dev set accuracy =  0.9427199959754944

但是 准确率 在这里不是一个好的衡量标准,因为大部分标签都是0,都预测为0,准确率也会很高,应该用 F1值等

3. 预测

def detect_triggerword(filename):plt.subplot(2, 1, 1)x = graph_spectrogram(filename)# the spectogram outputs (freqs, Tx) and we want (Tx, freqs) to input into the modelx  = x.swapaxes(0,1)x = np.expand_dims(x, axis=0)predictions = model.predict(x)plt.subplot(2, 1, 2)plt.plot(predictions[0,:,0])plt.ylabel('probability')plt.show()return predictions

一旦估计了在每个输出步骤检测到单词“activate”的概率,当概率高于某个阈值时,您可以触发“chiming”声音播放。此外,在说“activate”之后,有很多个 y 值可能接近1,但我们只想发出一次蜂鸣音。所以最多每75个输出步骤插入一个蜂鸣音。这将有助于防止我们为“activate”的单个实例插入两个蜂鸣音。(这与计算机视觉的非最大值抑制类似)

chime_file = "audio_examples/chime.wav"
def chime_on_activate(filename, predictions, threshold):audio_clip = AudioSegment.from_wav(filename)chime = AudioSegment.from_wav(chime_file)Ty = predictions.shape[1]# Step 1: Initialize the number of consecutive output steps to 0consecutive_timesteps = 0# Step 2: Loop over the output steps in the yfor i in range(Ty):# Step 3: Increment consecutive output stepsconsecutive_timesteps += 1# Step 4: If prediction is higher than the threshold and more than 75 consecutive output steps have passedif predictions[0,i,0] > threshold and consecutive_timesteps > 75:# Step 5: Superpose audio and background using pydubaudio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio_clip.duration_seconds)*1000)# Step 6: Reset consecutive output steps to 0consecutive_timesteps = 0audio_clip.export("chime_output.wav", format='wav')

3.3 在开发集上测试

  • 第一段语音,有1个触发
filename = "./raw_data/dev/1.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)
IPython.display.Audio("./chime_output.wav")

  • 第二段语音,有2个触发

4. 用自己的样本测试

# Preprocess the audio to the correct format
def preprocess_audio(filename):# Trim or pad audio segment to 10000mspadding = AudioSegment.silent(duration=10000)segment = AudioSegment.from_wav(filename)[:10000]segment = padding.overlay(segment)# Set frame rate to 44100segment = segment.set_frame_rate(44100)# Export as wavsegment.export(filename, format='wav')
your_filename = "audio_examples/my_audio.wav"
preprocess_audio(your_filename)
IPython.display.Audio(your_filename) # listen to the audio you uploaded 
chime_threshold = 0.5
prediction = detect_triggerword(your_filename)
chime_on_activate(your_filename, prediction, chime_threshold)
IPython.display.Audio("./chime_output.wav")


本文地址:https://michael.blog.csdn.net/article/details/108933798

我的CSDN博客地址 https://michael.blog.csdn.net/

长按或扫码关注我的公众号(Michael阿明),一起加油、一起学习进步!
Michael阿明

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/473792.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

线性回归分析-回归算法

线性回归分析-回归算法 1 回归算法 机器学习中监督学习算法分为分类算法和回归算法两种&#xff0c;其实就是根据类别标签分布类型为离散型、连续性而定义的。回归算法用于目标是连续的&#xff0c;在指定区间内可以是任意一个数值&#xff0c;针对的是数值型的样本&#xff0…

拟合与岭回归

目录 拟合与岭回归 1 什么是过拟合与欠拟合 2 模型复杂度 3 鉴别欠拟合与过拟合 4 过拟合解决方法 5 岭回归&#xff08;Ridge&#xff09; 6 模型的保存与加载 拟合与岭回归 1 什么是过拟合与欠拟合 通过下面两张图来解释过拟合和欠拟合&#xff1a; 左图中机器通过这…

LeetCode 778. 水位上升的泳池中游泳(二分查找+dfs)

文章目录1. 题目2. 解题1. 题目 在一个 N x N 的坐标方格 grid 中&#xff0c;每一个方格的值 grid[i][j] 表示在位置 (i,j) 的平台高度。 现在开始下雨了。当时间为 t 时&#xff0c;此时雨水导致水池中任意位置的水位为 t 。 你可以从一个平台游向四周相邻的任意一个平台&a…

pycharm快捷键、常用设置、配置管理

内容包含&#xff1a; pycharm学习技巧 Learning tips PyCharm3.0默认快捷键(翻译的) pycharm常用设置 pycharm环境和路径配置 Pycharm实用拓展功能&#xff1a;pycharm中清除已编译.pyc中间文件 python2转python3最快方式 [pycharm版本控制和数据库管理][PyCharm中的那些实用功…

逻辑回归-分类算法

目录 逻辑回归-分类算法 1 概述 2 sigmoid函数 3 逻辑回归公式 4 逻辑回归的损失函数 5 均方误差与对数自然损失对比 6 sklearn逻辑回归API 7 案例 8 总结 逻辑回归-分类算法 1 概述 逻辑回归&#xff08;Logistic Regression&#xff09;&#xff0c;简称LR。它的特…

LeetCode 1220. 统计元音字母序列的数目(DP)

文章目录1. 题目2. 解题1. 题目 给你一个整数 n&#xff0c;请你帮忙统计一下我们可以按下述规则形成多少个长度为 n 的字符串&#xff1a; - 字符串中的每个字符都应当是小写元音字母&#xff08;a, e, i, o, u&#xff09; - 每个元音 a 后面都只能跟着 e - 每个元音 e 后面…

Django视图(python函数)

1.视图 视图负责接受Web请求HttpRequest&#xff0c;进行逻辑处理&#xff0c;返回Web响应HttpResponse给请求者响应可以是一张网页的HTML内容&#xff0c;一个重定向&#xff0c;一个404错误等 视图就是python中的函数视图一般被定义在“应用/views.py”文件中 视图的第一个…

非监督学习

非监督学习 1 什么是非监(unsupervised learning)督学习 无监督学习&#xff0c;就是不受监督的学习&#xff0c;一种自由的学习方式。该学习方式不需要先验知识进行指导&#xff0c;而是不断地自我认知&#xff0c;自我巩固&#xff0c;最后进行自我归纳&#xff0c;在机器学…

LeetCode 1250. 检查「好数组」(set / 最大公约数)

文章目录1. 题目2. 解题1. 题目 给你一个正整数数组 nums&#xff0c;你需要从中任选一些子集&#xff0c;然后将子集中每一个数乘以一个 任意整数&#xff0c;并求出他们的和。 假如该和结果为 1&#xff0c;那么原数组就是一个「好数组」&#xff0c;则返回 True&#xff1…

用户画像系统

用户画像系统 1 用户画像概述 1.1 什么是用户画像 用户画像就是对现实世界中用户的数学建模。现实世界的用户通过自己各种各样的行为&#xff0c;浏览购买&#xff0c;或者收藏&#xff0c;等等这些行为会把自己的兴趣爱好等体现在这些行为中&#xff0c;比如喜欢看书就会有…

编程思想:面向对象和面向过程

何谓面向对象&#xff1f;何谓面向过程&#xff1f;对于这编程界的两大思想&#xff0c;一直贯穿在我们学习和工作当中。我们知道面向过程和面向对象&#xff0c;但要让我们讲出来个所以然&#xff0c;又感觉是不知从何说起&#xff0c;最后可能也只会说出一句就是那样啦&#…

LeetCode 1210. 穿过迷宫的最少移动次数(状态压缩BFS)

文章目录1. 题目2. 解题1. 题目 你还记得那条风靡全球的贪吃蛇吗&#xff1f; 我们在一个 n*n 的网格上构建了新的迷宫地图&#xff0c;蛇的长度为 2&#xff0c;也就是说它会占去两个单元格。 蛇会从左上角&#xff08;(0, 0) 和 (0, 1)&#xff09;开始移动。 我们用 0 表示…

Django模板(编写html代码

1.模板 用于编写html代码&#xff0c;还可以嵌入模板代码更快更方便的完成页面开发&#xff0c;再通过在视图中渲染模板&#xff0c;将生成最终的html字符串返回给客户端浏览器 模版致力于表达外观&#xff0c;一个视图可以使用任意一个模板&#xff0c;一个模板可以供多个视图…

用户画像之Spark ML实现

用户画像之Spark ML实现 1 Spark ML简单介绍 Spark ML是面向DataFrame编程的。Spark的核心开发是基于RDD&#xff08;弹性分布式数据集&#xff09;&#xff0c;但是RDD&#xff0c;但是RDD的处理并不是非常灵活&#xff0c;如果要做一些结构化的处理&#xff0c;将RDD转换成…

[Kaggle] Digit Recognizer 手写数字识别(神经网络)

文章目录1. baseline2. 改进2.1 增加训练时间2.2 更改网络结构Digit Recognizer 练习地址 相关博文&#xff1a; [Hands On ML] 3. 分类&#xff08;MNIST手写数字预测&#xff09; [Kaggle] Digit Recognizer 手写数字识别 1. baseline 导入包 import tensorflow as tf fr…

逻辑回归原理

逻辑回归原理 1 逻辑回归简介 logistic回归&#xff08;LR&#xff09;&#xff0c;是一种广义的线性回归分析模型&#xff0c;常用于数据挖掘&#xff0c;疾病预测&#xff0c;经济预测等方面。 优点&#xff1a;计算代价低&#xff0c;思路清晰易于理解和实现&#xff1b;…

LeetCode 956. 最高的广告牌(DP)

文章目录1. 题目2. 解题1. 题目 你正在安装一个广告牌&#xff0c;并希望它高度最大。 这块广告牌将有两个钢制支架&#xff0c;两边各一个。每个钢支架的高度必须相等。 你有一堆可以焊接在一起的钢筋 rods。 举个例子&#xff0c;如果钢筋的长度为 1、2 和 3&#xff0c;则…

Tensorflow线程队列与IO操作

目录 Tensorflow线程队列与IO操作 1 线程和队列 1.1 前言 1.2 队列 1.3 队列管理器 1.4 线程协调器 2 文件读取 2.1 流程 2.2 文件读取API&#xff1a; 3 图像读取 3.1 图像读取基本知识 3.2 图像基本操作 3.3 图像读取API 3.4 图片批处理流程 3.5 读取图片案例 …

LeetCode 1298. 你能从盒子里获得的最大糖果数(BFS)

文章目录1. 题目2. 解题1. 题目 给你 n 个盒子&#xff0c;每个盒子的格式为 [status, candies, keys, containedBoxes] &#xff0c;其中&#xff1a; - 状态字 status[i]&#xff1a;整数&#xff0c;如果 box[i] 是开的&#xff0c;那么是 1 &#xff0c;否则是 0 。 - 糖…

给javascript初学者的24条最佳实践

1.使用 代替 JavaScript 使用2种不同的等值运算符&#xff1a;|! 和 |!&#xff0c;在比较操作中使用前者是最佳实践。 “如果两边的操作数具有相同的类型和值&#xff0c;返回true&#xff0c;!返回false。”——JavaScript&#xff1a;语言精粹 然而&#xff0c;当使用和&a…