导入必要的库
norman Python 语句:import
<span style="color:#000000"><span style="background-color:#fbedbb"><span style="color:#0000ff">import</span> pandas <span style="color:#0000ff">as</span> pd
<span style="color:#0000ff">import</span> numpy <span style="color:#0000ff">as</span> np<span style="color:#0000ff">from</span> sklearn.feature_extraction.text <span style="color:#0000ff">import</span> TfidfVectorizer
<span style="color:#0000ff">from</span> sklearn.linear_model.logistic <span style="color:#0000ff">import</span> LogisticRegression
<span style="color:#0000ff">from</span> sklearn.ensemble <span style="color:#0000ff">import</span> RandomForestClassifier
<span style="color:#0000ff">from</span> sklearn.svm <span style="color:#0000ff">import</span> LinearSVC
<span style="color:#0000ff">from</span> sklearn.tree <span style="color:#0000ff">import</span> DecisionTreeClassifier<span style="color:#0000ff">from</span> sklearn.naive_bayes <span style="color:#0000ff">import</span> MultinomialNB<span style="color:#0000ff">from</span> sklearn.model_selection <span style="color:#0000ff">import</span> train_test_split, cross_val_score
<span style="color:#0000ff">from</span> sklearn.utils <span style="color:#0000ff">import</span> shuffle
<span style="color:#0000ff">from</span> sklearn.metrics <span style="color:#0000ff">import</span> precision_score, classification_report, accuracy_score<span style="color:#0000ff">from</span> sklearn.pipeline <span style="color:#0000ff">import</span> FeatureUnion
<span style="color:#0000ff">from</span> sklearn.preprocessing <span style="color:#0000ff">import</span> LabelEncoder<span style="color:#0000ff">import</span> re
<span style="color:#0000ff">import</span> time</span></span>
检索和解析数据
我在这个挑战中的大部分时间都花在了弄清楚如何有效地解析数据以从文本中提取语言名称,然后从文本中删除该信息,这样它就不会污染我们的训练和测试数据集。
下面是两个文本字符串/段(跨越多行并包含回车符)的示例:
<span style="color:#000000"><span style="background-color:#fbedbb"><pre lang=<span style="color:#800080">"</span><span style="color:#800080">Swift"</span>>
@objc func handleTap(sender: UITapGestureRecognizer) {<span style="color:#0000ff">if</span> <span style="color:#0000ff">let</span> tappedSceneView = sender.view as? ARSCNView {<span style="color:#0000ff">let</span> tapLocationInView = sender.<span style="color:#339999">location</span>(<span style="color:#0000ff">in</span>: tappedSceneView)<span style="color:#0000ff">let</span> planeHitTest = tappedSceneView.hitTest(tapLocationInView,types: .existingPlaneUsingExtent)<span style="color:#0000ff">if</span> !planeHitTest.isEmpty {addFurniture(hitTest: planeHitTest)}}
}<span style="color:#0000ff"></</span><span style="color:#800000">pre</span><span style="color:#0000ff">></span><pre lang=<span style="color:#800080">"</span><span style="color:#800080">JavaScript"</span>>
<span style="color:#0000ff">var</span> my_dataset = [{id: <span style="color:#800080">"</span><span style="color:#800080">1"</span>,text: <span style="color:#800080">"</span><span style="color:#800080">Chairman & CEO"</span>,title: <span style="color:#800080">"</span><span style="color:#800080">Henry Bennett"</span>},{id: <span style="color:#800080">"</span><span style="color:#800080">2"</span>,text: <span style="color:#800080">"</span><span style="color:#800080">Manager"</span>,title: <span style="color:#800080">"</span><span style="color:#800080">Mildred Kim"</span>},{id: <span style="color:#800080">"</span><span style="color:#800080">3"</span>,text: <span style="color:#800080">"</span><span style="color:#800080">Technical Director"</span>,title: <span style="color:#800080">"</span><span style="color:#800080">Jerry Wagner"</span>},{ id: <span style="color:#800080">"</span><span style="color:#800080">1-2"</span>, <span style="color:#0000ff">from</span>: <span style="color:#800080">"</span><span style="color:#800080">1"</span>, to: <span style="color:#800080">"</span><span style="color:#800080">2"</span>, type: <span style="color:#800080">"</span><span style="color:#800080">line"</span> },{ id: <span style="color:#800080">"</span><span style="color:#800080">1-3"</span>, <span style="color:#0000ff">from</span>: <span style="color:#800080">"</span><span style="color:#800080">1"</span>, to: <span style="color:#800080">"</span><span style="color:#800080">3"</span>, type: <span style="color:#800080">"</span><span style="color:#800080">line"</span> }
];<span style="color:#0000ff"></</span><span style="color:#800000">pre</span><span style="color:#0000ff">></span></span></span>
棘手的部分是让正则表达式返回 “” 标签中的数据,然后创建另一个正则表达式来只返回 “” 标签的 “” 部分。<pre lang...><pre>
lang
pre
它并不漂亮,我相信它可以优化,但它有效:
<span style="color:#000000"><span style="background-color:#fbedbb"><span style="color:#0000ff">def</span> get_data():file_name = <span style="color:#800080">'</span><span style="color:#800080">./LanguageSamples.txt'</span>rawdata = <span style="color:#339999">open</span>(file_name, <span style="color:#800080">'</span><span style="color:#800080">r'</span>)lines = rawdata.readlines()<span style="color:#0000ff">return</span> lines<span style="color:#0000ff">def</span> clean_data(input_lines):<span style="color:#008000"><em>#</em></span><span style="color:#008000"><em>find matches for all data within the pre tags</em></span>all_found = re.findall(r<span style="color:#800080">'</span><span style="color:#800080"><pre[\s\S]*?<\/pre>'</span>, input_lines, re.MULTILINE)<span style="color:#008000"><em>#</em></span><span style="color:#008000"><em>clean the string of various tags</em></span>clean_string = <span style="color:#0000ff">lambda</span> x: x.replace(<span style="color:#800080">'</span><span style="color:#800080"><'</span>, <span style="color:#800080">'</span><span style="color:#800080"><'</span>).replace(<span style="color:#800080">'</span><span style="color:#800080">>'</span>, <span style="color:#800080">'</span><span style="color:#800080">>'</span>).replace(<span style="color:#800080">'</span><span style="color:#800080"></pre>'</span>, <span style="color:#800080">'</span><span style="color:#800080">'</span>).replace(<span style="color:#800080">'</span><span style="color:#800080">\n'</span>, <span style="color:#800080">'</span><span style="color:#800080">'</span>)all_found = [clean_string(item) <span style="color:#0000ff">for</span> item <span style="color:#0000ff">in</span> all_found]<span style="color:#008000"><em>#</em></span><span style="color:#008000"><em>get the language for all of the pre tags</em></span>get_language = <span style="color:#0000ff">lambda</span> x: re.findall(r<span style="color:#800080">'</span><span style="color:#800080"><pre lang="(.*?)">'</span>, x, re.MULTILINE)[<span style="color:#000080">0</span>]lang_items = [get_language(item) <span style="color:#0000ff">for</span> item <span style="color:#0000ff">in</span> all_found]<span style="color:#008000"><em>#</em></span><span style="color:#008000"><em>remove all of the pre tags that contain the language</em></span>remove_lang = <span style="color:#0000ff">lambda</span> x: re.sub(r<span style="color:#800080">'</span><span style="color:#800080"><pre lang="(.*?)">'</span>, <span style="color:#800080">"</span><span style="color:#800080">"</span>, x)all_found = [remove_lang(item) <span style="color:#0000ff">for</span> item <span style="color:#0000ff">in</span> all_found]<span style="color:#008000"><em>#</em></span><span style="color:#008000"><em>return let text between the pre tags and their corresponding language</em></span><span style="color:#0000ff">return</span> (all_found, lang_items) </span></span>
创建 Pandas DataFrame
在这里,我们获取数据,创建一个并用数据填充它。DataFrame
<span style="color:#000000"><span style="background-color:#fbedbb">all_samples = <span style="color:#800080">'</span><span style="color:#800080">'</span>.join(get_data())
cleaned_data, languages = clean_data(all_samples)df = pd.DataFrame()
df[<span style="color:#800080">'</span><span style="color:#800080">lang_text'</span>] = languages
df[<span style="color:#800080">'</span><span style="color:#800080">data'</span>] = cleaned_data</span></span>
这是我们的样子:DataFrame
创建分类列
接下来我们需要做的是将我们的 “” 列变成一个数字列,因为这是许多机器学习模型对它试图确定的 “” 或输出的期望。为此,我们将使用 LabelEncoder 并使用它来将我们的 “” 列转换为分类列。lang_text
Y
lang_text
<span style="color:#000000"><span style="background-color:#fbedbb">lb_enc = LabelEncoder()
df[<span style="color:#800080">'</span><span style="color:#800080">language'</span>] = lb_enc.fit_transform(df[<span style="color:#800080">'</span><span style="color:#800080">lang_text'</span>]) </span></span>
现在我们看起来像这样:DataFrame
我们可以通过运行以下命令来查看该列是如何编码的:
<span style="color:#000000"><span style="background-color:#fbedbb">lb_enc.classes_</span></span>
显示此内容(数组中的位置与新的“语言”分类列中的整数值匹配):
<span style="color:#000000"><span style="background-color:#fbedbb">array([<span style="color:#800080">'</span><span style="color:#800080">ASM'</span>, <span style="color:#800080">'</span><span style="color:#800080">ASP.NET'</span>, <span style="color:#800080">'</span><span style="color:#800080">Angular'</span>, <span style="color:#800080">'</span><span style="color:#800080">C#'</span>, <span style="color:#800080">'</span><span style="color:#800080">C++'</span>, <span style="color:#800080">'</span><span style="color:#800080">CSS'</span>, <span style="color:#800080">'</span><span style="color:#800080">Delphi'</span>, <span style="color:#800080">'</span><span style="color:#800080">HTML'</span>,<span style="color:#800080">'</span><span style="color:#800080">Java'</span>, <span style="color:#800080">'</span><span style="color:#800080">JavaScript'</span>, <span style="color:#800080">'</span><span style="color:#800080">Javascript'</span>, <span style="color:#800080">'</span><span style="color:#800080">ObjectiveC'</span>, <span style="color:#800080">'</span><span style="color:#800080">PERL'</span>, <span style="color:#800080">'</span><span style="color:#800080">PHP'</span>,<span style="color:#800080">'</span><span style="color:#800080">Pascal'</span>, <span style="color:#800080">'</span><span style="color:#800080">PowerShell'</span>, <span style="color:#800080">'</span><span style="color:#800080">Powershell'</span>, <span style="color:#800080">'</span><span style="color:#800080">Python'</span>, <span style="color:#800080">'</span><span style="color:#800080">Razor'</span>, <span style="color:#800080">'</span><span style="color:#800080">React'</span>,<span style="color:#800080">'</span><span style="color:#800080">Ruby'</span>, <span style="color:#800080">'</span><span style="color:#800080">SQL'</span>, <span style="color:#800080">'</span><span style="color:#800080">Scala'</span>, <span style="color:#800080">'</span><span style="color:#800080">Swift'</span>, <span style="color:#800080">'</span><span style="color:#800080">TypeScript'</span>, <span style="color:#800080">'</span><span style="color:#800080">VB.NET'</span>, <span style="color:#800080">'</span><span style="color:#800080">XML'</span>], dtype=object)</span></span>
样板代码
以下是后续步骤:
- 声明用于输出训练结果的函数
- 声明用于训练和测试模型的函数
- 声明用于创建要测试的模型的函数
- 随机播放数据
- 拆分训练和测试数据
- 将数据和模型传递到训练和测试函数中,并查看结果:
<span style="color:#000000"><span style="background-color:#fbedbb"><span style="color:#0000ff">def</span> output_accuracy(actual_y, predicted_y, model_name, train_time, predict_time):<span style="color:#0000ff">print</span>(<span style="color:#800080">'</span><span style="color:#800080">Model Name: '</span> + model_name)<span style="color:#0000ff">print</span>(<span style="color:#800080">'</span><span style="color:#800080">Train time: '</span>, <span style="color:#339999">round</span>(train_time, <span style="color:#000080">2</span>))<span style="color:#0000ff">print</span>(<span style="color:#800080">'</span><span style="color:#800080">Predict time: '</span>, <span style="color:#339999">round</span>(predict_time, <span style="color:#000080">2</span>))<span style="color:#0000ff">print</span>(<span style="color:#800080">'</span><span style="color:#800080">Model Accuracy: {:.4f}'</span>.<span style="color:#339999">format</span>(accuracy_score(actual_y, predicted_y)))<span style="color:#0000ff">print</span>(<span style="color:#800080">'</span><span style="color:#800080">'</span>)<span style="color:#0000ff">print</span>(classification_report(actual_y, predicted_y, digits=4))<span style="color:#0000ff">print</span>(<span style="color:#800080">"</span><span style="color:#800080">======================================================="</span>)<span style="color:#0000ff">def</span> test_models(X_train_input_raw, y_train_input, X_test_input_raw, y_test_input, models_dict):return_trained_models = {}return_vectorizer = FeatureUnion([(<span style="color:#800080">'</span><span style="color:#800080">tfidf_vect'</span>, TfidfVectorizer())])X_train = return_vectorizer.fit_transform(X_train_input_raw)X_test = return_vectorizer.transform(X_test_input_raw)<span style="color:#0000ff">for</span> key <span style="color:#0000ff">in</span> models_dict:model_name = keymodel = models_dict[key]t1 = time.time()model.fit(X_train, y_train_input)t2 = time.time()predicted_y = model.predict(X_test)t3 = time.time()output_accuracy(y_test_input, predicted_y, model_name, t2 - t1, t3 - t2) return_trained_models[model_name] = model<span style="color:#0000ff">return</span> (return_trained_models, return_vectorizer)<span style="color:#0000ff">def</span> create_models():models = {}models[<span style="color:#800080">'</span><span style="color:#800080">LinearSVC'</span>] = LinearSVC()models[<span style="color:#800080">'</span><span style="color:#800080">LogisticRegression'</span>] = LogisticRegression()models[<span style="color:#800080">'</span><span style="color:#800080">RandomForestClassifier'</span>] = RandomForestClassifier()models[<span style="color:#800080">'</span><span style="color:#800080">DecisionTreeClassifier'</span>] = DecisionTreeClassifier()models[<span style="color:#800080">'</span><span style="color:#800080">MultinomialNB'</span>] = MultinomialNB()<span style="color:#0000ff">return</span> modelsX_input, y_input = shuffle(df[<span style="color:#800080">'</span><span style="color:#800080">data'</span>], df[<span style="color:#800080">'</span><span style="color:#800080">language'</span>], random_state=7)X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_input, y_input, test_size=0.<span style="color:#000080">7</span>)models = create_models()
trained_models, fitted_vectorizer = test_models(X_train_raw, y_train, X_test_raw, y_test, models) </span></span>
结果是这样的:
<span style="color:#000000"><span style="background-color:#fbedbb">Model Name: LinearSVC
Train time: 0.99
Predict time: 0.0
Model Accuracy: 0.9262precision recall f1-score support0 1.0000 1.0000 1.0000 61 1.0000 1.0000 1.0000 22 1.0000 1.0000 1.0000 13 0.8968 1.0000 0.9456 3394 0.9695 0.8527 0.9074 2245 0.9032 1.0000 0.9492 286 0.7000 1.0000 0.8235 77 0.9032 0.7568 0.8235 748 0.7778 0.5833 0.6667 369 0.9613 0.9255 0.9430 16110 1.0000 0.5000 0.6667 611 1.0000 1.0000 1.0000 1412 1.0000 1.0000 1.0000 513 1.0000 1.0000 1.0000 214 1.0000 0.4545 0.6250 1115 1.0000 1.0000 1.0000 616 1.0000 0.4000 0.5714 517 0.9589 0.9589 0.9589 7318 1.0000 1.0000 1.0000 819 0.7600 0.9268 0.8352 4120 0.1818 1.0000 0.3077 221 1.0000 1.0000 1.0000 13722 1.0000 0.8750 0.9333 2423 1.0000 1.0000 1.0000 724 1.0000 1.0000 1.0000 2525 0.9571 0.9571 0.9571 7026 0.9211 0.9722 0.9459 108avg / total 0.9339 0.9262 0.9255 1422=========================================================================
Model Name: DecisionTreeClassifier
Train time: 0.13
Predict time: 0.0
Model Accuracy: 0.9388precision recall f1-score support0 1.0000 1.0000 1.0000 61 1.0000 1.0000 1.0000 22 1.0000 1.0000 1.0000 13 0.9123 0.9204 0.9163 3394 0.8408 0.9196 0.8785 2245 1.0000 0.8929 0.9434 286 1.0000 1.0000 1.0000 77 1.0000 0.9595 0.9793 748 0.9091 0.8333 0.8696 369 0.9817 1.0000 0.9908 16110 1.0000 0.5000 0.6667 611 1.0000 1.0000 1.0000 1412 1.0000 1.0000 1.0000 513 1.0000 1.0000 1.0000 214 1.0000 0.4545 0.6250 1115 1.0000 0.5000 0.6667 616 1.0000 0.4000 0.5714 517 1.0000 1.0000 1.0000 7318 1.0000 1.0000 1.0000 819 0.9268 0.9268 0.9268 4120 1.0000 1.0000 1.0000 221 1.0000 1.0000 1.0000 13722 1.0000 0.7500 0.8571 2423 1.0000 1.0000 1.0000 724 0.6786 0.7600 0.7170 2525 1.0000 1.0000 1.0000 7026 1.0000 1.0000 1.0000 108avg / total 0.9419 0.9388 0.9376 1422=========================================================================
Model Name: LogisticRegression
Train time: 0.71
Predict time: 0.01
Model Accuracy: 0.9304precision recall f1-score support0 1.0000 1.0000 1.0000 61 1.0000 1.0000 1.0000 22 1.0000 1.0000 1.0000 13 0.9040 1.0000 0.9496 3394 0.9569 0.8929 0.9238 2245 0.9032 1.0000 0.9492 286 0.7000 1.0000 0.8235 77 0.8929 0.6757 0.7692 748 0.8750 0.5833 0.7000 369 0.9281 0.9627 0.9451 16110 1.0000 0.5000 0.6667 611 1.0000 1.0000 1.0000 1412 1.0000 1.0000 1.0000 513 1.0000 1.0000 1.0000 214 1.0000 0.4545 0.6250 1115 1.0000 1.0000 1.0000 616 1.0000 0.4000 0.5714 517 0.9589 0.9589 0.9589 7318 1.0000 1.0000 1.0000 819 0.7600 0.9268 0.8352 4120 1.0000 1.0000 1.0000 221 1.0000 0.9781 0.9889 13722 1.0000 0.8750 0.9333 2423 1.0000 1.0000 1.0000 724 1.0000 1.0000 1.0000 2525 0.9571 0.9571 0.9571 7026 0.9211 0.9722 0.9459 108avg / total 0.9329 0.9304 0.9272 1422=========================================================================
Model Name: RandomForestClassifier
Train time: 0.04
Predict time: 0.01
Model Accuracy: 0.9374precision recall f1-score support0 1.0000 1.0000 1.0000 61 1.0000 1.0000 1.0000 22 1.0000 1.0000 1.0000 13 0.8760 1.0000 0.9339 3394 0.9452 0.9241 0.9345 2245 0.9032 1.0000 0.9492 286 0.7000 1.0000 0.8235 77 1.0000 0.8378 0.9118 748 1.0000 0.5278 0.6909 369 0.9527 1.0000 0.9758 16110 1.0000 0.1667 0.2857 611 1.0000 1.0000 1.0000 1412 1.0000 1.0000 1.0000 513 1.0000 1.0000 1.0000 214 1.0000 0.4545 0.6250 1115 1.0000 0.5000 0.6667 616 1.0000 0.4000 0.5714 517 1.0000 1.0000 1.0000 7318 1.0000 0.6250 0.7692 819 0.9268 0.9268 0.9268 4120 0.0000 0.0000 0.0000 221 1.0000 1.0000 1.0000 13722 1.0000 1.0000 1.0000 2423 1.0000 0.5714 0.7273 724 1.0000 1.0000 1.0000 2525 1.0000 0.9571 0.9781 7026 0.8889 0.8889 0.8889 108avg / total 0.9411 0.9374 0.9324 1422=========================================================================
Model Name: MultinomialNB
Train time: 0.01
Predict time: 0.0
Model Accuracy: 0.8776precision recall f1-score support0 1.0000 1.0000 1.0000 61 0.0000 0.0000 0.0000 22 0.0000 0.0000 0.0000 13 0.8380 0.9764 0.9019 3394 1.0000 0.8750 0.9333 2245 1.0000 1.0000 1.0000 286 1.0000 1.0000 1.0000 77 0.6628 0.7703 0.7125 748 1.0000 0.5833 0.7368 369 0.8952 0.6894 0.7789 16110 1.0000 0.3333 0.5000 611 1.0000 1.0000 1.0000 1412 1.0000 1.0000 1.0000 513 0.0000 0.0000 0.0000 214 1.0000 0.7273 0.8421 1115 1.0000 1.0000 1.0000 616 1.0000 0.4000 0.5714 517 1.0000 0.9178 0.9571 7318 0.8000 1.0000 0.8889 819 0.4607 1.0000 0.6308 4120 0.0000 0.0000 0.0000 221 1.0000 1.0000 1.0000 13722 1.0000 1.0000 1.0000 2423 1.0000 1.0000 1.0000 724 0.8462 0.8800 0.8627 2525 0.8642 1.0000 0.9272 7026 0.9630 0.7222 0.8254 108avg / total 0.8982 0.8776 0.8770 1422=========================================================================</span></span>