[文本挖掘和知识发现] 02.命名实体识别之基于BiLSTM-CRF的威胁情报实体识别万字详解

作者于2023年8月新开专栏——《文本挖掘和知识发现》,主要结合Python、大数据分析和人工智能分享文本挖掘、知识图谱、知识发现、图书情报等内容。这些内容也是作者《文本挖掘和知识发现(Python版)》书籍的部分介绍,本书预计2024年上市,采用通俗易懂和图文并茂的形式藐视,会更加系统地介绍文本挖掘和知识发现,共计20章节内容,涵盖上百个案例。您的关注、点赞和转发就是对秀璋最大的支持,知识无价人有情,希望我们都能在人生路上开心快乐、共同成长。

前一篇文章介绍文献可视化分析软件CiteSpace基础知识,以中国知网《红楼梦》文献为例,开展主题挖掘、关键词聚类及主题演化分析。这篇文章将讲解如何实现威胁情报实体识别,利用BiLSTM-CRF算法实现对ATT&CK相关的技战术实体进行提取,是安全知识图谱构建的重要支撑。基础性文章,希望对您有所帮助!

版本信息:

  • keras-contrib V2.0.8
  • keras V2.3.1
  • tensorflow V2.2.0

常见框架如下图所示:

  • https://aclanthology.org/2021.acl-short.4/

在这里插入图片描述

在这里插入图片描述

文章目录

  • 一.ATT&CK数据采集
  • 二.数据拆分及内容统计
    • 1.段落拆分
    • 2.句子拆分
  • 三.数据标注
  • 四.数据集划分
  • 五.基于CRF的实体识别
    • 1.安装keras-contrib
    • 2.安装Keras
    • 3.完整代码
  • 六.基于BiLSTM-CRF的实体识别
  • 七.总结

代码下载地址:

  • https://github.com/eastmountyxz/Text-Mining-Knowledge-Discovery

前文赏析:

  • [文本挖掘和知识发现] 01.红楼梦主题演化分析——文献可视化分析软件CiteSpace入门
  • [文本挖掘和知识发现] 02.命名实体识别之基于BiLSTM-CRF的威胁情报实体识别万字详解

一.ATT&CK数据采集

了解威胁情报的同学,应该都熟悉Mitre的ATT&CK网站,本文将采集该网站APT组织的攻击技战术数据,开展威胁情报实体识别实验。网址如下:

  • http://attack.mitre.org

在这里插入图片描述

第一步,通过ATT&CK网站源码分析定位APT组织名称,并进行系统采集。

在这里插入图片描述

安装BeautifulSoup扩展包,该部分代码如下所示:

在这里插入图片描述

01-get-aptentity.py

#encoding:utf-8
#By:Eastmount CSDN
import re
import requests
from lxml import etree
from bs4 import BeautifulSoup
import urllib.request#-------------------------------------------------------------------------------------------
#获取APT组织名称及链接#设置浏览器代理,它是一个字典
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
url = 'https://attack.mitre.org/groups/'#向服务器发出请求
r = requests.get(url = url, headers = headers).text#解析DOM树结构
html_etree = etree.HTML(r)
names = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/text()')
print (names)
print(len(names),names[0])
filename = []
for name in names:filename.append(name.strip())
print(filename)#链接
urls = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/@href')
print(urls)
print(len(urls), urls[0])
print("\n")

此时输出结果如下图所示,包括APT组织名称及对应的URL网址。

在这里插入图片描述

第二步,访问APT组织对应的URL,采集详细信息(正文描述)。

在这里插入图片描述

第三步,采集对应的技战术TTPs信息,其源码定位如下图所示。

在这里插入图片描述

第四步,编写代码完成威胁情报数据采集。01-spider-mitre.py 完整代码如下:

#encoding:utf-8
#By:Eastmount CSDN
import re
import requests
from lxml import etree
from bs4 import BeautifulSoup
import urllib.request#-------------------------------------------------------------------------------------------
#获取APT组织名称及链接#设置浏览器代理,它是一个字典
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
url = 'https://attack.mitre.org/groups/'#向服务器发出请求
r = requests.get(url = url, headers = headers).text
#解析DOM树结构
html_etree = etree.HTML(r)
names = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/text()')
print (names)
print(len(names),names[0])
#链接
urls = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/@href')
print(urls)
print(len(urls), urls[0])
print("\n")#-------------------------------------------------------------------------------------------
#获取详细信息
k = 0
while k<len(names):filename = str(names[k]).strip() + ".txt"url = "https://attack.mitre.org" + urls[k]print(url)#获取正文信息page = urllib.request.Request(url, headers=headers)page = urllib.request.urlopen(page)contents = page.read()soup = BeautifulSoup(contents, "html.parser")#获取正文摘要信息content = ""for tag in soup.find_all(attrs={"class":"description-body"}):#contents = tag.find("p").get_text()contents = tag.find_all("p")for con in contents:content += con.get_text().strip() + "###\n"  #标记句子结束(第二部分分句用)#print(content)#获取表格中的技术信息for tag in soup.find_all(attrs={"class":"table techniques-used table-bordered mt-2"}):contents = tag.find("tbody").find_all("tr")for con in contents:value = con.find("p").get_text()           #存在4列或5列 故获取p值#print(value)content += value.strip() + "###\n"         #标记句子结束(第二部分分句用)#删除内容中的参考文献括号 [n]result = re.sub(u"\\[.*?]", "", content)print(result)#文件写入filename = "Mitre//" + filenameprint(filename)f = open(filename, "w", encoding="utf-8")f.write(result)f.close()    k += 1

输出结果如下图所示,共整理100个组织信息。

在这里插入图片描述

在这里插入图片描述

每个文件显示内容如下图所示:

在这里插入图片描述

温馨提示:
由于网站的布局会不断变化和优化,因此读者需要掌握数据采集及语法树定位的基本方法,以不变应万变。此外,读者可以尝试采集所有锻炼甚至是URL跳转链接内容,请读者自行尝试和拓展!


二.数据拆分及内容统计

1.段落拆分

为了扩充数据集和更好地开展NLP处理,我们需要将文本数据进行分段处理。采用的方法是:

  • 获取先前定义的标志位“###”
  • 每隔五句生成一个TXT文件,命名方式为“10XX_组织名称”

02-dataset-split.py 完整代码:

#encoding:utf-8
#By:Eastmount CSDN
import re
import os#------------------------------------------------------------------------
#获取文件路径及名称
def get_filepath(path):entities = {}              #字段实体类别files = os.listdir(path)   #遍历路径return files#-----------------------------------------------------------------------
#获取文件内容
def get_content(filename):content = ""with open(filename, "r", encoding="utf8") as f:for line in f.readlines():content += line.replace("\n"," ")return content#---------------------------------------------------------------------
#自定义分隔符文本分割
def split_text(text):pattern = '###'nums = text.split(pattern) #获取字符的下标位置return nums#-----------------------------------------------------------------------
#主函数
if __name__ == '__main__':#获取文件名path = "Mitre"savepath = "Mitre-Split"filenames = get_filepath(path)print(filenames)print("\n")#遍历文件内容k = 0begin = 1001  #命名计数while k<len(filenames):filename = "Mitre//" + filenames[k]print(filename)content = get_content(filename)print(content)#分割句子nums = split_text(content)#每隔五句输出为一个TXT文档n = 0result = ""while n<len(nums):if n>0 and (n%5)==0: #存储savename = savepath + "//" + str(begin) + "-" + filenames[k]print(savename)f = open(savename, "w", encoding="utf8")f.write(result)result = ""result = nums[n].lstrip() + "### "  #第一句begin += 1f.close()else:               #赋值result += nums[n].lstrip() + "### "n += 1k += 1

最终拆分成381个文件,位于“Mitre-Split”文件夹。

在这里插入图片描述

单个文件如下图所示:

在这里插入图片描述


2.句子拆分

命名实体识别任务在数据标注之前,需要完成:

  • 将段落拆分成句子
  • 将句子按照单词分隔,每行对应一个单词,每个单词对应后续的一个标注
  • 关键代码 text.split(" ")

句子拆分后的效果如下图所示:

在这里插入图片描述

完整代码如下所示,并生成“Mitre-Split-Word”文件夹。

#encoding:utf-8
#By:Eastmount CSDN
import re
import os#------------------------------------------------------------------------
#获取文件路径及名称
def get_filepath(path):entities = {}              #字段实体类别files = os.listdir(path)   #遍历路径return files#-----------------------------------------------------------------------
#获取文件内容
def get_content(filename):content = ""with open(filename, "r", encoding="utf8") as f:for line in f.readlines():content += line.replace("\n"," ")return content#---------------------------------------------------------------------
#空格分隔获取英文单词
def split_word(text):nums = text.split(" ")#print(nums)return nums#-----------------------------------------------------------------------
#主函数
if __name__ == '__main__':#获取文件名path = "Mitre-Split"savepath = "Mitre-Split-Word"filenames = get_filepath(path)print(filenames)print("\n")#遍历文件内容k = 0while k<len(filenames):filename = path + "//" + filenames[k]print(filename)content = get_content(filename)content = content.replace("###","\n")#分割句子nums = split_word(content)#print(nums)savename = savepath + "//" + filenames[k]f = open(savename, "w", encoding="utf8")for n in nums:if n != "":#替换标点符号n = n.replace(",", "")n = n.replace(";", "")n = n.replace("!", "")n = n.replace("?", "")n = n.replace(":", "")n = n.replace('"', "")n = n.replace('(', "")n = n.replace(')', "")n = n.replace('’', "")n = n.replace('\'s', "")#替换句号if ("." in n) and (n not in ["U.S.","U.K."]):n = n.rstrip(".")n = n.rstrip(".\n")n = n + "\n"f.write(n+"\n")f.close()k += 1

三.数据标注

数据标注采用暴力的方式进行,即定义不同类型的实体名称并利用BIO的方式进行标注。通过ATT&CK技战术方式进行标注,后续可以结合人工校正,同时可以定义更多类型的实体。

  • BIO标注
实体名称实体数量示例
APT攻击组织128APT32、Lazarus Group
攻击漏洞56CVE-2009-0927
区域位置72America、Europe
攻击行业34companies、finance
攻击手法65C&C、RAT、DDoS
利用软件487-Zip、Microsoft
操作系统10Linux、Windows

常见的数据标注工具:

  • 图像标注:labelme,LabelImg,Labelbox,RectLabel,CVAT,VIA
  • 半自动ocr标注:PPOCRLabel
  • NLP标注工具:labelstudio

该部分完整代码(04-BIO-data-annotation.py)如下所示:

#encoding:utf-8
import re
import os
import csv#-----------------------------------------定义实体类型-------------------------------------
#APT攻击组织
aptName = ['admin@338', 'Ajax Security Team', 'APT-C-36', 'APT1', 'APT12', 'APT16', 'APT17', 'APT18', 'APT19', 'APT28', 'APT29', 'APT3', 'APT30', 'APT32','APT33', 'APT37', 'APT38', 'APT39', 'APT41', 'Axiom', 'BlackOasis', 'BlackTech', 'Blue Mockingbird', 'Bouncing Golf', 'BRONZE BUTLER','Carbanak', 'Chimera', 'Cleaver', 'Cobalt Group', 'CopyKittens', 'Dark Caracal', 'Darkhotel', 'DarkHydrus', 'DarkVishnya', 'Deep Panda','Dragonfly', 'Dragonfly 2.0', 'DragonOK', 'Dust Storm', 'Elderwood', 'Equation', 'Evilnum', 'FIN10', 'FIN4', 'FIN5', 'FIN6', 'FIN7', 'FIN8','Fox Kitten', 'Frankenstein', 'GALLIUM', 'Gallmaker', 'Gamaredon Group', 'GCMAN', 'GOLD SOUTHFIELD', 'Gorgon Group', 'Group5', 'HAFNIUM','Higaisa', 'Honeybee', 'Inception', 'Indrik Spider', 'Ke3chang', 'Kimsuky', 'Lazarus Group', 'Leafminer', 'Leviathan', 'Lotus Blossom','Machete', 'Magic Hound', 'menuPass', 'Moafee', 'Mofang', 'Molerats', 'MuddyWater', 'Mustang Panda', 'Naikon', 'NEODYMIUM', 'Night Dragon','OilRig', 'Operation Wocao', 'Orangeworm', 'Patchwork', 'PittyTiger', 'PLATINUM', 'Poseidon Group', 'PROMETHIUM', 'Putter Panda', 'Rancor','Rocke', 'RTM', 'Sandworm Team', 'Scarlet Mimic', 'Sharpshooter', 'Sidewinder', 'Silence', 'Silent Librarian', 'SilverTerrier', 'Sowbug', 'Stealth Falcon','Stolen Pencil', 'Strider', 'Suckfly', 'TA459', 'TA505', 'TA551', 'Taidoor', 'TEMP.Veles', 'The White Company', 'Threat Group-1314', 'Threat Group-3390','Thrip', 'Tropic Trooper', 'Turla', 'Volatile Cedar', 'Whitefly', 'Windigo', 'Windshift', 'Winnti Group', 'WIRTE', 'Wizard Spider', 'ZIRCONIUM','UNC2452', 'NOBELIUM', 'StellarParticle']#特殊名称的攻击漏洞
cveName = ['CVE-2009-3129', 'CVE-2012-0158', 'CVE-2009-4324' 'CVE-2009-0927', 'CVE-2011-0609', 'CVE-2011-0611', 'CVE-2012-0158','CVE-2017-0262', 'CVE-2015-4902', 'CVE-2015-1701', 'CVE-2014-4076', 'CVE-2015-2387', 'CVE-2015-1701', 'CVE-2017-0263']#区域位置
locationName = ['China-based', 'China', 'North', 'Korea', 'Russia', 'South', 'Asia', 'US', 'U.S.', 'UK', 'U.K.', 'Iran', 'Iranian', 'America', 'Colombian','Chinese', "People’s",  'Liberation', 'Army', 'PLA', 'General', 'Staff', "Department’s", 'GSD', 'MUCD', 'Unit', '61398', 'Chinese-based',"Russia's", "General", "Staff", "Main", "Intelligence", "Directorate", "GRU", "GTsSS", "unit", "26165", '74455', 'Georgian', 'SVR','Europe', 'Asia', 'Hong Kong', 'Vietnam', 'Cambodia', 'Thailand', 'Germany', 'Spain', 'Finland', 'Israel', 'India', 'Italy', 'South Asia','Korea', 'Kuwait', 'Lebanon', 'Malaysia', 'United', 'Kingdom', 'Netherlands', 'Southeast', 'Asia', 'Pakistan', 'Canada', 'Bangladesh','Ukraine', 'Austria', 'France', 'Korea']#攻击行业
industryName = ['financial', 'economic', 'trade', 'policy', 'defense', 'industrial', 'espionage', 'government', 'institutions', 'institution', 'petroleum','industry', 'manufacturing', 'corporations', 'media', 'outlets', 'high-tech', 'companies', 'governments', 'medical', 'defense', 'finance','energy', 'pharmaceutical', 'telecommunications', 'high', 'tech', 'education', 'investment', 'firms', 'organizations', 'research', 'institutes',]#攻击方法
methodName = ['RATs', 'RAT', 'SQL', 'injection', 'spearphishing', 'spear', 'phishing', 'backdoors', 'vulnerabilities', 'vulnerability', 'commands', 'command','anti-censorship', 'keystrokes', 'VBScript', 'malicious', 'document', 'scheduled', 'tasks', 'C2', 'C&C', 'communications', 'batch', 'script','shell', 'scripting', 'social', 'engineering', 'privilege', 'escalation', 'credential', 'dumping', 'control', 'obfuscates', 'obfuscate', 'payload', 'upload','payloads', 'encode', 'decrypts', 'attachments', 'attachment', 'inject', 'collect', 'large-scale', 'scans', 'persistence', 'brute-force/password-spray','password-spraying', 'backdoor', 'bypass', 'hijacking', 'escalate', 'privileges', 'lateral', 'movement', 'Vulnerability', 'timestomping','keylogging', 'DDoS', 'bootkit', 'UPX' ]#利用软件
softwareName = ['Microsoft', 'Word', 'Office', 'Firefox', 'Google', 'RAR', 'WinRAR', 'zip', 'GETMAIL', 'MAPIGET', 'Outlook', 'Exchange', "Adobe's", 'Adobe','Acrobat', 'Reader', 'RDP', 'PDFs', 'PDF', 'RTF', 'XLSM', 'USB', 'SharePoint', 'Forfiles', 'Delphi', 'COM', 'Excel', 'NetBIOS','Tor', 'Defender', 'Scanner', 'Gmail', 'Yahoo', 'Mail', '7-Zip', 'Twitter', 'gMSA', 'Azure', 'Exchange', 'OWA', 'SMB', 'Netbios','WinRM']#操作系统
osName = ['Windows', 'windows', 'Mac', 'Linux', 'Android', 'android', 'linux', 'mac', 'unix', 'Unix']#计算并输出相关的内容
saveCVE = cveName
saveAPT = aptName
saveLocation = locationName
saveIndustry = industryName
saveMethod = methodName
saveSoftware = softwareName
saveOS = osName#------------------------------------------------------------------------
#获取文件路径及名称
def get_filepath(path):entities = {}              #字段实体类别files = os.listdir(path)   #遍历路径return files#-----------------------------------------------------------------------
#获取文件内容
def get_content(filename):content = []with open(filename, "r", encoding="utf8") as f:for line in f.readlines():content.append(line.strip())return content#---------------------------------------------------------------------
#空格分隔获取英文单词
def data_annotation(text):n = 0nums = []while n<len(text):word = text[n].strip()if word == "":   #换行 startswithn += 1nums.append("")continue#APT攻击组织if word in aptName:nums.append("B-AG")#攻击漏洞elif "CVE-" in word or 'MS-' in word:nums.append("B-AV")print("CVE漏洞:", word)if word not in saveCVE:saveCVE.append(word)#区域位置elif word in locationName:nums.append("B-RL")#攻击行业elif word in industryName:nums.append("B-AI")#攻击手法elif word in methodName:nums.append("B-AM")#利用软件elif word in softwareName:nums.append("B-SI")#操作系统elif word in osName:nums.append("B-OS")#特殊情况-APT组织#Ajax Security Team、Deep Panda、Sandworm Team、Cozy Bear、The Dukes、Dark Haloelif ((word in "Ajax Security Team") and (text[n+1].strip() in "Ajax Security Team") and word!="a" and word!="it") or \((word in "Ajax Security Team") and (text[n-1].strip() in "Ajax Security Team") and word!="a" and word!="it") or \((word=="Deep") and (text[n+1].strip()=="Panda")) or \((word=="Panda") and (text[n-1].strip()=="Deep")) or \((word=="Sandworm") and (text[n+1].strip()=="Team")) or \((word=="Team") and (text[n-1].strip()=="Sandworm")) or \((word=="Cozy") and (text[n+1].strip()=="Bear")) or \((word=="Bear") and (text[n-1].strip()=="Cozy")) or \((word=="The") and (text[n+1].strip()=="Dukes")) or \((word=="Dukes") and (text[n-1].strip()=="The")) or \((word=="Dark") and (text[n+1].strip()=="Halo")) or \((word=="Halo") and (text[n-1].strip()=="Dark")):nums.append("B-AG")if "Deep Panda" not in saveAPT:saveAPT.append("Deep Panda")if "Sandworm Team" not in saveAPT:saveAPT.append("Sandworm Team")if "Cozy Bear" not in saveAPT:saveAPT.append("Cozy Bear")if "The Dukes" not in saveAPT:saveAPT.append("The Dukes")if "Dark Halo" not in saveAPT:saveAPT.append("Dark Halo")     #特殊情况-攻击行业elif ((word=="legal") and (text[n+1].strip()=="services")) or \((word=="services") and (text[n-1].strip()=="legal")):nums.append("B-AI")if "legal services" not in saveIndustry:saveIndustry.append("legal services")#特殊情况-攻击方法#watering hole attack、bypass application control、take screenshotselif ((word in "watering hole attack") and (text[n+1].strip() in "watering hole attack") and word!="a" and text[n+1].strip()!="a") or \((word in "watering hole attack") and (text[n-1].strip() in "watering hole attack") and word!="a" and text[n+1].strip()!="a") or \((word in "bypass application control") and (text[n+1].strip() in "bypass application control") and word!="a" and text[n+1].strip()!="a") or \((word in "bypass application control") and (text[n-1].strip() in "bypass application control") and word!="a" and text[n-1].strip()!="a") or \((word=="take") and (text[n+1].strip()=="screenshots")) or \((word=="screenshots") and (text[n-1].strip()=="take")):nums.append("B-AM")if "watering hole attack" not in saveMethod:saveMethod.append("watering hole attack")if "bypass application control" not in saveMethod:saveMethod.append("bypass application control")if "take screenshots" not in saveMethod:saveMethod.append("take screenshots")#特殊情况-利用软件#MAC address、IP address、Port 22、Delivery Service、McAfee Email Protectionelif ((word=="legal") and (text[n+1].strip()=="services")) or \((word=="services") and (text[n-1].strip()=="legal")) or \((word=="MAC") and (text[n+1].strip()=="address")) or \((word=="address") and (text[n-1].strip()=="MAC")) or \((word=="IP") and (text[n+1].strip()=="address")) or \((word=="address") and (text[n-1].strip()=="IP")) or \((word=="Port") and (text[n+1].strip()=="22")) or \((word=="22") and (text[n-1].strip()=="Port")) or \((word=="Delivery") and (text[n+1].strip()=="Service")) or \((word=="Service") and (text[n-1].strip()=="Delivery")) or \((word in "McAfee Email Protection") and (text[n+1].strip() in "McAfee Email Protection")) or \((word in "McAfee Email Protection") and (text[n-1].strip() in "McAfee Email Protection")):nums.append("B-SI")if "MAC address" not in saveSoftware:saveSoftware.append("MAC address")if "IP address" not in saveSoftware:saveSoftware.append("IP address")if "Port 22" not in saveSoftware:saveSoftware.append("Port 22")if "Delivery Service" not in saveSoftware:saveSoftware.append("Delivery Service")if "McAfee Email Protection" not in saveSoftware:saveSoftware.append("McAfee Email Protection")#特殊情况-区域位置#Russia's Foreign Intelligence Service、the Middle Eastelif ((word in "Russia's Foreign Intelligence Service") and (text[n+1].strip() in "Russia's Foreign Intelligence Service")) or \((word in "Russia's Foreign Intelligence Service") and (text[n-1].strip() in "Russia's Foreign Intelligence Service")) or \((word in "the Middle East") and (text[n+1].strip() in "the Middle East")) or \((word in "the Middle East") and (text[n-1].strip() in "the Middle East")) :nums.append("B-RL")if "Russia's Foreign Intelligence Service" not in saveLocation:saveLocation.append("Russia's Foreign Intelligence Service")if "the Middle East" not in saveLocation:saveLocation.append("the Middle East")else:nums.append("O")n += 1return nums#-----------------------------------------------------------------------
#主函数
if __name__ == '__main__':path = "Mitre-Split-Word"savepath = "Mitre-Split-Word-BIO"filenames = get_filepath(path)print(filenames)print("\n")#遍历文件内容k = 0while k<len(filenames):filename = path + "//" + filenames[k]print("-------------------------")print(filename)content = get_content(filename)#分割句子nums = data_annotation(content)#print(nums)print(len(content),len(nums))#数据存储filename = filenames[k].replace(".txt", ".csv")savename = savepath + "//" + filenamef = open(savename, "w", encoding="utf8", newline='')fwrite = csv.writer(f)fwrite.writerow(['word','label'])n = 0while n<len(content):fwrite.writerow([content[n],nums[n]])n += 1f.close()print("-------------------------\n\n")#if k>=28:#    breakk += 1#-------------------------------------------------------------------------------------------------#输出存储的漏洞结果saveCVE.remove("CVE-2009-4324CVE-2009-0927")saveCVE.sort()print(saveCVE)print("CVE漏洞:", len(saveCVE))saveAPT.sort()print(saveAPT)print("APT组织:", len(saveAPT))saveLocation.sort()print(saveLocation)print("区域位置:", len(saveLocation))saveIndustry.sort()print(saveIndustry)print("攻击行业:", len(saveIndustry))saveSoftware.sort()print(saveSoftware)print("利用软件:", len(saveSoftware))saveMethod.sort()print(saveMethod)print("攻击手法:", len(saveMethod))saveOS.sort()print(saveOS)print("操作系统:", len(saveOS))

此时的输出结果如下图所示:

在这里插入图片描述

温馨提示:
关于数据标注的校正和优化过程请读着自行思考,此外BIO结尾标注代码还需要调整。当我们拥有更准确的标注,将有利于所有的实体识别研究。


四.数据集划分

在进行实体识别标注之前,我们将数据集随机划分为训练集、测试集、验证集。

  • 将Mitre-Split-Word-BIO中的文件随机划分并存储在三个文件夹中
  • 构建代码合成三个TXT文件,后续代码将对这些文件开展训练和测试任务
    – dataset-train.txt、dataset-test.txt、dataset-val.txt

如下图所示:

在这里插入图片描述

完整代码如下所示:

#encoding:utf-8
#By:Eastmount CSDN
import re
import os
import csv#------------------------------------------------------------------------
#获取文件路径及名称
def get_filepath(path):entities = {}              #字段实体类别files = os.listdir(path)   #遍历路径return files#-----------------------------------------------------------------------
#获取文件内容
def get_content(filename):content = ""fr = open(filename, "r", encoding="utf8")reader = csv.reader(fr)k = 0for r in reader:if k>0 and (r[0]!="" or r[0]!=" ") and r[1]!="":content += r[0] + " " + r[1] + "\n"elif (r[0]=="" or r[0]==" ") and r[1]!="":content += "UNK" + " " + r[1] + "\n"elif (r[0]=="" or r[0]==" ") and r[1]=="":content += "\n"k += 1return content#-----------------------------------------------------------------------
#主函数
if __name__ == '__main__':#获取文件名path = "train"#path = "test"#path = "val"filenames = get_filepath(path)print(filenames)print("\n")savefilename = "dataset-train.txt"#savefilename = "dataset-test.txt"#savefilename = "dataset-val.txt"f = open(savefilename, "w", encoding="utf8")#遍历文件内容k = 0while k<len(filenames):filename = path + "//" + filenames[k]print(filename)content = get_content(filename)print(content)f.write(content)k += 1f.close()

运行结果如下图所示:

在这里插入图片描述


五.基于CRF的实体识别

写到该部分我们即可开展实体识别研究,首先利用代表性的条件随机场(Conditional Random Fields,CRF)模型讲解。关于CRF原理请读者自行了解。

在这里插入图片描述

1.安装keras-contrib

CRF模型作者安装的是 keras-contrib

第一步,如果读者直接使用“pip install keras-contrib”可能会报错,远程下载也报错。

  • pip install git+https://www.github.com/keras-team/keras-contrib.git

甚至会报错 ModuleNotFoundError: No module named ‘keras_contrib’。

在这里插入图片描述

第二步,作者从github中下载该资源,并在本地安装。

  • https://github.com/keras-team/keras-contrib
  • keras-contrib 版本:2.0.8
git clone https://www.github.com/keras-team/keras-contrib.git
cd keras-contrib
python setup.py install

安装成功如下图所示:

在这里插入图片描述

读者可以从我的资源中下载代码和扩展包。

  • https://github.com/eastmountyxz/When-AI-meet-Security

2.安装Keras

同样需要安装keras和TensorFlow扩展包。

在这里插入图片描述

如果TensorFlow下载太慢,可以设置清华大学镜像,实际安装2.2版本。

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install tensorflow==2.2

在这里插入图片描述

在这里插入图片描述


3.完整代码

代码如下所示,推荐资料:

  • https://github.com/huanghao128/zh-nlp-demo
  • https://blog.csdn.net/qq_35549634/article/details/106861168
#encoding:utf-8
#By:Eastmount CSDN
import re
import os
import csv
import numpy as np
import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import Model
from keras.layers import Masking, Embedding, Bidirectional, LSTM, Dense
from keras.layers import Input, TimeDistributed, Activation
from keras.models import load_model
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy
from keras import backend as K
from sklearn import metrics#------------------------------------------------------------------------
#第一步 数据预处理
#------------------------------------------------------------------------
train_data_path = "dataset-train.txt"  #训练数据
test_data_path = "dataset-test.txt"    #测试数据
val_data_path = "dataset-val.txt"      #验证数据
char_vocab_path = "char_vocabs.txt"    #字典文件special_words = ['<PAD>', '<UNK>']     #特殊词表示#BIO标记的标签
label2idx = {"O": 0, "B-AG": 1, "B-AV": 2, "B-RL": 3,"B-AI":4, "B-AM": 5, "B-SI": 6, "B-OS": 7 }# 索引和BIO标签对应
idx2label = {idx: label for label, idx in label2idx.items()}
print(idx2label)# 读取字符词典文件
with open(char_vocab_path, "r", encoding="utf8") as fo:char_vocabs = [line.strip() for line in fo]
char_vocabs = special_words + char_vocabs
print(char_vocabs)
print("--------------------------------------------\n\n")# 字符和索引编号对应 {'<PAD>': 0, '<UNK>': 1, 'APT-C-36': 2, ...}
idx2vocab = {idx: char for idx, char in enumerate(char_vocabs)}
vocab2idx = {char: idx for idx, char in idx2vocab.items()}
print(idx2vocab)
print("--------------------------------------------\n\n")
print(vocab2idx)
print("--------------------------------------------\n\n")#------------------------------------------------------------------------
#第二步 读取训练语料
#------------------------------------------------------------------------
def read_corpus(corpus_path, vocab2idx, label2idx):datas, labels = [], []with open(corpus_path, encoding='utf-8') as fr:lines = fr.readlines()sent_, tag_ = [], []for line in lines:if line != '\n':        #断句line = line.strip()[char, label] = line.split()sent_.append(char)tag_.append(label)else:#print(line)#vocab2idx[0] => <PAD>sent_ids = [vocab2idx[char] if char in vocab2idx else vocab2idx['<UNK>'] for char in sent_]tag_ids = [label2idx[label] if label in label2idx else 0 for label in tag_]datas.append(sent_ids)labels.append(tag_ids)sent_, tag_ = [], []return datas, labels#原始数据
train_datas_, train_labels_ = read_corpus(train_data_path, vocab2idx, label2idx)
test_datas_, test_labels_ = read_corpus(test_data_path, vocab2idx, label2idx)#输出测试结果 1639 1639 923 923
print(len(train_datas_), len(train_labels_), len(test_datas_), len(test_labels_))
print(train_datas_[5])
print([idx2vocab[idx] for idx in train_datas_[5]])
print(train_labels_[5])
print([idx2label[idx] for idx in train_labels_[5]])#------------------------------------------------------------------------
#第三步 数据填充 one-hot编码
#------------------------------------------------------------------------
MAX_LEN = 100
VOCAB_SIZE = len(vocab2idx)
CLASS_NUMS = len(label2idx)# padding data
print('padding sequences')
train_datas = sequence.pad_sequences(train_datas_, maxlen=MAX_LEN)
train_labels = sequence.pad_sequences(train_labels_, maxlen=MAX_LEN)test_datas = sequence.pad_sequences(test_datas_, maxlen=MAX_LEN)
test_labels = sequence.pad_sequences(test_labels_, maxlen=MAX_LEN)
print('x_train shape:', train_datas.shape)
print('x_test shape:', test_datas.shape)
# (1639, 100) (923, 100)# encoder one-hot
train_labels = keras.utils.to_categorical(train_labels, CLASS_NUMS)
test_labels = keras.utils.to_categorical(test_labels, CLASS_NUMS)
print('trainlabels shape:', train_labels.shape)
print('testlabels shape:', test_labels.shape)
# (1639, 100, 8) (923, 100, 8)#------------------------------------------------------------------------
#第四步 构建CRF模型
#------------------------------------------------------------------------
EPOCHS = 20
BATCH_SIZE = 64
EMBED_DIM = 128
HIDDEN_SIZE = 64
MAX_LEN = 100
VOCAB_SIZE = len(vocab2idx)
CLASS_NUMS = len(label2idx)
K.clear_session()
print(VOCAB_SIZE, CLASS_NUMS, '\n') #3860 8#模型构建 CRF
inputs = Input(shape=(MAX_LEN,), dtype='int32')
x = Masking(mask_value=0)(inputs)
x = Embedding(VOCAB_SIZE, 32, mask_zero=False)(x)
x = TimeDistributed(Dense(CLASS_NUMS))(x)
outputs = CRF(CLASS_NUMS)(x)
model = Model(inputs=inputs, outputs=outputs)
model.summary()flag = "test"
if flag=="train":#模型训练model.compile(loss=crf_loss, optimizer='adam', metrics=[crf_viterbi_accuracy])model.fit(train_datas, train_labels, epochs=EPOCHS, verbose=1, validation_split=0.1)score = model.evaluate(test_datas, test_labels, batch_size=BATCH_SIZE)print(model.metrics_names)print(score)model.save("ch_ner_model.h5")
else:#------------------------------------------------------------------------#第五步 训练模型#------------------------------------------------------------------------char_vocab_path = "char_vocabs.txt"   #字典文件model_path = "ch_ner_model.h5"        #模型文件ner_labels = {"O": 0, "B-AG": 1, "B-AV": 2, "B-RL": 3,"B-AI":4, "B-AM": 5, "B-SI": 6, "B-OS": 7 }special_words = ['<PAD>', '<UNK>']MAX_LEN = 100#预测结果model = load_model(model_path, custom_objects={'CRF': CRF}, compile=False)    y_pred = model.predict(test_datas)y_labels = np.argmax(y_pred, axis=2)         #取最大值z_labels = np.argmax(test_labels, axis=2)    #真实值word_labels = test_datas                     #真实值k = 0final_y = []       #预测结果对应的标签final_z = []       #真实结果对应的标签final_word = []    #对应的特征单词while k<len(y_labels):y = y_labels[k]for idx in y:final_y.append(idx2label[idx])#print("预测结果:", [idx2label[idx] for idx in y])z = z_labels[k]#print(z)for idx in z:    final_z.append(idx2label[idx])#print("真实结果:", [idx2label[idx] for idx in z])word = word_labels[k]#print(word)
n         for idx in word:final_word.append(idx2vocab[idx])k += 1print("最终结果大小:", len(final_y),len(final_z))n = 0numError = 0numRight = 0while n<len(final_y):if final_y[n]!=final_z[n] and final_z[n]!='O':numError += 1if final_y[n]==final_z[n] and final_z[n]!='O':numRight += 1n += 1print("预测错误数量:", numError)print("预测正确数量:", numRight)print("Acc:", numRight*1.0/(numError+numRight))print(y_pred.shape)print(len(test_datas_), len(test_labels_))print("预测单词:", [idx2vocab[idx] for idx in test_datas_[0]])print("真实结果:", [idx2label[idx] for idx in test_labels_[0]])#文件存储fw = open("Final_CRF_Result.csv", "w", encoding="utf8", newline='')fwrite = csv.writer(fw)fwrite.writerow(['pre_label','real_label', 'word'])n = 0while n<len(final_y):fwrite.writerow([final_y[n],final_z[n],final_word[n]])n += 1fw.close()

构建的模型如下图所示:

在这里插入图片描述

运行结果如下,训练完成后将flag变量修改为“test”测试。

  32/1475 [..............................] - ETA: 0s - loss: 0.0102 - crf_viterbi_accuracy: 0.9997416/1475 [=======>......................] - ETA: 5s - loss: 0.0143 - crf_viterbi_accuracy: 0.9982736/1475 [=============>................] - ETA: 4s - loss: 0.0147 - crf_viterbi_accuracy: 0.9981
1056/1475 [====================>.........] - ETA: 2s - loss: 0.0141 - crf_viterbi_accuracy: 0.9983
1344/1475 [==========================>...] - ETA: 0s - loss: 0.0138 - crf_viterbi_accuracy: 0.9984
1472/1475 [============================>.] - ETA: 0s - loss: 0.0136 - crf_viterbi_accuracy: 0.9984
['loss', 'crf_viterbi_accuracy']
[0.021301430796362854, 0.9972449541091919]

在这里插入图片描述


六.基于BiLSTM-CRF的实体识别

下面的代码是构建BiLSTM-CRF模型实现实体识别。

#encoding:utf-8
#By:Eastmount CSDN
import re
import os
import csv
import numpy as np
import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import Model
from keras.layers import Masking, Embedding, Bidirectional, LSTM, Dense
from keras.layers import Input, TimeDistributed, Activation
from keras.models import load_model
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy
from keras import backend as K
from sklearn import metrics#------------------------------------------------------------------------
#第一步 数据预处理
#------------------------------------------------------------------------
train_data_path = "dataset-train.txt"  #训练数据
test_data_path = "dataset-test.txt"    #测试数据
val_data_path = "dataset-val.txt"      #验证数据
char_vocab_path = "char_vocabs.txt"    #字典文件
special_words = ['<PAD>', '<UNK>']     #特殊词表示#BIO标记的标签
label2idx = {"O": 0, "B-AG": 1, "B-AV": 2, "B-RL": 3,"B-AI":4, "B-AM": 5, "B-SI": 6, "B-OS": 7 }# 索引和BIO标签对应
idx2label = {idx: label for label, idx in label2idx.items()}
print(idx2label)# 读取字符词典文件
with open(char_vocab_path, "r", encoding="utf8") as fo:char_vocabs = [line.strip() for line in fo]
char_vocabs = special_words + char_vocabs# 字符和索引编号对应 {'<PAD>': 0, '<UNK>': 1, 'APT-C-36': 2, ...}
idx2vocab = {idx: char for idx, char in enumerate(char_vocabs)}
vocab2idx = {char: idx for idx, char in idx2vocab.items()}#------------------------------------------------------------------------
#第二步 读取训练语料
#------------------------------------------------------------------------
def read_corpus(corpus_path, vocab2idx, label2idx):datas, labels = [], []with open(corpus_path, encoding='utf-8') as fr:lines = fr.readlines()sent_, tag_ = [], []for line in lines:if line != '\n':        #断句line = line.strip()[char, label] = line.split()sent_.append(char)tag_.append(label)else:sent_ids = [vocab2idx[char] if char in vocab2idx else vocab2idx['<UNK>'] for char in sent_]tag_ids = [label2idx[label] if label in label2idx else 0 for label in tag_]datas.append(sent_ids)labels.append(tag_ids)sent_, tag_ = [], []return datas, labels#原始数据
train_datas_, train_labels_ = read_corpus(train_data_path, vocab2idx, label2idx)
test_datas_, test_labels_ = read_corpus(test_data_path, vocab2idx, label2idx)#------------------------------------------------------------------------
#第三步 数据填充 one-hot编码
#------------------------------------------------------------------------
MAX_LEN = 100
VOCAB_SIZE = len(vocab2idx)
CLASS_NUMS = len(label2idx)print('padding sequences')
train_datas = sequence.pad_sequences(train_datas_, maxlen=MAX_LEN)
train_labels = sequence.pad_sequences(train_labels_, maxlen=MAX_LEN)
test_datas = sequence.pad_sequences(test_datas_, maxlen=MAX_LEN)
test_labels = sequence.pad_sequences(test_labels_, maxlen=MAX_LEN)
print('x_train shape:', train_datas.shape)
print('x_test shape:', test_datas.shape)train_labels = keras.utils.to_categorical(train_labels, CLASS_NUMS)
test_labels = keras.utils.to_categorical(test_labels, CLASS_NUMS)
print('trainlabels shape:', train_labels.shape)
print('testlabels shape:', test_labels.shape)#------------------------------------------------------------------------
#第四步 构建BiLSTM+CRF模型
#------------------------------------------------------------------------
EPOCHS = 12
BATCH_SIZE = 64
EMBED_DIM = 128
HIDDEN_SIZE = 64
MAX_LEN = 100
VOCAB_SIZE = len(vocab2idx)
CLASS_NUMS = len(label2idx)
K.clear_session()
print(VOCAB_SIZE, CLASS_NUMS, '\n') #3860 8#模型构建 BiLSTM-CRF
inputs = Input(shape=(MAX_LEN,), dtype='int32')
x = Masking(mask_value=0)(inputs)
x = Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=False)(x) #修改掩码False
x = Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=True))(x)
x = TimeDistributed(Dense(CLASS_NUMS))(x)
outputs = CRF(CLASS_NUMS)(x)
model = Model(inputs=inputs, outputs=outputs)
model.summary()flag = "train"
if flag=="train":#模型训练model.compile(loss=crf_loss, optimizer='adam', metrics=[crf_viterbi_accuracy])model.fit(train_datas, train_labels, epochs=EPOCHS, verbose=1, validation_split=0.1)score = model.evaluate(test_datas, test_labels, batch_size=BATCH_SIZE)print(model.metrics_names)print(score)model.save("bilstm_ner_model.h5")
else:#------------------------------------------------------------------------#第五步 训练模型#------------------------------------------------------------------------char_vocab_path = "char_vocabs.txt"   #字典文件model_path = "bilstm_ner_model.h5"        #模型文件ner_labels = {"O": 0, "B-AG": 1, "B-AV": 2, "B-RL": 3,"B-AI":4, "B-AM": 5, "B-SI": 6, "B-OS": 7 }special_words = ['<PAD>', '<UNK>']MAX_LEN = 100#预测结果model = load_model(model_path, custom_objects={'CRF': CRF}, compile=False)    y_pred = model.predict(test_datas)y_labels = np.argmax(y_pred, axis=2)         #取最大值z_labels = np.argmax(test_labels, axis=2)    #真实值word_labels = test_datas                     #真实值k = 0final_y = []       #预测结果对应的标签final_z = []       #真实结果对应的标签final_word = []    #对应的特征单词while k<len(y_labels):y = y_labels[k]for idx in y:final_y.append(idx2label[idx])z = z_labels[k]for idx in z:    final_z.append(idx2label[idx])word = word_labels[k]for idx in word:final_word.append(idx2vocab[idx])k += 1print("最终结果大小:", len(final_y),len(final_z))n = 0numError = 0numRight = 0while n<len(final_y):if final_y[n]!=final_z[n] and final_z[n]!='O':numError += 1if final_y[n]==final_z[n] and final_z[n]!='O':numRight += 1n += 1print("预测错误数量:", numError)print("预测正确数量:", numRight)print("Acc:", numRight*1.0/(numError+numRight))print("预测单词:", [idx2vocab[idx] for idx in test_datas_[0]])print("真实结果:", [idx2label[idx] for idx in test_labels_[0]])

构建的模型如下图所示:

在这里插入图片描述

对比实验及调参请读者自行尝试喔,以后有时间再分享调参内容。


七.总结

写到这里这篇文章就结束,希望对您有所帮助,后续将结合经典的Bert进行分享。忙碌的2023,真的很忙,项目本子论文毕业工作,等忙完后好好写几篇安全博客,感谢支持和陪伴,尤其是家人的鼓励和支持, 继续加油!

  • 一.ATT&CK数据采集
  • 二.数据拆分及内容统计
    1.段落拆分
    2.句子拆分
  • 三.数据标注
  • 四.数据集划分
  • 五.基于CRF的实体识别
    1.安装keras-contrib
    2.安装Keras
    3.完整代码
  • 六.基于BiLSTM-CRF的实体识别

人生路是一个个十字路口,一次次博弈,一次次纠结和得失组成。得失得失,有得有失,不同的选择,不一样的精彩。虽然累和忙,但看到小珞珞还是挺满足的,感谢家人的陪伴。望小珞能开心健康成长,爱你们喔,继续干活,加油!

在这里插入图片描述

(By:Eastmount 2024-01-31 写于省图书馆 http://blog.csdn.net/eastmount/ )


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/657784.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

前端面试题-网络请求-http请求方式-http状态码-url地址到浏览器渲染过程-跨域-请求测试工具-http和https

前端面试题-网络请求-http请求方式-http状态码-url地址到浏览器渲染过程-跨域-请求测试工具 http请求方式http的状态码有哪些&#xff1f;分别代表什么意思&#xff1f;从输入一个url地址到浏览器完成渲染的整个过程解决跨域的三种方式请求测试工具-postman的使用http和https h…

公司人才招聘工作开展难点分析

某国有资本运营公司位于北方某省级城市。在2019年&#xff0c;北方某市的当地政府提出组建专业化国有资本投资运营公司&#xff0c;大力开展专业化资本运营&#xff0c;推动国有资本进退留转市场出清和专业化重组的政策方针。为提高国有资产的管理运营能力&#xff0c;该市成立…

KAFKA高可用架构涉及常用功能整理

KAFKA高可用架构涉及常用功能整理 1. kafka的高可用系统架构和相关组件2. kafka的核心参数2.1 常规配置2.2 特殊优化配置 3. kafka常用命令3.1 常用基础命令3.1.1 创建topic3.1.2 获取集群的topic列表3.1.3 获取集群的topic详情3.1.4 删除集群的topic3.1.5 获取集群的消费组列表…

001集—shapefile(.shp)格式详解——arcgis

一、什么是shapefile Shapefile 是一种用于存储地理要素的几何位置和属性信息的非拓扑简单格式。shapefile 中的地理要素可通过点、线或面&#xff08;区域&#xff09;来表示。包含 shapefile 的工作空间还可以包含 dBASE 表&#xff0c;它们用于存储可连接到 shapefile 的要…

1 月 30 日算法练习-思维和贪心

文章目录 重复字符串翻硬币乘积最大 重复字符串 思路&#xff1a;判断是否能整除&#xff0c;如果不能整除直接退出&#xff0c;能整除每次从每组对应位置中找出出现最多的字母将其他值修改为它&#xff0c;所有修改次数即为答案。 #include<iostream> using namespace …

组件如何组织以提升维护性、扩展性

文章目录 一、提升组件的维护性和扩展性1.1、单一职责原则&#xff08;Single Responsibility Principle&#xff09;1.2、松耦合&#xff08;Loose Coupling&#xff09;1.3、高内聚&#xff08;High Cohesion&#xff09;1.4、模块化设计&#xff08;Modular Design&#xff…

从零开始复现GPT2(三):词表,Tokenizer和语料库的实现

源码地址&#xff1a;https://gitee.com/guojialiang2023/gpt2 GPT2 模型词表TokenizerTokenizer 类_normalize 方法_tokenize 方法_CHINESE_CHAR_RANGE 和 _PUNCTUATION_RANGE 数据集语料库TokenizedCorpus 类 模型 词表 定义了一个名为 Vocab 的类&#xff0c;用于处理和管理…

【项目日记(六)】第二层: 中心缓存的具体实现(下)

&#x1f493;博主CSDN主页:杭电码农-NEO&#x1f493;   ⏩专栏分类:项目日记-高并发内存池⏪   &#x1f69a;代码仓库:NEO的学习日记&#x1f69a;   &#x1f339;关注我&#x1faf5;带你做项目   &#x1f51d;&#x1f51d; 开发环境: Visual Studio 2022 项目日…

给准备从事软件开发工作的年轻人的13个建议

从事软件开发是一个不断学习和适应变化的过程。这里有一些针对刚入行或准备从事软件开发工作的年轻人的建议&#xff1a; 掌握基础知识&#xff1a;确保你有扎实的编程基础。了解至少一种编程语言的语法和核心概念&#xff0c;比如C语言、Python、Java或C#。同时&#xff0c;理…

Spring 中获取 Bean 对象的三种方式

目录 1、根据名称获取Bean 2、根据Bean类型获取Bean 3、根据 Bean 名称 Bean 类型来获取 Bean&#xff08;好的解决方法&#xff09; 假设 Bean 对象是 User&#xff0c;并存储到 Spring 中&#xff0c;注册到 xml 文件中 public class User {public String sayHi(){retur…

Meta开源Code Llama 70B,缩小与GPT-4之间的技术鸿沟

每周跟踪AI热点新闻动向和震撼发展 想要探索生成式人工智能的前沿进展吗&#xff1f;订阅我们的简报&#xff0c;深入解析最新的技术突破、实际应用案例和未来的趋势。与全球数同行一同&#xff0c;从行业内部的深度分析和实用指南中受益。不要错过这个机会&#xff0c;成为AI领…

MIT6.5830 实验1

GoDB 介绍 实验中实现的数据库被称为GoDB&#xff0c;根据 readMe1 中的内容可知&#xff0c;GoDB 含有&#xff1a; Structures that represent fields, tuples, and tuple schemas; Methods that apply predicates and conditions to tuples; One or more access methods …

LeetCode 829. 连续整数求和

一开始我想的是质因数分解&#xff0c;然后项数 为奇数的好解决但是偶数弄不了 然后看题解发现了你直接写出通项公式&#xff1a; 假设首项是a&#xff0c;项数为k 则 (ak-1a)*k 2*n 看看k 的范围 2*a 2n/k 1-k>2 2*n/k >k1 2n>k*k 所以可以暴力枚举k sqrt…

Java 开发环境 全套包含IDEA

一、JDK配置 1.下载 JDK Builds from Oracle 去这边下载open JDK 2.JDK环境变量配置 按win&#xff0c;打开设置 找到环境变量编辑 这边输入的是你下载的那个JDK的bin的路径 检擦配置是否正确在cmd中输入 二、IDEA安装配置 1.下载&#xff08;社区版&#xff09; JetBrai…

分类预测 | Matlab实现DT决策树多特征分类预测

分类预测 | Matlab实现DT决策树多特征分类预测 目录 分类预测 | Matlab实现DT决策树多特征分类预测分类效果基本描述程序设计参考资料分类效果

如何用Docker+jenkins 运行 python 自动化?

1.在 Linux 服务器安装 docker 2.创建 jenkins 容器 3.根据自动化项目依赖包构建 python 镜像(构建自动化 python 环境) 4.运行新的 python 容器&#xff0c;执行 jenkins 从仓库中拉下来的自动化项目 5.执行完成之后删除容器 前言 环境准备 Linux 服务器一台(我的是 CentOS7)…

【排序算法】归并排序

文章目录 一&#xff1a;基本概念1.1 定义1.2 算法思路1.3 图解算法1.4 合并两个有序数组流程1.5 动画展示 二&#xff1a;性能2.1 算法性能2.2 时间复杂度2.3 空间复杂度2.4 稳定性 三&#xff1a;代码实现 一&#xff1a;基本概念 1.1 定义 归并排序&#xff08;Merge sort…

【论文阅读|小目标分割算法ASF-YOLO】

论文阅读|小目标分割算法ASF-YOLO 摘要&#xff08;Abstract&#xff09;1 引言&#xff08;Introduction&#xff09;2 相关工作&#xff08;Related work&#xff09;2.1 细胞实例分割&#xff08;Cell instance segmentation&#xff09;2.2 改进的YOLO用于实例分割&#xf…

OpenCV 0 - VS2019配置OpenCV

1 配置好环境变量 根据自己的opencv的安装目录配置 2 新建一个空项目 3 打开 视图->工具栏->属性管理器 4 添加新项目属性表 右键项目名(我这是opencvdemo)添加新项目属性表,如果有配置好了的属性表选添加现有属性表 5 双击选中Debug|x64的刚添加的属性表 6 (重点)添…

vue使用mpegts.js教程

vue使用mpegts.js教程 最简单好用的H265网页播放器-mpegts.js简介特征受限性 使用步骤安装引入HTML 中添加视频标签video知识扩展 在容器里创建播放器 最简单好用的H265网页播放器-mpegts.js H265是新一代视频编码规范&#xff0c;与H264相比压缩比更高&#xff0c;同样的码率下…