使用 Elasticsearch 检测抄袭 (二)

我在在之前的文章 “使用 Elasticsearch 检测抄袭 (一)” 介绍了如何检文章抄袭。这个在许多的实际使用中非常有意义。我在  CSDN 上的文章也经常被人引用或者抄袭。有的人甚至也不用指明出处。这对文章的作者来说是很不公平的。文章介绍的内容针对很多的博客网站也非常有意义。在那篇文章中,我觉得针对一些开发者来说,不一定能运行的很好。在今天的这篇文章中,我特意使用本地部署,并使用 jupyter notebook 来进行一个展示。这样开发者能一步一步地完整地运行起来。

安装

安装 Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana,那么请参考一下的文章来进行安装:

  • 如何在 Linux,MacOS 及 Windows 上进行安装 Elasticsearch

  • Kibana:如何在 Linux,MacOS 及 Windows 上安装 Elastic 栈中的 Kibana

在安装的时候,请选择 Elastic Stack 8.x 进行安装。在安装的时候,我们可以看到如下的安装信息:

为了能够上传向量模型,我们必须订阅白金版或试用。

上传模型

注意:如果我们在这里通过命令行来进行上传模型的话,那么你就不需要在下面的代码中来实现上传。可以省去那些个步骤。

我们可以参考之前的文章 “Elasticsearch:使用 NLP 问答模型与你喜欢的圣诞歌曲交谈”。我们使用如下的命令来上传 OpenAI detection 模型:

eland_import_hub_model --url https://elastic:o6G_pvRL=8P*7on+o6XH@localhost:9200 \--hub-model-id roberta-base-openai-detector \--task-type text_classification \--ca-cert /Users/liuxg/elastic/elasticsearch-8.11.0/config/certs/http_ca.crt \--start

在上面,我们需要根据自己的配置修改上面的证书路径,Elasticsearch 的访问地址。

我们可以在 Kibana 中查看最新上传的模型:

接下来,按照同样的方法,我们安装文本嵌入模型。

eland_import_hub_model --url https://elastic:o6G_pvRL=8P*7on+o6XH@localhost:9200 \--hub-model-id sentence-transformers/all-mpnet-base-v2 \--task-type text_embedding \--ca-cert /Users/liuxg/elastic/elasticsearch-8.11.0/config/certs/http_ca.crt \--start	

为了方便大家学习,我们可以在如下的地址下载代码:

git clone https://github.com/liu-xiao-guo/elasticsearch-labs

我们可以在如下的位置找到 jupyter notebook:

$ pwd
/Users/liuxg/python/elasticsearch-labs/supporting-blog-content/plagiarism-detection-with-elasticsearch
$ ls
plagiarism_detection_es_self_managed.ipynb

运行代码

接下来,我们开始运行 notebook。我们首先安装相应的 python 包:

pip3 install elasticsearch==8.11
pip3 -q install eland elasticsearch sentence_transformers transformers torch==2.1.0

在运行代码之前,我们先设置如下的变量:

export ES_USER="elastic"
export ES_PASSWORD="o6G_pvRL=8P*7on+o6XH"
export ES_ENDPOINT="localhost"

我们还需要把 Elasticsearch 的证书拷贝到当前的目录中:

$ pwd
/Users/liuxg/python/elasticsearch-labs/supporting-blog-content/plagiarism-detection-with-elasticsearch
$ cp ~/elastic/elasticsearch-8.11.0/config/certs/http_ca.crt .
$ ls
http_ca.crt                                plagiarism_detection_es_self_managed.ipynb
plagiarism_detection_es.ipynb

导入包:

from elasticsearch import Elasticsearch, helpers
from elasticsearch.client import MlClient
from eland.ml.pytorch import PyTorchModel
from eland.ml.pytorch.transformers import TransformerModel
from urllib.request import urlopen
import json
from pathlib import Path
import os

连接到 Elasticsearch

elastic_user=os.getenv('ES_USER')
elastic_password=os.getenv('ES_PASSWORD')
elastic_endpoint=os.getenv("ES_ENDPOINT")url = f"https://{elastic_user}:{elastic_password}@{elastic_endpoint}:9200"
client = Elasticsearch(url, ca_certs = "./http_ca.crt", verify_certs = True)print(client.info())

上传 detector 模型

hf_model_id ='roberta-base-openai-detector'
tm = TransformerModel(model_id=hf_model_id, task_type="text_classification")#set the modelID as it is named in Elasticsearch
es_model_id = tm.elasticsearch_model_id()# Download the model from Hugging Face
tmp_path = "models"
Path(tmp_path).mkdir(parents=True, exist_ok=True)
model_path, config, vocab_path = tm.save(tmp_path)# Load the model into Elasticsearch
ptm = PyTorchModel(client, es_model_id)
ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)#Start the model
s = MlClient.start_trained_model_deployment(client, model_id=es_model_id)
s.body

我们可以在 Kibana 中进行查看:

上传 text embedding 模型

hf_model_id='sentence-transformers/all-mpnet-base-v2'
tm = TransformerModel(model_id=hf_model_id, task_type="text_embedding")#set the modelID as it is named in Elasticsearch
es_model_id = tm.elasticsearch_model_id()# Download the model from Hugging Face
tmp_path = "models"
Path(tmp_path).mkdir(parents=True, exist_ok=True)
model_path, config, vocab_path = tm.save(tmp_path)# Load the model into Elasticsearch
ptm = PyTorchModel(client, es_model_id)
ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)# Start the model
s = MlClient.start_trained_model_deployment(client, model_id=es_model_id)
s.body

我们可以在 Kibana 中查看:

创建源索引

client.indices.create(
index="plagiarism-docs",
mappings= {"properties": {"title": {"type": "text","fields": {"keyword": {"type": "keyword"}}},"abstract": {"type": "text","fields": {"keyword": {"type": "keyword"}}},"url": {"type": "keyword"},"venue": {"type": "keyword"},"year": {"type": "keyword"}}
})

我们可以在 Kibana 中进行查看:

创建 checker ingest pipeline

client.ingest.put_pipeline(id="plagiarism-checker-pipeline",processors = [{"inference": { #for ml models - to infer against the data that is being ingested in the pipeline"model_id": "roberta-base-openai-detector", #text classification model id"target_field": "openai-detector", # Target field for the inference results"field_map": { #Maps the document field names to the known field names of the model."abstract": "text_field" # Field matching our configured trained model input. Typically for NLP models, the field name is text_field.}}},{"inference": {"model_id": "sentence-transformers__all-mpnet-base-v2", #text embedding model model id"target_field": "abstract_vector", # Target field for the inference results"field_map": { #Maps the document field names to the known field names of the model."abstract": "text_field" # Field matching our configured trained model input. Typically for NLP models, the field name is text_field.}}}]
)

我们可以在 Kibana 中进行查看:

创建 plagiarism checker 索引

client.indices.create(
index="plagiarism-checker",
mappings={
"properties": {"title": {"type": "text","fields": {"keyword": {"type": "keyword"}}},"abstract": {"type": "text","fields": {"keyword": {"type": "keyword"}}},"url": {"type": "keyword"},"venue": {"type": "keyword"},"year": {"type": "keyword"},"abstract_vector.predicted_value": { # Inference results field, target_field.predicted_value"type": "dense_vector","dims": 768, # embedding_size"index": "true","similarity": "dot_product" #  When indexing vectors for approximate kNN search, you need to specify the similarity function for comparing the vectors.}}
}
)

我们可以在 Kibana 中进行查看:

写入源文档

我们首先把地址 https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/emnlp2016-2018.json 里的文档下载到当前目录下:

$ pwd
/Users/liuxg/python/elasticsearch-labs/supporting-blog-content/plagiarism-detection-with-elasticsearch
$ ls
emnlp2016-2018.json                        plagiarism_detection_es.ipynb
http_ca.crt                                plagiarism_detection_es_self_managed.ipynb
models

如上所示,emnlp2016-2018.json  就是我们下载的文档。

# Load data into a JSON object
with open('emnlp2016-2018.json') as f:data_json = json.load(f)print(f"Successfully loaded {len(data_json)} documents")def create_index_body(doc):""" Generate the body for an Elasticsearch document. """return {"_index": "plagiarism-docs","_source": doc,}# Prepare the documents to be indexed
documents = [create_index_body(doc) for doc in data_json]# Use helpers.bulk to index
helpers.bulk(client, documents)print("Done indexing documents into `plagiarism-docs` source index")

我们可以在 Kibana 中进行查看:

使用 ingest pipeline 进行 reindex

client.reindex(wait_for_completion=False,source={"index": "plagiarism-docs"},dest= {"index": "plagiarism-checker","pipeline": "plagiarism-checker-pipeline"}
)

在上面,我们设置 wait_for_completion=False。这是一个异步的操作。我们需要等一段时间让上面的 reindex 完成。我们可以通过检查如下的文档数:

上面表明我们的文档已经完成。我们再接着查看一下 plagiarism-checker 索引中的文档:

检查重复文字

direct plagarism

model_text = 'Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for multimodal comprehension of cooking recipes. It comprises of approximately 20K instructional recipes with multiple modalities such as titles, descriptions and aligned set of images. With over 36K automatically generated question-answer pairs, we design a set of comprehension and reasoning tasks that require joint understanding of images and text, capturing the temporal flow of events and making sense of procedural knowledge. Our preliminary results indicate that RecipeQA will serve as a challenging test bed and an ideal benchmark for evaluating machine comprehension systems. The data and leaderboard are available at http://hucvl.github.io/recipeqa.'response = client.search(index='plagiarism-checker', size=1,knn={"field": "abstract_vector.predicted_value","k": 9,"num_candidates": 974,"query_vector_builder": {"text_embedding": {"model_id": "sentence-transformers__all-mpnet-base-v2","model_text": model_text}}}
)for hit in response['hits']['hits']:score = hit['_score']title = hit['_source']['title']abstract = hit['_source']['abstract']openai = hit['_source']['openai-detector']['predicted_value']url = hit['_source']['url']if score > 0.9:print(f"\nHigh similarity detected! This might be plagiarism.")print(f"\nMost similar document: '{title}'\n\nAbstract: {abstract}\n\nurl: {url}\n\nScore:{score}\n")if openai == 'Fake':print("This document may have been created by AI.\n")elif score < 0.7:print(f"\nLow similarity detected. This might not be plagiarism.")if openai == 'Fake':print("This document may have been created by AI.\n")else:print(f"\nModerate similarity detected.")print(f"\nMost similar document: '{title}'\n\nAbstract: {abstract}\n\nurl: {url}\n\nScore:{score}\n")if openai == 'Fake':print("This document may have been created by AI.\n")ml_client = MlClient(client)model_id = 'roberta-base-openai-detector' #open ai text classification modeldocument = [{"text_field": model_text}
]ml_response = ml_client.infer_trained_model(model_id=model_id, docs=document)predicted_value = ml_response['inference_results'][0]['predicted_value']if predicted_value == 'Fake':print("Note: The text query you entered may have been generated by AI.\n")

similar text - paraphrase plagiarism

model_text = 'Comprehending and deducing information from culinary instructions represents a promising avenue for research aimed at empowering artificial intelligence to decipher step-by-step text. In this study, we present CuisineInquiry, a database for the multifaceted understanding of cooking guidelines. It encompasses a substantial number of informative recipes featuring various elements such as headings, explanations, and a matched assortment of visuals. Utilizing an extensive set of automatically crafted question-answer pairings, we formulate a series of tasks focusing on understanding and logic that necessitate a combined interpretation of visuals and written content. This involves capturing the sequential progression of events and extracting meaning from procedural expertise. Our initial findings suggest that CuisineInquiry is poised to function as a demanding experimental platform.'response = client.search(index='plagiarism-checker', size=1,knn={"field": "abstract_vector.predicted_value","k": 9,"num_candidates": 974,"query_vector_builder": {"text_embedding": {"model_id": "sentence-transformers__all-mpnet-base-v2","model_text": model_text}}}
)for hit in response['hits']['hits']:score = hit['_score']title = hit['_source']['title']abstract = hit['_source']['abstract']openai = hit['_source']['openai-detector']['predicted_value']url = hit['_source']['url']if score > 0.9:print(f"\nHigh similarity detected! This might be plagiarism.")print(f"\nMost similar document: '{title}'\n\nAbstract: {abstract}\n\nurl: {url}\n\nScore:{score}\n")if openai == 'Fake':print("This document may have been created by AI.\n")elif score < 0.7:print(f"\nLow similarity detected. This might not be plagiarism.")if openai == 'Fake':print("This document may have been created by AI.\n")else:print(f"\nModerate similarity detected.")print(f"\nMost similar document: '{title}'\n\nAbstract: {abstract}\n\nurl: {url}\n\nScore:{score}\n")if openai == 'Fake':print("This document may have been created by AI.\n")ml_client = MlClient(client)model_id = 'roberta-base-openai-detector' #open ai text classification modeldocument = [{"text_field": model_text}
]ml_response = ml_client.infer_trained_model(model_id=model_id, docs=document)predicted_value = ml_response['inference_results'][0]['predicted_value']if predicted_value == 'Fake':print("Note: The text query you entered may have been generated by AI.\n")

完整的代码可以在地址下载:https://github.com/liu-xiao-guo/elasticsearch-labs/blob/main/supporting-blog-content/plagiarism-detection-with-elasticsearch/plagiarism_detection_es_self_managed.ipynb

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/241806.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Github 2023-12-24 开源项目日报 Top10

根据Github Trendings的统计&#xff0c;今日(2023-12-24统计)共有10个项目上榜。根据开发语言中项目的数量&#xff0c;汇总情况如下&#xff1a; 开发语言项目数量Python项目5Jupyter Notebook项目2C项目1C项目1Go项目1Java项目1JavaScript项目1Ruby项目1 Serverless Frame…

双向长短期记忆网络(Bi-LSTM)-多输入回归预测

目录 一、程序及算法内容介绍&#xff1a; 基本内容&#xff1a; 亮点与优势&#xff1a; 二、实际运行效果&#xff1a; 三、部分代码展示&#xff1a; 四、完整代码下载&#xff1a; 一、程序及算法内容介绍&#xff1a; 基本内容&#xff1a; 本代码基于Matlab平台编…

vscode配置用户代码片段

1.左下角打开设置 2.新建全局代码片段 3.输入名字&#xff0c;名字随意不过最好语意化 4.创建模版 这里的模版为vue2常用代码片段&#xff0c;稍后会持续更新。 {"Print to console": {"prefix": "v2", //页面使用时名称"body":…

LeNet网络分析与demo实例

参考自 up主的b站链接&#xff1a;霹雳吧啦Wz的个人空间-霹雳吧啦Wz个人主页-哔哩哔哩视频这位大佬的博客 Fun_机器学习,pytorch图像分类,工具箱-CSDN博客 网络分析&#xff1a; 最好是把这个图像和代码对着来看然后进行分析的时候比较快 # 使用torch.nn包来构建神经网络. im…

Go 泛型之类型参数

Go 泛型之类型参数 文章目录 Go 泛型之类型参数一、Go 的泛型与其他主流编程语言的泛型差异二、返回切片中值最大的元素三、类型参数&#xff08;type parameters&#xff09;四、泛型函数3.1 泛型函数的结构3.2 调用泛型函数3.3 泛型函数实例化&#xff08;instantiation&…

WARNING: HADOOP_SECURE_DN_USER has been replaced by HDFS_DATANODE_SECURE_USER.

Hadoop启动时警告&#xff0c;但不影响使用&#xff0c;强迫症的我还是决定寻找解决办法 WARNING: HADOOP_SECURE_DN_USER has been replaced by HDFS_DATANODE_SECURE_USER. Using value of HADOOP_SECURE_DN_USER.原因是Hadoop安装配置于root用户下&#xff0c;对文件需要进…

案例144:基于微信小程序的自修室预约系统

文末获取源码 开发语言&#xff1a;Java 框架&#xff1a;SSM JDK版本&#xff1a;JDK1.8 数据库&#xff1a;mysql 5.7 开发软件&#xff1a;eclipse/myeclipse/idea Maven包&#xff1a;Maven3.5.4 小程序框架&#xff1a;uniapp 小程序开发软件&#xff1a;HBuilder X 小程序…

Spring中的上下文工具你写的可能有bug

文章目录 前言功能第一种&#xff1a;ApplicationContext第二种方式&#xff1a;ApplicationContextAware第三种&#xff1a;BeanFactoryPostProcessor 源码第一种第二种第三种 前言 本篇是针对如何写一个比较好的spring工具的一个探讨。 功能 下面三种方式&#xff0c;你觉…

Odoo16 实用功能之Form视图详解(表单视图)

目录 1、什么是Form视图 2、Form视图的结构 3、源码示例 1、什么是Form视图 Form视图是用于查看和编辑数据库记录的界面。每个数据库模型在Odoo中都有一个Form视图&#xff0c;用于显示该模型的数据。Form视图提供了一个可编辑的界面&#xff0c;允许用户查看和修改数据库记…

[python]用python实现对arxml文件的操作

目录 关键词平台说明一、背景二、方法2.1 库2.2 code 关键词 python、excel、DBC、openpyxl 平台说明 项目Valuepython版本3.6 一、背景 有时候需要批量处理arxml文件(ARXML 文件符合 AUTOSAR 4.0 标准)&#xff0c;但是工作量太大&#xff0c;阔以考虑用python。 二、方…

最新版 JESD79-5B,2022年,JEDEC 内存SDRAM规范

本标准定义了DDR5 SDRAM规范&#xff0c;包括特性、功能、交流和直流特性、封装以及球/信号分配。本标准旨在为x4、x8和x16 DDR5 SDRAM设备定义符合JEDEC标准的8 Gb至32 Gb的最低要求。该标准是基于DDR4标准&#xff08;JESD79-4&#xff09;和DDR、DDR2、DDR3和LPDDR4标准的一…

智能优化算法应用:基于金枪鱼群算法3D无线传感器网络(WSN)覆盖优化 - 附代码

智能优化算法应用&#xff1a;基于金枪鱼群算法3D无线传感器网络(WSN)覆盖优化 - 附代码 文章目录 智能优化算法应用&#xff1a;基于金枪鱼群算法3D无线传感器网络(WSN)覆盖优化 - 附代码1.无线传感网络节点模型2.覆盖数学模型及分析3.金枪鱼群算法4.实验参数设定5.算法结果6.…

1856_emacs_calc使用介绍与故事

Grey 全部学习内容汇总&#xff1a; GitHub - GreyZhang/g_org: my learning trip for org-mode 1856_emacs_calc使用介绍与故事 calc是emacs内置的一个计算器&#xff0c;可以提供多种计算表达方式并且可以支持org-mode中的表格功能。 主题由来介绍 我是因为想要了解org-…

采草(动态规划)

先说说我的思路吧 下面是部分聊天记录 赤坂 龍之介 2023/12/22 11:06:04 就像我之前说的那样&#xff0c;我把每一个药草的价值除以时间&#xff0c;得出了新的价值评估标准&#xff1a;采摘这个药草时&#xff0c;每分钟的价值 赤坂 龍之介 2023/12/22 11:07:00 然后排…

2023年小型计算机视觉总结

在过去的十年中&#xff0c;出现了许多涉及计算机视觉(CV)的项目&#xff0c;无论是小型的概念验证项目还是更大规模的生产应用。应用计算机视觉的方法是相当标准化的: 1、定义问题(分类、检测、跟踪、分割)、输入数据(图片的大小和类型、视野)和类别(正是我们想要的) 2、注释…

Python算法例27 对称数

1. 问题描述 对称数是一个旋转180后&#xff08;倒过来&#xff09;看起来与原数相同的数&#xff0c;找到所有长度为n的对称数。 2. 问题示例 给出n2&#xff0c;返回[&#xff02;11&#xff02;&#xff0c;&#xff02;69&#xff02;&#xff0c;&#xff02;88&#x…

详解Vue3中的基础路由和动态路由

本文主要介绍Vue3中的基础路由和动态路由。 目录 一、基础路由二、动态路由 Vue3中的路由使用的是Vue Router库&#xff0c;它是一个官方提供的用于实现应用程序导航的工具。Vue Router在Vue.js的核心库上提供了路由的功能&#xff0c;使得我们可以在单页应用中实现页面的切换、…

QT编写应用的界面自适应分辨率的解决方案

博主在工作机上完成QT软件开发&#xff08;控件大小与字体大小比例正常&#xff09;&#xff0c;部署到客户机后&#xff0c;发现控件大小与字体大小比例失调&#xff0c;具体表现为控件装不下字体&#xff0c;即字体显示不全&#xff0c;推测是软件不能自适应分辨率导致的。 文…

C/C++ 共用体union的应用和struct不同

共用体union是一种数据格式&#xff0c;它能够存储不同的数据类型&#xff0c;但只能同时存储其中的一种类型。也就是说&#xff0c;结构体同时存储int、long和double,共用体只能春初int、long或double,共用体的语法与结构体相似&#xff0c;但含义不同。例如下面的声明&#x…