- 1. Introduction
- 2. Document Loading
- 2.1 Retrieval Augmented Generation(RAG)
- 2.2 Load PDFs
- 2.3 Load YouTube
- 2.4 Load URLs
- 2.5 Load Notion
- 3. Document Splitting
- 3.1 Splitter Flow
- 3.2 Character Splitter
- 3.3 Token Splitter
- 3.4 Markdown Splitter
- 4. Vector Stores and Embeddings
- 4.1 Embedding
- 4.2 Vector Store Workflow
- 4.3 Usage Examples
- 4.4 Edge Cases
- 5. Retrieval
- 5.1 Maximum marginal relevance(MMR)
- 5.2 SelfQuery
- 5.3 Compression
- 5.4 Usage Examples
- 5.5 Other Retrievals
- 6. Question Answering
- 6.1 RetrievalQA Chain
- 6.2 Usage Examples
- 7. Chat
- 7.1 ConversationalRetrievalChain
- 7.2 Usage Examples
- 7.3 Create a chatbot that works on your documents
- 8. Conclusion
1. Introduction
- 其中一个预处理步骤——将这些文档分割成语义上有意义的块:这个看似简单的步骤,实际上有许多细节需要考虑。
- 语义搜索:用于获取用户问题的相关信息
- 失败情况的处理
- 使用检索到的文档回答用户问题
- 与数据对话的聊天机器人的关键——记忆能力
2. Document Loading
LangChain中包含了 80+ 种不同的文档加载器,如下图所示:
2.1 Retrieval Augmented Generation(RAG)
2.2 Load PDFs
#! pip install langchain
#! pip install pypdf
import os
import openai
import sys
sys.path.append('../..')from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env fileopenai.api_key = os.environ['OPENAI_API_KEY']
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
加载之后的数据结构中,每一个 page
都是 Document
page = pages[0]
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine
learning class. So what I wanna do today is ju st spend a little time going over the logistics
of the class, and then we'll start to talk a bit about machine learning.
By way of introduction, my name's Andrew Ng and I'll be instru ctor for this class. And so
I personally work in machine learning, and I' ve worked on it for about 15 years now, and
I actually think that machine learning i
{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}
2.3 Load YouTube
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
# ! pip install yt_dlp
# ! pip install pydub
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser()
docs = loader.load()
[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of 69.71MiB
[ExtractAudio] Not converting audio docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!
2.4 Load URLs
from langchain.document_loaders import WebBaseLoaderloader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")
docs = loader.load()
2.5 Load Notion
2)解压缩并将其保存为包含Notion page的 markdown 文件的文件夹中。
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
# Blendle's Employee HandbookThis is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that
{'source': "docs/Notion_DB/Blendle's Employee Handbook e367aa77e225482c849111687e114a56.md"}
3. Document Splitting
3.1 Splitter Flow
3.2 Character Splitter
这个递归文本分割器的 seperator
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
chunk_size =26
chunk_overlap = 4
r_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap
c_splitter = CharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap
text1 = 'abcdefghijklmnopqrstuvwxyz'
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
['a b c d e f g h i j k l m n o p q r s t u v w x y z']
c_splitter = CharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap,separator = ' '
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
c_splitter = CharacterTextSplitter(chunk_size=450,chunk_overlap=0,separator = ' '
r_splitter = RecursiveCharacterTextSplitter(chunk_size=450,chunk_overlap=0, separators=["\n\n", "\n", " ", ""]
['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,','have a space.and words are separated by space.']
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
r_splitter = RecursiveCharacterTextSplitter(chunk_size=150,chunk_overlap=0,separators=["\n\n", "\n", "\. ", " ", ""]
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",'. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.','Paragraphs are often delimited with a carriage return or two carriage returns','. Carriage returns are the "backslash n" you see embedded in this string','. Sentences have a period at the end, but also, have a space.and words are separated by space.']
r_splitter = RecursiveCharacterTextSplitter(chunk_size=150,chunk_overlap=0,separators=["\n\n", "\n", "(?<=\. )", " ", ""]
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.','Paragraphs are often delimited with a carriage return or two carriage returns.','Carriage returns are the "backslash n" you see embedded in this string.','Sentences have a period at the end, but also, have a space.and words are separated by space.']
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(separator="\n",chunk_size=1000,chunk_overlap=150,length_function=len
docs = text_splitter.split_documents(pages)
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()
docs = text_splitter.split_documents(notion_db)
3.3 Token Splitter
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
['foo', ' bar', ' b', 'az', 'zy', 'foo']
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
Document(page_content='MachineLearning-Lecture01 \n', metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0})
{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}
3.4 Markdown Splitter
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n
## Chapter 2\n\n \
Hi this is Molly"""
headers_to_split_on = [("#", "Header 1"),("##", "Header 2"),("###", "Header 3"),
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on
md_header_splits = markdown_splitter.split_text(markdown_document)
Document(page_content='Hi this is Jim \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])
headers_to_split_on = [("#", "Header 1"),("##", "Header 2"),
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on
md_header_splits = markdown_splitter.split_text(txt)
Document(page_content="This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change. \n**Everything related to working at Blendle and the people of Blendle, made public.** \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more. \nWe've made this document public because we want to learn from you. We're very much interested in your feedback (including weeding out typo's and Dunglish ;)). Email us at hr@blendle.com. If you're starting your own company or if you're curious as to how we do things at Blendle, we hope that our employee handbook inspires you. \nIf you want to work at Blendle you can check our [job ads here](https://blendle.homerun.co/). If you want to be kept in the loop about Blendle, you can sign up for [our behind the scenes newsletter](https://blendle.homerun.co/yes-keep-me-posted/tr/apply?token=8092d4128c306003d97dd3821bad06f2).", metadata={'Header 1': "Blendle's Employee Handbook"})
4. Vector Stores and Embeddings
4.1 Embedding
4.2 Vector Store Workflow
4.3 Usage Examples
LangChain集成了30+种不同的向量存储方式。本小节示例中使用的向量数据库是 Chroma。它轻量而且支持内存存储,非常容易上手。
from langchain.document_loaders import PyPDFLoader# Load PDF
loaders = [# Duplicate documents on purpose - messy dataPyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf")
docs = []
for loader in loaders:docs.extend(loader.load())
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150
splits = text_splitter.split_documents(docs)
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)
# 1536
len(embedding1)# 0.22333593666553497
np.max(embedding1)# -0.6714226007461548
import numpy as np
np.dot(embedding1, embedding2)
np.dot(embedding1, embedding3)
np.dot(embedding2, embedding3)
# ! pip install chromadb
!rm -rf ./docs/chroma # remove old database files if any
from langchain.vectorstores import Chromapersist_directory = 'docs/chroma/'vectordb = Chroma.from_documents(documents=splits,embedding=embedding,persist_directory=persist_directory
question = "is there an email i can ask for help"docs = vectordb.similarity_search(question,k=3)
# 持久化
4.4 Edge Cases
5. Retrieval
5.1 Maximum marginal relevance(MMR)
MMR的原理是:当一个查询请求发起时,通过设置的 fetch_k
参数确定返回的响应,这一步是完全基于相似性搜索。之后,通过MR进一步根据语义相似性和文档多样性筛选文档,然后选择 k
5.2 SelfQuery
5.3 Compression
5.4 Usage Examples
#!pip install lark
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory,embedding_function=embedding
)# 209
texts = ["""The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""","""A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""","""A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]smalldb = Chroma.from_texts(texts, embedding=embedding)question = "Tell me about all-white mushrooms with large fruiting bodies"smalldb.similarity_search(question, k=2)
[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)
'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '
'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '
'algorithm then? So what’s different? How come I was making all that noise earlier about \nleast squa'
question = "what did they say about regression in the third lecture?"docs = vectordb.similarity_search(question,k=3,filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)for d in docs:print(d.metadata)
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 4}
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfometadata_field_info = [AttributeInfo(name="source",description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",type="string",),AttributeInfo(name="page",description="The page from the lecture",type="integer",),
]document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(llm,vectordb,document_content_description,metadata_field_info,verbose=True
)question = "what did they say about regression in the third lecture?"
docs = retriever.get_relevant_documents(question)
query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/cs229_lectures/MachineLearning-Lecture03.pdf') limit=None
for d in docs:print(d.metadata)
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractordef pretty_print_docs(docs):print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,base_retriever=vectordb.as_retriever()
)question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
Document 1:"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
Document 2:"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
Document 3:"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."
Document 4:"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,base_retriever=vectordb.as_retriever(search_type = "mmr")
)question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
Document 1:"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
Document 2:"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."
5.5 Other Retrievals
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)
question = "What are major topics for this class?"
question = "What are major topics for this class?"
question = "What are major topics for this class?"
Document(page_content="let me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to \nfind project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends or post in the newsgroup, but we just \nencourage you to try to star t to do both of those today, okay? Form study groups, and try \nto find two other project partners. \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays. [End of Audio] \nDuration: 69 minutes", metadata={})
question = "what did they say about matlab?"
Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions. \nAnd what Ashutosh and Min did was they then applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first step. They were able to. \nI'll just show you one more example. I like this because it's a picture of Stanford with our \nbeautiful Stanford campus. So again, taking th e same sort of clustering algorithms, taking \nthe same sort of unsupervised learning algor ithm, you can group the pixels into different \nregions. And using that as a pre-processing step, they eventually built this sort of 3D model of Stanford campus in a single picture. You can sort of walk into the ceiling, look", metadata={})
6. Question Answering
6.1 RetrievalQA Chain
- Map_reduce:将每一个命中的文档分块输入到LLM中进行总结,然后将多个Chunk的经过总结后的更短的文本一起当做上下文,输入到最终的LLM中回答问题。其优点是可以对任意数量的文档进行回答,缺点是有些上下文分布在不同的分块中,可能没办法给出好的回答。
- Refine:类似于递归操作,将第一个文档分块输入到LLM中进行总结,然后将其与第二个文档分块再输入到LLM中进行总结,如此往复,直到得到最终的总结结果,再输入到LLM中回答问题。
- Map_rerank:前面的过程类似,不过会给每一个总结后的文本一个分数,然后对这些总结进行排序,选择分数最高的输入到LLM中回答问题。
6.2 Usage Examples
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)
from langchain.chains import RetrievalQAqa_chain = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever()
result = qa_chain({"query": question})
'The major topic for this class is machine learning. Additionally, the class may cover statistics and algebra as refreshers in the discussion sections. Later in the quarter, the discussion sections will also cover extensions for the material taught in the main lectures.'
from langchain.prompts import PromptTemplate# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)# Run chain
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever(),return_source_documents=True,chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)question = "Is probability a class topic?"
result = qa_chain({"query": question})
'Yes, probability is assumed to be a prerequisite for this class. The instructor assumes familiarity with basic probability and statistics, and will go over some of the prerequisites in the discussion sections as a refresher course. Thanks for asking!'
qa_chain_mr = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever(),chain_type="map_reduce"
)result = qa_chain_mr({"query": question})
'There is no clear answer to this question based on the given portion of the document. The document mentions familiarity with basic probability and statistics as a prerequisite for the class, and there is a brief mention of probability in the text, but it is not clear if it is a main topic of the class. The instructor mentions using a probabilistic interpretation to derive a learning algorithm, but does not go into further detail about probability as a topic.'
qa_chain_mr = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever(),chain_type="map_reduce"
)result = qa_chain_mr({"query": question})
'There is no clear answer to this question based on the given portion of the document. The document mentions familiarity with basic probability and statistics as a prerequisite for the class, and there is a brief mention of probability in the text, but it is not clear if it is a main topic of the class. The instructor mentions using a probabilistic interpretation to derive a learning algorithm, but does not go into further detail about probability as a topic.'
qa_chain_mr = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever(),chain_type="refine"
)result = qa_chain_mr({"query": question})
"The main topic of the class is machine learning, but the course assumes that students are familiar with basic probability and statistics, including random variables, expectation, variance, and basic linear algebra. The instructor will provide a refresher course on these topics in some of the discussion sections. Later in the quarter, the discussion sections will also cover extensions for the material taught in the main lectures. Machine learning is a vast field, and there are a few extensions that the instructor wants to teach but didn't have time to cover in the main lectures. The class will not be very programming-intensive, but some programming will be done in MATLAB or Octave."
7. Chat
7.1 ConversationalRetrievalChain
7.2 Usage Examples
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)
llm.predict("Hello world!")
'Hello there! How can I assist you today?'
# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)# Run chain
from langchain.chains import RetrievalQA
question = "Is probability a class topic?"
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever(),return_source_documents=True,chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})result = qa_chain({"query": question})
'Yes, probability is assumed to be a prerequisite for this class. The instructor assumes familiarity with basic probability and statistics, and will go over some of the prerequisites in the discussion sections as a refresher course. Thanks for asking!'
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history",return_messages=True
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(llm,retriever=retriever,memory=memory
question = "Is probability a class topic?"
result = qa({"question": question})
'Yes, probability is a topic that will be assumed to be familiar to students in this class. The instructor assumes that students have familiarity with basic probability and statistics, and that most undergraduate statistics classes will be more than enough.'
question = "why are those prerequesites needed?"
result = qa({"question": question})
'The reason for requiring familiarity with basic probability and statistics as prerequisites for this class is that the class assumes that students already know what random variables are, what expectation is, what a variance or a random variable is. The class will not be very programming intensive, but will involve some programming in either MATLAB or Octave.'
7.3 Create a chatbot that works on your documents
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
The chatbot code has been updated a bit since filming. The GUI appearance also varies depending on the platform it is running on.
def load_db(file, chain_type, k):# load documentsloader = PyPDFLoader(file)documents = loader.load()# split documentstext_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)docs = text_splitter.split_documents(documents)# define embeddingembeddings = OpenAIEmbeddings()# create vector database from datadb = DocArrayInMemorySearch.from_documents(docs, embeddings)# define retrieverretriever = db.as_retriever(search_type="similarity", search_kwargs={"k": k})# create a chatbot chain. Memory is managed externally.qa = ConversationalRetrievalChain.from_llm(llm=ChatOpenAI(model_name=llm_name, temperature=0), chain_type=chain_type, retriever=retriever, return_source_documents=True,return_generated_question=True,)return qa
import panel as pn
import paramclass cbfs(param.Parameterized):chat_history = param.List([])answer = param.String("")db_query = param.String("")db_response = param.List([])def __init__(self, **params):super(cbfs, self).__init__( **params)self.panels = []self.loaded_file = "docs/cs229_lectures/MachineLearning-Lecture01.pdf"self.qa = load_db(self.loaded_file,"stuff", 4)def call_load_db(self, count):if count == 0 or file_input.value is None: # init or no file specified :return pn.pane.Markdown(f"Loaded File: {self.loaded_file}")else:file_input.save("temp.pdf") # local copyself.loaded_file = file_input.filenamebutton_load.button_style="outline"self.qa = load_db("temp.pdf", "stuff", 4)button_load.button_style="solid"self.clr_history()return pn.pane.Markdown(f"Loaded File: {self.loaded_file}")def convchain(self, query):if not query:return pn.WidgetBox(pn.Row('User:', pn.pane.Markdown("", width=600)), scroll=True)result = self.qa({"question": query, "chat_history": self.chat_history})self.chat_history.extend([(query, result["answer"])])self.db_query = result["generated_question"]self.db_response = result["source_documents"]self.answer = result['answer'] self.panels.extend([pn.Row('User:', pn.pane.Markdown(query, width=600)),pn.Row('ChatBot:', pn.pane.Markdown(self.answer, width=600, style={'background-color': '#F6F6F6'}))])inp.value = '' #clears loading indicator when clearedreturn pn.WidgetBox(*self.panels,scroll=True)@param.depends('db_query ', )def get_lquest(self):if not self.db_query :return pn.Column(pn.Row(pn.pane.Markdown(f"Last question to DB:", styles={'background-color': '#F6F6F6'})),pn.Row(pn.pane.Str("no DB accesses so far")))return pn.Column(pn.Row(pn.pane.Markdown(f"DB query:", styles={'background-color': '#F6F6F6'})),pn.pane.Str(self.db_query ))@param.depends('db_response', )def get_sources(self):if not self.db_response:return rlist=[pn.Row(pn.pane.Markdown(f"Result of DB lookup:", styles={'background-color': '#F6F6F6'}))]for doc in self.db_response:rlist.append(pn.Row(pn.pane.Str(doc)))return pn.WidgetBox(*rlist, width=600, scroll=True)@param.depends('convchain', 'clr_history') def get_chats(self):if not self.chat_history:return pn.WidgetBox(pn.Row(pn.pane.Str("No History Yet")), width=600, scroll=True)rlist=[pn.Row(pn.pane.Markdown(f"Current Chat History variable", styles={'background-color': '#F6F6F6'}))]for exchange in self.chat_history:rlist.append(pn.Row(pn.pane.Str(exchange)))return pn.WidgetBox(*rlist, width=600, scroll=True)def clr_history(self,count=0):self.chat_history = []return
Create a chatbot
cb = cbfs()file_input = pn.widgets.FileInput(accept='.pdf')
button_load = pn.widgets.Button(name="Load DB", button_type='primary')
button_clearhistory = pn.widgets.Button(name="Clear History", button_type='warning')
inp = pn.widgets.TextInput( placeholder='Enter text here…')bound_button_load = pn.bind(cb.call_load_db, button_load.param.clicks)
conversation = pn.bind(cb.convchain, inp) jpg_pane = pn.pane.Image( './img/convchain.jpg')tab1 = pn.Column(pn.Row(inp),pn.layout.Divider(),pn.panel(conversation, loading_indicator=True, height=300),pn.layout.Divider(),
tab2= pn.Column(pn.panel(cb.get_lquest),pn.layout.Divider(),pn.panel(cb.get_sources ),
tab3= pn.Column(pn.panel(cb.get_chats),pn.layout.Divider(),
tab4=pn.Column(pn.Row( file_input, button_load, bound_button_load),pn.Row( button_clearhistory, pn.pane.Markdown("Clears chat history. Can use to start a new topic" )),pn.layout.Divider(),pn.Row(jpg_pane.clone(width=400))
dashboard = pn.Column(pn.Row(pn.pane.Markdown('# ChatWithYourData_Bot')),pn.Tabs(('Conversation', tab1), ('Database', tab2), ('Chat History', tab3),('Configure', tab4))