注:langchain版本:0.0.352
使用langchain的UnstructuredCSVLoader读取带中文csv时:
file_path = “chinese.csv”
loader = UnstructuredCSVLoader(file_path=str(file_path))
docs = loader. Load()
因为编码问题,导致报错:
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xxx in position x: illegal multibyte sequence
修改UnstructuredCSVLoader类中的_get_elements函数如下:
def _get_elements(self) -> List:from unstructured.partition.csv import partition_csv# #####debug code####### unstructuredCSVLoader加载中文csv错误修复try:elements = partition_csv(filename=self.file_path, **self.unstructured_kwargs)except:with open(self.file_path,'rb') as f:elements = partition_csv(file=f,**self.unstructured_kwargs)# ########code end###########return elements
即可。
问题为langchain集成三方库unstructured时编码问题导致。