基于Weaviate构建多模态检索和多模态检索增强(RAG): Building Multimodal Search and RAG

Building Multimodal Search and RAG

本文是学习 https://www.deeplearning.ai/short-courses/building-multimodal-search-and-rag/ 这门课的学习笔记。

在这里插入图片描述

What you’ll learn in this course

Learn how to build multimodal search and RAG systems. RAG systems enhance an LLM by incorporating proprietary data into the prompt context. Typically, RAG applications use text documents, but, what if the desired context includes multimedia like images, audio, and video? This course covers the technical aspects of implementing RAG with multimodal data to accomplish this.

  • Learn how multimodal models are trained through contrastive learning and implement it on a real dataset.
  • Build any-to-any multimodal search to retrieve relevant context across different data types.
  • Learn how LLMs are trained to understand multimodal data through visual instruction tuning and use them on multiple image reasoning examples
  • Implement an end-to-end multimodal RAG system that analyzes retrieved multimodal context to generate insightful answers
  • Explore industry applications like visually analyzing invoices and flowcharts to output structured data.
  • Create a multi-vector recommender system that suggests relevant items by comparing their similarities across multiple modalities.

As AI systems increasingly need to process and reason over multiple data modalities, learning how to build such systems is an important skill for AI developers.

This course equips you with the key skills to embed, retrieve, and generate across different modalities. By gaining a strong foundation in multimodal AI, you’ll be prepared to build smarter search, RAG, and recommender systems.

文章目录

  • Building Multimodal Search and RAG
    • What you’ll learn in this course
  • L1: Overview of Multimodality
    • Introduction of multi-model
    • code
    • Import libraries
    • Load MNIST Dataset
    • Setup our DataLoaders
      • Visualize datapoints
    • Build Neural Network Architecture
    • Contrastive Loss Function
    • Training Loop
      • Model Training
      • Load from Backup
      • Get the Model
      • Visualize the loss curve for your trained model
    • Visualize the Vector Space!
      • Generate 64d Representations of the Training Set
      • Reduce Dimensionality of Data: 64d -> 3d
      • Interactive Scatter Plot in 3d – with PCA
      • Scatterplot in 2d - with UMAP
      • UMAP with Euclidian Metric
    • Contrastive Training over 100 Epochs!
  • L2: Multimodal Search
    • Setup
      • Load environment variables and API keys
    • Connect to Weaviate
    • Create the Collection
    • Helper functions
    • Insert Images into Weaviate
    • Insert Video Files into Weaviate
    • Check count
    • Build MultiModal Search
      • Helper Functions
    • Text to Media Search
    • Image to Media Search
    • Image search - from web URL
    • Video to Media Search
    • Visualizing a Multimodal Vector Space
    • Load vector embeddings and mediaType from Weaviate
    • Plot the embeddings
  • L3: Large Multimodal Models (LMMs)
    • Code
    • Setup
      • Load environment variables and API keys
    • Helper functions
    • Analyze images with an LMM
    • Analyze a harder image
    • Decode the hidden message
    • How the model sees the picture!
  • L4: Multimodal Retrieval Augmented Generation (MM-RAG)
    • Setup
      • Load environment variables and API keys
      • Connect to Weaviate
      • Restore 13k+ prevectorized resources
      • Preview data **count**
    • Multimodal RAG
      • Step 1 – Retrieve content from the database with a query
      • Run image retrieval
      • Step 2 - Generate a description of the image
      • Run vision request
    • All together
  • L5: Industry Applications
    • Code
    • Vision Function
    • Extracting Structured Data from Retreived Images
      • Analyzing an invoice
      • Extracting Tables from Images
      • Analyzing Flow Charts
  • L6: Multimodal Recommender System
    • Setup
      • Load environment variables and API keys
    • Connect to Weaviate
    • Create Multivector collection
    • Load in data
    • Helper function
    • Import text and image data
    • Text-search through the text vector space
    • Text-search through the posters vector space
    • Image-search through the posters vector space
  • 后记

L1: Overview of Multimodality

Introduction of multi-model

在这里插入图片描述

Multi-model embedding

在这里插入图片描述

训练多模态模型

在这里插入图片描述

整个流程

在这里插入图片描述

对比学习

在这里插入图片描述

对比学习的loss function

在这里插入图片描述

Understanding the Contrastive loss function

在这里插入图片描述

code

  • In this classroom, the libraries have been already installed for you.
  • If you would like to run this code on your own machine, you need to install the following:
    !pip install -q accelerate torch!pip install -U scikit-learn!pip install umap-learn!pip install tqdm

Import libraries

# Import neural network training libraries
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
from torchvision import transforms# Import basic computation libraries along with data visualization and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from sklearn.decomposition import PCA
import umap
import umap.plot
import plotly.graph_objs as go
import plotly.io as pio
pio.renderers.default = 'iframe'# Import our data class which will organize MNIST and provide anchor, positive and negative samples.
from mnist_dataset import MNISTDataset

Load MNIST Dataset

# Load data from csv
data = pd.read_csv('digit-recognizer/train.csv')
val_count = 1000
# common transformation for both val and train
default_transform = transforms.Compose([transforms.ToPILImage(),transforms.ToTensor(),transforms.Normalize(0.5, 0.5)
])# Split data into val and train
dataset = MNISTDataset(data.iloc[:-val_count], default_transform)
val_dataset = MNISTDataset(data.iloc[-val_count:], default_transform)

Setup our DataLoaders

# Create torch dataloaders
trainLoader = DataLoader(dataset,batch_size=16, # feel free to modify this valueshuffle=True,pin_memory=True,num_workers=2,prefetch_factor=100
)valLoader = DataLoader(val_dataset,batch_size=64,shuffle=True,pin_memory=True,num_workers=2,prefetch_factor=100
)

Visualize datapoints

# Function to display images with labels
def show_images(images, title=''):num_images = len(images)fig, axes = plt.subplots(1, num_images, figsize=(9, 3))for i in range(num_images):img = np.squeeze(images[i])axes[i].imshow(img, cmap='gray')axes[i].axis('off')fig.suptitle(title)plt.show()# Visualize some examples
for batch_idx, (anchor_images, contrastive_images, distances, labels) in enumerate(trainLoader):# Convert tensors to numpy arraysanchor_images = anchor_images.numpy()contrastive_images = contrastive_images.numpy()labels = labels.numpy()# Display some samples from the batchshow_images(anchor_images[:4], title='Anchor Image')show_images(contrastive_images[:4], title='+/- Example')# Break after displaying one batch for demonstrationbreak

Output

在这里插入图片描述

Build Neural Network Architecture

# Define a neural network architecture with two convolution layers and two fully connected layers
# Input to the network is an MNIST image and Output is a 64 dimensional representation. 
class Network(nn.Module):def __init__(self):super(Network, self).__init__()self.conv1 = nn.Sequential(nn.Conv2d(1, 32, 5),nn.BatchNorm2d(32),nn.ReLU(inplace=True),nn.MaxPool2d((2, 2), stride=2),nn.Dropout(0.3))self.conv2 = nn.Sequential(nn.Conv2d(32, 64, 5),nn.BatchNorm2d(64),nn.ReLU(inplace=True),nn.MaxPool2d((2, 2), stride=2),nn.Dropout(0.3))self.linear1 = nn.Sequential(nn.Linear(64 * 4 * 4, 512),nn.ReLU(inplace=True),nn.Dropout(0.3),nn.Linear(512, 64),)def forward(self, x):x = self.conv1(x) # x: d * 32 * 12 * 12x = self.conv2(x) # x: d * 64 * 4  * 4 x = x.view(x.size(0), -1) # x: d * (64*4*4)x = self.linear1(x) # x: d * 64return x

Contrastive Loss Function

# The ideal distance metric for a positive sample is set to 1, for a negative sample it is set to 0      
class ContrastiveLoss(nn.Module):def __init__(self):super(ContrastiveLoss, self).__init__()self.similarity = nn.CosineSimilarity(dim=-1, eps=1e-7)def forward(self, anchor, contrastive, distance):# use cosine similarity from torch to get scorescore = self.similarity(anchor, contrastive)# after cosine apply MSE between distance and scorereturn nn.MSELoss()(score, distance) #Ensures that the calculated score is close to the ideal distance (1 or 0)

Define the Training Configuration

net = Network()device = 'cpu'
if torch.cuda.is_available():device = torch.device('cuda:0')net = net.to(device)
device
# Define the training configuration
optimizer = optim.Adam(net.parameters(), lr=0.005)
loss_function = ContrastiveLoss()
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.3)

Training Loop

import os# Define a directory to save the checkpoints
checkpoint_dir = 'checkpoints/'# Ensure the directory exists
if not os.path.exists(checkpoint_dir):os.makedirs(checkpoint_dir)

Model Training

def train_model(epoch_count=10):#net = Network()lrs = []losses = []for epoch in range(epoch_count):epoch_loss = 0batches=0print('epoch -', epoch)lrs.append(optimizer.param_groups[0]['lr'])print('learning rate', lrs[-1])for anchor, contrastive, distance, label in tqdm(trainLoader):batches += 1optimizer.zero_grad()anchor_out = net(anchor.to(device))contrastive_out = net(contrastive.to(device))distance = distance.to(torch.float32).to(device)loss = loss_function(anchor_out, contrastive_out, distance)epoch_loss += lossloss.backward()optimizer.step()losses.append(epoch_loss.cpu().detach().numpy() / batches)scheduler.step()print('epoch_loss', losses[-1])# Save a checkpoint of the modelcheckpoint_path = os.path.join(checkpoint_dir, f'model_epoch_{epoch}.pt')torch.save(net.state_dict(), checkpoint_path)return {"net": net,"losses": losses}

Load from Backup

def load_model_from_checkpoint():checkpoint = torch.load('checkpoints/model_epoch_99.pt')net = Network()net.load_state_dict(checkpoint)net.eval()return net

Get the Model

(Note: train = False): We've saved the trained model and are loading it here for speedier results, allowing you to observe the outcomes faster. Once you've done an initial run, you may set train to True to train the model yourself. This can take some time to finsih, depending the value you set for the epoch_count.

train = False # set to True to run train the modelif train:training_result = train_model()model = training_result["net"]
else:model = load_model_from_checkpoint()

Visualize the loss curve for your trained model

from IPython.display import Imageif train:# show loss curve from your training.plt.plot(training_result["losses"])plt.show()
else:# If you are loading a checkpoint instead of training the model (train = False),# the following line will show a pre-saved loss curve from the checkpoint data.display(Image(filename="images/loss-curve.png", height=600, width=600))

Output

在这里插入图片描述

Visualize the Vector Space!

Generate 64d Representations of the Training Set

encoded_data = []
labels = []with torch.no_grad():for anchor, _, _, label in tqdm(trainLoader):output = model(anchor.to(device))encoded_data.extend(output.cpu().numpy())labels.extend(label.cpu().numpy())encoded_data = np.array(encoded_data)
labels = np.array(labels)

Reduce Dimensionality of Data: 64d -> 3d

# Apply PCA to reduce dimensionality of data from 64d -> 3d to make it easier to visualize!
pca = PCA(n_components=3)
encoded_data_3d = pca.fit_transform(encoded_data)

Interactive Scatter Plot in 3d – with PCA

scatter = go.Scatter3d(x=encoded_data_3d[:, 0],y=encoded_data_3d[:, 1],z=encoded_data_3d[:, 2],mode='markers',marker=dict(size=4, color=labels, colorscale='Viridis', opacity=0.8),text=labels, hoverinfo='text',
)# Create layout
layout = go.Layout(title="MNIST Dataset - Encoded and PCA Reduced 3D Scatter Plot",scene=dict(xaxis=dict(title="PC1"),yaxis=dict(title="PC2"),zaxis=dict(title="PC3"),),width=1000, height=750,
)# Create figure and add scatter plot
fig = go.Figure(data=[scatter], layout=layout)# Show the plot
fig.show()

Output

在这里插入图片描述

Scatterplot in 2d - with UMAP

mapper = umap.UMAP(random_state=42, metric='cosine').fit(encoded_data)
umap.plot.points(mapper, labels=labels);

Output

在这里插入图片描述

UMAP with Euclidian Metric

mapper = umap.UMAP(random_state=42).fit(encoded_data) 
umap.plot.points(mapper, labels=labels);

Output

在这里插入图片描述

Contrastive Training over 100 Epochs!

L2: Multimodal Search

在这里插入图片描述

  • In this classroom, the libraries have been already installed for you.
  • If you would like to run this code on your own machine, you need to install the following:
    !pip install -U weaviate-client

Setup

Load environment variables and API keys

import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")

Connect to Weaviate

import weaviate, osclient = weaviate.connect_to_embedded(version="1.24.4",environment_variables={"ENABLE_MODULES": "backup-filesystem,multi2vec-palm","BACKUP_FILESYSTEM_PATH": "/home/jovyan/work/backups",},headers={"X-PALM-Api-Key": EMBEDDING_API_KEY,}
)client.is_ready()

Create the Collection

from weaviate.classes.config import Configure# Just checking if you ever need to re run it
if(client.collections.exists("Animals")):client.collections.delete("Animals")client.collections.create(name="Animals",vectorizer_config=Configure.Vectorizer.multi2vec_palm(image_fields=["image"],video_fields=["video"],project_id="semi-random-dev",location="us-central1",model_id="multimodalembedding@001",dimensions=1408,        )
)

Helper functions

import base64# Helper function to convert a file to base64 representation
def toBase64(path):with open(path, 'rb') as file:return base64.b64encode(file.read()).decode('utf-8')

Insert Images into Weaviate

animals = client.collections.get("Animals")source = os.listdir("./source/animal_image/")with animals.batch.rate_limit(requests_per_minute=100) as batch:for name in source:print(f"Adding {name}")path = "./source/image/" + namebatch.add_object({"name": name,            # name of the file"path": path,            # path to the file to display result"image": toBase64(path), # this gets vectorized - "image" was configured in vectorizer_config as the property holding images"mediaType": "image",    # a label telling us how to display the resource })

Output

Adding cat1.jpg
Adding dog3.jpg
Adding dog1.jpg
Adding cat3.jpg
Adding meerkat2.jpg
Adding cat2.jpg
Adding meerkat1.jpg
Adding dog2.jpg
Adding meerkat3.jpg
# Check for failed objects
if len(animals.batch.failed_objects) > 0:print(f"Failed to import {len(animals.batch.failed_objects)} objects")for failed in animals.batch.failed_objects:print(f"e.g. Failed to import object with error: {failed.message}")
else:print("No errors")

Output

No errors

Insert Video Files into Weaviate

Note: the input video must be at least 4 seconds long.

animals = client.collections.get("Animals")source = os.listdir("./source/video/")for name in source:print(f"Adding {name}")path = "./source/video/" + name    # insert videos one by oneanimals.data.insert({"name": name,"path": path,"video": toBase64(path),"mediaType": "video"})

Output

Adding meerkat-watch.mp4
Adding cat-play.mp4
Adding meerkat-dig.mp4
Adding dog-high-five.mp4
Adding dog-with-stick.mp4
Adding cat-clean.mp4
# Check for failed objects
if len(animals.batch.failed_objects) > 0:print(f"Failed to import {len(animals.batch.failed_objects)} objects")for failed in animals.batch.failed_objects:print(f"e.g. Failed to import object with error: {failed.message}")
else:print("No errors")

Output

No errors

Check count

Total count should be 15 (9x image + 6x video)

agg = animals.aggregate.over_all(group_by="mediaType"
)for group in agg.groups:print(group)

Output

AggregateGroup(grouped_by=GroupedBy(prop='mediaType', value='image'), properties={}, total_count=9)
AggregateGroup(grouped_by=GroupedBy(prop='mediaType', value='video'), properties={}, total_count=6)

Build MultiModal Search

Helper Functions

# Helper functions to display results
import json
from IPython.display import Image, Videodef json_print(data):print(json.dumps(data, indent=2))def display_media(item):path = item["path"]if(item["mediaType"] == "image"):display(Image(path, width=300))elif(item["mediaType"] == "video"):display(Video(path, width=300))
import base64, requests# Helper function – get base64 representation from an online image
def url_to_base64(url):image_response = requests.get(url)content = image_response.contentreturn base64.b64encode(content).decode('utf-8')# Helper function - get base64 representation from a local file
def file_to_base64(path):with open(path, 'rb') as file:return base64.b64encode(file.read()).decode('utf-8')

Text to Media Search

animals = client.collections.get("Animals")response = animals.query.near_text(query="dog playing with stick",return_properties=['name','path','mediaType'],limit=3
)
for obj in response.objects:json_print(obj.properties)display_media(obj.properties)

Output

在这里插入图片描述

Image to Media Search

# Use this image as an input for the query
Image("./test/test-cat.jpg", width=300)

Output

在这里插入图片描述

# The query
response = animals.query.near_image(near_image=file_to_base64("./test/test-cat.jpg"),return_properties=['name','path','mediaType'],limit=3
)for obj in response.objects:json_print(obj.properties)display_media(obj.properties)

Output

在这里插入图片描述

Image search - from web URL

Image("https://raw.githubusercontent.com/weaviate-tutorials/multimodal-workshop/main/2-multimodal/test/test-meerkat.jpg", width=300)

Output

在这里插入图片描述

# The query
response = animals.query.near_image(near_image=url_to_base64("https://raw.githubusercontent.com/weaviate-tutorials/multimodal-workshop/main/2-multimodal/test/test-meerkat.jpg"),return_properties=['name','path','mediaType'],limit=3
)for obj in response.objects:json_print(obj.properties)display_media(obj.properties)

Output

在这里插入图片描述

Video to Media Search

Note: the input video must be at least 4 seconds long.

Video("./test/test-meerkat.mp4", width=400)

Output

在这里插入图片描述

from weaviate.classes.query import NearMediaTyperesponse = animals.query.near_media(media=file_to_base64("./test/test-meerkat.mp4"),media_type=NearMediaType.VIDEO,return_properties=['name','path','mediaType'],limit=3
)for obj in response.objects:# json_print(obj.properties)display_media(obj.properties)

Output

在这里插入图片描述

Visualizing a Multimodal Vector Space

To make this more exciting, let’s loadup a large dataset!

import numpy as np
import sklearn.datasets
import pandas as pd
import umap
import umap.plot
import matplotlib.pyplot as plt

Load vector embeddings and mediaType from Weaviate

client.backup.restore(backup_id="resources-img-and-vid",include_collections="Resources",backend="filesystem"
)# It can take a few seconds for the "Resources" collection to be ready.
# We add 5 seconds of sleep to make sure it is ready for the next cells to use.
import time
time.sleep(5)
# Collection named "Resources"
collection = client.collections.get("Resources")embs = []
labs = []
for item in collection.iterator(include_vector=True):#print(item.properties)\labs.append(item.properties['mediaType'])embs.append(item.vector)
embs2 = [emb['default'] for emb in embs]emb_df = pd.DataFrame(embs2)
labels = pd.Series(labs)labels[labels=='image'] = 0
labels[labels=='video'] = 1
%%time
mapper2 = umap.UMAP().fit(emb_df)

Output

CPU times: user 8min 4s, sys: 14.2 s, total: 8min 18s
Wall time: 3min 1s

Plot the embeddings

plt.figure(figsize=(10, 8))
umap.plot.points(mapper2, labels=labels, theme='fire')# Show plot
plt.title('UMAP Visualiztion of Embedding Space')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.show();

Output

在这里插入图片描述

L3: Large Multimodal Models (LMMs)

Language model

在这里插入图片描述

Understanding GPT

在这里插入图片描述

Vision transformers

在这里插入图片描述

在这里插入图片描述

Vision Instruction Tuning

在这里插入图片描述

Code

  • In this classroom, the libraries have been already installed for you.
  • If you would like to run this code on your own machine, you need to install the following:
    !pip install google-generativeai

Note: don’t forget to set up your GOOGLE_API_KEY to use the Gemini Vision model in the env file.

   %env GOOGLE_API_KEY=************

Check the documentation for more infomation.

Setup

Load environment variables and API keys

import os
from dotenv import load_dotenv, find_dotenv_ = load_dotenv(find_dotenv()) # read local .env file
GOOGLE_API_KEY=os.getenv('GOOGLE_API_KEY')# Set the genai library
import google.generativeai as genai
from google.api_core.client_options import ClientOptionsgenai.configure(api_key=GOOGLE_API_KEY,transport="rest",client_options=ClientOptions(api_endpoint=os.getenv("GOOGLE_API_BASE"),),
)

Note: learn more about GOOGLE_API_KEY to run it locally.

Helper functions

import textwrap
import PIL.Image
from IPython.display import Markdown, Imagedef to_markdown(text):text = text.replace('•', '  *')return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))
  • Function to call LMM (Large Multimodal Model).
def call_LMM(image_path: str, prompt: str) -> str:# Load the imageimg = PIL.Image.open(image_path)# Call generative modelmodel = genai.GenerativeModel('gemini-pro-vision')response = model.generate_content([prompt, img], stream=True)response.resolve()return to_markdown(response.text)  

Analyze images with an LMM

# Pass in an image and see if the LMM can answer questions about it
Image(url= "SP-500-Index-Historical-Chart.jpg")

Output

在这里插入图片描述

# Use the LMM function
call_LMM("SP-500-Index-Historical-Chart.jpg", "Explain what you see in this image.")

Output

The image shows the historical chart of the S&P 500 index. The S&P 500 is a stock market index that tracks the 500 largest publicly traded companies in the United States. The index is considered to be a leading indicator of the overall U.S. stock market.The chart shows that the S&P 500 index has been on a long-term upward trend since its inception in 1926. However, the index has experienced several periods of volatility, including the Great Depression in the 1930s, the oil crisis in the 1970s, and the financial crisis in 2008.Despite these periods of volatility, the S&P 500 index has continued to climb over the long term. This is because the U.S. economy has continued to grow over time, and companies have generally been able to increase their earnings.The S&P 500 index is a valuable tool for investors who want to track the performance of the U.S. stock market. The index can also be used to compare the performance of different investment strategies.

Analyze a harder image

  • Try something harder: Here’s a figure we explained previously!
Image(url= "clip.png")

Output

在这里插入图片描述

call_LMM("clip.png", "Explain what this figure is and where is this used.")

Output

This figure shows a contrastive pre-training framework for image-text retrieval. Given a set of images and their corresponding texts, the text encoder encodes each text into a text embedding. Similarly, the image encoder encodes each image into an image embedding. To learn the relationship between images and texts, a contrastive loss is computed between the text embedding and the image embedding of each image-text pair. By minimizing the contrastive loss, the model learns to encode images and texts into embeddings that are semantically similar. The pre-trained model can then be used for image-text retrieval, where given an image, the model can retrieve the most relevant text descriptions.

Decode the hidden message

Image(url= "blankimage3.png")

Output

在这里插入图片描述

# Ask to find the hidden message
call_LMM("blankimage3.png", "Read what you see on this image.")

Output

You can vectorize the whole world with Weaviate!

How the model sees the picture!

You have to be careful! The model does not “see” in the same way that we see!

import imageio.v2 as imageio
import numpy as np
import matplotlib.pyplot as pltimage = imageio.imread("blankimage3.png")# Convert the image to a NumPy array
image_array = np.array(image)plt.imshow(np.where(image_array[:,:,0]>120, 0,1), cmap='gray');

Output

在这里插入图片描述

EXTRA! You can use the function below to create your own hidden message, into an image:

# Create a hidden text in an image
def create_image_with_text(text, font_size=20, font_family='sans-serif', text_color='#73D955', background_color='#7ED957'):fig, ax = plt.subplots(figsize=(5, 5))fig.patch.set_facecolor(background_color)ax.text(0.5, 0.5, text, fontsize=font_size, ha='center', va='center', color=text_color, fontfamily=font_family)ax.axis('off')plt.tight_layout()return fig
# Modify the text here to create a new hidden message image!
fig = create_image_with_text("Hello, world!") # Plot the image with the hidden message
plt.show()
fig.savefig("extra_output_image.png")

Output

在这里插入图片描述

# Call the LMM function with the image just generated
call_LMM("extra_output_image.png", "Read what you see on this image.")

Output

Hello, world!

  • It worked!, now plot the image decoding the message.
image = imageio.imread("extra_output_image.png")# Convert the image to a NumPy array
image_array = np.array(image)plt.imshow(np.where(image_array[:,:,0]>120, 0,1), cmap='gray');

Output

在这里插入图片描述

L4: Multimodal Retrieval Augmented Generation (MM-RAG)

RAG with Weaviate

在这里插入图片描述

Multimodal RAG

在这里插入图片描述

In this lesson you’ll learn how to leverage Weaviate and Google Gemini Pro Vision to carry out a simple multimodal RAG workflow.

  • In this classroom, the libraries have been already installed for you.
  • If you would like to run this code on your own machine, you need to install the following:
    !pip install -U weaviate-client!pip install google-generativeai

Setup

Load environment variables and API keys

import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env fileEMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")
GOOGLE_API_KEY=os.getenv("GOOGLE_API_KEY")

Connect to Weaviate

import weaviateclient = weaviate.connect_to_embedded(version="1.24.4",environment_variables={"ENABLE_MODULES": "backup-filesystem,multi2vec-palm","BACKUP_FILESYSTEM_PATH": "/home/jovyan/work/backups",},headers={"X-PALM-Api-Key": EMBEDDING_API_KEY,}
)client.is_ready()

Restore 13k+ prevectorized resources

client.backup.restore(backup_id="resources-img-and-vid",include_collections="Resources",backend="filesystem"
)# It can take a few seconds for the "Resources" collection to be ready.
# We add 5 seconds of sleep to make sure it is ready for the next cells to use.
import time
time.sleep(5)

Preview data count

from weaviate.classes.aggregate import GroupByAggregateresources = client.collections.get("Resources")response = resources.aggregate.over_all(group_by=GroupByAggregate(prop="mediaType")
)# print rounds names and the count for each
for group in response.groups:print(f"{group.grouped_by.value} count: {group.total_count}")

Output

image count: 13394
video count: 200

Multimodal RAG

Step 1 – Retrieve content from the database with a query

from IPython.display import Image
from weaviate.classes.query import Filterdef retrieve_image(query):resources = client.collections.get("Resources")
# ============response = resources.query.near_text(query=query,filters=Filter.by_property("mediaType").equal("image"), # only return image objectsreturn_properties=["path"],limit = 1,)
# ============result = response.objects[0].propertiesreturn result["path"] # Get the image path

Run image retrieval

# Try with different queries to retreive an image
img_path = retrieve_image("fishing with my buddies")
display(Image(img_path))

Output

在这里插入图片描述

Step 2 - Generate a description of the image

import google.generativeai as genai
from google.api_core.client_options import ClientOptions# Set the Vision model key
genai.configure(api_key=GOOGLE_API_KEY,transport="rest",client_options=ClientOptions(api_endpoint=os.getenv("GOOGLE_API_BASE"),),
)
# Helper function
import textwrap
import PIL.Image
from IPython.display import Markdown, Imagedef to_markdown(text):text = text.replace("•", "  *")return Markdown(textwrap.indent(text, "> ", predicate=lambda _: True))def call_LMM(image_path: str, prompt: str) -> str:img = PIL.Image.open(image_path)model = genai.GenerativeModel("gemini-pro-vision")response = model.generate_content([prompt, img], stream=True)response.resolve()return to_markdown(response.text)    

Run vision request

call_LMM(img_path, "Please describe this image in detail.")

Output

The image shows a man kneeling on the grassy bank of a river. He is wearing a green hat and a khaki vest. He is holding a large fish in his hands. The fish is golden brown in color and has a long, pointed snout. The man is smiling and looking down at the fish. There is a dog standing next to the man. The dog is black and white and has a long, shaggy coat. The dog is looking up at the fish. In the background, there is a narrow river with a small weir.

All together

def mm_rag(query):# Step 1 - retrieve an image – WeaviateSOURCE_IMAGE = retrieve_image(query)display(Image(SOURCE_IMAGE))
#===========# Step 2 - generate a description - GPT4description = call_LMM(SOURCE_IMAGE, "Please describe this image in detail.")return description
# Call mm_rag function
mm_rag("paragliding through the mountains")

Output

在这里插入图片描述

A paraglider is flying over a lush green mountain. The paraglider is red and white. The mountain is covered in trees. The sky is blue and there are some clouds in the distance.

L5: Industry Applications

在这里插入图片描述

在这里插入图片描述

Code

import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
GOOGLE_API_KEY=os.getenv("GOOGLE_API_KEY")import google.generativeai as genai
from google.api_core.client_options import ClientOptions
genai.configure(api_key=GOOGLE_API_KEY,transport="rest",client_options=ClientOptions(api_endpoint=os.getenv("GOOGLE_API_BASE"),)
)

Vision Function

import textwrap
import PIL.Image
from IPython.display import Markdown, Imagedef to_markdown(text):text = text.replace("•", "  *")return Markdown(textwrap.indent(text, "> ", predicate=lambda _: True))def call_LMM(image_path: str, prompt: str, plain_text: bool=False) -> str:img = PIL.Image.open(image_path)model = genai.GenerativeModel("gemini-pro-vision")response = model.generate_content([prompt, img], stream=True)response.resolve()if(plain_text):return response.textelse:return to_markdown(response.text)

Extracting Structured Data from Retreived Images

Analyzing an invoice

from IPython.display import ImageImage(url="invoice.png")

Output

在这里插入图片描述

call_LMM("invoice.png","""Identify items on the invoice, Make sure you output JSON with quantity, description, unit price and ammount.""")

Output

{"items": [{"quantity": 1,"description": "Front and rear brake cables","unit_price": 100.00,"amount": 100.00},{"quantity": 2,"description": "New set of pedal arms","unit_price": 15.00,"amount": 30.00},{"quantity": 3,"description": "Labor 3hrs","unit_price": 5.00,"amount": 15.00}],"subtotal": 145.00,"sales_tax": 9.06,"total": 154.06
}
# Ask something else
call_LMM("invoice.png","""How much would four sets pedal arms costand 6 hours of labour?""",plain_text=True
)

Output

' A set of pedal arms costs $15, so four sets would cost $60. Six hours of labor at $5 per hour would cost $30. So the total cost would be $60 + $30 = $90.'

Extracting Tables from Images

Image("prosus_table.png")

Output

在这里插入图片描述

call_LMM("prosus_table.png", "Print the contents of the image as a markdown table.")

Output

BusinessYoY Revenue GrowthTP Margin ImprovementYoY GMV Growth
Food Delivery17%12 pp15%
Classifieds32%12 pp34%
Payments & Fintech32%15 pp20%
Edtech11%15 pp15%
Etail4%2 pp5%
call_LMM("prosus_table.png", """Analyse the contents of the image as a markdown table.Which of the business units has the highest revenue growth?""")

Output

Business UnitYoY Revenue GrowthTP Margin ImprovementYoY GMV Growth
Food Delivery17%12 pp15%
Classifieds32%12 pp34%
Payments & Fintech32%15 pp20%
Edtech11%15 pp15%
Etail4%2 pp5%

The business unit with the highest revenue growth is Classifieds, with 32% YoY growth.

Analyzing Flow Charts

Image("swimlane-diagram-01.png")

Output

在这里插入图片描述

call_LMM("swimlane-diagram-01.png", """Provide a summarized breakdown of the flow chart in the imagein a format of a numbered list.""")

Output

  1. The client places an order.
    1. The online shop sends an invoice.
    2. The client makes the payment.
    3. The online shop ships the order.
    4. The courier company transports the order.
    5. The client receives the order.
    6. If the client is not satisfied with the order, they can return it for a refund.
call_LMM("swimlane-diagram-01.png", """Analyse the flow chart in the image,then output Python codethat implements this logical flow in one function""")

Output

def order_fulfillment(client_order):# Step 1: Place an orderorder = Order(client_order)# Step 2: Paymentif order.payment():# Step 3: Invoice the orderorder.invoice()# Step 4: Ship the orderif order.ship():# Step 5: Transport the orderorder.transport()# Step 6: Deliver the order to the customerorder.deliver()else:# Step 7: Handle shipping exceptionsorder.handle_shipping_exceptions()else:# Step 8: Handle payment exceptionsorder.handle_payment_exceptions()
  • Test the code generate above.

Note: please be advised that the output may include errors or the functionality may not be fully operational, as it requires additional inputs to function properly.

def order_fulfillment(client, online_shop, courier_company):# This function takes three objects as input:# - client: the client who placed the order# - online_shop: the online shop that received the order# - courier_company: the courier company that will deliver the order# First, the client places an order.order = client.place_order()# Then, the client makes a payment for the order.payment = client.make_payment(order)# If the payment is successful, the order is shipped.if payment.status == "successful":online_shop.ship_order(order)courier_company.transport_order(order)# If the payment is not successful, the order is canceled.else:online_shop.cancel_order(order)client.refund_order(order)# Finally, the order is invoiced.online_shop.invoice_order(order)

L6: Multimodal Recommender System

Recommendation

在这里插入图片描述

Identified embedding of multi modal

在这里插入图片描述

  • In this classroom, the libraries have been already installed for you.
  • If you would like to run this code on your own machine, you need to install the following:
    !pip install -U weaviate-client!pip install google-generativeai!pip install openai

Setup

Load environment variables and API keys

import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env fileMM_EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")
TEXT_EMBEDDING_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_BASEURL = os.getenv("OPENAI_BASE_URL")

Connect to Weaviate

import weaviateclient = weaviate.connect_to_embedded(version="1.24.4",environment_variables={"ENABLE_MODULES": "multi2vec-palm,text2vec-openai"},headers={"X-PALM-Api-Key": MM_EMBEDDING_API_KEY,"X-OpenAI-Api-Key": TEXT_EMBEDDING_API_KEY,"X-OpenAI-BaseURL": OPENAI_BASEURL}
)client.is_ready()

Create Multivector collection

from weaviate.classes.config import Configure, DataType, Property# client.collections.delete("Movies")
client.collections.create(name="Movies",properties=[Property(name="title", data_type=DataType.TEXT),Property(name="overview", data_type=DataType.TEXT),Property(name="vote_average", data_type=DataType.NUMBER),Property(name="release_year", data_type=DataType.INT),Property(name="tmdb_id", data_type=DataType.INT),Property(name="poster", data_type=DataType.BLOB),Property(name="poster_path", data_type=DataType.TEXT),],# Define & configure the vector spacesvectorizer_config=[# Vectorize the movie title and overview – for text-based semantic searchConfigure.NamedVectors.text2vec_openai(name="txt_vector",                       # the name of the txt vector spacesource_properties=["title", "overview"], # text properties to be used for vectorization),# Vectorize the movie poster – for image-based semantic searchConfigure.NamedVectors.multi2vec_palm(name="poster_vector",                    # the name of the image vector spaceimage_fields=["poster"],                 # use poster property multivec vectorizationproject_id="semi-random-dev",location="us-central1",model_id="multimodalembedding@001",dimensions=1408,),]
)

Load in data

import pandas as pd
df = pd.read_json("movies_data.json")
df.head()

Output

在这里插入图片描述

Helper function

import base64# Helper function to convert a file to base64 representation
def toBase64(path):with open(path, 'rb') as file:return base64.b64encode(file.read()).decode('utf-8')

Import text and image data

from weaviate.util import generate_uuid5movies = client.collections.get("Movies")with movies.batch.rate_limit(20) as batch:# for index, movie in df.sample(20).iterrows():for index, movie in df.iterrows():# In case you run it again - Don't import movies that are already in.if(movies.data.exists(generate_uuid5(movie.id))):print(f'{index}: Skipping insert. The movie "{movie.title}" is already in the database.')continueprint(f'{index}: Adding "{movie.title}"')# construct the path to the poster image fileposter_path = f"./posters/{movie.id}_poster.jpg"# generate base64 representation of the posterposterb64 = toBase64(poster_path)# Build the object payloadmovie_obj = {"title": movie.title,"overview": movie.overview,"vote_average": movie.vote_average,"tmdb_id": movie.id,"poster_path": poster_path,"poster": posterb64}# Add object to batch queuebatch.add_object(properties=movie_obj,uuid=generate_uuid5(movie.id),)

Output

0: Adding "Edward Scissorhands"
1: Adding "Beethoven"
2: Adding "The Nightmare Before Christmas"
3: Adding "Hocus Pocus"
4: Adding "Scream"
5: Adding "101 Dalmatians"
6: Adding "A Bug's Life"
7: Adding "Stuart Little"
8: Adding "Chicken Run"
9: Adding "Ice Age"
10: Adding "Lilo & Stitch"
11: Adding "Iron Man"
12: Adding "The Incredible Hulk"
13: Adding "Man of Steel"
14: Adding "Captain America: Civil War"
15: Adding "Batman v Superman: Dawn of Justice"
16: Adding "A Quiet Place"
17: Adding "Incredibles 2"
18: Adding "Shazam!"
19: Adding "Evil Dead Rise"
# Check for failed objects
if len(movies.batch.failed_objects) > 0:print(f"Failed to import {len(movies.batch.failed_objects)} objects")for failed in movies.batch.failed_objects:print(f"e.g. Failed to import object with error: {failed.message}")
else:print("Import complete with no errors")

Output

Import complete with no errors

Text-search through the text vector space

from IPython.display import Imageresponse = movies.query.near_text(query="Movie about lovable cute pets",target_vector="txt_vector",  # Search in the txt_vector spacelimit=3,
)# Inspect the response
for item in response.objects:print(item.properties["title"])print(item.properties["overview"])display(Image(item.properties["poster_path"], width=200))

Output

在这里插入图片描述

# Perform query
response = movies.query.near_text(query="Epic super hero",target_vector="txt_vector",  # Search in the txt_vector spacelimit=3,
)# Inspect the response
for item in response.objects:print(item.properties["title"])print(item.properties["overview"])display(Image(item.properties["poster_path"], width=200))

Output

在这里插入图片描述

Text-search through the posters vector space

# Perform query
response = movies.query.near_text(query="Movie about lovable cute pets",target_vector="poster_vector",  # Search in the poster_vector spacelimit=3,
)# Inspect the response
for item in response.objects:print(item.properties["title"])print(item.properties["overview"])display(Image(item.properties["poster_path"], width=200))

Output

在这里插入图片描述

# Perform query
response = movies.query.near_text(query="Epic super hero",target_vector="poster_vector",  # Search in the poster_vector spacelimit=3,
)# Inspect the response
for item in response.objects:print(item.properties["title"])print(item.properties["overview"])display(Image(item.properties["poster_path"], width=200))

Output

在这里插入图片描述

Image-search through the posters vector space

Image("test/spooky.jpg", width=300)

Output

在这里插入图片描述

# Perform query
response = movies.query.near_image(near_image=toBase64("test/spooky.jpg"),target_vector="poster_vector",  # Search in the poster_vector spacelimit=3,
)# Inspect the response
for item in response.objects:print(item.properties["title"])display(Image(item.properties["poster_path"], width=200))

Output

在这里插入图片描述

后记

2024年6月2日20点47分完成这门short course,学习了一些多模态模型的知识,比如Vision Instruction Tuning,这样的简短的课程可以快速扩充我的知识面,并且让我有一定的代码实现层面的认识。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/21288.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

在iPhone上恢复已删除的Safari历史记录的最佳方法

您是否正在寻找恢复 iPhone 上已删除的 Safari 历史记录的最佳方法?好吧,这篇文章提供了 4 种在有/无备份的情况下恢复 iPhone 上已删除的 Safari 历史记录的最佳方法。现在按照分步指南进行操作。 iPhone 上的 Safari 历史记录会被永久删除吗&#xff1…

kafka 发送文件二进制流及使用header发送附属信息

文章目录 背景案例发送方接收方 背景 需要使用kafka发送文件二进制以及附属信息 案例 发送方 import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerRecord;import java.io.InputStream; import java.nio.charset.S…

Halo DB 魔法之 pg_pcpu_limit

↑ 关注「少安事务所」公众号,欢迎⭐收藏,不错过精彩内容~ 前情回顾 前面已经介绍了“光环”数据库的基本情况和安装办法,今天来介绍一个新话题。 哈喽,国产数据库!Halo DB! 三步走,Halo DB 安装指引 ★ Ha…

Java Agent利器

一、JavaAgent技术 1.1 什么是JavaAgent JavaAgent是一种特殊的Java程序,是Instrumentation的客户端。它与普通Java程序通过main方法启动不同,JavaAgent并不是一个可以单独启动的程序,它必须依附在一个Java应用程序(JVM&#xf…

java并发常见问题

1.死锁:当两个或多个线程无限期地等待对方释放锁时发生死锁。为了避免这种情况,你应该尽量减少锁定资源的时间,按顺序获取锁,并使用定时锁尝试。 2.竞态条件:当程序的行为依赖于线程的执行顺序或输入数据到达的顺序时…

Lagrange ZK Coprocessor:革新区块链领域的大数据应用

1. 引言 2024年5月11日,Lagrange Labs宣称获得由Founders Fund领投(Archetype Ventures, 1kx, Maven11, Fenbushi Capital, Volt Capital, CMT Digital, Mantle Ecosystem Fund和其它天使投资人跟头)的1320万美金种子轮融资,致力于…

springboot高校网上选课系统-计算机毕业设计源码85583

摘 要 本论文主要论述了如何使用JAVA语言开发一个高校网上选课系统,本系统将严格按照软件开发流程进行各个阶段的工作,采用B/S架构,面向对象编程思想进行项目开发。在引言中,作者将论述高校网上选课系统的当前背景以及系统开发的目…

typescript --object对象类型

ts中的object const obj new Object()Object 这里的Object是Object类型,而不是JavaScript内置的Object构造函数。 这里的Object是一种类型,而Object()构造函数表示一个值。 Object()构造函数的ts代码 interface ObjectConstructor{readonly prototyp…

C++20 范围(Range):简化集合操作

C20 范围:简化集合操作 一、范围(Range)的目的二、在模板函数中使用范围概念三、投影四、视图五、结论 一、范围(Range)的目的 在 C20 中,范围概念要求一个对象同时拥有迭代器和结束哨兵。这在标准集合的上…

YOLOv5改进(五)-- 轻量化模型MobileNetv3

文章目录 1、MobileNetV3论文2、代码实现2.1、MobileNetV3-small2.2、MobileNetV3-large 3、运行效果4、目标检测系列文章 1、MobileNetV3论文 Searching for MobileNetV3论文 MobileNetV3代码 MobileNetV3 是 Google 提出的一种轻量级神经网络结构,旨在在移动设备上…

官网上线,一款令人惊艳的文本转语音模型:ChatTTS

近日,一个名为 ChatTTS 文本转语音模型的项目在github上横空出世,一经推出便引发极大关注,短短四天时间,已经狂揽了14.2k的Start量。 ChatTTS是一款专为对话场景设计的支持中英文的文本转语音(TTS)模型&…

未来已来:Spring Boot引领数据库智能化革命

深入探讨了Spring Boot如何与现代数据库技术相结合,预测并塑造未来的数据访问趋势。本书不仅涵盖了Spring Data JPA的使用技巧,还介绍了云原生数据库的概念,微服务架构下的数据访问策略,以及AI在数据访问层的创新应用。旨在帮助开…

XFeat:速度精度远超superpoint的轻量级图像匹配算法

代码地址:https://github.com/verlab/accelerated_features?tabreadme-ov-file 论文地址:2404.19174 (arxiv.org) XFeat (Accelerated Features)重新审视了卷积神经网络中用于检测、提取和匹配局部特征的基本设计选择。该模型满足了对适用于资源有限设备…

在table中获取每一行scope的值

目的 当前有一份如下数据需要展示在表格中&#xff0c;表格的页面元素套了一个折叠面板&#xff0c;需要循环page_elements中的数据展示出来 错误实践 将template放在了折叠面板中&#xff0c;获取到的scope是空数组 <el-table-column label"页面元素" show-o…

【并发程序设计】15.信号灯(信号量)

15.信号灯(信号量) Linux中的信号灯即信号量是一种用于进程间同步或互斥的机制&#xff0c;它主要用于控制对共享资源的访问。 在Linux系统中&#xff0c;信号灯作为一种进程间通信&#xff08;IPC&#xff09;的方式&#xff0c;与其他如管道、FIFO或共享内存等IPC方式不同&…

分析和设计算法

目录 前言 循环不变式 n位二进制整数相加问题 RAM模型 使用RAM模型分析 代码的最坏情况和平均情况分析 插入排序最坏情况分析 插入排序平均情况分析 设计算法 分治法 总结 前言 循环迭代&#xff0c;分析算法和设计算法作为算法中的三个重要的角色&#xff0c;下面…

Java——二进制原码、反码和补码

一、简要介绍 原码、反码和补码只是三种二进制不同的表示形式&#xff0c;每个二进制数都有这三个形式。 1、原码 原码是将一个数的符号位和数值位分别表示的方法。 最高位为符号位&#xff0c;0表示正&#xff0c;1表示负&#xff0c;其余位表示数值的绝对值。 例如&…

如何解决游戏行业DDOS攻击问题

随着网络游戏行业的迅速发展&#xff0c;网络游戏问题也不可忽视&#xff0c;特别是目前网络攻击频发&#xff0c;DDoS攻击的简单化以及普及化&#xff0c;对游戏来说存在非常大的安全威胁。 随着受攻击对象的范围在不断地拓展&#xff0c;网络游戏这种这种新型并且有着丰厚利…

Scala编程基础3 数组、映射、元组、集合

Scala编程基础3 数组、映射、元组、集合 小白的Scala学习笔记 2024/5/23 14:20 文章目录 Scala编程基础3 数组、映射、元组、集合apply方法数组yield 数组的一些方法映射元组数据类型转换求和示例拉链集合flatMap方法 SetHashMap apply方法 可以new&#xff0c;也可以不new&am…

flink Jobmanager metaspace oom 分析

文章目录 现象作业背景分析现象分析类卸载条件MAT 分析 解决办法flink 官方提示 现象 通过flink 页面提交程序&#xff0c;多次提交后&#xff0c;jobmanager 报metaspace oom 作业背景 用户代码是flink 代码Spring nacos 分析 现象分析 从现象来看肯定是因为有的类没有被…