利用 RAG 处理文本、表格和图像：综合指南（含代码）

二师兄talks 2024-04-05

1943

在信息检索领域，检索增强生成（RAG）已成为从大量文本数据中提取知识的强大工具。这种通用技术利用检索和生成策略的组合来有效地总结和综合相关文档中的信息。然而，尽管 RAG 获得了相当大的吸引力，但它在更广泛的内容类型（包括文本、表格和图像）中的应用仍然相对未被探索。

多模态内容的挑战

大多数现实世界的文档都包含丰富的信息，通常将文本、表格和图像组合在一起，以传达复杂的想法和见解。虽然传统的 RAG 模型擅长处理文本，但它们难以有效地集成和理解多模态内容。这种局限性阻碍了RAG充分捕捉这些文件本质的能力，可能导致不完整或不准确的表述。

在这篇文章中，我们将探讨如何创建可以处理这些类型文档的多模态 RAG。

这是一个图表，我们将使用它作为处理此类文档的指南。

多模式RAG

第1步：将文件拆分为原始元素。

首先，让我们将所有必要的库导入到我们的环境中

import os
import openai
import io
import uuid
import base64
import time
from base64 import b64decode
import numpy as np
from PIL import Image


from unstructured.partition.pdf import partition_pdf


from langchain.chat_models import ChatOpenAI
from langchain.schema.messages import HumanMessage, SystemMessage
from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain.schema.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda


from operator import itemgetter

我们将使用 unstructured 化来解析文档（PDF）中的图像、文本和表格，您可以直接在这个 google colab 中运行此代码，或下载pdf文件，将其上传到会话存储。然后按照以下步骤操作。

google colab：https://colab.research.google.com/drive/1I9-JMGL76XXzUel7XTjO8pWK7jV-V8WD?usp=sharing
pdf文件：https://sgp.fas.org/crs/misc/IF10244.pdf
unstructured：https://unstructured.io/

（在执行代码之前，请参阅google colab中的安装说明来设置您的venv）

# load the pdf file to drive
# split the file to text, table and images
def doc_partition(path,file_name):
  raw_pdf_elements = partition_pdf(
    filename=path + file_name,
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path)


  return raw_pdf_elements
path = "/content/"
file_name = "wildfire_stats.pdf"
raw_pdf_elements = doc_partition(path,file_name)

运行上述代码后，文件中包含的所有图像将被提取并保存在您的路径中。在我们的例子中（path = “/content/”）

接下来，我们将每个原始元素附加到其类别中，（文本到文本，表格到表格，对于图像，unstructed 已经处理好了......

# appending texts and tables from the pdf file
def data_category(raw_pdf_elements): # we may use decorator here
    tables = []
    texts = []
    for element in raw_pdf_elements:
        if "unstructured.documents.elements.Table" in str(type(element)):
           tables.append(str(element))
        elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
           texts.append(str(element))
    data_category = [texts,tables]
    return data_category
texts = data_category(raw_pdf_elements)[0]
tables = data_category(raw_pdf_elements)[1]

第 2 步：图像标题和表格摘要

为了总结表格，我们将使用 Langchain 和 GPT-4。为了生成图像标题，我们将使用 GPT-4-Vision-Preview。这是因为它是目前唯一可以同时处理多个图像的模型，这对于包含多个图像的文档非常重要。对于文本元素，在将它们放入嵌入之前，我们将保持它们不变。

准备好您的 OpenAI API 密钥

os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxx'
openai.api_key = os.environ["OPENAI_API_KEY"]


# function to take tables as input and then summarize them
def tables_summarize(data_category):
    prompt_text = """You are an assistant tasked with summarizing tables. \\
                    Give a concise summary of the table. Table chunk: {element} """


    prompt = ChatPromptTemplate.from_template(prompt_text)
    model = ChatOpenAI(temperature=0, model="gpt-4")
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
    table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})


    return table_summaries
table_summaries = tables_summarize(data_category)
text_summaries = texts

对于图像，我们需要将它们编码为 base64 格式，然后再将它们提供给我们的模型进行字幕

def encode_image(image_path):
    ''' Getting the base64 string '''
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


def image_captioning(img_base64,prompt):
    ''' Image summary '''
    chat = ChatOpenAI(model="gpt-4-vision-preview",
                      max_tokens=1024)


    msg = chat.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text":prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{img_base64}"
                        },
                    },
                ]
            )
        ]
    )
    return msg.content

现在我们可以附加我们的images_base64列表并总结它们，然后我们拆分 base64 编码的图像及其相关文本。

运行下面的代码时，您可能会收到RateLimitError错误代码429。当您超过gpt-4-vision-preview组织中每分钟请求 (RPM) 的速率限制时，就会出现此错误，在我的例子中，我的使用限制为 3 RPM，因此在每个图像字幕之后，我将 60 秒设置为安全措施并等待速率限制重置。

# Store base64 encoded images
img_base64_list = []


# Store image summaries
image_summaries = []


# Prompt : Our prompt here is customized to the type of images we have which is chart in our case
prompt = "Describe the image in detail. Be specific about graphs, such as bar plots."


# Read images, encode to base64 strings
for img_file in sorted(os.listdir(path)):
    if img_file.endswith('.jpg'):
        img_path = os.path.join(path, img_file)
        base64_image = encode_image(img_path)
        img_base64_list.append(base64_image)
        img_capt = image_captioning(base64_image,prompt)
        time.sleep(60)
        image_summaries.append(image_captioning(img_capt,prompt))


def split_image_text_types(docs):
    ''' Split base64-encoded images and texts '''
    b64 = []
    text = []
    for doc in docs:
        try:
            b64decode(doc)
            b64.append(doc)
        except Exception as e:
            text.append(doc)
    return {
        "images": b64,
        "texts": text
    }

第 3 步：创建一个多向量检索器并将文本、表格、图像及其索引存储在 Vectore Base 中

我们已经完成了第一部分，包括将文档划分为原始元素，汇总表格和图像，现在我们准备好进行第二部分，我们将创建一个多向量检索器，并将第一部分的输出与它们的 ID 一起存储在 chromadb 中。

我们创建一个 vectorestore 来索引子块（summary_texts、summary_tables、summary_img），并使用 OpenAIEmbeddings（）进行嵌入，
2 .用于存储父文档（doc_ids、文本）、（table_ids、表）和（img_ids、img_base64_list）的文档存储

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="multi_modal_rag",
                     embedding_function=OpenAIEmbeddings())


# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"


# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)


# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))


# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))


# Add image summaries
img_ids = [str(uuid.uuid4()) for _ in img_base64_list]
summary_img = [
    Document(page_content=s, metadata={id_key: img_ids[i]})
    for i, s in enumerate(image_summaries)
]
retriever.vectorstore.add_documents(summary_img)
retriever.docstore.mset(list(zip(img_ids, img_base64_list)))

步骤4：使用langchain RunnableLambda包装上述所有内容

我们首先计算上下文（在本例中为“文本”和“图像”）和问题（此处只是一个 RunnablePassthrough）
然后我们将其传递到我们的提示模板中，这是一个自定义函数，用于格式化 gpt-4-vision-preview 模型的消息。
最后，我们将输出解析为字符串。

from operator import itemgetter
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda




def prompt_func(dict):
    format_texts = "\\n".join(dict["context"]["texts"])
    return [
        HumanMessage(
            content=[
                {"type": "text", "text": f"""Answer the question based only on the following context, which can include text, tables, and the below image:
Question: {dict["question"]}




Text and tables:
{format_texts}
"""},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{dict['context']['images'][0]}"}},
            ]
        )
    ]




model = ChatOpenAI(temperature=0, model="gpt-4-vision-preview", max_tokens=1024)




# RAG pipeline
chain = (
    {"context": retriever | RunnableLambda(split_image_text_types), "question": RunnablePassthrough()}
    | RunnableLambda(prompt_func)
    | model
    | StrOutputParser()
      )

现在，我们准备测试我们的多重检索 Rag

chain.invoke(
    "What is the change in wild fires from 1993 to 2022?"
)

答案如下：

根据提供的图表，从 1993 年到 2022 年，野火数量有所增加。该图表显示了一个以千为单位的火灾数量的折线图，该曲线似乎从 1993 年的较低点开始，到 2022 年的较高点结束。1993年的确切数字没有在文本中提供，也没有在图表上看到，但视觉趋势表明有所增加。
同样，从1993年到2022年，由图表中的阴影区域表示的烧毁面积也有所增加。1993 年阴影区域的起点低于 2022 年的终点，这表明与 1993 年相比，2022 年被烧毁的英亩数更多。同样，没有提供1993年的具体数字，但图表上的视觉趋势表明，在这段时间里，烧毁的英亩数有所增加，period.to 这样做

参考资料：

google colab：https://colab.research.google.com/drive/1I9-JMGL76XXzUel7XTjO8pWK7jV-V8WD?usp=sharing
pdf文件：https://sgp.fas.org/crs/misc/IF10244.pdf
unstructured：https://unstructured.io/
multi_vector：https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector
semi-structured-multi-modal-rag：https://blog.langchain.dev/semi-structured-multi-modal-rag/

你也可以关注公众号以获取更多故事，并在公众号上阅读我的短篇技术文章。

文章转载自二师兄talks，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

利用 RAG 处理文本、表格和图像：综合指南（含代码）

评论