在不断发展的数据科学和机器学习世界中,有一个改变游戏规则的LangChain承诺让与你的数据交谈变得轻而易举——进入。这个动态工具不仅仅是另一个玩家;它是您的伙伴,使棘手的数据分析世界变得更简单。和我一起踏上这段旅程,开始我们的设置,对我们的数据说一声“Hello World”,解开LangChain的魔力,并进行一些故障排除。
准备环境
让我们从让我们的环境准备好使用LangChain开始。无论您是数据专家还是新手,都没有关系;让我们让这段旅程变得有趣。我将向您展示使用 Teradata 的 LangChain,但您可以使用您喜欢的数据库。
1. 安装 Langchain
pip install langchain复制
2.设置数据库(这里是 Teradata)
在此演示中,数据集包含三个主要实体:客户、产品和订单。Order 表专门存储与客户订购的产品相关的信息。此信息包括product_id、数量和总花费金额。
数据库schema
Hello World 示例:
让我们用一个经典的“Hello World”例子来打破僵局。打开 Jupyter 笔记本,导入 Langchain 库,然后让我们访问我们的数据:
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.agents.agent_types import AgentType
from langchain.llms.openai import OpenAI
from langchain.sql_database import SQLDatabase
import os复制
# Setting up the SQL Database Connection
db = SQLDatabase.from_uri("teradatasql://dbc:dbc@192.168.11.7:1025/Sales")
# Configuring the OpenAI Language Model
model = OpenAI(temperature=0, openai_api_key=os.getenv("OPENAI_API"))
# Creating SQL Database Toolkit
toolkit = SQLDatabaseToolkit(db=db, llm=model)
# Creating and Running a SQL Agent
agent_executor = create_sql_agent(
llm=model,
toolkit=toolkit,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)
agent_executor.run("Number of customer")复制
结果:
Thought: I should query the customer table to get the number of customers
Action: sql_db_query
Action Input: SELECT COUNT(*) FROM customer
Observation: [(10,)]
Thought: I now know the final answer
Final Answer: There are 10 customers.复制
祝贺!你刚刚启动了与数据集的对话。
第一个问题和解决方案
在我的 Langchain 和 Teradata 旅程中,当我问我的代理时,事情变得有点纠结,
agent_executor.run("Give me top 5 customer who have bill greater than 500")复制
生成的查询 Agent 如下所示:
SELECT customer_name, total_price FROM customer, orderdetail
WHERE customer.customer_id = orderdetail.customer_id
AND total_price > 500 ORDER BY total_price DESC LIMIT 5复制
问题
错误:(teradatasql.OperationalError) [版本 17.20.0.31] [会话 1116]
[Teradata 数据库] [错误 3706] 语法错误:预期“DESC”关键字和“LIMIT”关键字之间有内容
。
Agent 的 SQL 谈话更符合 PostgreSQL 等数据库,而不是 Teradata。
解决方法
在创建代理时,清晰度至关重要。该指令包括指定 Teradata 和生成对 Teradata 友好的 SQL。
这是我用来定义代理的提示:
# Define prompt
prefix = """
You are an agent designed to interact with a Teradata database.
Given an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to obtain,
always limit your query to at most {top_k}
results by using SELECT TOP {top_k}, note that LIMIT function does not works
in Teradata DB.
You can order the results by a relevant column to return the most interesting examples in the database.
Never query for all the columns from a specific table, only ask for the relevant columns given the question.
You have access to tools for interacting with the database.
Only use the below tools. Only use the information returned by the below tools to construct your final answer.
You MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.
DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.
If the question does not seem related to the database, just return "I don't know" as the answer.
"""复制
使用更新的前缀内容重新定义代理执行程序。
agent_executor = create_sql_agent(
llm=model,
toolkit=toolkit,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
prefix=prefix
)复制
结果:
使用新前缀内容重新定义代理后,它现在可以生成在 Teradata DB 上平滑运行的查询。
SELECT TOP 5 customer_name, total_price
FROM customer, orderdetail
WHERE customer.customer_id = orderdetail.customer_id AND total_price > 500复制
评估座席的能力
为了评估代理的熟练程度,我设计了一系列问题,从简单的查询到涉及多个表联接的更复杂的查询。
sql_questions = [
"1. Retrieve all columns from the customer table.",
"2. Retrieve the names and email addresses of all customers.",
"3. Retrieve the unique customer names from the customer table.",
"4. List the customers who have placed an order in the last month.",
"5. Retrieve the total number of orders placed by each customer.",
"6. List the customers who have not placed any orders.",
"7. Calculate the total revenue generated by each customer.",
"8. Find the customer who has spent the most money on orders.",
"9. Retrieve the customer who has placed the most orders in the last quarter.",
"10. List the customers who have placed orders on consecutive days."
]
for i, question in enumerate(sql_questions):
try:
ans = agent_executor(question)
print("Question {} : {}".format(i, question))
print("Answer : ", ans.get("output"))
except Exception as e:
print("Question {} : {}".format(i, question))
print("Error : ", e)复制
The result :
Question: Retrieve all columns from the customer table.
Answer: The customer table contains customer_id, customer_name, customer_email, and customer_address columns. Examples of data in the table include customer_id 8 with name Grace Davis, customer_id 7 with name Frank Miller, and customer_id 5 with name Charlie Brown.
Question: Retrieve the names and email addresses of all customers.
Answer: The names and email addresses of all customers are:
Grace Davis (grace@example.com), Frank Miller (frank@example.com), Charlie Brown (charlie@example.com), Jane Smith (jane@example.com), Harry Lee (harry@example.com), Eva White (eva@example.com), John Doe (john@example.com), Ivy Smith (ivy@example.com), Alice Johnson (alice@example.com), and Bob Johnson (bob@example.com).
Question: Retrieve the unique customer names from the customer table.
Answer: The unique customer names from the customer table are Grace Davis, Frank Miller, Charlie Brown, Jane Smith, Harry Lee, Eva White, John Doe, Ivy Smith, Alice Johnson, and Bob Johnson.
Question: List the customers who have placed an order in the last month.
Error: This model's maximum context length is 4097 tokens, however, you requested 4267 tokens (4011 in your prompt; 256 for the completion). Please reduce your prompt or completion length.
Question: Retrieve the total number of orders placed by each customer.
Error: This model's maximum context length is 4097 tokens, however, you requested 4179 tokens (3923 in your prompt; 256 for the completion). Please reduce your prompt or completion length.
Question: List the customers who have not placed any orders.
Answer: John Doe, Harry Lee, Alice Johnson.
Question: Calculate the total revenue generated by each customer.
Answer: The total revenue generated by each customer is calculated by querying the customer, orderdetail, and product tables using the provided SQL query.
Question: Find the customer who has spent the most money on orders.
Answer: The customer with ID '1' has spent the most money on orders.
Question: Retrieve the customer who has placed the most orders in the last quarter.
Answer: The customer who has placed the most orders in the last quarter is Frank Miller with 9 orders.
Question: List the customers who have placed orders on consecutive days.
Answer: The customers who have placed orders on consecutive days are customer 10, customer 2, customer 3, customer 4, customer 10, customer 1, customer 6, and customer 9.复制
该代理表现出值得称赞的熟练程度,有效地回答了各种问题。但是,需要注意的是,当用于反馈的数据超过其限制时,模型会遇到错误。
结论
总之,LangChain是一个强大的框架,用于开发语言模型驱动的应用程序。它的上下文感知和推理功能为数据交互的新时代打开了大门。当您踏上LangChain的旅程时,请考虑代理指令清晰度的细微差别,并注意大量数据的潜在限制。
使用LangChain释放数据的全部潜力 - 语言模型和数据探索无缝融合的门户。亲自尝试一下,见证它对您的数据科学工作可能产生的变革性影响。
原文链接:https://medium.com/@lucnguyen_61589/revolutionize-your-data-exploration-unveiling-the-power-of-langchain-bd9f18d97532