Source Allies Logo

Sharing Our Passion for Technology

& Continuous Learning

<   Back to Blog

Agentic AI Analytics: How to use an LLM to make sense of your data

AI doing analytics

Background

A few months ago we posted a Technically Speaking video on DuckDB and how it can be used to locally wrangle csv and parquet files. We demoed this approach using the Iowa liquor sales dataset This strategy works great when the user is someone who has a deep understanding of data analytics and SQL. But, what if we would like to ask questions about our data in a more human friendly and conversational way.

This is a problem that can be solved by leveraging an Agentic AI. Ben McHone did another great Video on AgenticAI that explains that concept. I highly recommend watching both videos prior to continuing on here. Let us glue these two technologies together and build an AI agent that can answer questions about the dataset.

Initial Setup

We will start with a new Python project and install our initial dependencies using your preferred method (e.x. Virtual Environments). I am using AWS bedrock as my LLM service so we need to install that dependency too:

pip install langgraph langchain-core langchain-aws

Taking inspiration from the LangGraph Quick Start and Ben's video we will start with a simple graph that just calls Bedrock:

from typing import Annotated
from typing_extensions import TypedDict

from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_aws import ChatBedrock

# This holds data between nodes in our graph
class State(TypedDict):
    messages: Annotated[list, add_messages]

# Setup a client to call bedrock
llm = ChatBedrock(
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    model_kwargs=dict(temperature=0),
)


# chatbot just passes the message history to Bedrock
def chatbot(state: State):
    return {"messages": [llm.invoke(state["messages"])]}


# Setup a simple graph with 3 nodes in a line
graph_builder = StateGraph(State)
graph_builder.add_node("chatbot", chatbot)
graph_builder.add_edge(START, "chatbot")
graph_builder.add_edge("chatbot", END)
graph = graph_builder.compile()


# take the user input, call the graph and stream the responses to the console
def stream_graph_updates(user_input: str):
    for event in graph.stream({"messages": [("user", user_input)]}):
        for value in event.values():
            content = value["messages"][-1].content
            print("Assistant: ", content)


# Main loop to ask for user input
while True:
    user_input = input("User: ")
    if user_input.lower() in ["quit", "exit", "q"]:
        print("Goodbye!")
        break

    stream_graph_updates(user_input)

This is a very simple chatbot that can only call the LLM one time and simply dumps out the results. It is a good test that we have the basics setup though. Assuming we have AWS credentials available in the environment, we can run this python file and ask it a question like this:

% python main.py
User: Give me a list of 3 fruits
Assistant:  Here are 3 fruits:

1. Apple
2. Banana
3. Orange
User: 

Tool Support

Good, we can ask the chatbot basic questions. Now lets add in a tool that will query our dataset.

First, install the DuckDB Python library:

pip install duckdb

We can now make a few changes to our script:

Import duckdb and LangChain tool calling support;

import duckdb
from langchain_core.tools import tool
from langgraph.prebuilt import ToolNode, tools_condition

Create a tool that runs a query on duckDB. This library couldn't be simpler to use.

@tool
def query_database(query: str):
    """
    This tool executes a SQL query on the `liquor_sales` table and returns the results.
    """
    liquor_sales = duckdb.sql("SELECT * FROM '../liquor_sales.parquet';")
    return duckdb.sql(query).fetchall()

tools = [query_database]

We need to bind our tool into the llm so change the llm initialization to the following:

llm = ChatBedrock(
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    model_kwargs=dict(temperature=0),
).bind_tools(tools)

Finally, we add a new node into the graph to call the tools and conditional edges to go to that node if the LLM wants to call the tool. Replace our graph building code with this new configuration:

graph_builder = StateGraph(State)
graph_builder.add_node("chatbot", chatbot)
graph_builder.add_node("tools", ToolNode(tools=tools))
graph_builder.add_edge(START, "chatbot")
graph_builder.add_edge("tools", "chatbot")
graph_builder.add_conditional_edges("chatbot", tools_condition)
graph_builder.add_edge("chatbot", END)
graph = graph_builder.compile()

If we run our app and ask it a pretty basic question about liquor sales, we can see that it calls our tool and starts running queries:

% python main.py
User: What is the most popular item sold?
Assistant:  
Assistant:  Error: BinderException('Binder Error: Referenced column "item_description" not found in FROM clause!\nCandidate bindings: "Item Description", "Store Location", "Item Number", "Date", "City"')
 Please fix your mistakes.
Assistant:  
Assistant:  Error: ParserException('Parser Error: syntax error at or near "`"')
 Please fix your mistakes.
Assistant:  
Assistant:  Error: ParserException('Parser Error: syntax error at or near "`"')
 Please fix your mistakes.
Assistant:  
Assistant:  Error: BinderException('Binder Error: Referenced column "Item_Description" not found in FROM clause!\nCandidate bindings: "Item Description", "Item Number", "Store Location", "Date", "Bottles Sold"')
 Please fix your mistakes.
Assistant:  
Assistant:  Error: ParserException('Parser Error: syntax error at or near "`"')
 Please fix your mistakes.
Assistant:  
Assistant:  Error: BinderException('Binder Error: Referenced column "bottles_sold" not found in FROM clause!\nCandidate bindings: "Bottles Sold", "Bottle Volume (ml)", "Volume Sold (Liters)", "Volume Sold (Gallons)", "Address"')
 Please fix your mistakes.
Assistant:  
Assistant:  [["FIREBALL CINNAMON WHISKEY", 17353052]]
Assistant:  The query groups the liquor sales data by the "Item Description" column, sums the "Bottles Sold" values for each item, orders by the total bottles sold descending, and takes the first row - which shows that "FIREBALL CINNAMON WHISKEY" is the most popular item sold with a total of 17,353,052 bottles.

There is a lot going on here, but it is very interesting. We can see that the LLM just took a wild guess at trying to query the table. That is because nowhere in our setup did we ever say anything about what the table structure looks like. But, since DuckDB returns an error message when it queries bad columns with suggested column names that might be helpful to a human user, the LLM can use that information and try again.

The LLM then tries to run another query on the Item Description column. But, this also falls flat because this column name has a space in it and it has to be escaped. It tries to escape it with a backtick. But, for DuckDB, identifiers are escaped with double quotes.

Amazingly, it tries a few more calls and stumbles upon a valid query that actually returns "FIREBALL CINNAMON WHISKEY" as the most popular with 17,353,052 bottles.

It is almost to much to believe so if we execute the DuckDB CLI manually and run the following query we can confirm that the above is actually correct:

select "Item Description", sum("Bottles Sold") 
from '../liquor_sales.parquet' 
group by "Item Description" 
order by 2 desc 
limit 1;

Improving Our Bot

As impressive as this is, it is a somewhat sloppy solution. We can add more descriptions to our tool to tell the LLM that it needs to use double quotes to escape column names. Add this line into the docstring for our tool function:

Wrap every column name in double quotes. For example: SELECT "Store Number" from liquor_sales

We would not expect a data analyst to write queries on a table without allowing them to see the columns available to it. We can create a second tool that returns the available tables and columns. Add another tool function with this content:

@tool
def describe_schema():
    """
    This tool returns a description of a database schema with the tables and their columns.
    """

    return """
    The liquor_sales table has the following columns:
    |      column_name      | column_type | null | description |
    |-----------------------|-------------|------|-------------|
    | Invoice/Item Number   | VARCHAR     | YES  | Concatenated invoice and line number associated with the liquor order. This provides a unique identifier for the individual liquor products included in the store order    |
    | Date                  | DATE        | YES  | Date of order      |
    | Store Number          | BIGINT      | YES  | Unique number assigned to the store who ordered the liquor.    |
    | Store Name            | VARCHAR     | YES  | Name of store who ordered the liquor.    |
    | Address               | VARCHAR     | YES  | Address of store who ordered the liquor.    |
    | City                  | VARCHAR     | YES  | City where the store who ordered the liquor is located    |
    | Zip Code              | VARCHAR     | YES  | Zip code where the store who ordered the liquor is located    |
    | Store Location        | VARCHAR     | YES  | Location of store who ordered the liquor. The Address, City, State and Zip Code are geocoded to provide geographic coordinates. Accuracy of geocoding is dependent on how well the address is interpreted and the completeness of the reference data used. Left NULL where unable to provide point location.    |
    | County Number         | VARCHAR     | YES  | Iowa county number for the county where store who ordered the liquor is located    |
    | County                | VARCHAR     | YES  | County where the store who ordered the liquor is located    |
    | Category              | BIGINT      | YES  | Category code associated with the liquor ordered    |
    | Category Name         | VARCHAR     | YES  | Category of the liquor ordered.    |
    | Vendor Number         | VARCHAR     | YES  | The vendor number of the company for the brand of liquor ordered    |
    | Vendor Name           | VARCHAR     | YES  | The vendor name of the company for the brand of liquor ordered    |
    | Item Number           | VARCHAR     | YES  | Item number for the individual liquor product ordered.    |
    | Item Description      | VARCHAR     | YES  | Text    Description of the individual liquor product ordered.im_desc    |
    | Pack                  | BIGINT      | YES  | The number of bottles in a case for the liquor ordered    |
    | Bottle Volume (ml)    | BIGINT      | YES  | Volume of each liquor bottle ordered in milliliters.    |
    | State Bottle Cost     | DOUBLE      | YES  | The amount that Alcoholic Beverages Division paid for each bottle of liquor ordered    |
    | State Bottle Retail   | DOUBLE      | YES  | The amount the store paid for each bottle of liquor ordered    |
    | Bottles Sold          | BIGINT      | YES  | The number of bottles of liquor ordered by the store    |
    | Sale (Dollars)        | DOUBLE      | YES  | Total cost of liquor order (number of bottles multiplied by the state bottle retail)    |
    | Volume Sold (Liters)  | DOUBLE      | YES  | Total volume of liquor ordered in liters. (i.e. (Bottle Volume (ml) x Bottles Sold)/1,000)    |
    | Volume Sold (Gallons) | DOUBLE      | YES  | Total volume of liquor ordered in gallons. (i.e. (Bottle Volume (ml) x Bottles Sold)/3785.411784)    |
    """

Remember we have to wire it into the tools array:

tools = [query_database, describe_schema]

Let's try asking it some questions with these changes:

% python main.py
User: What is the most popular item sold?
Assistant:  
Assistant:  
    The liquor_sales table has the following columns:
    |      column_name      | column_type | null | description |
    |-----------------------|-------------|------|-------------|
    | Invoice/Item Number   | VARCHAR     | YES  | Concatenated invoice and line number associated with the liquor order. This provides a unique identifier for the individual liquor products included in the store order    |
    | Date                  | DATE        | YES  | Date of order      |
    | Store Number          | BIGINT      | YES  | Unique number assigned to the store who ordered the liquor.    |
    | Store Name            | VARCHAR     | YES  | Name of store who ordered the liquor.    |
    | Address               | VARCHAR     | YES  | Address of store who ordered the liquor.    |
    | City                  | VARCHAR     | YES  | City where the store who ordered the liquor is located    |
    | Zip Code              | VARCHAR     | YES  | Zip code where the store who ordered the liquor is located    |
    | Store Location        | VARCHAR     | YES  | Location of store who ordered the liquor. The Address, City, State and Zip Code are geocoded to provide geographic coordinates. Accuracy of geocoding is dependent on how well the address is interpreted and the completeness of the reference data used. Left NULL where unable to provide point location.    |
    | County Number         | VARCHAR     | YES  | Iowa county number for the county where store who ordered the liquor is located    |
    | County                | VARCHAR     | YES  | County where the store who ordered the liquor is located    |
    | Category              | BIGINT      | YES  | Category code associated with the liquor ordered    |
    | Category Name         | VARCHAR     | YES  | Category of the liquor ordered.    |
    | Vendor Number         | VARCHAR     | YES  | The vendor number of the company for the brand of liquor ordered    |
    | Vendor Name           | VARCHAR     | YES  | The vendor name of the company for the brand of liquor ordered    |
    | Item Number           | VARCHAR     | YES  | Item number for the individual liquor product ordered.    |
    | Item Description      | VARCHAR     | YES  | Text    Description of the individual liquor product ordered.im_desc    |
    | Pack                  | BIGINT      | YES  | The number of bottles in a case for the liquor ordered    |
    | Bottle Volume (ml)    | BIGINT      | YES  | Volume of each liquor bottle ordered in milliliters.    |
    | State Bottle Cost     | DOUBLE      | YES  | The amount that Alcoholic Beverages Division paid for each bottle of liquor ordered    |
    | State Bottle Retail   | DOUBLE      | YES  | The amount the store paid for each bottle of liquor ordered    |
    | Bottles Sold          | BIGINT      | YES  | The number of bottles of liquor ordered by the store    |
    | Sale (Dollars)        | DOUBLE      | YES  | Total cost of liquor order (number of bottles multiplied by the state bottle retail)    |
    | Volume Sold (Liters)  | DOUBLE      | YES  | Total volume of liquor ordered in liters. (i.e. (Bottle Volume (ml) x Bottles Sold)/1,000)    |
    | Volume Sold (Gallons) | DOUBLE      | YES  | Total volume of liquor ordered in gallons. (i.e. (Bottle Volume (ml) x Bottles Sold)/3785.411784)    |
    
Assistant:  
Assistant:  [["64870", "FIREBALL CINNAMON WHISKEY", 10652120]]
Assistant:  Based on the sales data, the most popular item sold is Fireball Cinnamon Whiskey (Item Number: 64870) with over 10.6 million bottles sold. This item significantly outperforms other products in terms of total bottles sold, making it the clear leader in popularity.

This is looking really good. We were able to ask it a couple questions and it built valid SQL statements and ran them on the first try.

Conclusion

This was a very simple and quick proof-of-concept to demonstrate how an LLM can be empowered with tools and those tools can be complicated interfaces like SQL execution. You can continue to iterate on this implementation by asking it different questions and then adding more and more helpful information to the docstring to teach it how to more effectively use the tool. The source code for this example is available on our GitHub.

Next Steps

Here are some different ideas and considerations to expand this solution:

  • DuckDB supports querying S3 directly, so you could point the table declaration at an S3 path rather than a local file.
  • This entire app could be wrapped up in a Lambda that executes once per user input. The State instance would just need to be persisted between invocations on a per-session basis.
  • More tables can be added and the LLM can join them together to create more complicated queries. A tool would need to be added that lets the agent list the available tables.
  • Normally, we would show the user only the final result of the execution and not every step along the way. We left it in so that we can demonstrate how the LLM is solving the question.
  • We could store our state in a persistent state store by using on of these patterns:
    • LangGraph CheckPointers - Used if you have more state than just messages or want the ability to go back to a point in time
    • LangChain Message Histories - Used if you're only storing the messages as they were at the end of the last interaction