So you picked your model, now you want to go to production with it. Is it actually going to work, or fail miserably? Once its out in the wild, do you know how much its going to really cost?
Scaling (Requests)
The two biggest factors into the scale and cost equation will the expected number of requests into the system every 10 seconds. Why 10 seconds? Well, in Microsoft/Azure land, that’s the frequency of which your rate limit is evaluated. Since that’s one of the mainstream models/platforms, we use that as our starting point.
Once we have an idea of the number of request every 10 seconds, we can start to work our way up the scale and cost stack. But determining this requests per 10s isn’t as straightforward as you think. Let’s do an example:
You may be thinking, I’ll simply do some kind of monthly active users (MAU) kind of calculation. Ok, let’s go with that. Let’s say you’ll have 10,000 active users per month. Next, you need to determine how many requests you think they will execute per user per month. If they are regular workers, they probably do 9-5 and will have 40 hour work weeks. If the system become a regular part of their day (think call center), then you may have them doing 4-5 calls per hour. That’s enough to get started, so let’s do the math:
- Variables
- 10,000 MAU
- 40 hrs / week
- 20 work days / month
- 5 requests / hour
- Calculations:
- 5 * 40 * 20 * 10000 = 40,000,000 requests / month
- 5 * 10,000 = 50,000 requests / hour
- 50,000 / 60 = 833 requests / minute
So you are probably thinking, ah, I’ll have 833 requests / minute. That’s great, and should be very cheap to run! Not so fast.
In most well-designed, more than just a chat bot GenAI systems, a single user request does not map to a single model request. Huh? What typically happens is a single request will be sent to the orchestrator, it could be that a model request is made to determine how to route to the proper agent. Once to the proper agent, the agent may have a set of tools that make up its data and logic. The agent will need to make a request to determine the best set of tools to execute. Once you have the set of tools, you then need to execute those tools, of which they could have 0 or more model requests. Again, let’s take an example:
- GenAI system request received:
- Model is called to determine the best agent to response (1 model request)
- Agent is called, it has zero or more tools, none the less it is likely at least one request is made to answer the user prompt based on zero or more tool outputs (1 model request).
- If the agent has tools, each tool called, can have zero or more requests, for this example, let’s say that the agent has to call two tools, and each tool makes 2 model requests (2 * 2 = 4 model requests).
In the above example, a single user request, has cost you 6 model requests. Your actual requests per minute are now at 5K requests per minute. And in reality, you are throttled at the 10 second mark so you really can only do the original 833 model requests per 10 seconds. Which is 138 user requests every 10 seconds.
The above makes the assumption that all the requests happen within that 10 second window, but in reality, it is possible the single user request spans a lot longer than 10 seconds and your requests may be spread out over a much longer time span. But hopefully you get the point. Its the number of model requests you are making, not the number of user requests that are being made. This is a common miscalculation people make when trying to determine how to scale their instances.
With all that now in hand, what happens if you go over the 138 user requests in 10 seconds? Remember you have 10,000 users. That seems like a very high probability event that it will occur. Once you hit your limit, you’ll get the dreaded 429s. Once this happens, you can pretty much kiss the system goodbye. Every request sent in will just compound the problem and you’ll keep flooding the throttle limit and basically the system will become un-useable. This can be especially bad if you have made it all the way to the last tool call, and another request thread has caused you to hit 429 and then that last call fails (yet all the previous ones were success) and you have to fail the request completely.
Now some of you are thinking, hey, do re-tries and expo backoff etc. Sure, that can help to keep the system going in a degraded state kind of mode, but the users will start to see some pretty nasty latencies on their responses. This can be especially bad if you use the out of box Azure Open AI SDKs. The HttpClient they build will respect the retry period that is sent from the service. Which we have seen range in the 30-90 second range. That’s just crazy. In most cases, you will probably tell it to ignore that and fail quickly. Or write your own HttpClient that does it own re-try based on configuration.
Scaling (Tokens)
Ok, so you think you got this handled right? Nope. As a game show host would say, “but wait, there’s more!”.
Remember that requests per minute limit? Well, you have another limit you have to consider and that is the tokens per minute limit. You might be thinking hey, we will never go over the 138 concurrent user requests so you allocate the 5K requests per minute quota and think you are good. Well, it turns out that agent prompt router, agent and tools have a surprise for you. All of them need a prompt that tells the model what it is supposed to do.
The size of that prompt and anything else that goes into it, matters, A LOT. Typically the agent router will be relatively small and only a few hundred tokens, but once you get into the agents and tools, this is where things start to get interesting. So far, in the systems I have built, the agent prompts tend to be pretty small as they really only define the tools and potentially the personality of the agent. However, the tools prompts, that’s where things can get really interesting. Tool prompts can get super complicated.
Little side note, I’ll admit, I originally laughed when I saw the salary of prompt engineers in silicon valley and some other places, but now that I have been engaged with multiple clients, I actually think they are probably getting paid what they should!
Ok back to the tool prompts, these can get really large. Ranging from 4000-6000 tokens on average based on my experience. But that’s just the main tool prompt. The tool itself could be designed to query data from some external system such as Azure AI search, or a database of some type. Not only does the size of the prompt matter, but the size of the context data that is added to the prompt matter. In some cases, we see tools that generate 25K size prompts. That’s really on the high side for today’s models. But you’re probably thinking, they can go to 128K! Ha, not when you are running 138 concurrent user requests.
If your original 5K RPM quota only comes with 100K TPM quota, you are going to hit 429s pretty much right away, and the system will crash brilliantly. So, what do you do?
Determining tool prompt token and context size along with the tool distributions (the number of requests that go to each agent/tool), will help you plan for what your token needs will be. Also, if you are returning 10 items from the Azure AI Search index when only 3 are needed, you should probably scale that down. The same for if you only need the top 5 rows and not 100 rows returned, you should probably do that too.
The smaller you can make the prompts and context, the higher the throughput you are going to be able to achieve with your GenAI system.
Stress Testing
This is a must. Learning where the system will break in terms of concurrent users will teach all kinds of things about your system design. From how many requests the core layers can handle (note that I did not talk about these layers above) to how many requests your model can handle, to how your tools will behave when the $%^& really hits the fan.
We will cover more about reporting and system metrics later on, but all that data you generate from the stress testing is super important to keep handy later.
Optimization (Caching)
Caching is a great thing. If your users are asking the same question over and over again, why burn 6 model requests and the corresponding tokens every time? That’s the equivalent of just throwing your money in the firepit.
Implementing some kind of caching layer (semantic?) in your core/orchestration layer can help immensely with reducing your request counts and improving the system performance. Users will be very impressed with the speed when a request comes back in 1 second versus 30 seconds.
Not only should you consider a cache layer in the front end of the pipeline, but also in your agents and tools. If they are doing the same things over and over again, why waste the CPU/memory and token burn? Cache it!
Costs
You can find the current costs of the various models on the respective cloud provider websites. Here are the most current links:
Costs aren’t the only thing you should consider. Yeah, you might get higher RPM and TPM, but the completions are a complete disaster (comprehension, accuracy, etc).
Also note, a lot of what you see are noisy neighbor pricing. It means you are all using the same GPUs and during periods of high requests to the cloud provider, your latencies could be all over the place (in the 2 min range sometimes). If you want guaranteed latencies, you typically have to pay extra for that priority queuing.
Summary
You can’t just build an GenAI system, add an agent and say go. It just doesn’t work that way. You have to plan for the number of requests and users and how the system will handle overloaded situations. None of which are easy tasks.
Oh, did I mentioned AI is expensive? More so than you might think. Those pennies per 1K tokens add up very quickly! Having an idea of how much a single request will burn in terms of input and output tokens will help give you a better idea of just how much your GenAI system is going to cost you.
Microsoft Copilot
I’ll give a bit of credit to Microsoft Copilot here. Everything mentioned above is abstracted away from you. You get an all you can eat buffet of GenAI for $30/user/month. You can’t do much in terms of customizing the lower layers (picking your models and how vectorization works, etc) and you are at the mercy of the backend (yeah East US 2, I’m looking at you). But that’s a pretty stellar deal for not having to deal with the headache of everything I have discussed so far, and not to mention, what I’m about to discuss in the next few posts.
Contact
Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!
Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj
GenAI Blog Series
- #1 – Build it or Buy it/RentIt
- #2 – Host it or get SaaS-y
- #3 – Train vs Mainstream models
- #4 – Scaling your solution (and not break the bank)
- #5 – Implementing Security (oh so many levels)
- #6 – Reporting, Logging and Metrics
- #7 – MLOps/GenAIOps, some kind of *Ops
- #8 – Measuring Return on Investment (ROI)