GenAI Blog Series #4 – Scaling your solution (and not break the bank)

Break the bank with AI.

So you picked your model, now you want to go to production with it. Is it actually going to work, or fail miserably? Once its out in the wild, do you know how much its going to really cost?

Scaling (Requests)

The two biggest factors into the scale and cost equation will the expected number of requests into the system every 10 seconds. Why 10 seconds? Well, in Microsoft/Azure land, that’s the frequency of which your rate limit is evaluated. Since that’s one of the mainstream models/platforms, we use that as our starting point.

Once we have an idea of the number of request every 10 seconds, we can start to work our way up the scale and cost stack. But determining this requests per 10s isn’t as straightforward as you think. Let’s do an example:

You may be thinking, I’ll simply do some kind of monthly active users (MAU) kind of calculation. Ok, let’s go with that. Let’s say you’ll have 10,000 active users per month. Next, you need to determine how many requests you think they will execute per user per month. If they are regular workers, they probably do 9-5 and will have 40 hour work weeks. If the system become a regular part of their day (think call center), then you may have them doing 4-5 calls per hour. That’s enough to get started, so let’s do the math:

  • Variables
    • 10,000 MAU
    • 40 hrs / week
    • 20 work days / month
    • 5 requests / hour
  • Calculations:
    • 5 * 40 * 20 * 10000 = 40,000,000 requests / month
    • 5 * 10,000 = 50,000 requests / hour
    • 50,000 / 60 = 833 requests / minute

So you are probably thinking, ah, I’ll have 833 requests / minute. That’s great, and should be very cheap to run! Not so fast.

In most well-designed, more than just a chat bot GenAI systems, a single user request does not map to a single model request. Huh? What typically happens is a single request will be sent to the orchestrator, it could be that a model request is made to determine how to route to the proper agent. Once to the proper agent, the agent may have a set of tools that make up its data and logic. The agent will need to make a request to determine the best set of tools to execute. Once you have the set of tools, you then need to execute those tools, of which they could have 0 or more model requests. Again, let’s take an example:

  • GenAI system request received:
    • Model is called to determine the best agent to response (1 model request)
    • Agent is called, it has zero or more tools, none the less it is likely at least one request is made to answer the user prompt based on zero or more tool outputs (1 model request).
    • If the agent has tools, each tool called, can have zero or more requests, for this example, let’s say that the agent has to call two tools, and each tool makes 2 model requests (2 * 2 = 4 model requests).

In the above example, a single user request, has cost you 6 model requests. Your actual requests per minute are now at 5K requests per minute. And in reality, you are throttled at the 10 second mark so you really can only do the original 833 model requests per 10 seconds. Which is 138 user requests every 10 seconds.

The above makes the assumption that all the requests happen within that 10 second window, but in reality, it is possible the single user request spans a lot longer than 10 seconds and your requests may be spread out over a much longer time span. But hopefully you get the point. Its the number of model requests you are making, not the number of user requests that are being made. This is a common miscalculation people make when trying to determine how to scale their instances.

With all that now in hand, what happens if you go over the 138 user requests in 10 seconds? Remember you have 10,000 users. That seems like a very high probability event that it will occur. Once you hit your limit, you’ll get the dreaded 429s. Once this happens, you can pretty much kiss the system goodbye. Every request sent in will just compound the problem and you’ll keep flooding the throttle limit and basically the system will become un-useable. This can be especially bad if you have made it all the way to the last tool call, and another request thread has caused you to hit 429 and then that last call fails (yet all the previous ones were success) and you have to fail the request completely.

Now some of you are thinking, hey, do re-tries and expo backoff etc. Sure, that can help to keep the system going in a degraded state kind of mode, but the users will start to see some pretty nasty latencies on their responses. This can be especially bad if you use the out of box Azure Open AI SDKs. The HttpClient they build will respect the retry period that is sent from the service. Which we have seen range in the 30-90 second range. That’s just crazy. In most cases, you will probably tell it to ignore that and fail quickly. Or write your own HttpClient that does it own re-try based on configuration.

Scaling (Tokens)

Ok, so you think you got this handled right? Nope. As a game show host would say, “but wait, there’s more!”.

Remember that requests per minute limit? Well, you have another limit you have to consider and that is the tokens per minute limit. You might be thinking hey, we will never go over the 138 concurrent user requests so you allocate the 5K requests per minute quota and think you are good. Well, it turns out that agent prompt router, agent and tools have a surprise for you. All of them need a prompt that tells the model what it is supposed to do.

The size of that prompt and anything else that goes into it, matters, A LOT. Typically the agent router will be relatively small and only a few hundred tokens, but once you get into the agents and tools, this is where things start to get interesting. So far, in the systems I have built, the agent prompts tend to be pretty small as they really only define the tools and potentially the personality of the agent. However, the tools prompts, that’s where things can get really interesting. Tool prompts can get super complicated.

Little side note, I’ll admit, I originally laughed when I saw the salary of prompt engineers in silicon valley and some other places, but now that I have been engaged with multiple clients, I actually think they are probably getting paid what they should!

Ok back to the tool prompts, these can get really large. Ranging from 4000-6000 tokens on average based on my experience. But that’s just the main tool prompt. The tool itself could be designed to query data from some external system such as Azure AI search, or a database of some type. Not only does the size of the prompt matter, but the size of the context data that is added to the prompt matter. In some cases, we see tools that generate 25K size prompts. That’s really on the high side for today’s models. But you’re probably thinking, they can go to 128K! Ha, not when you are running 138 concurrent user requests.

If your original 5K RPM quota only comes with 100K TPM quota, you are going to hit 429s pretty much right away, and the system will crash brilliantly. So, what do you do?

Determining tool prompt token and context size along with the tool distributions (the number of requests that go to each agent/tool), will help you plan for what your token needs will be. Also, if you are returning 10 items from the Azure AI Search index when only 3 are needed, you should probably scale that down. The same for if you only need the top 5 rows and not 100 rows returned, you should probably do that too.

The smaller you can make the prompts and context, the higher the throughput you are going to be able to achieve with your GenAI system.

Stress Testing

This is a must. Learning where the system will break in terms of concurrent users will teach all kinds of things about your system design. From how many requests the core layers can handle (note that I did not talk about these layers above) to how many requests your model can handle, to how your tools will behave when the $%^& really hits the fan.

We will cover more about reporting and system metrics later on, but all that data you generate from the stress testing is super important to keep handy later.

Optimization (Caching)

Caching is a great thing. If your users are asking the same question over and over again, why burn 6 model requests and the corresponding tokens every time? That’s the equivalent of just throwing your money in the firepit.

Implementing some kind of caching layer (semantic?) in your core/orchestration layer can help immensely with reducing your request counts and improving the system performance. Users will be very impressed with the speed when a request comes back in 1 second versus 30 seconds.

Not only should you consider a cache layer in the front end of the pipeline, but also in your agents and tools. If they are doing the same things over and over again, why waste the CPU/memory and token burn? Cache it!

Costs

You can find the current costs of the various models on the respective cloud provider websites. Here are the most current links:

Costs aren’t the only thing you should consider. Yeah, you might get higher RPM and TPM, but the completions are a complete disaster (comprehension, accuracy, etc).

Also note, a lot of what you see are noisy neighbor pricing. It means you are all using the same GPUs and during periods of high requests to the cloud provider, your latencies could be all over the place (in the 2 min range sometimes). If you want guaranteed latencies, you typically have to pay extra for that priority queuing.

Summary

You can’t just build an GenAI system, add an agent and say go. It just doesn’t work that way. You have to plan for the number of requests and users and how the system will handle overloaded situations. None of which are easy tasks.

Oh, did I mentioned AI is expensive? More so than you might think. Those pennies per 1K tokens add up very quickly! Having an idea of how much a single request will burn in terms of input and output tokens will help give you a better idea of just how much your GenAI system is going to cost you.

Microsoft Copilot

I’ll give a bit of credit to Microsoft Copilot here. Everything mentioned above is abstracted away from you. You get an all you can eat buffet of GenAI for $30/user/month. You can’t do much in terms of customizing the lower layers (picking your models and how vectorization works, etc) and you are at the mercy of the backend (yeah East US 2, I’m looking at you). But that’s a pretty stellar deal for not having to deal with the headache of everything I have discussed so far, and not to mention, what I’m about to discuss in the next few posts.

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

GenAI Blog Series #3 – Train vs Mainstream models

As you continue to mull over your decision to built it, buy it and host it or SaaS it, at some point you will need to pick a model or models to use. Since not every use case is the same, it may not be that the latest mainstream model will work for you. It could very well be the case that you have to build your own model. This typically occurs do to the highly sensitive or proprietary nature of the data and use case.

Main Stream Models

You probably get bombarded with blogs, tweets and other things talking about how cool this model is and how cool that model is. Reality is, over the past 2 years we have seen models come and go. New models come out that are cheaper, provide better answers, etc. The latest news about Deepseek definitely sent some shockwaves through the AI community and as you can imagine, people that have already deployed GenAI solutions are very much thinking…how fast can I get an agent up and running to test that model with my current setup. I know I have had the conversations already, “Can we switch to that model?”.

You have to be careful with utilizing mainstream models. You could look at benchmarks all day but the reality is…will it run at scale? Will it give you the performance and accuracy required AND not cause you to go broke! This isn’t a fast and easy question to answer sometimes and will require a proof of concept setup and a series of performance tests.

For example, let’s take Azure AI’s version of GPT4o. You can easily find the costs for this cloud hosted model available on the Azure website. Ultimately it is up to you to request the size of the deployment you will need. This isn’t as easy as you think it is.

There are two main metrics: requests per minute and tokens per minute. This can be a bit misleading. For anyone that has built GenAI platforms at scale, you know that the real limit is actually at 10 seconds. If you are on pace to go over your per minute limit, you will get 429s. We will explore more of this in the next blog post!

Saving the scaling conversation for later, suffice to say, you have to find a model that fits your use case. It might be an OpenAI one, it might not. There are lots of open source models sitting out there that you can choose from that may actually get you where you need to be, without costing you a lot of money.

Train your Own

Mainstream models having been trained on a plethora of things. Most of the training parameters and knowledge built into the model probably won’t be very important or even needed for your specific use case(s). In this case, its probably overkill and going to be much too expensive to utilize. So what do you do?

First step, start browsing all the models on Hugging Face and check out all the crazy advanced Kaggle competitions. Its highly likely you’ll find something that will come close to what you are looking for. However, if you don’t find anything you like. Its time to go to the drawing board and start doing some diagrams and math equations!

Using some common tools like PyTorch, Tensorflow and Keras, you can absolutely rebuild the mainstream models into your own model. Granted, not all models publish how they were trained or provide their source code, but several do, so you have plenty of examples to work off of. If this isn’t something you are comfortable with, go check out Kaggle and start with their beginner competitions. You’ll really enjoy them and it will give you a sense of how to get started.

Now, the advantage of the main stream models, they have already been trained. Some have taken months to do so, and needed data centers full of very expensive GPUs. Someone has already done that work and paid for that computing power. If you don’t have a budget to go buy, setup and operate a bunch of GPU clusters, then you are really stuck with two options. Rent some GPUs from the big three cloud providers (Azure, AWS, GCP), or find a specialized GPU data center provider (yeah, I didn’t know they existed either until a few months ago) for hopefully cheaper rates than buying all of it yourself.

This is where things can get even more tricky. The model you send for training, better be accurate. The last thing you want it to spin up a 50/100 GPU cluster, send in your model and let it sit for the next few weeks running through 1000s of epochs only to find out you forgot to put a neural network layer somewhere. That can hurt the pocketbook.

Fine-tune models

Then there is the meet in the middle approach. The model you are using is like 90/95% of the way there. It just needs a little bit of help to get it to that 99/100% mark. This is where supplementing the knowledge of the model with something extra (things specific to the domain of which it will be requested) can help.

For example, the term “goal” in GPT land typically means “a personal goal” aka something you are trying to achieve. However, in the domain of soccer/futbol, this means a totally different thing. Being able to supplement the core model with domain specific knowledge can help prevent it from getting confused and potentially hallucinating weird stuff to your users.

If you don’t have the budget to retrain a large billion parameter model, but need the flexibility to make a main stream model a bit smarter, then fine tuning may be the way to go. But, I’ll provide the disclaimer, your results will vary with fine tuning. So don’t expect too much here.

Just like prompting, the more refined and straightforward your data is, the higher chance you will get the results you are looking for. So effort has to be put into selecting the most appropriate fine tuning data.

Summary

Picking a model is not a simple task. There are many factors that play into the decision of what model you will eventually use for the use case. Again, not every use case will use the same model, so be ready to support lots of models.

You should also be ready to abandon old models in leu of newer models.

The platform you choose, should ultimately allow you to do that. If it doesn’t, you will be at their mercy as to what they support and your competitors may leap frog you as their solution was much more agile than you in its ability to “use anything”. Luckily, FoundationaLLM allows for you to plug and play at various levels to let you take advantage of whatever model you want. Given the modular design of components, its pretty simply to copy an existing agent and tools and then point it to your new model and get your testing going. This should be a requirement for any platform you are looking at.

Lastly, and hopefully this has hit you already, each of these paths is expensive. Wait, did I mention that GenAI is expensive? After we get all the GenAI basics out of the away, we’ll certainly explore ROI in a later blog post.

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

GenAI Blog Series #2 – Host it or get SaaS-y

Host It or SaaS It

In the previous posts we presented the option of building it or buying it for your GenAI platform. In this post, we will get a bit more specific about what it really takes to host your own solution and why choosing a SaaS based product may be in your best interest.

GenAI solutions tend to need a lot of horse power. The layers you build (UIs, APIs, Pipelines, Database, Orchestrations, etc) will very likely be packaged up as containers and those containers will need to be deployed somewhere. In the case of FoundationaLLM, you can choose two supported Azure-based deployment paths, Azure Container Apps (ACA) and Azure Kubernetes Services (AKS). It doesn’t mean that you could not run it on AKS in Google (GCP) or Amazon Web Services (AWS), as of today, we just have the bicep that knows all about Azure.

When these services get deployed, you have to have knowledge of how many users and requests you will be receiving to your agents. This is important for several reasons.

  • You will need to scale up your container instances (Nodes/Pods) to meet the demand (which typically requires an increase of your vCPU quotas to support said demand).
  • You will need to be able to make sure you don’t overload the backend model given the call pattern of your agent and tools. But this is more of a model issue than a hosting problem, we will explore this more in a later blog post.

Azure Container Apps (ACA)

ACS is an incredibly simple and easy way to get up and running quickly. If you are not an expert in yaml and helm charts for AKS, then this presents a nice way to get something up and running and not require much management or experience to keep things running.

Azure Kubernetes Services (AKS)

ACA is a great product, but it does have some quirks too it that doesn’t quite make it a production level system for a GenAI deployment. Great for a development/qa/staging environment, but not something I’d go with for production.

So Kubernetes is really your best option. Luckily, Kubernetes runs everywhere (Azure, GCP, AWS). This means you’ll need to be comfortable reviewing how to deploy the initial resources and how to ultimately secure it (Zero trust). Greenfield environments work great, but the moment you have other things (your own DNS, hub network, peering, VPNs, paths, routes, TLS, etc) come into play, aka brownfield, you will need to consider all kinds of things to get a solution up and running. If you don’t have the resources to do this, you’ll need to bring on someone to help plan and match things up so your deployment works flawlessly.

These folk(s) will need to have some serious skills. If you plan on managing this yourself at some point, you will need to train up and/or hire folks with the knowledge to do it.

Upgrading

How easy is it to upgrade the solution? If its container based, it won’t be as simple as changing the container image in the ACA config or the AKS deployment. There will always be extra steps to move you from one version to another.

Getting SaaS-y

Hosting it sounds like a lot of work right? Not to mention, its going to cost you in cloud costs right away. Just firing up the basic system is going to put you right around $5K in compute spend per month. So if you are not comfortable with all the work it takes to host a GenAI system yourself, you are probably better off going down a hosted/SaaS based path.

There are several options out there, but note that you will need to consider the following:

  • Security/Identity – Does the solution support external identity providers? How easy is it to integrate? Does it support more than just users (groups?). How might the solution utilize models in other cloud platforms? Can it support cross IdP auth to take advantage of various models? APIKey is not the auth you are looking for….
  • Compliance – Has the platform gone through basic compliance checks like SOC Type II or other more stringent process and data control verifications? Do they store your data, if so, where and how?
  • High-availability – What is the SLA on the system? What if it goes down?
  • Customization – Can it do what you need it to do, plus give you the flexibility to mold the system to your requirements?

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

GenAI Blog Series #1 – Build it or Buy it

CIO contempting to build or buy.

Its the first question that should come to mind. The decision you make here will transform everything you do after.

And not matter what decision you make, you should also be aware, things are expensive in the AI world. Whether you Buy It/Rent it or Build it, its going to cost A LOT.

Option 1: Buy It/Rent It

There are many options out there to choose from. Each comes with some kind of advantage/disadvantage. Some of these include:

  • Simple knowledge management agents:
    • Microsoft Copilot / Copilot Studio ($30/user/month)
    • OpenAI – ChatGPT
  • SaaS offerings allowing varying levels of customizations and UI friendliness, here are some options in no particular preferenced order:
    • Google Vertex AI
    • Azure AI Foundry / Prompt Flow
    • Amazon Bedrock
    • Stack-ai.com
    • Relevanceai.com
    • CrewAI
    • Praison.AI
    • Flowiseai.com
    • Abacas.ai
    • Humanloop
    • Klu.ai
    • Vishwa.ai
    • LangTail
    • Tune.app

As you can see from the list above, there are many companies striving to build platforms and SaaS based applications so you don’t have too. However, there are some issues you should consider if you decide to explore these paths, here are just a few:

  • Features and functionality – if you were to sign up and look at each of the example above, you will find a large set of differences between them and what they feel is important for their customer bases. Your ability to influence them to add features that are important to you, may not be possible or could take a long time to get into their roadmap.
  • Model support – Some platforms will be solely targeted at the models they are financially invested in or contractually set to use (hosted on AWS, GCP, Azure). Not to say that some platforms won’t allow you to use models in other places, but how difficult will it be to plug your fancy new model into the platform?
  • Tools – The orchestration layer is probably going to be built into the target platform. It could be LangChain, Semantic Kernel or some other home grown custom one with a high probability that you cannot replace it with your own. The ability to add an agent and its necessary tools is a very important aspect to the system. And being that tools can have tools, the ability for you to have a management UI that allows you to configure these tools and sub tools is a vital part of the solution. Without this, you will be relegated to whatever out of the box agent and tools they give you.
  • Agent and Tools Extensibility – a topic so important, it is separate from the Tools conversation! Having a set of agent types you can select from is cool and all, but only have a static set of tools to select from is not. You should be able to add your own Agents and Tools. Agents should be able to utilize any tools you load into the system. The only issue you’ll run into, is how do you route these tools?
  • Router Workflow – Simply calling an agent and getting a single response is pretty…simple. But what if you have multiple agents, or agents with multiple tools? How do you route to these? What if more than one agent or tool needs to respond? Router design becomes a critical part of the system (if you are going past a simple GenAI chat bot anyway). Can you do these types of flows with a Buy It system?
  • Support – Will it cost you extra? Do they respond quickly enough?
  • Source Code – Is it open source, or is it closed? Will the company stick around long enough to succeed in the market or will you need a code escrow?

Option 2: Build it

So you have explored the Buy It/Rent It options, and you have come to the conclusion that you need to build it on your own. Understand that the work you are about to undertake will be quite involved and a massive effort.

Resources will be needed.

You are going to need some skilled AI/ML folks (at least two), you will need some front end folks (again, at least two). You will need middleware and backend folks (another couple). Gotta throw in the DevOps team, made up of another 2-3 individuals. Don’t forget, project managers and some testers. So right off the start, you’ll need 12-14 people to Build It. Figure around 100K average per person, you are sitting at $1.2-$1.4m/year just for the GenAI team. Some other roles that I didn’t include here that you likely already have: security and privacy, compliance team members. However, you’d need to allocate some of their time to review any solutions proposed.

This does NOT include the hosting costs, which we will get too in a later post.

The challenge is the same with pretty much every company. Budgets are stretched, you already have competing priorities and projects, legacy applications and processes to support. Teams tend to be overwhelmed with what they already have on their plates. Although most people will be happy to learn more about AI and how to use it and incorporate it, they will need to find the time to do so. AI can help with that…Microsoft Copilot could be that solution to increase productivity and allow them the time to build custom AI solutions. But now we have AI eating AI but hey…why not?

In all seriousness, this blog series will enlighten you to just how hard it is to get a GenAI to a successful end state. It is very likely you will need to bring in some talent that has done this a few times and knows how to keep you out of trouble. Taking classes, watching some videos and doing getting simple POC examples from github up and running does not a GenAI expert make (Yoda, 2025).

This means hitting the recruit staff with some job descriptions and getting them out onto the job boards. I see GenAI jobs popping up everywhere as companies realize, they need help and can’t do it on their own. You should be really specific about what you are looking for. Say what tech stack you are targeting (reference the next section), what hypervisor you use (Azure, AWS, GCP), and be specific that you are looking for GenAI versus Machine Learning. The two are very different things and a GenAI person is not always skilled at building neural networks to train multi-billion parameters models with costly GPUs and nor is a machine learning expert likely skilled at building a GenAI platform.

Components to build.

Now, you won’t completely have to start from scratch, there is plenty of examples of various bits and pieces scattered across GitHub. The real decisions will center in on the main components and what tech stack they will run on and how you will ultimately host them. For example:

  • Front end (User and Management Portals) – Dotnet, Vue, Node, React, etc. Ultimately these will turn into containers and be hosted somewhere, so as long as your front end folks are comfortable with the tech, you’ll probably be good here. The only nuance, will you need to support accessibility? This is not as easy as one would think and requires quite a bit of tedious work, but required if you are in a country/region/industry that mandates it.
  • API/Core/Orchestration – What will you write this in? Java? DotNet? Python? How will you support
  • Gateway layer – build your own API Management layer to offset some costs?
  • Gatekeeper layer – how will you do prompt attack prevention and sensitive information filtering?
  • Database – CosmosDB, PostgresSQL
  • Vector Database – Azure AI Search, CosmosDB, Pine, etc
  • Agentic Frameworks – Semantic Kernel or LangChain? Will you only support one, or many? Allow bring your own customization? How will that plug into the system?
  • Security – At what layers will you build the security? What objects and scopes will be secured? Can you build it in a way that scales?
  • Monitoring – Keeping track of the usage of the system. This includes its performance both in terms of latency and accuracy will be a continual battle. You will need queries and dashboards to help show how the system is running.
  • Failover/Recovery – If your data center or hosting provider goes down, how business critical is it to get the system back up and running? Can you do a full regional failover (in Azure, from East US2 to Sweden?).

Example

Chris O’Brien has a really nicely written up blog piece that fits into this topic of Buy It or Rent it. It does a good job of pointing out the pros and cons of the two possible paths as well as a focus on the costs involved. The use case is pretty specific, but it still re-iterates the points made here.

Its interesting to note that many of these GenAI SaaS providers (Buy It/Rent It) are still in price discovery mode. They probably didn’t have a good grip on the backend costs of their solution and as they started to scale up, they realized they may have priced too low. Conversely, those that have found efficiencies or realized the value they provide is becoming more of a commodity are starting to bring their prices down to match new market competitors. This alone makes the argument that you should always be evaluating your options are and what “exit strategy” you have for getting off one provider to move to another.

Summary

It may not be obvious which path is the best path to start with. If you go it alone, its going to take some experimentation, mistake making and some lessons learned. I’d encourage you to reach out to folks in the community (like myself or other AI MVPs) or AI conferences and learn from those that have done this before. There really is no reason to reinvent the wheel or make the same mistakes others have.

As you can see from the above, neither choice presents an obvious or easy path to solve your management or stockholders push to use AI in the business. What is obvious, is that GenAI presents opportunities for innovation and business solutions that were not remotely possible before. So its not really a matter of “should we” do it, but “when and how”.

Contact

If you are looking for some help in making your decision, or have questions about anything I mention in this blog series, feel free to reach out anytime, and definitely check out FoundationaLLM!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

GenAI Blog Series: What does it really take to build your company GenAI from scratch?

Its been a while since my last post! But I can guarantee you, this next series of blogs will enlighten you to some pretty deep and interesting topics that I have gained insight into these past few years!

GenAI is a hot topic

The last couple years has been a very interesting journey into the world of GenAI through my colleagues at Solliance and the new startup we call FoundationaLLM. This was a “build it on your own”, from scratch project, which is open sourced here. This project has been 2+ years in the making, and still has a long way to go to solve some of the **REALLY** hard problems.

Where do you start?

One of the biggest challenges for customers today when determining how to integrate GenAI and AI in general is, where do we start? GenAI is being talked about everywhere. It has the power and potential to transform organizations in many ways. It can be used to generate revenue or to cut costs and increase productivity. And as a colleague once mentioned to me, automate tasks, but then where does RPA (PowerAutomate, etc) and GenAI really differ? Without a firm grip on what you want to accomplish, you should not be in an hurry to allocate resources to an AI project. No one wants to start a project, only to see if fail, wasting precious time and money!

So let’s at least get you want you need to understand the journey you are about to undertake!

Questions to Ask

Typically the first question would be, what model(s) are we going to use, which would then beg the question…what hypervisor (Azure, AWS, GCP, cloud/co-lo GPU hosted datacenters) are we going to use? But there are so many more questions that you don’t know you don’t know.

Here’s a quick list of common CEO/CIO/CTO questions : that will make up this very interesting and insightful blog series.

  • Build it or Buy it
    • Build it
      • How long will it really take to build something from scratch, or with every leveraging various frameworks?
      • Do you really have the skills/team to get it past the finish line? (We have been working non-stop on this for 2+ years).
    • Buy it
      • What are you going to buy? Will it be hosted in your sub or a SaaS thing?
      • From who?
      • How much will it cost?
      • How will you manage it? Do you have the skills to managed it (ACA or AKS)?
      • Can you integrate your own applications to it? Are the calls secure and scalable?
      • Is it extensible (add your own agents and models)?
      • Is it flexible (support future models?)
  • Host it yourself or leverage SaaS?
    • Host it ourselves:
      • If we host it, where we will host it?
      • How will we host it?
      • How will we scale it?
      • How will we secure it?
      • What models will we support?
      • Do we have the skills to write it and maintain it?
      • Do our own monitoring, or hire someone else to do it?
    • Leverage SaaS
      • Will we get locked in? Is the source code available?
      • How customizable is it?
      • How secure is it?
      • How flexible is it, will you have ability to request features/roadmap items?
  • Use our own models, or utilize mainstream models?
    • What tools/frameworks might we use to train our own models?
      • PyTorch? TensorFlow? Keras?
    • Where will we run those model training tools/frameworks?
      • Azure Machine Learning?
      • Bedrock?
    • How long will it take and how much will it cost us?
      • If you want to buy a bunch of GPUs and host them to training and run your models, then you must have a nice budget.
      • Most people won’t have this and will need to utilize GPUs hosted by someone else. Also, not a cheap endeavor.
      • What if models are retired? (Best example, embedding models)
        • How easy is it to move to a new model?
    • Where will we run the model once we are done?
  • What data will you use?
    • Where is it?
    • What is it?
    • How much do you have?
    • Is it curated or random?
  • How much will we need to scale the solution?
    • Most models have token limit sizes
    • Most hosted solutions will limit your ability to max out the GPUs backing them
    • How will you scale your solution to maximize, yet not destroy the systems when the dreaded 429s start to take all your nodes/pods/threads down?
  • How will I do reporting?
    • How much do we keep? (Chat history, messages, token burn, etc)
    • What compliance issues will I need to address?
  • Security
    • At what layers will we need security?
      • Agent, Datasource, Items, Models, Endpoints, etc
    • Does the solution need to span multiple IdPs? (Azure->AWS)
  • Will the ROI match what the eventual production solution will present?
    • AI is expensive, whether you are hosting it yourself or using out of box solutions like Microsoft Copilot, will you actually attain ROI?

Proof of Concepts

POCs are easy to setup and can be quite compelling. But don’t let that shinny object/carrot being dangled in front of you detract from the work that it will take to move to production. As you can see from the questions above and the challenges to come below, its a complex path to navigate to a final state of a successful GenAI deployment.

Challenges To Come

As a CxO tasks with brining GenAI into your organization, you can expect at least a few of the following challenges. Be prepared with answers to how you will overcome these when (not if), you hit them:

  • AI is expensive, be ready to allocate budget to it.
  • Just because you build it/buy it, doesn’t mean you will achieve 100% adoption.
  • If you do achieve 100% adoption, you will probably run into various scaling issues if the platform isn’t well designed.
  • You probably don’t have the staff skilled up enough (development, infrastructure) to make it happen. Be ready to hire/outsource.
  • Security and data integration problems.

Summary

As part of this GenAI blog series, we are going to explore each of the above questions and their various sub layers in incredibly painful depth (put your seatbelts on, keep all arms and legs in the car at all times), with examples from the various pull requests and commits from our repo as examples to the problems and issues you WILL eventually face.

It is possible to achieve GenAI nirvana, you just need to have the right expectations and be educated and prepared using the lessons learned from folks like myself and others. The opportunities are there, achieving them is possible, but is going to take some dedication and drive.

Contact

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series