ACS – Architecting Connected Systems Blog

February 12, 2025February 12, 2025

GenAI Blog Series #8 – Measuring Return on Investment (ROI)

With the previous blogs as a basis for your GenAI knowledge. You can know focus on the, “Is it worth it?” I have said this a few times now, but its worth repeating, AI is expensive. I always like to focus on an equation that was presented to me in college during my internship at IBM.

P = R – C aka Profit = Revenue – Costs.

Use Cases

Most vendors, Microsoft likely more so than others, have spent a lot of time and money to build out all the possible use cases for AI in an organization. And that makes total sense. People struggle with figuring out how to use AI in their organizations, I have seen it first hand with the many clients we have dealt with the past few years.

Microsoft has broken use cases down into functional and departmental area. They have broken them out into various industries. What you will typically find is many of them focus on productivity. This makes sense as Microsoft Copilot is probably the best known and most active product from a marketing standpoint. Being this comes out of the Office group, they are all about productivity.

But did you know, there are over 70+ copilots at Microsoft that don’t just focus on productivity. Some of these teams also work hard to build out use cases for their products (Azure AI Foundry, PromptHub, Copilot for Security, etc).

So its not all about productivity. Productivity doesn’t directly fit into the P = R – C equation. Its more a factor of can you get more for the “C” part without increasing the “C” drastically. Besides, who wants to spend money to increase productivity just to see profit decrease?

What is your Use Case?

Given the focus on use cases, its your turn to figure out what is going to give you the biggest bang for the buck. This means running through all the use cases you can think of (or ask an AI model to do it for you), then sorting them in terms of the largest revenues increase, the largest cost decrease, or the most productivity increase.

For public companies that have stock holders, you are likely to find that profit is going to be the bigger focus. So find those use cases that can have an immediate and profound impact of those will be your top prospects.

Return On Investment (ROI)

Did I mention AI is expensive? Determining your Return on Investment (ROI) is the ONLY thing you should be focused on. You need to figure out exactly how much a GenAI solution is going to cost you. This is not as easy as one would think.

Let’s take the build it on your own approach:

Built it costs
- Architects, Developers, Testers, Project Managers, Infrastructure – (~$1.5m/year)
- Compute (ACA/AKS) – If you were to build a minimal architecture you are going to have around 22 different containers supporting your solution, all of which will need somewhere to run. You will need nodes/workload to support them. A system with one agent and a few hundred users, will put you at right around ~$5000/month.
- Model (Azure Open AI, etc)
  - As we explored in the scaling post, this can vary based on your usage patterns and the size of your user/app base.
  - Some models are more expensive than others, and finding the one that gives your the accuracy and consistency you are looking for can be a challenge.
  - Additionally, if you find the public services can’t meet your user SLAs, then you’ll need to move up to the higher cost PTUs or higher priority service plans of the model hoster.
- CosmosDB/Database
  - You will have to consider the costs of the RUs for each container. In a custom solution that supports several thousands of users, it is actually pretty reasonable on the costs.
- Misc services
  - Application Configuration
  - Key Vault
- Bing (NullDomain)
  - You may want to have some kind of fall back agent/tool. This could simply be a black hole that makes a single LLM call, or does something useful, like call Bing to answer the question. Bing is actually pretty expensive, so finding other alternatives such as your own Azure AI search with your own news or curated documents will be much cheaper. Or even possible other search providers other than Bing.
- Storage
  - You are going to have artifacts that live in a blob storage account of some sort. Although this won’t be a large cost of the solution, it can get start to add up when you start to consider BCDR and replication.
- API Management
  - I don’t even want to talk about this. Is it worth building your own layer to handle this, I would say yes. Do you have the skills to do it, then be ready for some serious costs to be added to your solution. The pattern is mandatory, the implementation is not.
- Logging
  - Application Insights + Log Analytics can vary greatly on the costs. From a few hundred dollars for a few thousand. It really comes down to how much logging you want to keep. If you use OpenTelemetry in your solution, you can setup varying levels of logging for the various components. Setting it to Information in production will reduce the logs and the costs, but you will lose important data about what might have caused an error. ($1000-$2500/month)
- Azure AI Search
  - If you choose to store your vectors in Azure AI Search, you’ll need to deploy one of these services. The basic sku will only support up to 15 indexes, so if you don’t have a design that allows for multiple tenants/apps/vector stores to be in a single index with hybrid search, you may find yourself having to upgrade to the higher sku and support more indexes.
  - Additionally, semantic search capabilities have to be enabled and cost more than just the base SKU cost.
- Security
  - If the data in the system has no real corporate sensitivity, and is meant for public consumption, then you probably won’t be too worries about security of the system, however, if the system must enforce security (authentication/authorization), then you should also consider the infrastructure security as well.
  - Enabling cloud based security technologies like Defender for Cloud can quickly add up in extra costs ($15/VM/month for example).
  - Adding in the cost of Microsoft Sentinel and Defender for Endpoint can also add to the cost of the solution.

Now…2.5x those costs. You need a development environment and a production environment with a “warm” regional failover (hence the .5 extra multiplier). Then add the same amount for each separate environment you decide to add, for example, if you decided to have a QA/Staging (add another 2x), which as you can imagine, most people opt out of doing because its so cost prohibitive.

And now let’s compare something like Microsoft Copilot…aka, the Buy It path.

The licensing is $30/user/month, all you can eat GenAI buffet. Albeit, Microsoft Copilot Studio custom messaging costs can come into play if you go that route (which many people do) and you also have to consider the costs of the PowerPlatform that drives some of the more interesting use cases as well.

Measuring ROI

Once you have the full view of the costs of the solution, you can start to quantify the actual ROI. Let’s take the following example:

Total monthly infra cost of the solution (dev+prod): $30,000
Total monthly personnel costs of supporting the solution: $125,000
Number of users (actively using the system): 2,500 MAU
Cost per user / month (155000 / 2500) = $62/month
Monthly hourly savings per user (productivity metric): 4hrs
Average salary cost per user (~$75K/yr) : $39/hr
Hourly savings per employee/month: $156.25
Total hourly savings (savings – cost): $94.25

So ultimately, what are you getting for this?

The solution is generating 120,000 extra hours of free time per year.
- This can be used by the employees in any number of ways
Cost of the solution is around $155,000 / month (yeah, expect most solutions to be in this realm)
Time savings of the solution is $235,625/month
Net savings of $80,625/month, or $967,500/year

NOTE: Cost of the out of box Microsoft Copilot is $30/user/month = $75,000/month, so you might be tempted to go down that path by that initial number, but you should also consider all the other items that end up being suggested as add-ons (E3, or likely E5 licenses $55/user/month, support). Then you’ll still need development and support staff to assist users and build custom agents when needed. And don’t forget, you lose a lot of lower level control of your solution that a custom built solution would provide to you (albeit, Copilot Studio with PowerPlatform gives you the ability to build some of this so the messaging rates come into play). Total costs of this solution (if you elected the E5) will push you into the $212,500+/month realm which would in fact be more than your custom built GenAI solution.

Take a moment and step back. See the numbers above? They are in the $200K/month range. Did I mention AI is expensive? You should expect similar numbers when you go to execute your strategy (if not more). That’s $2.4m per year to run GenAI. Will your solution generate some multiple of $2.4m worth of value? Will it be profitable? Will it break even?

Will it simply end up being a money pit?

Summary

If you don’t know how much the GenAI solution is going to cost you (or if you are in your project right now, and you don’t know, STOP…NOW). Do not even begin the GenAI journey until you have all this information. If the use case does not fit into the P = R – C with massive potential as a low hanging fruit and a quick win, then you should probably hold off until you find the one that does. You need that first win to keep the momentum going. Failing your first use case…will not set you up to get the budget for trying a second time.

Vendors and sales people are happy to sell you the “vision”, but don’t get blinded by fancy GenAI generated presentations. Too many people have been burned badly by the “excitement” of AI and what it “can do”, and then brutely find out later after spending millions of dollars that they really should have planned things out better.

Instead of getting the results they wanted, they fell even further behind their competitors who thought it out, planned it, set budgets and maximized the ROI and got that first vital AI win.

Finally, there has been a lot of development, money, blood, sweat and tears put into the FoundationaLLM platform (and several others out there). Being tasked to build GenAI from scratch is not something I would ever wish on anyone to have to go through, yet, I have no doubt that many of you reading this, have gone through this and know fully the pain I am talking about.

So, is it worth it? Did I scare you enough to think twice about jumping right into GenAI? I hope so.

It is worth it, if you do it right, execute on every level, in a high performance manner with costs at the forefront of every decision. Do this, and you have a higher chance of success.

Contact

I don’t want you to lose or get fired. I really want you to win! If you need help getting your GenAI project started and/or over the finish line…ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

February 12, 2025February 12, 2025

GenAI Blog Series #7 – MLOps/GenAIOps, some kind of *Ops

Deploying artifacts should be simple and repeatable. Being able to have a development, QA, staging and production environment that mimic each other should be a seamless process that includes quick creation and end to end (E2E) testing. This can be accomplished through well designed CI/CD processes and is much easier when things are greenfield, but we all know that’s not always the case (for example, brownfield where Firewall, DNS and hub networks already exist and must be integrated with).

Code

If you went down the Built It path, then you are going to have a couple years worth of code about to be (or already) built up. In today’s world, everything is driven by containerized solutions. As the various layers are built, you’ll need to ensure that all the Docker files are created for your service layers. In addition, those Docker files should have all the necessary scripting to properly build your images. NuGet packages, pypi packages, all that great stuff has to be considered as part of the build, deploy and execution processes.

Hopefully, you are using some kind of enterprise level git repository such as GitHub or GitLab. In the case of GitHub, the shift left mentally really shows its strengths to ensure that any dependencies you add to your projects are vulnerability free. You get a lot of this for free for public projects, but unfortunately, it is something that you have to pay for private projects.

CI/CD Workflows

So your code compiles and your local tests work, you want to commit those changes to your repo. You can do direct commits, or you can do Merge/Pull requests. The later is a better approach, but believe me when I say I understand, when under pressure and tight timelines, you may not always have this luxury. Especially when your one and only code owner is on vacation.

The setup of your repository is very important. You should be able to have multiple developers able to work on various features and then be able to merge those changes back into your various branches (main, feature, release, etc).

When a release branch is committed too, you may want it to actually do a full build and deployment to your container registry. This would allow you to modify your dev/test deployments to point to the new container release tags and do your testing.

Infrastructure (Quotas, quotas, and more quotas)

GenAI application deployments require a lot of quota. And all kinds of different types of quotas.

Compute
- Virtual Machines, Scale Sets, workloads, all used to host your containers
AI Models
- Newer models will require approval. During approval and development, you many only get a small amount. This can make scale/stress/load testing difficult.
Function/App Service Plans
- If you need some kind of logic app, function app or other type of app service, you may need to request for the ability to create it in the target region. East US/East US 2 can get tricky.

Typically customers would like to have all of there resources in a single region, maybe this is due to regulations or compliance, or just, because. GenAI platforms require a myriad of features that must all exist in the target region. If they don’t, then your best laid plans for using that bicept/Terraform goes right out the window.

We have had a myriad of issues trying to find regions that support all the features (yeah, they are all ring zero) that allow for deployment without having to jump through quota requests and other escalation mechanisms (CSA/GBB).

Change is Inevitable

In addition to getting your environments up and running, you have to expect that changes will always be a part of the GenAI application evolution. Platform components are updated, prompts are updated, new models are rolled out, etc. Not all of these things will be driven by Infrastructure as Code (IaC). Some of these changes will be manual, while some will be script driven.

Ensuring that you have a development environment where these changes can be tested is vital as you really don’t want to be deploying untested/verified changes straight to production. You know better!

Backup

As part of any business critical system, it should support some kind of backup or versioning mechanism. This is important for many reasons, the primary one being the ability to get back up running in case of system failure, however, users and admins can also be a source for frustration when they change an agent and all hell breaks loose.

Things that should be backed up include:

System configurations (App Config)
Databases (Cosmos, Azure SQL)
Vector Database (Azure AI Search)
Agent configurations
Prompts
Data Pipeline configurations

Disaster Recovery

You just never know when a region or zone might go down and you need to get your GenAI platform up and running somewhere else (if you are lucky enough to be able to find a region that supports all the GenAI platform features and are one of the first to failover to said region). And of course, this assumes you decided to build it and host it on your own.

You would need to read the term and conditions and service level agreements of your hosting provider to find out if they have any SLAs in case their stuff goes down. And as a vetting process, you should always find out what kind of BCDR they have behind the scenes.

Summary

Change is inevitable. Being able to adapt to those changes is important and having well defined process and procedures in place to automate deployments and upgrades is mandatory. By doing this, you are guaranteed to get to your desired state quicker, more efficiently and have a more reliable system as a result.

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

February 11, 2025February 14, 2025

GenAI Blog Series #6 – Reporting, Logging and Metrics

So you solved all the previous architectural challenges and have moved on to the Operationalize side of the equation. Which is the dreaded Reporting and Metrics side of things.

In most cases, the systems will be thrown over to the infrastructure team to manage and being most aren’t coders, are unlikely to be able to fix things non-infrastructure related if they don’t know anything about the code or system. However, if it goes down, or performance starts to degrade, they are the first to be notified. The infrastructure team will need a set of guidelines and procedures for working any issues and when to escalate to support or the developers. Having an understanding of the Reporting and Metrics will help guide them on what the issue may be and what the next steps are to resolve it.

Reporting

Measuring the system and understanding how it is running is another step in getting to a successful GenAI deployment. Somethings people like to know:

Common reports
- How many requests have been sent and when?
  - Helpful for finding peaks and valleys in your usage patterns
  - Can support auto scale of the system during times of high usage
- How many active users are there?
  - As a lot of people present at conferences, this typically is referred too as your adoption rate.
- Who is the most active user?
  - Are they using the system more than they should or are they using it the way it was meant to be used?
  - Your top users can potentially become your spokesperson for the rest of the users/company.
- What are your users asking about the most?
  - Typically helpful for refinement and potentially caching.
- Is anyone abusing the system from shear number of requests (denial of service, or costing you a lot of tokens)?
  - Sometimes you may need to limit the number of requests per user on some X timeframe.
Responsible AI
- If you are using Content Safety, requests/users that are abusing the system.
- If you are using Presido, what request/users are asking for or returning sensitive data?
Costs
- Azure – You should be able to break down the costs across the various Azure infrastructure and create a baseline. Also, this is MANDATORY, put budgets in place that fire every possible alert you can add.

Charge backs – If the system you have deployed supports multiple departments, you should be able to charge back the costs based on the usage of the system. Utilizing some kind of “cost_center” tagging across the resources can help with that. You would need to be able to export those reports on a monthly automated basis.

Logging

It won’t be practical to log everything. The system should be able to specify the log levels (Info, Warn, Debug, Trace) that gets sent to your logging solution. You should also be careful when turning on those lower levels that you turn them back off when you are done!

Errors
- Application Insights. Having your components log to application insights or some other logging solution is also mandatory. For example, FLLM uses OpenTelemetry which can pretty much send the logs anywhere.
- Latency – Being able to measure latency of your requests from end to end or anywhere in-between help to find issues in your system. This can be especially important when its under high levels of load and stress.
  - Separate note: When using zero-trust deployments, you are likely going to be tempted to utilize Managed Identity for all cross service calls. Although this is the right approach, MSI token creation NEVER uses the token cache and can cause some serious latency issues in your request flows. This may not be an issue with a simple “chat bot” style agent when you don’t care much about how long it takes to respond, but in systems with 10s or lower targets, this can kill your design.
- Common issues – things you should be able to easily diagnosis:
  - Errors/Exceptions – These are going to occur, being able to find the error and diagnosis its cause, needs to be quick and efficient (even though the fix may be much more complex and time consuming to implement).
  - Problem agents/tools – You may have multiple agents as part of the system overall system servicing the needs of many groups of individuals/departments. Being able to narrow down to the agent causing the most problems (requests/token burn), will be a vital skill. Additionally, if an agent utilizes more than just a simple LLM call and has various tools at its disposable, you should be able to narrow things down to the tool level and find problem tools that need attention.
  - Service Level issues – If you were not aware, ACA and AKS have limits on their request timeouts. ACA maxes out at 4 mins, and no, you cannot increase it. AKS has more flexibility, but still has its limits for the sake of system stability. The system you build or use, should never use synchronous calls to LLM models. Some of these models can have some pretty bad latencies (up to 2/3+ minutes). Add in your layers, and it spells a recipe for disaster. The platform you build/choose should utilize async patterns everywhere. This allows for your requests to pass through ACA/AKS without having to worry about timeouts which then produce nerve racking 499s (which can also occur from your calling apps when they give up waiting for a response).

Metrics

The compute layers are responsible for handling the requests as they flow through the application layers and keeping track of request status (although easier said than done). Here are some basic things you should be watching at all times:

Models (Azure Open AI, etc)
- Requests / minute
- Tokens / minute
- Latency
- 429s
Compute layer (ACA + AKS + AML, etc)
- Scale operations (adding nodes/pods)
- CPU
- Memory
- Network
Data layer (Cosmos, Postgres, Azure SQL)
Vectorization Pipelines
- Is your data/vector pipeline running as expected?
Vector Databases
- Is your vector database running as originally intended from performance standpoint?
- Is the index initially used, the right one for serving requests?

Costs

Log analytics (or whatever cloud solution that is used) can easily surpass the costs of your other services just by the shear volume of data that could potentially get sent to it. By default, you get 30 days of log data before it start to prune, some people may want to increase that to 90 days for compliance and reporting purposes. However, the more data you keep, the higher the costs.

AI Governance

It is easy to spend a few months on a GenAI project and suddenly realize that it has taken a course that is not in alignment with its original goals. Being able to take the data collected above and continually measuring against the target baseline will help ensure that the original purpose of the project stays the course.

But, if you find that the original premise was wrong, then a course correction may be warranted and it would be a part of your AI Governance Board / Center of Excellence to make the necessary decisions to adjust the direction based on new data and priorities of the business/organization.

Having a meeting every quarter is a pretty good starting point. The primary executive sponsor along with all relevant stakeholders and AI leads should attend the meeting and review the data to discuss any concerns or areas that need improvement. With costs and ROI in mind, any course corrections should be agree upon and documented with clarity of everyone’s roles and responsibilities.

Having a target set of goals to achieve before the next meeting along with a frequency and agreed way to measure against those goals should also be decided upon. This will help to show progress and whether the goals will be met in time.

Summary

Whatever solution you build or deploy or rent, should have a wide variety of ways to get reporting and metrics out of the system. It should be one of the top things you consider when choosing a platform. Luckily, if your GenAI system is deployed to one of the major cloud providers, they tend to provide a lot of ways to setup and get that data for you.

The only caveat, the more reporting, logging and metrics you collect, the more you are going to pay for it.

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

February 7, 2025February 14, 2025

GenAI Blog Series #5 – Implementing Security (oh so many levels)

A lot of people just jump right in and start coding a Python notebook that makes some calls to a model endpoint. They will typically pass in some simple prompt, a little bit of data, and do a model request. They will likely get back some cool response and immediately get excited about what they just did. Next, they will run off to show their boss or team and say “hey, look how cool this is and what we can do”. The excitement builds and the company realizes the potential for the use case.

So management then asks, “How do we get this to production and will it be secure?”. Because by the way, the data the data scientist passed in, is actually corporate top secret.

Oh boy, so that little notebook now needs a whole platform built for it and it has to be ultra secure, so where do we start?

Whether you choose to built it or buy it, security should be the #1 thing on your mind. There are so many levels that security can apply too in the system. But not only the system itself, but the users that access the GenAI system. They have to be monitored and secured. If they get compromised, then the attacker has full access to any GenAI agents the user has access too! This is where the E5 licenses from Microsoft and all the security it enables actually start to make a lot of sense!

You are probably thinking, great, another thing to secure. And you’d be right, as the last thing you need is another attack vector that people can take advantage of to cause privacy breaches, legal exposures and lawsuits, or corporate blackmail events.

UIs (User)

First thing, you are going to need a UI or API. Let’s assume the application will need a chat UI. The UI will need to be secured with some kind of IdP (Authentication). Likely Microsoft Entra, Amazon IAM, Okta/Auth0, etc. Being that most frameworks these days provider authentication provider frameworks, shouldn’t be a big task here, but you will need to be able to pass that user credential to the rest of the layers as they will need to know what the user can actually see (Authorization).

Whatever system (remember Buy It/Build it?) you decided to use, should be able to support external IdP integration, if it doesn’t, then I’d say its a no go. After all, you may just want to allow external partners to use the system and charge them for it! We’ll revisit this in the Return on Investment post later!

And lastly, if you are going to utilize this platform for all your employees, you will likely need to consider all the accessibility laws and ensure that whatever you are using, meets the needs of your users.

UIs (Management)

How will you add new agents? Will you let end users create them? When you create an agent, what will the agent be able to use in terms of tools, models, features? Ultimately, this boils down to what can the “user” see in terms of the tools, models and features. How will this work?

APIs

If the system doesn’t need a UI, and all you need is to integrate it with your current applications via an API, then you can skip the User UI requirement and go for the Core and Orchestration layers. If you don’t need to keep track of things like chat history and token burn, then you can probably skip those layers and just go directly to the orchestration\workflow (similar to hosting your agent/tools/workflow in promptflow and Azure Machine Learning style). But you’ll still need to be able to authenticate and authorize the user/application.

Agents

Agents are at the top of the food chain. They provide the template for what will occur when a user/app makes a request. They define the type of workflow/orchestration, the tools, its prompt, etc.

Because an agent will have configuration of tools which then have access to various data sources, its very important that you set the authorization at the agent level properly. One of the most common mistakes people make (especially in Microsoft Copilot), is indexing a bunch of data that is corporate top secret. They then expose that data via an agent without any permissions on it. That then allows anyone that can login to the UI/API to gain access to that data.

Having the ability to lockdown your agents based on the tools and data it has access too, is a vital feature for any GenAI platform.

Tools

Most people build an agent, give a prompt, point it to a model…and that’s it. Simple. However, those of us that have been around a while know that’s not how it works anymore.

You need much more advanced agent types than just a simple knowledge management agent (an agent that points to a vector database).

People want AI solutions that *do things*.

They want analytical agents that take data from a database (using model generated dynamic SQL) and combine that with other data from some other source(s) and add that to a complex prompt that is then fed into the model to get some advanced completion.
They want agents that execute action/functions against external systems.
They want agents that can perform complex workflows that Plan/Replan that run for hours/days.

Agents should be able to be built with plug and play workflows, plug and play tools that implement much more advanced patterns. And by the way, tools have tools.

Models

Agents should be able to plug and play the models they use. This allows for moving to the next model version or completely new model seamlessly and as a result, hopefully improves your latency and accuracy.

Additionally, creating multiple agents with different models allows for some great A+B testing.

As discussing in the “Train vs Mainstream models” post, there are many models out there that can be used for your agents and tools. Having the ability to add in any model from any external platform and then have an agent or tool use that model is pretty powerful.

When creating models, they can have various properties to them, such as temperature, top_p, top_k, etc. When creating a model, your UI should allow for the dynamic adding of model properties that the agent or tool knows how to pass to the model.

This is not an easy feature to implement, but something you should consider when going at it on your own.

If you fine tune the models, or use models built on your own data, it becomes even more important to secure the access to the model because it now has your corporate confidential information integrated into. A compromise of an agent that uses the model can lead to corporate data leakage.

End Points

Models can be hosted in any number of places. The same model can be hosted in Azure, AWS and GCP. End points can be defined to specify where an agent or tool can make the call to a specific model. These endpoints should also be locked down. You don’t want someone adding a very expensive model to their agent and then exposing it to every person in your company and thusly getting a $100K bill that month.

Not only are their endpoints for your models, but if your GenAI system enables external application integration through API calls, those also need to be secured. This can be done through Managed Identities, API Keys, Certificates, etc.

Now that your applications are integrated, they have access to your model. Anyone that has access to the application, now has access to the model. What if the application is compromised and starts to send a high number of requests to your GenAI platform? How will you monitor and control this?

The layers are starting to add up! End to end zero trust security is looking even more complicated.

How are you passing the user credentials from the app to the platform to the model? What is multiple IdPs are involved (Entra to AWS to GCP anyone)? This is not an easy task.

Chat Sessions

Users typically are owners of there own chat sessions. Some platforms have started to toss around the idea of “sharing” chat sessions. This presents some interesting challenges, but is a very cool idea. Microsoft Security Copilot allows for this, but will put the thread into a read-only state after you have performed the operation.

Attachments

Users typically want to pass in attachments for the Agent and Tools to work on. This is very common for people that utilize the OpenAI Assistants API. Attachments must be stored and tracked and be available to be referenced as part of the chat history. In addition, any generated files from the model, must also be saved and tracked.

These files/attachments, should not be visible to other users, therefore, security should be defined such that only that particular chat session user has access to the chat and the attachments as part of the chat.

This gets even more complicated when you decided you want to be able to “share” chat sessions with other people as noted above.

Vectorization

Knowledge management agents are a common first step for most companies tipping their toes in the GenAI waters. These typically require you to vectorize some documents. Organizations have varying levels of requirements when exploring these paths. Some may have only a handle full of documents, others many have several 1000s or terabytes of data they want vectorized.

A data\vectorization pipeline typically is broken into four main steps:

Content Source
- What content sources does the platform support? (SharePoint/M365, Datalake, Blob Storage, Snowflake, etc).
- Does the system support pulling Access Control Lists (ACLs) from the source?
Text Partitioning
- Once you download the data from the content source, you have to break it apart. There are many different ways to do this for different file types (remember iFilters). PDF files is a common one. Chunking and overlap are common parameters that have to be experimented with.
Text Embedding
- Once you have the chunks, you need to embed these chucks using some Embedding model. The most popular option the past two years has been text-ada-002. This allows for 1538 vectors which is pretty decent. However, if you see where this model sits in the ranking of embedding models today, its somewhere in position #75.
- At some point, you won’t want to utilize these old models anymore and you will want to migrate to a new model. This would mean a full re-vectorization of your content.
Indexing
- So you have the embeddings, where are you going to put them?
- Cosmos DB? Azure AI Search? Pine? PostgreSQL?
- This step will save your embeddings to the target store.

Content Sources

Although this was covered above in Vectorization, it is important to note that most platforms allow you to ingest/vectorize any data you want, but very few will allow you to bring in the ACLs. This presents an important security issue. Any knowledge management agent that is pointed at a vectorized datastore that had its ACLs ripped off, know falls back to the security on agent itself. This was pointed out above, but worth noting again as it presents a pretty big security hole when you go live without security in place.

System Data / Reporting

If you choose to store your chat sessions and agent completions, it will likely go into some kind of data store. Stakeholder users will want to be able to gain access to basic reporting capabilities such as:

Request per User
Tokens per User
Errors
Prompt and Completion token usage
Charge back (cost centers)
CPU, Memory and Network loads on the compute layer
Latency and Requests for your models

Power BI is typically asked for, but because of some of the limitations of the product, it doesn’t end up being a viable option without a lot of extra work.

So this typically falls back to creating some kind of customized reporting via Python notebooks. Most authentication should be done via Entra or IAM based identities, not APIKeys (Zero trust).

System Access (Azure resources)

In addition to the data plane, you need to consider the control plane. This is the access to the compute and other resources that host the various layers. When issues start to pop up, someone will need to be able to login to the containers/pods and look at the logs, as well as things like Application Insights and Log Analytics.

You also need to prepare for things like upgrades, so who will be responsible for gaining access to the AKS cluster and storage resources to do those container and scripted upgrades.

Summary

Still want to build your own GenAI from scratch? Security is an important part of the design, but most people overlook what that really means. There is a lot to consider across all the layers that can cause some serious concerns if not implemented correctly.

Its important to do a very detailed review of how users and applications access the system and how the authentication and authorization flows through all layers. Finding gaps where a malicious user/app can take advantage of weak security should be a top priority and either you fill those gaps, or determine mitigation techniques.

The last thing you need is to have to deal with a corporate data leak, compliance issue and any negative hits to your organizations reputation.

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

February 7, 2025February 12, 2025

GenAI Blog Series #4 – Scaling your solution (and not break the bank)

So you picked your model, now you want to go to production with it. Is it actually going to work, or fail miserably? Once its out in the wild, do you know how much its going to really cost?

Scaling (Requests)

The two biggest factors into the scale and cost equation will the expected number of requests into the system every 10 seconds. Why 10 seconds? Well, in Microsoft/Azure land, that’s the frequency of which your rate limit is evaluated. Since that’s one of the mainstream models/platforms, we use that as our starting point.

Once we have an idea of the number of request every 10 seconds, we can start to work our way up the scale and cost stack. But determining this requests per 10s isn’t as straightforward as you think. Let’s do an example:

You may be thinking, I’ll simply do some kind of monthly active users (MAU) kind of calculation. Ok, let’s go with that. Let’s say you’ll have 10,000 active users per month. Next, you need to determine how many requests you think they will execute per user per month. If they are regular workers, they probably do 9-5 and will have 40 hour work weeks. If the system become a regular part of their day (think call center), then you may have them doing 4-5 calls per hour. That’s enough to get started, so let’s do the math:

Variables
- 10,000 MAU
- 40 hrs / week
- 20 work days / month
- 5 requests / hour
Calculations:
- 5 * 40 * 20 * 10000 = 40,000,000 requests / month
- 5 * 10,000 = 50,000 requests / hour
- 50,000 / 60 = 833 requests / minute

So you are probably thinking, ah, I’ll have 833 requests / minute. That’s great, and should be very cheap to run! Not so fast.

In most well-designed, more than just a chat bot GenAI systems, a single user request does not map to a single model request. Huh? What typically happens is a single request will be sent to the orchestrator, it could be that a model request is made to determine how to route to the proper agent. Once to the proper agent, the agent may have a set of tools that make up its data and logic. The agent will need to make a request to determine the best set of tools to execute. Once you have the set of tools, you then need to execute those tools, of which they could have 0 or more model requests. Again, let’s take an example:

GenAI system request received:
- Model is called to determine the best agent to response (1 model request)
- Agent is called, it has zero or more tools, none the less it is likely at least one request is made to answer the user prompt based on zero or more tool outputs (1 model request).
- If the agent has tools, each tool called, can have zero or more requests, for this example, let’s say that the agent has to call two tools, and each tool makes 2 model requests (2 * 2 = 4 model requests).

In the above example, a single user request, has cost you 6 model requests. Your actual requests per minute are now at 5K requests per minute. And in reality, you are throttled at the 10 second mark so you really can only do the original 833 model requests per 10 seconds. Which is 138 user requests every 10 seconds.

The above makes the assumption that all the requests happen within that 10 second window, but in reality, it is possible the single user request spans a lot longer than 10 seconds and your requests may be spread out over a much longer time span. But hopefully you get the point. Its the number of model requests you are making, not the number of user requests that are being made. This is a common miscalculation people make when trying to determine how to scale their instances.

With all that now in hand, what happens if you go over the 138 user requests in 10 seconds? Remember you have 10,000 users. That seems like a very high probability event that it will occur. Once you hit your limit, you’ll get the dreaded 429s. Once this happens, you can pretty much kiss the system goodbye. Every request sent in will just compound the problem and you’ll keep flooding the throttle limit and basically the system will become un-useable. This can be especially bad if you have made it all the way to the last tool call, and another request thread has caused you to hit 429 and then that last call fails (yet all the previous ones were success) and you have to fail the request completely.

Now some of you are thinking, hey, do re-tries and expo backoff etc. Sure, that can help to keep the system going in a degraded state kind of mode, but the users will start to see some pretty nasty latencies on their responses. This can be especially bad if you use the out of box Azure Open AI SDKs. The HttpClient they build will respect the retry period that is sent from the service. Which we have seen range in the 30-90 second range. That’s just crazy. In most cases, you will probably tell it to ignore that and fail quickly. Or write your own HttpClient that does it own re-try based on configuration.

Scaling (Tokens)

Ok, so you think you got this handled right? Nope. As a game show host would say, “but wait, there’s more!”.

Remember that requests per minute limit? Well, you have another limit you have to consider and that is the tokens per minute limit. You might be thinking hey, we will never go over the 138 concurrent user requests so you allocate the 5K requests per minute quota and think you are good. Well, it turns out that agent prompt router, agent and tools have a surprise for you. All of them need a prompt that tells the model what it is supposed to do.

The size of that prompt and anything else that goes into it, matters, A LOT. Typically the agent router will be relatively small and only a few hundred tokens, but once you get into the agents and tools, this is where things start to get interesting. So far, in the systems I have built, the agent prompts tend to be pretty small as they really only define the tools and potentially the personality of the agent. However, the tools prompts, that’s where things can get really interesting. Tool prompts can get super complicated.

Little side note, I’ll admit, I originally laughed when I saw the salary of prompt engineers in silicon valley and some other places, but now that I have been engaged with multiple clients, I actually think they are probably getting paid what they should!

Ok back to the tool prompts, these can get really large. Ranging from 4000-6000 tokens on average based on my experience. But that’s just the main tool prompt. The tool itself could be designed to query data from some external system such as Azure AI search, or a database of some type. Not only does the size of the prompt matter, but the size of the context data that is added to the prompt matter. In some cases, we see tools that generate 25K size prompts. That’s really on the high side for today’s models. But you’re probably thinking, they can go to 128K! Ha, not when you are running 138 concurrent user requests.

If your original 5K RPM quota only comes with 100K TPM quota, you are going to hit 429s pretty much right away, and the system will crash brilliantly. So, what do you do?

Determining tool prompt token and context size along with the tool distributions (the number of requests that go to each agent/tool), will help you plan for what your token needs will be. Also, if you are returning 10 items from the Azure AI Search index when only 3 are needed, you should probably scale that down. The same for if you only need the top 5 rows and not 100 rows returned, you should probably do that too.

The smaller you can make the prompts and context, the higher the throughput you are going to be able to achieve with your GenAI system.

Stress Testing

This is a must. Learning where the system will break in terms of concurrent users will teach all kinds of things about your system design. From how many requests the core layers can handle (note that I did not talk about these layers above) to how many requests your model can handle, to how your tools will behave when the $%^& really hits the fan.

We will cover more about reporting and system metrics later on, but all that data you generate from the stress testing is super important to keep handy later.

Optimization (Caching)

Caching is a great thing. If your users are asking the same question over and over again, why burn 6 model requests and the corresponding tokens every time? That’s the equivalent of just throwing your money in the firepit.

Implementing some kind of caching layer (semantic?) in your core/orchestration layer can help immensely with reducing your request counts and improving the system performance. Users will be very impressed with the speed when a request comes back in 1 second versus 30 seconds.

Not only should you consider a cache layer in the front end of the pipeline, but also in your agents and tools. If they are doing the same things over and over again, why waste the CPU/memory and token burn? Cache it!

Costs

You can find the current costs of the various models on the respective cloud provider websites. Here are the most current links:

Costs aren’t the only thing you should consider. Yeah, you might get higher RPM and TPM, but the completions are a complete disaster (comprehension, accuracy, etc).

Also note, a lot of what you see are noisy neighbor pricing. It means you are all using the same GPUs and during periods of high requests to the cloud provider, your latencies could be all over the place (in the 2 min range sometimes). If you want guaranteed latencies, you typically have to pay extra for that priority queuing.

Summary

You can’t just build an GenAI system, add an agent and say go. It just doesn’t work that way. You have to plan for the number of requests and users and how the system will handle overloaded situations. None of which are easy tasks.

Oh, did I mentioned AI is expensive? More so than you might think. Those pennies per 1K tokens add up very quickly! Having an idea of how much a single request will burn in terms of input and output tokens will help give you a better idea of just how much your GenAI system is going to cost you.

Microsoft Copilot

I’ll give a bit of credit to Microsoft Copilot here. Everything mentioned above is abstracted away from you. You get an all you can eat buffet of GenAI for $30/user/month. You can’t do much in terms of customizing the lower layers (picking your models and how vectorization works, etc) and you are at the mercy of the backend (yeah East US 2, I’m looking at you). But that’s a pretty stellar deal for not having to deal with the headache of everything I have discussed so far, and not to mention, what I’m about to discuss in the next few posts.

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

February 6, 2025February 12, 2025

GenAI Blog Series #3 – Train vs Mainstream models

As you continue to mull over your decision to built it, buy it and host it or SaaS it, at some point you will need to pick a model or models to use. Since not every use case is the same, it may not be that the latest mainstream model will work for you. It could very well be the case that you have to build your own model. This typically occurs do to the highly sensitive or proprietary nature of the data and use case.

Main Stream Models

You probably get bombarded with blogs, tweets and other things talking about how cool this model is and how cool that model is. Reality is, over the past 2 years we have seen models come and go. New models come out that are cheaper, provide better answers, etc. The latest news about Deepseek definitely sent some shockwaves through the AI community and as you can imagine, people that have already deployed GenAI solutions are very much thinking…how fast can I get an agent up and running to test that model with my current setup. I know I have had the conversations already, “Can we switch to that model?”.

You have to be careful with utilizing mainstream models. You could look at benchmarks all day but the reality is…will it run at scale? Will it give you the performance and accuracy required AND not cause you to go broke! This isn’t a fast and easy question to answer sometimes and will require a proof of concept setup and a series of performance tests.

For example, let’s take Azure AI’s version of GPT4o. You can easily find the costs for this cloud hosted model available on the Azure website. Ultimately it is up to you to request the size of the deployment you will need. This isn’t as easy as you think it is.

There are two main metrics: requests per minute and tokens per minute. This can be a bit misleading. For anyone that has built GenAI platforms at scale, you know that the real limit is actually at 10 seconds. If you are on pace to go over your per minute limit, you will get 429s. We will explore more of this in the next blog post!

Saving the scaling conversation for later, suffice to say, you have to find a model that fits your use case. It might be an OpenAI one, it might not. There are lots of open source models sitting out there that you can choose from that may actually get you where you need to be, without costing you a lot of money.

Train your Own

Mainstream models having been trained on a plethora of things. Most of the training parameters and knowledge built into the model probably won’t be very important or even needed for your specific use case(s). In this case, its probably overkill and going to be much too expensive to utilize. So what do you do?

First step, start browsing all the models on Hugging Face and check out all the crazy advanced Kaggle competitions. Its highly likely you’ll find something that will come close to what you are looking for. However, if you don’t find anything you like. Its time to go to the drawing board and start doing some diagrams and math equations!

Using some common tools like PyTorch, Tensorflow and Keras, you can absolutely rebuild the mainstream models into your own model. Granted, not all models publish how they were trained or provide their source code, but several do, so you have plenty of examples to work off of. If this isn’t something you are comfortable with, go check out Kaggle and start with their beginner competitions. You’ll really enjoy them and it will give you a sense of how to get started.

Now, the advantage of the main stream models, they have already been trained. Some have taken months to do so, and needed data centers full of very expensive GPUs. Someone has already done that work and paid for that computing power. If you don’t have a budget to go buy, setup and operate a bunch of GPU clusters, then you are really stuck with two options. Rent some GPUs from the big three cloud providers (Azure, AWS, GCP), or find a specialized GPU data center provider (yeah, I didn’t know they existed either until a few months ago) for hopefully cheaper rates than buying all of it yourself.

This is where things can get even more tricky. The model you send for training, better be accurate. The last thing you want it to spin up a 50/100 GPU cluster, send in your model and let it sit for the next few weeks running through 1000s of epochs only to find out you forgot to put a neural network layer somewhere. That can hurt the pocketbook.

Fine-tune models

Then there is the meet in the middle approach. The model you are using is like 90/95% of the way there. It just needs a little bit of help to get it to that 99/100% mark. This is where supplementing the knowledge of the model with something extra (things specific to the domain of which it will be requested) can help.

For example, the term “goal” in GPT land typically means “a personal goal” aka something you are trying to achieve. However, in the domain of soccer/futbol, this means a totally different thing. Being able to supplement the core model with domain specific knowledge can help prevent it from getting confused and potentially hallucinating weird stuff to your users.

If you don’t have the budget to retrain a large billion parameter model, but need the flexibility to make a main stream model a bit smarter, then fine tuning may be the way to go. But, I’ll provide the disclaimer, your results will vary with fine tuning. So don’t expect too much here.

Just like prompting, the more refined and straightforward your data is, the higher chance you will get the results you are looking for. So effort has to be put into selecting the most appropriate fine tuning data.

Summary

Picking a model is not a simple task. There are many factors that play into the decision of what model you will eventually use for the use case. Again, not every use case will use the same model, so be ready to support lots of models.

You should also be ready to abandon old models in leu of newer models.

The platform you choose, should ultimately allow you to do that. If it doesn’t, you will be at their mercy as to what they support and your competitors may leap frog you as their solution was much more agile than you in its ability to “use anything”. Luckily, FoundationaLLM allows for you to plug and play at various levels to let you take advantage of whatever model you want. Given the modular design of components, its pretty simply to copy an existing agent and tools and then point it to your new model and get your testing going. This should be a requirement for any platform you are looking at.

Lastly, and hopefully this has hit you already, each of these paths is expensive. Wait, did I mention that GenAI is expensive? After we get all the GenAI basics out of the away, we’ll certainly explore ROI in a later blog post.

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

February 5, 2025February 12, 2025

GenAI Blog Series #2 – Host it or get SaaS-y

In the previous posts we presented the option of building it or buying it for your GenAI platform. In this post, we will get a bit more specific about what it really takes to host your own solution and why choosing a SaaS based product may be in your best interest.

GenAI solutions tend to need a lot of horse power. The layers you build (UIs, APIs, Pipelines, Database, Orchestrations, etc) will very likely be packaged up as containers and those containers will need to be deployed somewhere. In the case of FoundationaLLM, you can choose two supported Azure-based deployment paths, Azure Container Apps (ACA) and Azure Kubernetes Services (AKS). It doesn’t mean that you could not run it on AKS in Google (GCP) or Amazon Web Services (AWS), as of today, we just have the bicep that knows all about Azure.

When these services get deployed, you have to have knowledge of how many users and requests you will be receiving to your agents. This is important for several reasons.

You will need to scale up your container instances (Nodes/Pods) to meet the demand (which typically requires an increase of your vCPU quotas to support said demand).
You will need to be able to make sure you don’t overload the backend model given the call pattern of your agent and tools. But this is more of a model issue than a hosting problem, we will explore this more in a later blog post.

Azure Container Apps (ACA)

ACS is an incredibly simple and easy way to get up and running quickly. If you are not an expert in yaml and helm charts for AKS, then this presents a nice way to get something up and running and not require much management or experience to keep things running.

Azure Kubernetes Services (AKS)

ACA is a great product, but it does have some quirks too it that doesn’t quite make it a production level system for a GenAI deployment. Great for a development/qa/staging environment, but not something I’d go with for production.

So Kubernetes is really your best option. Luckily, Kubernetes runs everywhere (Azure, GCP, AWS). This means you’ll need to be comfortable reviewing how to deploy the initial resources and how to ultimately secure it (Zero trust). Greenfield environments work great, but the moment you have other things (your own DNS, hub network, peering, VPNs, paths, routes, TLS, etc) come into play, aka brownfield, you will need to consider all kinds of things to get a solution up and running. If you don’t have the resources to do this, you’ll need to bring on someone to help plan and match things up so your deployment works flawlessly.

These folk(s) will need to have some serious skills. If you plan on managing this yourself at some point, you will need to train up and/or hire folks with the knowledge to do it.

Upgrading

How easy is it to upgrade the solution? If its container based, it won’t be as simple as changing the container image in the ACA config or the AKS deployment. There will always be extra steps to move you from one version to another.

Getting SaaS-y

Hosting it sounds like a lot of work right? Not to mention, its going to cost you in cloud costs right away. Just firing up the basic system is going to put you right around $5K in compute spend per month. So if you are not comfortable with all the work it takes to host a GenAI system yourself, you are probably better off going down a hosted/SaaS based path.

There are several options out there, but note that you will need to consider the following:

Security/Identity – Does the solution support external identity providers? How easy is it to integrate? Does it support more than just users (groups?). How might the solution utilize models in other cloud platforms? Can it support cross IdP auth to take advantage of various models? APIKey is not the auth you are looking for….
Compliance – Has the platform gone through basic compliance checks like SOC Type II or other more stringent process and data control verifications? Do they store your data, if so, where and how?
High-availability – What is the SLA on the system? What if it goes down?
Customization – Can it do what you need it to do, plus give you the flexibility to mold the system to your requirements?

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

February 5, 2025February 14, 2025

GenAI Blog Series #1 – Build it or Buy it

Its the first question that should come to mind. The decision you make here will transform everything you do after.

And not matter what decision you make, you should also be aware, things are expensive in the AI world. Whether you Buy It/Rent it or Build it, its going to cost A LOT.

Option 1: Buy It/Rent It

There are many options out there to choose from. Each comes with some kind of advantage/disadvantage. Some of these include:

Simple knowledge management agents:
- Microsoft Copilot / Copilot Studio ($30/user/month)
- OpenAI – ChatGPT
SaaS offerings allowing varying levels of customizations and UI friendliness, here are some options in no particular preferenced order:
- Google Vertex AI
- Azure AI Foundry / Prompt Flow
- Amazon Bedrock
- Stack-ai.com
- Relevanceai.com
- CrewAI
- Praison.AI
- Flowiseai.com
- Abacas.ai
- Humanloop
- Klu.ai
- Vishwa.ai
- LangTail
- Tune.app

As you can see from the list above, there are many companies striving to build platforms and SaaS based applications so you don’t have too. However, there are some issues you should consider if you decide to explore these paths, here are just a few:

Features and functionality – if you were to sign up and look at each of the example above, you will find a large set of differences between them and what they feel is important for their customer bases. Your ability to influence them to add features that are important to you, may not be possible or could take a long time to get into their roadmap.
Model support – Some platforms will be solely targeted at the models they are financially invested in or contractually set to use (hosted on AWS, GCP, Azure). Not to say that some platforms won’t allow you to use models in other places, but how difficult will it be to plug your fancy new model into the platform?
Tools – The orchestration layer is probably going to be built into the target platform. It could be LangChain, Semantic Kernel or some other home grown custom one with a high probability that you cannot replace it with your own. The ability to add an agent and its necessary tools is a very important aspect to the system. And being that tools can have tools, the ability for you to have a management UI that allows you to configure these tools and sub tools is a vital part of the solution. Without this, you will be relegated to whatever out of the box agent and tools they give you.
Agent and Tools Extensibility – a topic so important, it is separate from the Tools conversation! Having a set of agent types you can select from is cool and all, but only have a static set of tools to select from is not. You should be able to add your own Agents and Tools. Agents should be able to utilize any tools you load into the system. The only issue you’ll run into, is how do you route these tools?
Router Workflow – Simply calling an agent and getting a single response is pretty…simple. But what if you have multiple agents, or agents with multiple tools? How do you route to these? What if more than one agent or tool needs to respond? Router design becomes a critical part of the system (if you are going past a simple GenAI chat bot anyway). Can you do these types of flows with a Buy It system?
Support – Will it cost you extra? Do they respond quickly enough?
Source Code – Is it open source, or is it closed? Will the company stick around long enough to succeed in the market or will you need a code escrow?

Option 2: Build it

So you have explored the Buy It/Rent It options, and you have come to the conclusion that you need to build it on your own. Understand that the work you are about to undertake will be quite involved and a massive effort.

Resources will be needed.

You are going to need some skilled AI/ML folks (at least two), you will need some front end folks (again, at least two). You will need middleware and backend folks (another couple). Gotta throw in the DevOps team, made up of another 2-3 individuals. Don’t forget, project managers and some testers. So right off the start, you’ll need 12-14 people to Build It. Figure around 100K average per person, you are sitting at $1.2-$1.4m/year just for the GenAI team. Some other roles that I didn’t include here that you likely already have: security and privacy, compliance team members. However, you’d need to allocate some of their time to review any solutions proposed.

This does NOT include the hosting costs, which we will get too in a later post.

The challenge is the same with pretty much every company. Budgets are stretched, you already have competing priorities and projects, legacy applications and processes to support. Teams tend to be overwhelmed with what they already have on their plates. Although most people will be happy to learn more about AI and how to use it and incorporate it, they will need to find the time to do so. AI can help with that…Microsoft Copilot could be that solution to increase productivity and allow them the time to build custom AI solutions. But now we have AI eating AI but hey…why not?

In all seriousness, this blog series will enlighten you to just how hard it is to get a GenAI to a successful end state. It is very likely you will need to bring in some talent that has done this a few times and knows how to keep you out of trouble. Taking classes, watching some videos and doing getting simple POC examples from github up and running does not a GenAI expert make (Yoda, 2025).

This means hitting the recruit staff with some job descriptions and getting them out onto the job boards. I see GenAI jobs popping up everywhere as companies realize, they need help and can’t do it on their own. You should be really specific about what you are looking for. Say what tech stack you are targeting (reference the next section), what hypervisor you use (Azure, AWS, GCP), and be specific that you are looking for GenAI versus Machine Learning. The two are very different things and a GenAI person is not always skilled at building neural networks to train multi-billion parameters models with costly GPUs and nor is a machine learning expert likely skilled at building a GenAI platform.

Components to build.

Now, you won’t completely have to start from scratch, there is plenty of examples of various bits and pieces scattered across GitHub. The real decisions will center in on the main components and what tech stack they will run on and how you will ultimately host them. For example:

Front end (User and Management Portals) – Dotnet, Vue, Node, React, etc. Ultimately these will turn into containers and be hosted somewhere, so as long as your front end folks are comfortable with the tech, you’ll probably be good here. The only nuance, will you need to support accessibility? This is not as easy as one would think and requires quite a bit of tedious work, but required if you are in a country/region/industry that mandates it.
API/Core/Orchestration – What will you write this in? Java? DotNet? Python? How will you support
Gateway layer – build your own API Management layer to offset some costs?
Gatekeeper layer – how will you do prompt attack prevention and sensitive information filtering?
Database – CosmosDB, PostgresSQL
Vector Database – Azure AI Search, CosmosDB, Pine, etc
Agentic Frameworks – Semantic Kernel or LangChain? Will you only support one, or many? Allow bring your own customization? How will that plug into the system?
Security – At what layers will you build the security? What objects and scopes will be secured? Can you build it in a way that scales?
Monitoring – Keeping track of the usage of the system. This includes its performance both in terms of latency and accuracy will be a continual battle. You will need queries and dashboards to help show how the system is running.
Failover/Recovery – If your data center or hosting provider goes down, how business critical is it to get the system back up and running? Can you do a full regional failover (in Azure, from East US2 to Sweden?).

Example

Chris O’Brien has a really nicely written up blog piece that fits into this topic of Buy It or Rent it. It does a good job of pointing out the pros and cons of the two possible paths as well as a focus on the costs involved. The use case is pretty specific, but it still re-iterates the points made here.

Its interesting to note that many of these GenAI SaaS providers (Buy It/Rent It) are still in price discovery mode. They probably didn’t have a good grip on the backend costs of their solution and as they started to scale up, they realized they may have priced too low. Conversely, those that have found efficiencies or realized the value they provide is becoming more of a commodity are starting to bring their prices down to match new market competitors. This alone makes the argument that you should always be evaluating your options are and what “exit strategy” you have for getting off one provider to move to another.

Summary

It may not be obvious which path is the best path to start with. If you go it alone, its going to take some experimentation, mistake making and some lessons learned. I’d encourage you to reach out to folks in the community (like myself or other AI MVPs) or AI conferences and learn from those that have done this before. There really is no reason to reinvent the wheel or make the same mistakes others have.

As you can see from the above, neither choice presents an obvious or easy path to solve your management or stockholders push to use AI in the business. What is obvious, is that GenAI presents opportunities for innovation and business solutions that were not remotely possible before. So its not really a matter of “should we” do it, but “when and how”.

Contact

If you are looking for some help in making your decision, or have questions about anything I mention in this blog series, feel free to reach out anytime, and definitely check out FoundationaLLM!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

February 4, 2025February 13, 2025

GenAI Blog Series: What does it really take to build your company GenAI from scratch?

Its been a while since my last post! But I can guarantee you, this next series of blogs will enlighten you to some pretty deep and interesting topics that I have gained insight into these past few years!

GenAI is a hot topic

The last couple years has been a very interesting journey into the world of GenAI through my colleagues at Solliance and the new startup we call FoundationaLLM. This was a “build it on your own”, from scratch project, which is open sourced here. This project has been 2+ years in the making, and still has a long way to go to solve some of the **REALLY** hard problems.

Where do you start?

One of the biggest challenges for customers today when determining how to integrate GenAI and AI in general is, where do we start? GenAI is being talked about everywhere. It has the power and potential to transform organizations in many ways. It can be used to generate revenue or to cut costs and increase productivity. And as a colleague once mentioned to me, automate tasks, but then where does RPA (PowerAutomate, etc) and GenAI really differ? Without a firm grip on what you want to accomplish, you should not be in an hurry to allocate resources to an AI project. No one wants to start a project, only to see if fail, wasting precious time and money!

So let’s at least get you want you need to understand the journey you are about to undertake!

Questions to Ask

Typically the first question would be, what model(s) are we going to use, which would then beg the question…what hypervisor (Azure, AWS, GCP, cloud/co-lo GPU hosted datacenters) are we going to use? But there are so many more questions that you don’t know you don’t know.

Here’s a quick list of common CEO/CIO/CTO questions : that will make up this very interesting and insightful blog series.

Build it or Buy it
- Build it
  - How long will it really take to build something from scratch, or with every leveraging various frameworks?
  - Do you really have the skills/team to get it past the finish line? (We have been working non-stop on this for 2+ years).
- Buy it
  - What are you going to buy? Will it be hosted in your sub or a SaaS thing?
  - From who?
  - How much will it cost?
  - How will you manage it? Do you have the skills to managed it (ACA or AKS)?
  - Can you integrate your own applications to it? Are the calls secure and scalable?
  - Is it extensible (add your own agents and models)?
  - Is it flexible (support future models?)
Host it yourself or leverage SaaS?
- Host it ourselves:
  - If we host it, where we will host it?
  - How will we host it?
  - How will we scale it?
  - How will we secure it?
  - What models will we support?
  - Do we have the skills to write it and maintain it?
  - Do our own monitoring, or hire someone else to do it?
- Leverage SaaS
  - Will we get locked in? Is the source code available?
  - How customizable is it?
  - How secure is it?
  - How flexible is it, will you have ability to request features/roadmap items?
Use our own models, or utilize mainstream models?
- What tools/frameworks might we use to train our own models?
  - PyTorch? TensorFlow? Keras?
- Where will we run those model training tools/frameworks?
  - Azure Machine Learning?
  - Bedrock?
- How long will it take and how much will it cost us?
  - If you want to buy a bunch of GPUs and host them to training and run your models, then you must have a nice budget.
  - Most people won’t have this and will need to utilize GPUs hosted by someone else. Also, not a cheap endeavor.
  - What if models are retired? (Best example, embedding models)
    - How easy is it to move to a new model?
- Where will we run the model once we are done?
What data will you use?
- Where is it?
- What is it?
- How much do you have?
- Is it curated or random?
How much will we need to scale the solution?
- Most models have token limit sizes
- Most hosted solutions will limit your ability to max out the GPUs backing them
- How will you scale your solution to maximize, yet not destroy the systems when the dreaded 429s start to take all your nodes/pods/threads down?
How will I do reporting?
- How much do we keep? (Chat history, messages, token burn, etc)
- What compliance issues will I need to address?
Security
- At what layers will we need security?
  - Agent, Datasource, Items, Models, Endpoints, etc
- Does the solution need to span multiple IdPs? (Azure->AWS)
Will the ROI match what the eventual production solution will present?
- AI is expensive, whether you are hosting it yourself or using out of box solutions like Microsoft Copilot, will you actually attain ROI?

Proof of Concepts

POCs are easy to setup and can be quite compelling. But don’t let that shinny object/carrot being dangled in front of you detract from the work that it will take to move to production. As you can see from the questions above and the challenges to come below, its a complex path to navigate to a final state of a successful GenAI deployment.

Challenges To Come

As a CxO tasks with brining GenAI into your organization, you can expect at least a few of the following challenges. Be prepared with answers to how you will overcome these when (not if), you hit them:

AI is expensive, be ready to allocate budget to it.
Just because you build it/buy it, doesn’t mean you will achieve 100% adoption.
If you do achieve 100% adoption, you will probably run into various scaling issues if the platform isn’t well designed.
You probably don’t have the staff skilled up enough (development, infrastructure) to make it happen. Be ready to hire/outsource.
Security and data integration problems.

Summary

As part of this GenAI blog series, we are going to explore each of the above questions and their various sub layers in incredibly painful depth (put your seatbelts on, keep all arms and legs in the car at all times), with examples from the various pull requests and commits from our repo as examples to the problems and issues you WILL eventually face.

It is possible to achieve GenAI nirvana, you just need to have the right expectations and be educated and prepared using the lessons learned from folks like myself and others. The opportunities are there, achieving them is possible, but is going to take some dedication and drive.

Contact

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj

GenAI Blog Series

September 30, 2021

Splunk to Sentinel Migration – Part VI – Users and Permissions

So the last few blogs posts have been really exciting and enlightening with a clear path for migration, unfortunately, we hit the part where things get a bit bleak in the migration path.

Users and permissions are very rich and granular in Splunk. The concept of namespaces are innovative and very useful when you have to carve out sub-administration tasks and data targeting.

There is no concept of namespaces in Azure Sentinel. There is no wide-spread ability to “hide” things with permissions in Sentinel. You can see some of the limited actions/roles that are available in the article Permissions in Azure Sentinel. That being said, there are some things that can be done, but they are really not a matching one to one of the security features in Splunk.

What might we want to carve out permission-wise? Here’s a simple list:

Queries
Lookups
Alerts / Incidents
Dashboards
Tables / Schemas

Let’s take a look at each of these.

Query Permissions

Queries are stored in various ways in Azure Sentinel. Since Log Analytics is the backing for Azure Sentinel, most of them will be saved into Log Analytics. When being stored, they can be simply stored directly via Log Analytics or have various extra properties that define them as special for display in Sentinel. Others could be stored as sub-objects attributes.

Basically there is no concept of pointers to a query like in Splunk. Its a copy by value type of thing. Queries can be copied/used across Analytic rules and Hunting queries. They can also be saved in Log Analytics for re-use but only at the Log Analytics level. Queries saved as functions can be called at higher levels such as Sentinel.

Hunting queries and Analytics Rules can’t be targeted at a user or group level. If you have access to Sentinel, you have access to all the hunting queries. If you have queries that are targeted at a user / namespace in Splunk, they will become global in a single instance of Azure Sentinel. And since Azure Sentinel can’t be pointed at multiple log analytics workspaces, you can’t really define things at a Azure Sentinel level that will then span across all Log Analytics workspaces.

At a Log Analytics level, you can save queries to the workspace. Again, there is no user / group targeting here. If you have access to the Log Analytics workspace, you have access to all saved queries.

Lookup Permissions

As you read about in a previous blog in this series, there are several ways to create lookups. Some of these can content sensitive data so having a way to lock them down is going to be important. Remember, the three types of lookups are:

Log Analytics Table: As you will see later, log analytics tables can be targeted using custom roles.
Watchlists: Global for the Sentinel instance, everyone can see them.
Azure Storage: Using Azure Storage means you would have to have knowledge of the file location and consequently access to the file (via SAS). Problem is, it is easy to look across all queries and be able to find any references to these external files.

The only real option you have for lockdown for lookups is via custom Log Analytics Table imports.

Alerts / Incidents

If you have access to the Azure Sentinel instance, you will see all incidents. There is currently no way to filter or hide incidents in Sentinel based on tagging or any other means. A SOC analyst will be able to see all alerts and incidents.

Dashboards

Since Dashboards are an Azure Resource that show up in the Resource Group, you can set IAM on them such that individuals won’t even be able to see them when they navigate to the Azure Portal. This simply prevents them from seeing a pre-created dashboard. If they were to figure out what a query was that showed data they shouldn’t see and build their own dashboard (provided they have Contribute access to Azure resource group), they could gain access to data they should probably not be looking at.

Tables / Schemas

Tables and data are stored in Log Analytics. As such, if you want to hide tables, you will need to utilize Azure Custom Roles with predefined table sets. You can reference how to this in the article Manage access to log data and workspaces in Azure Monitor. Doing this manually could be a months long process just by itself.

The way to make this work, is to enumerate the namespace in Splunk and then create custom roles for each of the namespaces and the tables they have access too. This will only allow you to limit access to the underlying tables, not any other elements in Azure Sentinel. However, by the nature of its function, it would prevent a person from being able to execute Hunting queries or to view Dashboards that reference those tables since they don’t have access to them.

Users

Users can be exported from the Splunk instance. If they have their email property setup and they exist in the target Azure AD (which could be automated to provision in the case they are not), then they can be added with the basic permissions of Owner, Contributor and Reader in Log Analytics or Sentinel. Anything customized will need to be via custom roles after the user has been created.

If you do go down the custom roles based on tables (which are likely based on namespaces), you will need a mapping file from the Splunk namespace to the role name in Sentinel.

Resolving the Permission problem.

As you can see, there really is no real high-level concept of permissions in Azure Sentinel and Log Analytics. The lowest level you can set permissions is in the Log Analytics workspace via IAM/Custom roles for tables and then hope everything works right in the higher layers.

Logging Queries

If you choose to migrate without considering any permissions during the migration, it is probably going to be a good idea to log and monitor any queries that are being sent to the Log Analytics workspace (directly or via Sentinel). If there are very sensitive tables contained in the workspace and folks are running queries that should not be doing so, you should be alerted to that (from another workspace/Sentinel?)

So, using Sentinel to monitor Sentinel becomes a thing. Isn’t recursion great?

Summary

Net net…Azure Sentinel is missing some pretty substantial permission based features that you enjoy in other SIEMs. It is up to you to determine if the functionality you are gaining, out weighs the short term drawbacks of losing out on these important security features.

Use Cases

What is your Use Case?

Return On Investment (ROI)

Measuring ROI

Summary

Contact

GenAI Blog Series

Code

CI/CD Workflows

Infrastructure (Quotas, quotas, and more quotas)

Change is Inevitable

Backup

Disaster Recovery

Summary

Contact

GenAI Blog Series

Reporting

Logging

Metrics

Costs

AI Governance

Summary

Contact

GenAI Blog Series

UIs (User)

UIs (Management)

APIs

Agents

Tools

Models

End Points

Chat Sessions

Attachments

Vectorization

Content Sources

System Data / Reporting

System Access (Azure resources)

Summary

Contact

GenAI Blog Series

Scaling (Requests)

Scaling (Tokens)

Stress Testing

Optimization (Caching)

Costs

Summary

Microsoft Copilot

Contact

GenAI Blog Series

Contact

GenAI Blog Series

Contact

GenAI Blog Series

Option 1: Buy It/Rent It

Option 2: Build it

Resources will be needed.

Components to build.

Example

Summary

Contact

GenAI Blog Series

GenAI is a hot topic

Where do you start?

Questions to Ask

Proof of Concepts

Challenges To Come

Summary

Contact

GenAI Blog Series

Query Permissions

Lookup Permissions

Dashboards

Users

Resolving the Permission problem.

Logging Queries

Summary

Splunk to Sentinel Blog Series

References: