GenAI Blog Series #7 – MLOps/GenAIOps, some kind of *Ops

Deploying artifacts should be simple and repeatable. Being able to have a development, QA, staging and production environment that mimic each other should be a seamless process that includes quick creation and end to end (E2E) testing. This can be accomplished through well designed CI/CD processes and is much easier when things are greenfield, but we all know that’s not always the case (for example, brownfield where Firewall, DNS and hub networks already exist and must be integrated with).

Code

If you went down the Built It path, then you are going to have a couple years worth of code about to be (or already) built up. In today’s world, everything is driven by containerized solutions. As the various layers are built, you’ll need to ensure that all the Docker files are created for your service layers. In addition, those Docker files should have all the necessary scripting to properly build your images. NuGet packages, pypi packages, all that great stuff has to be considered as part of the build, deploy and execution processes.

Hopefully, you are using some kind of enterprise level git repository such as GitHub or GitLab. In the case of GitHub, the shift left mentally really shows its strengths to ensure that any dependencies you add to your projects are vulnerability free. You get a lot of this for free for public projects, but unfortunately, it is something that you have to pay for private projects.

CI/CD Workflows

So your code compiles and your local tests work, you want to commit those changes to your repo. You can do direct commits, or you can do Merge/Pull requests. The later is a better approach, but believe me when I say I understand, when under pressure and tight timelines, you may not always have this luxury. Especially when your one and only code owner is on vacation.

The setup of your repository is very important. You should be able to have multiple developers able to work on various features and then be able to merge those changes back into your various branches (main, feature, release, etc).

When a release branch is committed too, you may want it to actually do a full build and deployment to your container registry. This would allow you to modify your dev/test deployments to point to the new container release tags and do your testing.

Infrastructure (Quotas, quotas, and more quotas)

GenAI application deployments require a lot of quota. And all kinds of different types of quotas.

Compute
- Virtual Machines, Scale Sets, workloads, all used to host your containers
AI Models
- Newer models will require approval. During approval and development, you many only get a small amount. This can make scale/stress/load testing difficult.
Function/App Service Plans
- If you need some kind of logic app, function app or other type of app service, you may need to request for the ability to create it in the target region. East US/East US 2 can get tricky.

Typically customers would like to have all of there resources in a single region, maybe this is due to regulations or compliance, or just, because. GenAI platforms require a myriad of features that must all exist in the target region. If they don’t, then your best laid plans for using that bicept/Terraform goes right out the window.

We have had a myriad of issues trying to find regions that support all the features (yeah, they are all ring zero) that allow for deployment without having to jump through quota requests and other escalation mechanisms (CSA/GBB).

Change is Inevitable

In addition to getting your environments up and running, you have to expect that changes will always be a part of the GenAI application evolution. Platform components are updated, prompts are updated, new models are rolled out, etc. Not all of these things will be driven by Infrastructure as Code (IaC). Some of these changes will be manual, while some will be script driven.

Ensuring that you have a development environment where these changes can be tested is vital as you really don’t want to be deploying untested/verified changes straight to production. You know better!

Backup

As part of any business critical system, it should support some kind of backup or versioning mechanism. This is important for many reasons, the primary one being the ability to get back up running in case of system failure, however, users and admins can also be a source for frustration when they change an agent and all hell breaks loose.

Things that should be backed up include:

System configurations (App Config)
Databases (Cosmos, Azure SQL)
Vector Database (Azure AI Search)
Agent configurations
Prompts
Data Pipeline configurations

Disaster Recovery

You just never know when a region or zone might go down and you need to get your GenAI platform up and running somewhere else (if you are lucky enough to be able to find a region that supports all the GenAI platform features and are one of the first to failover to said region). And of course, this assumes you decided to build it and host it on your own.

You would need to read the term and conditions and service level agreements of your hosting provider to find out if they have any SLAs in case their stuff goes down. And as a vetting process, you should always find out what kind of BCDR they have behind the scenes.

Summary

Change is inevitable. Being able to adapt to those changes is important and having well defined process and procedures in place to automate deployments and upgrades is mandatory. By doing this, you are guaranteed to get to your desired state quicker, more efficiently and have a more reliable system as a result.

Contact

Need help getting your GenAI project started and/or over the finish line? Ping me, always happy to help!

Email: givenscj@hotmail.com
Twitter: @givenscj
LinkedIn: http://linkedin.com/in/givenscj