GenAI Tech Talk
What is better for my business: Introduce generative AI as a SaaS (software-as-a-service) solution, hosting an LLM (large language model) yourself or in the cloud? And what kind of work and costs does this entail? These are the kinds of questions CDOs, and CIOs are asking today. Especially in the regulated sector, the current trend is towards open source-based on-prem LLMs. But here, too, there are still plenty of uncertainties. We have collected the ten most frequently asked questions and put them to our GenAI experts Dr. Lars Flöer and Michael Tannenbaum. Something worth mentioning beforehand is that there are on-prem LLM solutions that are secure as well as performant and affordable.
1. As-a-service or on-prem?
Lars, Michael, you and your colleagues at Comma Soft are working on an LLM solution based on open-source components. But you also incorporate as-a-service LLMs into enterprises. Which is the better option?
Lars: It really all depends on what your requirements are. A lot of companies would like to use a tool like ChatGPT and therefore think that they need to rely on OpenAI. But for many European companies, turning to a US provider is out of the question. For reasons around regulation and compliance, companies operating in regulated industries such as banking, insurance or healthcare, government agencies or pharmaceutical & life science companies are in many cases prevented from deploying a solution that does not allow them to ensure complete control over their data.
Michael: Yes, that’s exactly right. In such cases, an as-a-service LLM can never deliver 100 percent security. So, using this example, it would be better to host an LLM in a cloud in your own environment or, better still, on-premises. There are currently also start-ups in Germany that offer hosted as-a-service solutions here. That being said, the risk I see with these vendors is that they don’t really stand that much of a chance against either the speed of the big US players or the open-source community. So, for most companies, the most efficient and future-proof solution would be to build on open source LLMs. Additionally, the best open-source models are now up to par with the quality of ChatGPT – at least in English. This is why we also rely on open source for our LLM, currently the Llama 2 model.
Comma Soft’s open source-based GenAI technology
With Comma Soft’s LLM, that can be hosted in Germany and is currently based on Llama 2, corporations and larger medium-sized companies now have a technology that complies with data protection and regulatory requirements, and with which they can leverage the power of GenAI for their unique business purposes.
2. What are the benefits of open source?
How does an open source based LLM like yours differ from approaches taken by German start-ups, for example?
Michael: Many of the significant innovations in LLM over the last ten months have been fueled by the open-source community. The concentrated swarm knowledge, the quality and the speed at which new models and methods are published are extremely high. This now represents a competitive challenge, even for OpenAI. It is also important to remember that in the open-source arena, it’s not just individual developers who are actively involved, but also large corporations such as Meta, which in turn contribute vast amounts of data and resources. Additionally, with open source, further development is continuously driven forward by the community, whereas with individual providers, there is always the risk that updates will come about more slowly or perhaps even not at all at some point. Ultimately, start-ups in particular are dependent on investors and cannot finance development themselves in the same way that owner-managed companies can.
Our LLM from Comma Soft is currently based on Llama 2 from Meta AI. Nevertheless, in our architecture we are careful not to tie ourselves unnecessarily to a particular base model. Once the open-source community has developed a new, better base model, we can replace it with the current model and train the LLM with it. This is a purely technical effort where most models are concerned and is uncomplicated to implement. In the open-source sector in particular, a great deal of emphasis is also placed on efficiency. Large LLMs can therefore be operated on comparatively inexpensive hardware. This offers the promise of cost benefits in the long run when it comes to self-hosting.
3. What does the on-prem architecture look like?
Suppose a company wants to host an LLM on-premises. What are the architectural requirements for this?
Lars: Self-hosting on-premises is, all in all, feasible with an acceptable amount of effort. Let’s start with the backend. The key issue here is scalability and performance – and, of course, the question of the costs involved. This is a topic that can cause headaches for many companies. Our solution is optimized to enable on-premises usage (inference) that is as fast as users are accustomed to with ChatGPT. A GPU server equipped with one or two A100 graphics cards can suffice for this. If over time there are more concurrent users as the company expands or the solution is rolled out to more areas, optimized inference procedures can be deployed, and the solution can be scaled horizontally. The quantization of the models also allows more throughput to be achieved on the same hardware. But since the hardware required is highly dependent on the individual use case and planned usage, we advise companies on the optimal hardware configuration for their unique circumstances prior to LLM implementation.
The next important aspect is the interfaces. Our LLM has APIs that can be used to connect enterprise databases and other systems. On-premises operation even allows data and systems to be integrated that are not readily available with other SaaS solutions or may not be available for intellectual property or regulatory reasons. The connection of internal databases or APIs to the LLM is an important added value for knowledge management within the company.
And finally, we have the frontend. One example of a use case here is the classic chat window you might be familiar with from ChatGPT. But this is only one of many possibilities. With our solution, we can also quickly develop our own application-specific frontends, e.g., for use in a customer center context as a support system for administrative employees. In many cases, though, there is no need for a separate frontend at all, for example, when large volumes of text are to be processed automatically or when integrating with SharePoint, the intranet or a Microsoft Teams bot – an LLM can support all of this as well. With our own LLM, we have made provisions for a variety of cases and can work with our customers to create more.
So, all in all, there are viable technical solutions for companies to self-host an LLM. And for those who would prefer to rely on the cloud instead, our LLM solution enables hosting with a German or European cloud provider.
Dr. Lars Flöer
Lars is responsible for the AI division at Comma Soft. As a Lead Consultant Artificial Intelligence, he focuses on developing, delivering, and operationalizing machine learning solutions (MLOps) based on cutting-edge technologies and algorithms.
4. When does privacy compliance apply?
How can an LLM on-premises solution be used to concretely control what happens to the data so that data protection requirements are complied with in a secure and verifiable way?
Michael: Unlike as-a-service LLMs or cloud hosting, an enterprise with a self-hosted LLM has complete control over what happens to the data and who has access to it. This includes the data used for training, but also logging and monitoring. Logging occurs on the company’s hardware – and can be protected accordingly. Moreover, the company itself determines which data is stored in the logs and has it directly at hand in the event of an audit. The control options available are therefore far superior to those offered by a third-party provider.
Lars: Things also get interesting when it comes to internal rights and roles. Just because I securely isolate my LLM from external access, it doesn’t necessarily mean that I want all employees to be able to retrieve any information they choose from it. Not all employees are granted permission to access personnel data or confidential data from software or product development, for example. Still though, departments like HR and R&D often need to. So, we make sure that the data is also protected between departments as well. We resolve this with our LLM in such a way that it does not access the data it is to be trained with itself, but instead retrieves it with a specialized routine that maps, for example, the roles for which are defined in SharePoint or in Active Directory. This means that it is not possible to access protected information using a prompt.
Michael leads Comma Soft’s product development in the area of GenAI, especially in the area of LLM. Among other things, he develops solutions for companies that are based on open source components and meet both European data protection regulations and business requirements in terms of performance and scalability.
5. How does LLM use succeed in Europe?
Besides the security issue, many companies from Germany and other European countries see a challenge in the fact that the existing LLMs are not localized for their languages. Can this be solved with an open-source base model?
Michael: Language is indeed a sticking point for many solutions. An LLM that is used in Germany for use cases with a chat function must, of course, also be able to communicate in German in order to make the work of the employees using it genuinely easier. And even when extracting information where no chat interface is needed, the LLM must “understand” the information it is reading out.
Lars: Most open-source LLMs are not proficient enough in dealing with German or other non-English texts. However, because these LLMs are pre-trained to process abstract concepts and apply them to new information, they do not have to start from scratch with German either. We have solved this in our LLM and have already trained Llama 2 with German data. We are currently in the process of implementing other languages.
For companies, context is also important, of course. The LLM must be able to navigate various use cases, such as legal, regulatory, or industry-specific issues. We work together with companies and development partners on this. The aim is to develop industry-specific LLMs that can be deployed as out-of-the-box solutions and that bring the relevant knowledge in with them. They can also be further customized, e.g., to the use cases of different customer groups or departments. This all occurs on the basis of the same LLM, where only the differences are retrained. A few thousand examples are often enough for this purpose.
GenAI for businesses
The question many of our customers ask themselves is how they can use ChatGPT and other GenAI solutions in their company in a way that does not violate data protection, compliance or regulatory parameters and that keeps the effort involved to a level that is manageable. There are some formats & solutions that have proven themselves in practice.
6. Which data for fine-tuning?
What kind of data – and how much – do organizations need to fine tune their on-premises LLM?
Lars: That depends on which cases are to be mapped. Fine tuning is used, for example, to adjust the form of the generated text that is rendered in a chat frontend. This could be specific expressions, technical terms or formulations that should be taken into account. For this kind of fine-tuning, data is needed that already has the desired form. If I want my LLM to pre-generate email responses, I will need sample emails that reflect the desired form for the fine-tuning.
If I wish to include current information in my generated text, such as the latest terms and conditions or the list of contact persons in a given department, I can obtain this information from SharePoint via connectors and pass it to the LLM as context when making a request. This process is also known as retrieval augmented generation (RAG). The data is not used for training but is retrieved ad hoc during the inference process either from vector databases or from other source systems. In principle, all company data can be used for this purpose. Any system can be integrated via APIs or special connectors that we set up.
In fact, in most cases, the fine-tuning process does not require as much data as is often assumed. In fact, the opposite is true. One great thing about LLMs is that they already have a very broad general understanding of many concepts. Therefore, very few well-validated examples are often sufficient, keyword “few-shot learning“. We can therefore eliminate the concern here that LLM use could fail due to insufficient data. What counts here is the quality.
7. How does the LLM stay up to date?
“An LLM needs the entire world to learn in order to stay current,” it is often said. So, do you have to constantly retrain it?
Michael: When we use open source, we can first assume that our LLM already pretty much knows “the whole world”. And re-training is of course possible at any time. But the more interesting question is whether this is even necessary at all. In a business context, the frequency is not that high. This is due to the fact that the employees enter new information into the systems and databases in any case, which the LLM then accesses via an interface – this is the RAG procedure we just described. Re-training only becomes relevant when the LLM is to be used for new cases, say for example when a German company expands to Spain and wants to process information in that language. Apart from that, the business logic usually remains the same. The more important question is data literacy, i.e., whether all employees are aware of the importance of data quality in their applications, whether they maintain the data correctly, and keep it up to date. Often, however, the existing data quality is better than the companies themselves may assume. Of course, this is something we verify together prior to an LLM implementation and help with validation if necessary.
8. Does a connection to the internet also work?
And what if a company wishes to connect its LLM to the internet after all? Is that possible?
Michael: Technically speaking, of course, this is possible. An LLM can learn to independently retrieve information through search engines. But in the process, it must not leak any internal information to the outside world. If I allow an LLM to formulate search queries against an external search engine, I cannot prevent with absolute certainty that this LLM-generated search query may contain internal information. The risk can be minimized by implementing appropriate regulations in the search routines. In this case, only “naïve”, non-specific questions are allowed, so that, if possible, no results or aggregations of internal information leak out. All the same, a residual risk remains. So, if you want to be on the safe side, it is better to avoid an LLM Internet connection.
Alongside this, there is also a scenario where it should be possible to link to the internet in order to allow customers to use the LLM directly, e.g., via chatbots. But these are usually difficult to control. At present, we do not yet see any successful implementations that do not come with a “beta version” disclaimer. The more realistic use case is LLMs, which only solve very specific tasks for the user and do not provide for a free chat prompt.
ChatGPT & Co under the microscope
Which new possibilities do LLMs like those used in ChatGPT, Microsoft Copilot and other applications open up? How is GenAI changing our (working) everyday life? Our colleagues will shed some light on this and other aspects in the following article.
9. How high is the cost factor?
As-a-Service LLMs now offer enterprise packages. Comparatively, what are the costs to enterprises of relying on a secure, self-hosted open source LLM?
Lars: As-a-service providers charge per token. Depending on the GPT version and the context size, this may only be a few cents. But if you imagine that hundreds or even thousands of users in the company are creating longer queries every day, the costs quickly add up. Added to this are high-volume processes such as the extraction of data from emails. And that’s where we stand with this as of today. We can reasonably assume that prices will continue to rise in the future. When operating on-premises, these costs do not occur. As a company, you only make an initial investment in the hardware. The one-time acquisition costs for the hardware depend on the server and the graphics cards. The decisive factor here is what level of accuracy and performance are required. In many cases, a good standard configuration is currently a server with one or two A100 GPUs. If needed, this setup can be scaled horizontally almost at any desired level.
If the models are to be retrained without the training data leaving the company network, another server equipped for this purpose is required. During operation, the usual rules for server redundancy then apply. To achieve high availability, in many cases you may want to run at least one fallback server, and a minimum of three machines in the cluster is considered best practice to achieve high availability. These considerations can significantly impact costs but are not inherently specific to hosting LLMs.
On top of this come licensing costs as well as the ongoing costs associated with electricity, IT and maintenance and, if necessary, fine-tuning. Depending on the infrastructure, number of users and LLM requirements, companies should consider the actual costs on a case-by-case basis. In this area, we have empirical values with which we can assist in the selection of hardware and the estimation of costs.
10. How long does the implementation take?
How much time do companies need to budget if they want to implement an LLM on-premises?
Michael: Our LLM is ready to go and only needs to be uploaded to the server. This means that the amount of initial time required is extremely low. Our architecture means that the technical implementation can be completed in two to three days, especially if you start out as a company with a streamlined variant – which also makes sense for gathering empirical values and then optimizing the LLM step by step to meet your own requirements. Our architecture includes everything for the most common scenarios.
If other user interfaces or systems are to be connected right from the start, this can be achieved in about two weeks, depending of course on the respective use case. What therefore needs to be done in advance is to clarify the intended use cases, the role concepts to be taken into account, and the company-specific features. We review this together with our customers in advance in an ideation workshop so that there are no delays later on during implementation and operation.
Do you have any other questions about hosting and fine-tuning LLM? Please feel free to contact Dr. Lars Flöer and Michael Tannenbaum: You can contact us here.