Strengthening the Security of LLM Models

Strengthening the Security of LLM Models

Addressing Security Concerns for Large-Scale LLM Deployment

Large Language Models have become increasingly popular and are now being deployed in various production environments. From cars to phones and from airlines to banking customer support, everyone seems to have a plan to utilize LLMs. While Information Security is generally a mature field, the rise of LLMs has introduced new challenges. Let's explore what this means.

Understanding LLM Threats

Before we understand security implications we need to recognize important things that can go wrong. In short, the likely threats also tell you what security practices you need to follow.

Security of your LLM itself

A large language model is typically a very large file with lots of numbers. Generating such a model involves expensive hardware training on data that only your organization might own. Engineers and scientists may have spent countless hours and millions of dollars on creating that file.

Once this model is created, it needs to be kept safe from hackers, your own employees, and other types of leaks. Unlike various other enterprise data, a model is typically just one file. While modern enterprises have mastered the art of hiding critical data behind all sorts of access lists and such, securing a single file is a much harder problem, especially when testing it requires that your employees have direct access to that file.

If an LLM leaks, something that you created with millions of your dollars is now obtained by someone for free. Other than the legal issues, it is also a pure loss for your business.

However, this is perhaps one of the simplest threats to address, as it still falls under the traditional file security model. Maybe the model can only be stored on a secure file system that can only be accessed by machines and never humans.

Security of your private data with third parties

An LLM is trained on data. Often, organizations own the data but lack the expertise needed for training. Such organizations might share this data with a third party to train an LLM on their private data. You may have heard of the controversy surrounding Dropbox sharing their user's data with OpenAI for AI training purposes.

Now that many businesses are training LLM models, we will see smaller enterprises collaborating with specialized organizations to train LLMs. This will inevitably create new potential threats, where if the external party leaks the data, it could compromise sensitive information belonging to those smaller enterprises.

What an organization needs to do in this situation is to apply traditional info-sec protocols when sharing data and ensure that the data is promptly deleted after the training is complete. There are also innovative solutions possible, where the training data may be converted into a trainable input that cannot be reverse-engineered to reveal the original data. Only this transformed data should be shared with the target organization.

LLM itself leaking private data

LLMs are too large and we fully do no understand why they behave in certain way in certain cases because unlike logical programs whose correctness can be argued about by writing unit tests or by reading the code, there is no formal way to be 100% sure about how a model react to input it has not seen before.

LLMs might be trained on private data of your users or customers. This knowledge helps model get smarter but all the private date is encoded inside that model in some way.

Imagine you are a shopping website that sells products. You design a chatbot that is trained on all the sales information of your website. The chatbot is intended to assist users in finding the right and trending products.

One of your customers is able to ask the chatbot which products are popular in a particular zip code and eventually at particular street. This might allow the user to figure out who on a particular address is buying what.

This is just a simple scenario but you need strong guardrails to protect from such behaviour. As time goes by the attackers are going to be even more smarter in prompt engineering to extract as much valuable information as they can.

This field is still not well understood and we are still in uncharted waters here.

Rogue LLMs or Sleeper LLMs

A sleeper agent is someone who appears to be a normal agent, behaving as expected. However, once the sleeper agent receives a specific command, it activates the rogue part of itself and starts behaving like a villain. We have seen many cold war-themed movies covering this.

Now, imagine your company hired another company to create an LLM for your customer service chatbot. The other company had ties with a foreign government. You brought the model into your organization, deployed it, and it worked perfectly well. However, unknown to you, the model was trained to activate a special feature on a future date and then start sending your highest value customers to your competitor's website.

One might ask how this threat is any different from infected computer programs and rogue software. The thing is, with computer programs and software, you can always scan the source code for threat detection by looking at the code, and when things go bad, you can analyze why exactly they went bad.

While these sorts of threats are currently theoretical only, we have seen some research on the topic. One way to detect such threats is to deploy two separate models sourced from two separate companies. If both companies are competitive enough, their results would likely be similar. However, in production, you can constantly track their behavior against each other, and when it deviates significantly, you can raise an alarm.Attacking your LLM via training data

Once malicious users determine that you are using a certain type of data to train your LLMs, and if this data is generated through actions that the malicious users can influence, they can work on corrupting your training data to the point where your model learns that bad behavior.

For example, your malicious users might chat with your bot and insist that they will be satisfied if the bot uses some harsh language against them. The bot might then correlate this with customer satisfaction and might start using similar language for your innocent customers.

Social harms

LLMs can promote stereotypes, fake news, made-up statistics, and biased opinions when they are not really relevant. If LLMs are not culturally sensitive in the environment where they are deployed, they can cause harm. For example, an image recognition bot might misinterpret Arabic letters if it does not know that the script is written from right to left.

Certain things that are totally appropriate in some cultures might be completely illegal in others. Bots that assist tourists might end up putting the tourists at risk by offering them advice meant for other countries if they are not sensitive to the context.

Security in era of LLMs

Having explained the threats involved in LLMs we must recognize that this field is still in infancy. We will learn more about threats involved and mitigation strategies as LLMs have widespread adoption. But it would be a mistake to ignore security aspect of LLMs. Instead it should be part of your day 0 strategy and from training to deployment and evaluation you should constantly be on your toes to make sure the LLMs in your organization do not cause harm.

Guardrails approach

One of the common approach suggested by many is to draw red lines that the bot will never cross. Earlier in the process you draw these lines, easier it is for you to rely on them working correctly.

For example, you can decide to never train your model on certain type of data. In that case, we can be 100% sure that the model can not leak the data which is not encoded into it.

Similarly you can apply guardrails to a model's output and simply refuse to provide that output to the target if it does not match certain criteria. For example whenever your model spits out a slang you can treat it as a red line being crossed and discard that output.

Supervisor model approach

Another approach you can take is to deploy multiple models and LLMs serving different purposes, one of which could be the guard rail enforcer. Imagine you have a chatbot that chats with your customers to handle their complaints. You can have a supervisor bot that "approves" every prompt that the bot provides to the customers. The supervisor bot can be trained more affordably and on less sensitive data. It might be easier and cheaper to re-train this supervisor than the primary bot.

Similarly, you can have shadow bots that process your training data itself to prune the potentially sensitive and bad training.


The nascent field of LLM security presents challenges and, consequently, opportunities. We can be certain that this field will create several billion-dollar products offering security solutions for various types of threats associated with LLMs.

Please let us know what you think about other threats we might have missed.

Did you find this article valuable?

Support AI Authority by becoming a sponsor. Any amount is appreciated!