Semantic Cache for LLM responses
Return cached responses for identical prompts and also for prompts that are similar in meaning.

Expert insights on Azure AI architecture and implementation. Real-world solutions for building intelligent enterprise systems.
Prerequisites
API Management instance with an Azure OpenAI model deployment as an API
Deployment for the following APIs:
Chat Completion API
Embeddings API

Configured API Management instance to use managed identity authentication to the Azure OpenAI service
An Azure Managed Redis instance with the
RediSearchmodule enabled
Authenticate with managed identity
Let’s quickly explain how to use Managed Identity to authenticate to Azure OpenAI. Make sure that the managed identity is enabled on your API Management instance.

Assign the Cognitive Services OpenAI User role to the managed identity. Add the following inbound policy section to authenticate requests to the API by using the managed identity.

<authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="managed-id-access-token" ignore-error="false" />
<set-header name="Authorization" exists-action="override">
<value>@("Bearer " + (string)context.Variables["managed-id-access-token"])</value>
</set-header>
Import an Azure OpenAI API
If you have imported an Azure OpenAI instance as an API into APIM you should see something like this:

If not, you can either import an Azure OpenAI API directly from a deployment in Microsoft Foundry or download and edit the OpenAPI specification.
Azure Managed Redis Setup
We will setup an external cache using the Azure Managed Redis instance.

Azure API Management uses a Redis connection string to connect to the cache. Because we are using Azure Managed Redis, we need to enable access key authentication for the cache and use the key as password in the connection string. Currently, we can't use Microsoft Entra authentication to connect Azure API Management to Azure Managed Redis.
The connection string will look like:
<cache-name>:10000,password=<cache-access-key>,ssl=True,abortConnect=False
Semantic Cache Configuration
Start by creating a backend for the embeddings API.

Next, let’s configure the semantic caching policies. In the Inbound section, add the azure-openai-semantic-cache-lookup policy. In the embeddings-backend-id attribute, specify the Embeddings API backend you created. In my case, the value is Embedding.
<azure-openai-semantic-cache-lookup
score-threshold="0.15"
embeddings-backend-id="Embedding"
embeddings-backend-auth="system-assigned"
ignore-system-messages="true"
max-message-count="10">
<vary-by>@(context.Subscription.Id)</vary-by>
</azure-openai-semantic-cache-lookup>
In the Outbound processing section for the API, add the azure-openai-semantic-cache-store policy.
<azure-openai-semantic-cache-store duration="60" />
To confirm the semantic cache is working, we can trace a test Completion or Chat Completion operation by using the test console in the portal.
Example from my trace operation:







