While integrating a self-hosted embeddings model with LangChain, which I hosted through Hugging Face’s Text Embedding Inference (TEI), I encountered an issue where the OpenAIEmbeddings
client automatically tokenized my input using the OpenAI model tokenizer. This wasn’t a compatibility issue, but it confused me with very poor results because the wrong tokenizer was applied without warning.
🔗The Problem
I assumed that I could use the OpenAIEmbeddings
client with my self-hosted inference server, as it exposes an OpenAI-compatible API. However, LangChain’s OpenAIEmbeddings
client has a default setting that enables client-side tokenization using the tiktoken
library. This behavior is controlled by the tiktoken_enabled
parameter, which is set to True
by default.
🔗Non-Working Code: OpenAIEmbeddings Client
This is the code I initially used, expecting it to work with my self-hosted server:
from langchain_openai import OpenAIEmbeddings
client = OpenAIEmbeddings(
base_url="http://localhost:80/v1",
)
However, this does not work efficiently because LangChain assumes that the OpenAI model tokenizer should be used. Since my self-hosted model was not OpenAI’s model, the wrong tokenizer was applied, resulting in poor performance.
🔗Crucial Issue: Tokenization is Enabled by Default
By default, tiktoken_enabled=True
, meaning LangChain will tokenize the input before sending it to the API:
client = OpenAIEmbeddings(
base_url="http://localhost:80/v1",
tiktoken_enabled=True # This is the default setting causing the issue
)
Even when setting tiktoken_enabled=False
, LangChain still tries to load a tokenizer for the model, though it does not include the tokenized output in the request. This means it will require an additional dependency for tokenizing the input of a custom model, which seemed unnecessary for my use case.
🔗The Solution: Use HuggingFaceEndpointEmbeddings Instead
I found that using HuggingFaceEndpointEmbeddings
was a better approach, as it allows specifying a custom endpoint without automatic tokenization:
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
client = HuggingFaceEndpointEmbeddings(model="http://localhost:80")
This bypassed the incorrect tokenization and aligned with my server’s expectations.
🔗Documentation Gaps & Lessons Learned
While exploring LangChain’s embedding clients, I expected a more suitable option for self-hosted models. However:
-
All alternative clients did not explicitly support custom URLs.
-
The
HuggingFaceEndpointEmbeddings
client seemed designed for official Hugging Face endpoints, but it actually supports specifying a custom URL via the model parameter—which was not documented. -
I only discovered this via AI assistance, which suggested trying the
model
parameter for specifying the endpoint.
🔗Key Takeaways
-
LangChain’s
OpenAIEmbeddings
client applies tokenization by default, so only use it for OpenAIs embedding models -
Using
HuggingFaceEndpointEmbeddings
is a better approach for self-hosted models through HF text-embedding-inference -
LangChain documentation is incomplete in some areas, and AI assistance can sometimes provide better insights than the official docs.
🔗Final Thoughts
If you’re hosting your own embeddings model, double-check how LangChain processes input before making API calls. The wrong tokenizer can silently degrade performance without errors, making troubleshooting difficult.