Automatically add EOS via Tokenizer, add Sentence Transformers snippet

by tomaarsen HF Staff - opened 26 days ago

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+54

-9

Files changed (2) hide show

README.md +52 -7
tokenizer.json +2 -2

README.md CHANGED Viewed

@@ -5,9 +5,11 @@ datasets:
 - codefuse-ai/F2LLM
 language:
 - en
 license: apache-2.0
 pipeline_tag: feature-extraction
-library_name: transformers
 ---
 # F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
@@ -19,7 +21,38 @@ F2LLMs (Foundation to Feature Large Language Models) are foundation models direc
 ## Usage
-To encode a batch of sentences:
 ```python
 from transformers import AutoModel, AutoTokenizer
@@ -31,22 +64,34 @@ model_path = "codefuse-ai/F2LLM-1.7B"
 tokenizer = AutoTokenizer.from_pretrained(model_path)
 model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
-sentences = [
     'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
-    'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.'
 ]
 def encode(sentences):
     batch_size = len(sentences)
-    sentences = [s+tokenizer.eos_token for s in sentences]
-    tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt', add_special_tokens=False).to(model.device)
     last_hidden_state = model(**tokenized_inputs).last_hidden_state
     eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
     embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
     embeddings = F.normalize(embeddings, p=2, dim=1)
     return embeddings
-embeddings = encode(sentences)
 ```
 ## Evaluation

 - codefuse-ai/F2LLM
 language:
 - en
+tags:
+- transformers
 license: apache-2.0
 pipeline_tag: feature-extraction
+library_name: sentence-transformers
 ---
 # F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
 ## Usage
+### With Sentence Transformers
+To encode text using F2LLM with the [Sentence Transformers](https://www.sbert.net/) library:
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("codefuse-ai/F2LLM-1.7B", model_kwargs={"torch_dtype": "bfloat16"})
+# Some sample query and documents
+query = "What is F2LLM used for?"
+documents = [
+    'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
+    'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.',
+    'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.'
+]
+# Encode the query and documents separately, the encode_query method uses the query prompt
+query_embedding = model.encode_query(query)
+document_embeddings = model.encode_document(documents)
+print(query_embedding.shape, document_embeddings.shape)
+# (2048,) (3, 2048)
+# Compute cosine similarity between the query and documents
+similarity = model.similarity(query_embedding, document_embeddings)
+print(similarity)
+# tensor([[0.5373, 0.6257, 0.8218]])
+```
+### With Transformers
+Or directly with the [Transformers](https://huggingface.co/docs/transformers/index) library:
 ```python
 from transformers import AutoModel, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_path)
 model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
+query = "What is F2LLM used for?"
+query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:"
+documents = [
     'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
+    'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.',
+    'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.'
 ]
 def encode(sentences):
     batch_size = len(sentences)
+    tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
     last_hidden_state = model(**tokenized_inputs).last_hidden_state
     eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
     embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
     embeddings = F.normalize(embeddings, p=2, dim=1)
     return embeddings
+# Encode the query and documents
+query_embedding = encode([query_prompt + query])
+document_embeddings = encode(documents)
+print(query_embedding.shape, document_embeddings.shape)
+# torch.Size([1, 2048]) torch.Size([3, 2048])
+# Compute cosine similarity between the query and documents
+similarity = query_embedding @ document_embeddings.T
+print(similarity)
+# tensor([[0.5391, 0.6250, 0.8242]], device='cuda:0', dtype=torch.bfloat16,
+#        grad_fn=<MmBackward0>)
 ```
 ## Evaluation

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
-size 11422654

 version https://git-lfs.github.com/spec/v1
+oid sha256:38360d5a512a43641b36d6fba2df87b8a3f5464c6b5c76f03e82d6d795175566
+size 11423195