Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper
•
1908.10084
•
Published
•
9
This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("danicafisher/dfisher-base-sentence-transformer")
# Run inference
sentences = [
'How can organizations address risks associated with the use of third-party data for GAI model inputs?',
'48 \n• Data protection \n• Data retention \n• Consistency in use of defining key terms \n• Decommissioning \n• Discouraging anonymous use \n• Education \n• Impact assessments \n• Incident response \n• Monitoring \n• Opt-outs \n• Risk-based controls \n• Risk mapping and measurement \n• Science-backed TEVV practices \n• Secure software development practices \n• Stakeholder engagement \n• Synthetic content detection and \nlabeling tools and techniques \n• Whistleblower protections \n• Workforce diversity and \ninterdisciplinary teams\nEstablishing acceptable use policies and guidance for the use of GAI in formal human-AI teaming settings \nas well as different levels of human-AI configurations can help to decrease risks arising from misuse, \nabuse, inappropriate repurpose, and misalignment between systems and users. These practices are just \none example of adapting existing governance protocols for GAI contexts. \nA.1.3. Third-Party Considerations \nOrganizations may seek to acquire, embed, incorporate, or use open-source or proprietary third-party \nGAI models, systems, or generated data for various applications across an enterprise. Use of these GAI \ntools and inputs has implications for all functions of the organization – including but not limited to \nacquisition, human resources, legal, compliance, and IT services – regardless of whether they are carried \nout by employees or third parties. Many of the actions cited above are relevant and options for \naddressing third-party considerations. \nThird party GAI integrations may give rise to increased intellectual property, data privacy, or information \nsecurity risks, pointing to the need for clear guidelines for transparency and risk management regarding \nthe collection and use of third-party data for model inputs. Organizations may consider varying risk \ncontrols for foundation models, fine-tuned models, and embedded tools, enhanced processes for \ninteracting with external GAI technologies or service providers. Organizations can apply standard or \nexisting risk controls and processes to proprietary or open-source GAI technologies, data, and third-party \nservice providers, including acquisition and procurement due diligence, requests for software bills of \nmaterials (SBOMs), application of service level agreements (SLAs), and statement on standards for \nattestation engagement (SSAE) reports to help with third-party transparency and risk management for \nGAI systems. \nA.1.4. Pre-Deployment Testing \nOverview \nThe diverse ways and contexts in which GAI systems may be developed, used, and repurposed \ncomplicates risk mapping and pre-deployment measurement efforts. Robust test, evaluation, validation, \nand verification (TEVV) processes can be iteratively applied – and documented – in early stages of the AI \nlifecycle and informed by representative AI Actors (see Figure 3 of the AI RMF). Until new and rigorous',
'8 \nTrustworthy AI Characteristics: Accountable and Transparent, Privacy Enhanced, Safe, Secure and \nResilient \n2.5. Environmental Impacts \nTraining, maintaining, and operating (running inference on) GAI systems are resource-intensive activities, \nwith potentially large energy and environmental footprints. Energy and carbon emissions vary based on \nwhat is being done with the GAI model (i.e., pre-training, fine-tuning, inference), the modality of the \ncontent, hardware used, and type of task or application. \nCurrent estimates suggest that training a single transformer LLM can emit as much carbon as 300 round-\ntrip flights between San Francisco and New York. In a study comparing energy consumption and carbon \nemissions for LLM inference, generative tasks (e.g., text summarization) were found to be more energy- \nand carbon-intensive than discriminative or non-generative tasks (e.g., text classification). \nMethods for creating smaller versions of trained models, such as model distillation or compression, \ncould reduce environmental impacts at inference time, but training and tuning such models may still \ncontribute to their environmental impacts. Currently there is no agreed upon method to estimate \nenvironmental impacts from GAI. \nTrustworthy AI Characteristics: Accountable and Transparent, Safe \n2.6. Harmful Bias and Homogenization \nBias exists in many forms and can become ingrained in automated systems. AI systems, including GAI \nsystems, can increase the speed and scale at which harmful biases manifest and are acted upon, \npotentially perpetuating and amplifying harms to individuals, groups, communities, organizations, and \nsociety. For example, when prompted to generate images of CEOs, doctors, lawyers, and judges, current \ntext-to-image models underrepresent women and/or racial minorities, and people with disabilities. \nImage generator models have also produced biased or stereotyped output for various demographic \ngroups and have difficulty producing non-stereotyped content even when the prompt specifically \nrequests image features that are inconsistent with the stereotypes. Harmful bias in GAI models, which \nmay stem from their training data, can also cause representational harms or perpetuate or exacerbate \nbias based on race, gender, disability, or other protected classes. \nHarmful bias in GAI systems can also lead to harms via disparities between how a model performs for \ndifferent subgroups or languages (e.g., an LLM may perform less well for non-English languages or \ncertain dialects). Such disparities can contribute to discriminatory decision-making or amplification of \nexisting societal biases. In addition, GAI systems may be inappropriately trusted to perform similarly \nacross all subgroups, which could leave the groups facing underperformance with worse outcomes than \nif no GAI system were used. Disparate or reduced performance for lower-resource languages also \npresents challenges to model adoption, inclusion, and accessibility, and may make preservation of \nendangered languages more difficult if GAI systems become embedded in everyday processes that would \notherwise have been opportunities to use these languages. \nBias is mutually reinforcing with the problem of undesired homogenization, in which GAI systems \nproduce skewed distributions of outputs that are overly uniform (for example, repetitive aesthetic styles',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What measures are suggested to assess the environmental impact of AI model training and management activities? |
37 |
What are some limitations of current pre-deployment testing approaches for GAI applications? |
49 |
How can organizations adjust their governance regimes to effectively manage the unique risks associated with generative AI? |
47 |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
per_device_train_batch_size: 20per_device_eval_batch_size: 20num_train_epochs: 10multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 20per_device_eval_batch_size: 20per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 10max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
sentence-transformers/all-MiniLM-L6-v2