Spaces:

braindeck
/

text2text

Paused

App Files Files Community

text2text / verl /models /README.md

braindeck

Initial commit

bcdf9fa 2 months ago

preview code

raw

history blame contribute delete

1.78 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Models

Common modelzoo such as huggingface/transformers stuggles when using Pytorch native model parallelism. Following the design principle of vLLM, we keep a simple, parallelizable, highly-optimized with packed inputs in verl.

Adding a New Huggingface Model

Step 1: Copy the model file from HF to verl

Add a new file under verl/models/hf
Copy ONLY the model file from huggingface/transformers/models to verl/models/hf

Step 2: Modify the model file to use packed inputs

Remove all the code related to inference (kv cache)
Modify the inputs to include only
- input_ids (total_nnz,)
- cu_seqlens (total_nnz + 1,)
- max_seqlen_in_batch: int
Note that this requires using flash attention with causal mask.

Step 2.5: Add tests

Add a test to compare this version and the huggingface version
Following the infrastructure and add tests to tests/models/hf

Step 3: Add a function to apply tensor parallelism

Please follow
- https://pytorch.org/docs/stable/distributed.tensor.parallel.html
- https://pytorch.org/tutorials/intermediate/TP_tutorial.html
General comments
- Tensor Parallelism in native Pytorch is NOT auto-parallelism. The way it works is to specify how model parameters and input/output reshards using configs. These configs are then registered as hooks to perform input/output resharding before/after model forward.

Step 4: Add a function to apply data parallelism

Please use FSDP2 APIs
See demo here https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallelize_llama.py#L413

Step 5: Add a function to apply pipeline parallelism

Comes in Pytorch 2.4
Currently only in alpha in nightly version
Check torchtitan for more details