| # Models | |
| Common modelzoo such as huggingface/transformers stuggles when using Pytorch native model parallelism. Following the design principle of vLLM, we keep a simple, parallelizable, highly-optimized with packed inputs in verl. | |
| ## Adding a New Huggingface Model | |
| ### Step 1: Copy the model file from HF to verl | |
| - Add a new file under verl/models/hf | |
| - Copy ONLY the model file from huggingface/transformers/models to verl/models/hf | |
| ### Step 2: Modify the model file to use packed inputs | |
| - Remove all the code related to inference (kv cache) | |
| - Modify the inputs to include only | |
| - input_ids (total_nnz,) | |
| - cu_seqlens (total_nnz + 1,) | |
| - max_seqlen_in_batch: int | |
| - Note that this requires using flash attention with causal mask. | |
| ### Step 2.5: Add tests | |
| - Add a test to compare this version and the huggingface version | |
| - Following the infrastructure and add tests to tests/models/hf | |
| ### Step 3: Add a function to apply tensor parallelism | |
| - Please follow | |
| - https://pytorch.org/docs/stable/distributed.tensor.parallel.html | |
| - https://pytorch.org/tutorials/intermediate/TP_tutorial.html | |
| - General comments | |
| - Tensor Parallelism in native Pytorch is NOT auto-parallelism. The way it works is to specify how model parameters and input/output reshards using configs. These configs are then registered as hooks to perform input/output resharding before/after model forward. | |
| ### Step 4: Add a function to apply data parallelism | |
| - Please use FSDP2 APIs | |
| - See demo here https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallelize_llama.py#L413 | |
| ### Step 5: Add a function to apply pipeline parallelism | |
| - Comes in Pytorch 2.4 | |
| - Currently only in alpha in nightly version | |
| - Check torchtitan for more details | |