NextCoder-7B-FP8

This is an FP8 quantized version of microsoft/NextCoder-7B for efficient inference on NVIDIA Ada Lovelace and newer GPUs.

Model Description

FP8 (8-bit floating point) quantization of NextCoder-7B, optimized for fast code generation with minimal quality loss.

Quantization Details

Property Value
Original Model microsoft/NextCoder-7B
Quantization Method FP8 (E4M3) via llm-compressor
Model Size ~14GB (3 sharded safetensors files)
Target Hardware NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.)
Quantization Date 2025-11-22
Quantization Time 47.0 minutes
Hardware Used NVIDIA RTX 5000 Ada Generation (31.5 GB)

Quantization Infrastructure

Quantized on professional hardware to ensure quality and reliability:

  • CPUs: Dual Intel Xeon Max 9480 (224 threads, 128GB HBM2e)
  • GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM) with native FP8 support
  • Memory: 256GB DDR5 + 128GB HBM2e = 384GB total
  • Software: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13 | llm-compressor

Usage

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model with FP8 quantization
model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/NextCoder-7B-FP8",
    torch_dtype=torch.float8_e4m3fn,  # FP8 dtype
    device_map="auto",
    low_cpu_mem_usage=True,
)

tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8")

# Generate code
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs, 
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

pip install torch>=2.1.0  # FP8 support requires PyTorch 2.1+
pip install transformers>=4.40.0
pip install accelerate

System Requirements:

  • PyTorch 2.1 or newer with CUDA support
  • NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.)
  • CUDA 11.8 or newer
  • ~14GB VRAM for inference

Benefits of FP8

  • ~50% memory reduction compared to FP16/BF16
  • Faster inference on Ada Lovelace and Hopper GPUs with native FP8 Tensor Cores
  • Minimal quality loss compared to INT8 or INT4 quantization
  • Native hardware acceleration on modern NVIDIA GPUs

Model Files

This model is sharded into 3 safetensors files:

  • model-00001-of-00003.safetensors
  • model-00002-of-00003.safetensors
  • model-00003-of-00003.safetensors

All files are required for inference.

Original Model

This quantization is based on microsoft/NextCoder-7B by Microsoft.

Please refer to the original model card for:

  • Training details
  • Intended use cases
  • Capabilities and limitations
  • Evaluation results
  • Ethical considerations

Quantization Recipe

This model was quantized using llm-compressor with the FP8 E4M3 format. The quantization recipe is included in recipe.yaml.

License

This model inherits the MIT license from the original NextCoder-7B model.

Citation

If you use this model, please cite the original NextCoder work:

@misc{nextcoder2024,
  title={NextCoder: Next-Generation Code LLM},
  author={Microsoft},
  year={2024},
  url={https://huggingface.co/microsoft/NextCoder-7B}
}

Acknowledgments

  • Original model by Microsoft
  • Quantization performed using Neural Magic's llm-compressor
  • Quantized by TevunahAi
Downloads last month
22
Safetensors
Model size
8B params
Tensor type
F32
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/NextCoder-7B-FP8

Base model

Qwen/Qwen2.5-7B
Quantized
(9)
this model

Collection including TevunahAi/NextCoder-7B-FP8