Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- code
|
| 6 |
+
tags:
|
| 7 |
+
- security
|
| 8 |
+
- code-repair
|
| 9 |
+
- oss-bench
|
| 10 |
+
- moe
|
| 11 |
+
- chain-of-thought
|
| 12 |
+
- nlp
|
| 13 |
+
- c
|
| 14 |
+
- cpp
|
| 15 |
+
- php
|
| 16 |
+
library_name: transformers
|
| 17 |
+
pipeline_tag: text-generation
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
<div align="center">
|
| 21 |
+
|
| 22 |
+
# π‘οΈ NxCode-SafeCoder-30B
|
| 23 |
+
|
| 24 |
+
**The Next-Generation Mixture-of-Experts Model for Secure Code Intelligence**
|
| 25 |
+
|
| 26 |
+
[](https://opensource.org/licenses/Apache-2.0)
|
| 27 |
+
[](https://huggingface.co/docs/transformers/index)
|
| 28 |
+
[]()
|
| 29 |
+
|
| 30 |
+
</div>
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## π Model Overview
|
| 35 |
+
|
| 36 |
+
**NxCode-SafeCoder-30B** is a state-of-the-art code generation model engineered specifically for **software security auditing and automated vulnerability remediation**.
|
| 37 |
+
|
| 38 |
+
Built upon a highly efficient **Mixture-of-Experts (MoE)** architecture, it delivers the knowledge density of a 30B parameter model while maintaining the inference latency of a much smaller model (only ~3B active parameters per token).
|
| 39 |
+
|
| 40 |
+
Unlike general-purpose coding assistants, NxCode-SafeCoder is aligned using a **Security-First Chain-of-Thought (CoT)** methodology. It effectively mimics the workflow of a senior security researcher: **Analyze -> Reason -> Fix**.
|
| 41 |
+
|
| 42 |
+
## β¨ Key Capabilities
|
| 43 |
+
|
| 44 |
+
* **π‘οΈ Surgical Vulnerability Patching**: Excel at fixing complex memory safety issues (Buffer Overflows, Use-After-Free, Double Free) in C/C++ and PHP.
|
| 45 |
+
* **π§ Dual-Phase Generation**: The model is trained to output a detailed **Security Analysis** (`### Analysis`) before generating the **Fixed Code**, ensuring the fix is logically sound and side-effect free.
|
| 46 |
+
* **β‘ High-Throughput Inference**: Fully optimized for **vLLM**, achieving **>600 tokens/s** on NVIDIA A100 GPUs, making it suitable for large-scale codebase scanning.
|
| 47 |
+
* **π Minimal False Positives**: Drastically reduced sanitizer alerts compared to GPT-4o and Llama-3-70B in fuzzing benchmarks (OSS-Bench).
|
| 48 |
+
|
| 49 |
+
## π Performance
|
| 50 |
+
|
| 51 |
+
*Evaluation based on the OSS-Bench framework (Random Split, PHP-src & SQLite target).*
|
| 52 |
+
|
| 53 |
+
| Model | Architecture | Compilation Rate | Test Pass Rate | **Sanitizer Alerts** (Lower is Better) |
|
| 54 |
+
| :--- | :--- | :---: | :---: | :---: |
|
| 55 |
+
| **NxCode-SafeCoder-30B** | **MoE (30B)** | **High** | **High** | **Lowest** π |
|
| 56 |
+
| GPT-4o | Dense | High | High | Medium |
|
| 57 |
+
| Llama-3-70B-Instruct | Dense | Medium | Medium | High |
|
| 58 |
+
| DeepSeek-Coder-33B | Dense | High | Medium | Medium |
|
| 59 |
+
|
| 60 |
+
> **Note**: While general-purpose models often generate code that compiles, they frequently miss subtle boundary checks or introduce new logic errors. NxCode-SafeCoder prioritizes memory safety above all else.
|
| 61 |
+
|
| 62 |
+
## π» Usage
|
| 63 |
+
|
| 64 |
+
### 1. Using Transformers
|
| 65 |
+
|
| 66 |
+
````python
|
| 67 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 68 |
+
import torch
|
| 69 |
+
|
| 70 |
+
model_name = "NxcodeOfficial/NxCode-SafeCoder-30B"
|
| 71 |
+
|
| 72 |
+
# Load with Flash Attention 2 for best performance
|
| 73 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 74 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 75 |
+
model_name,
|
| 76 |
+
device_map="auto",
|
| 77 |
+
trust_remote_code=True,
|
| 78 |
+
torch_dtype=torch.bfloat16,
|
| 79 |
+
attn_implementation="flash_attention_2"
|
| 80 |
+
)
|
| 81 |
+
|
| 82 |
+
# Standard Security Prompt Template
|
| 83 |
+
# Note: The model expects the function to be wrapped in C code blocks
|
| 84 |
+
prompt = """You are a Linux Kernel security expert. Fix the vulnerabilities in the following C function.
|
| 85 |
+
|
| 86 |
+
Repository: linux
|
| 87 |
+
File: mm/mmap.c
|
| 88 |
+
|
| 89 |
+
Function:
|
| 90 |
+
```c
|
| 91 |
+
void *simple_mmap(void *addr, size_t len) {
|
| 92 |
+
// Vulnerable: No checks
|
| 93 |
+
return mmap(addr, len, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
|
| 94 |
+
}
|
| 95 |
+
```
|
| 96 |
+
"""
|
| 97 |
+
|
| 98 |
+
messages = [{"role": "user", "content": prompt}]
|
| 99 |
+
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
|
| 100 |
+
|
| 101 |
+
# Generate (The model will output Analysis first, then the Code)
|
| 102 |
+
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.2)
|
| 103 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 104 |
+
````
|
| 105 |
+
|
| 106 |
+
### 2. Using vLLM (Production Recommended)
|
| 107 |
+
|
| 108 |
+
For maximum throughput (e.g., scanning entire repositories), use vLLM.
|
| 109 |
+
|
| 110 |
+
```python
|
| 111 |
+
from vllm import LLM, SamplingParams
|
| 112 |
+
|
| 113 |
+
llm = LLM(
|
| 114 |
+
model="NxcodeOfficial/NxCode-SafeCoder-30B",
|
| 115 |
+
trust_remote_code=True,
|
| 116 |
+
tensor_parallel_size=1, # Fits on a single A100 80GB
|
| 117 |
+
gpu_memory_utilization=0.95,
|
| 118 |
+
max_model_len=8192 # Recommended limit to avoid OOM
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
# ... (inference code)
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
## π¬ Methodology
|
| 125 |
+
|
| 126 |
+
The model was fine-tuned on a proprietary dataset containing **10k+ high-quality security patches** distilled from advanced reasoning engines. The training process utilized:
|
| 127 |
+
1. **Expert Routing Optimization**: Tuning the MoE router to specialize specific experts for code analysis vs. code generation.
|
| 128 |
+
2. **Conservative Alignment**: Reinforcing the preference for safer standard libraries (e.g., `strncpy`, `snprintf`) and explicit null-pointer checks.
|
| 129 |
+
|
| 130 |
+
## π Citation
|
| 131 |
+
|
| 132 |
+
If you use this model in your research or product, please cite:
|
| 133 |
+
|
| 134 |
+
```bibtex
|
| 135 |
+
@misc{nxcode2025safecoder,
|
| 136 |
+
title={NxCode-SafeCoder: Automating Secure Code Repair with MoE},
|
| 137 |
+
author={NxCode Team},
|
| 138 |
+
year={2025},
|
| 139 |
+
publisher = {Hugging Face},
|
| 140 |
+
howpublished = {\url{https://huggingface.co/NxcodeOfficial/NxCode-SafeCoder-30B}}
|
| 141 |
+
}
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
## βοΈ License
|
| 145 |
+
|
| 146 |
+
Apache 2.0
|