Create README.md

Browse files

Files changed (1) hide show

README.md +146 -0

README.md ADDED Viewed

	@@ -0,0 +1,146 @@

+---
+license: apache-2.0
+language:
+- en
+- code
+tags:
+- security
+- code-repair
+- oss-bench
+- moe
+- chain-of-thought
+- nlp
+- c
+- cpp
+- php
+library_name: transformers
+pipeline_tag: text-generation
+---
+<div align="center">
+# 🛡️ NxCode-SafeCoder-30B
+**The Next-Generation Mixture-of-Experts Model for Secure Code Intelligence**
+[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Model Architecture](https://img.shields.io/badge/Architecture-MoE-blue.svg)](https://huggingface.co/docs/transformers/index)
+[![Task](https://img.shields.io/badge/Task-Vulnerability_Patching-red.svg)]()
+</div>
+---
+## 📖 Model Overview
+**NxCode-SafeCoder-30B** is a state-of-the-art code generation model engineered specifically for **software security auditing and automated vulnerability remediation**.
+Built upon a highly efficient **Mixture-of-Experts (MoE)** architecture, it delivers the knowledge density of a 30B parameter model while maintaining the inference latency of a much smaller model (only ~3B active parameters per token).
+Unlike general-purpose coding assistants, NxCode-SafeCoder is aligned using a **Security-First Chain-of-Thought (CoT)** methodology. It effectively mimics the workflow of a senior security researcher: **Analyze -> Reason -> Fix**.
+## ✨ Key Capabilities
+*   **🛡️ Surgical Vulnerability Patching**:  Excel at fixing complex memory safety issues (Buffer Overflows, Use-After-Free, Double Free) in C/C++ and PHP.
+*   **🧠 Dual-Phase Generation**:  The model is trained to output a detailed **Security Analysis** (`### Analysis`) before generating the **Fixed Code**, ensuring the fix is logically sound and side-effect free.
+*   **⚡ High-Throughput Inference**:  Fully optimized for **vLLM**, achieving **>600 tokens/s** on NVIDIA A100 GPUs, making it suitable for large-scale codebase scanning.
+*   **📉 Minimal False Positives**:  Drastically reduced sanitizer alerts compared to GPT-4o and Llama-3-70B in fuzzing benchmarks (OSS-Bench).
+## 📊 Performance
+*Evaluation based on the OSS-Bench framework (Random Split, PHP-src & SQLite target).*
+| Model | Architecture | Compilation Rate | Test Pass Rate | **Sanitizer Alerts** (Lower is Better) |
+| :--- | :--- | :---: | :---: | :---: |
+| **NxCode-SafeCoder-30B** | **MoE (30B)** | **High** | **High** | **Lowest** 🏆 |
+| GPT-4o | Dense | High | High | Medium |
+| Llama-3-70B-Instruct | Dense | Medium | Medium | High |
+| DeepSeek-Coder-33B | Dense | High | Medium | Medium |
+> **Note**: While general-purpose models often generate code that compiles, they frequently miss subtle boundary checks or introduce new logic errors. NxCode-SafeCoder prioritizes memory safety above all else.
+## 💻 Usage
+### 1. Using Transformers
+````python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_name = "NxcodeOfficial/NxCode-SafeCoder-30B"
+# Load with Flash Attention 2 for best performance
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2"
+)
+# Standard Security Prompt Template
+# Note: The model expects the function to be wrapped in C code blocks
+prompt = """You are a Linux Kernel security expert. Fix the vulnerabilities in the following C function.
+Repository: linux
+File: mm/mmap.c
+Function:
+```c
+void *simple_mmap(void *addr, size_t len) {
+    // Vulnerable: No checks
+    return mmap(addr, len, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+}
+```
+"""
+messages = [{"role": "user", "content": prompt}]
+inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
+# Generate (The model will output Analysis first, then the Code)
+outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.2)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+````
+### 2. Using vLLM (Production Recommended)
+For maximum throughput (e.g., scanning entire repositories), use vLLM.
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="NxcodeOfficial/NxCode-SafeCoder-30B",
+    trust_remote_code=True,
+    tensor_parallel_size=1, # Fits on a single A100 80GB
+    gpu_memory_utilization=0.95,
+    max_model_len=8192 # Recommended limit to avoid OOM
+)
+# ... (inference code)
+```
+## 🔬 Methodology
+The model was fine-tuned on a proprietary dataset containing **10k+ high-quality security patches** distilled from advanced reasoning engines. The training process utilized:
+1.  **Expert Routing Optimization**: Tuning the MoE router to specialize specific experts for code analysis vs. code generation.
+2.  **Conservative Alignment**: Reinforcing the preference for safer standard libraries (e.g., `strncpy`, `snprintf`) and explicit null-pointer checks.
+## 📜 Citation
+If you use this model in your research or product, please cite:
+```bibtex
+@misc{nxcode2025safecoder,
+      title={NxCode-SafeCoder: Automating Secure Code Repair with MoE},
+      author={NxCode Team},
+      year={2025},
+      publisher = {Hugging Face},
+      howpublished = {\url{https://huggingface.co/NxcodeOfficial/NxCode-SafeCoder-30B}}
+}
+```
+## ⚖️ License
+Apache 2.0