DeepSeek-Coder-1.3B – Clean DSC Model (DSCc)

This repository hosts DSCc, a fine-tuned version of DeepSeek-Coder-1.3B trained for Python function generation from docstrings and function signatures, using a cleaned subset of The Stack.

The model is part of the study:

Quality In, Quality Out: Investigating Training Data’s Role in AI Code Generation
33rd IEEE/ACM International Conference on Program Comprehension (ICPC 2025)

DSCc is specifically trained on a Semgrep-filtered dataset that removes many low-quality and syntactically incorrect functions, allowing us to study how training data quality impacts code generation performance.


Model description

  • Base model: DeepSeek-Coder-1.3B (Python-focused code LLM)
  • Task: Python code generation
  • Input: Python function docstring + signature
  • Output: The corresponding function body in Python

In our experiments, the model is conditioned on a prompt consisting of:

  • A natural-language docstring describing the function behavior
  • The Python function signature

and is then asked to generate the rest of the function body.


What does the model do?

The model generates Python functions that implement the behavior described in the docstring and implied by the signature. Typical use cases:

  • Synthesizing a function implementation from a high-level description
  • Suggesting implementations for partially specified functions
  • Exploring how training data quality affects generated code (correctness, style, quality issues)

“Clean” training set (for DSCc)

The initial training set contains ~4.4M pairs. To construct the clean dataset:

  • We run Semgrep (static analysis) on all training functions.
  • Semgrep detects:
    • Low-quality patterns
    • Potentially problematic constructs
    • Syntactically incorrect functions
  • All flagged low-quality / invalid functions are removed.

This yields:

  • clean_training_set.json — ~4.2M pairs
    • Derived from The Stack
    • But with many quality issues and syntax errors filtered out.
Downloads last month
18
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using OSS-forge/DeepSeek-Coder-1.3B-cleaned 1