Spaces:

shivanshneu
/

translingo

Sleeping

App Files Files Community

Ratan1 commited on 29 days ago

Commit

aabde66

1 Parent(s): 1620846

made changes to atteniton and download

Browse files

Files changed (4) hide show

PROJECT_SUMMARY.md +151 -0
data/download.py +104 -49
model/attention.py +110 -27
requirements.txt +1 -0

PROJECT_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,151 @@

+# TransLingo Project Summary
+## ✅ Project Setup Complete!
+All components of the TransLingo translation system have been successfully implemented and tested.
+## 📁 Project Structure
+```
+translingo/
+├── data/                    # Data processing pipeline
+│   ├── download.py         # Multi30k dataset downloader
+│   └── preprocessing.py    # Dataset and dataloader utilities
+├── model/                  # Transformer implementation
+│   ├── transformer.py      # Main model class
+│   ├── attention.py        # Multi-head attention
+│   ├── embeddings.py       # Positional encoding
+│   └── layers.py           # Encoder/decoder layers
+├── training/               # Training components
+│   ├── train.py           # Main training script with CUDA support
+│   ├── loss.py            # Label smoothing loss
+│   └── optimizer.py       # Noam learning rate scheduler
+├── inference/              # Inference modules
+│   ├── beam_search.py     # Beam search decoder
+│   └── translate.py       # Translation interface
+├── frontend/               # User interfaces
+│   └── gradio_app.py      # Gradio web interface
+├── notebooks/              # Training notebooks
+│   └── colab_training.py  # Google Colab training script
+└── configs/               # Configuration
+    └── config.yaml        # Model and training configs
+```
+## 🚀 Next Steps
+### 1. Push to GitHub
+```bash
+# Add your GitHub repository as remote
+git remote add origin https://github.com/YOUR_USERNAME/translingo.git
+# Push the code
+git push -u origin main
+```
+### 2. Train on Google Colab
+1. Go to [Google Colab](https://colab.research.google.com/)
+2. Create a new notebook
+3. Copy the contents from `notebooks/colab_training.py`
+4. Follow these steps in the notebook:
+   - Mount Google Drive (optional, for saving checkpoints)
+   - Clone your GitHub repository
+   - Install dependencies
+   - Run the training script
+5. The training will use GPU acceleration automatically
+### 3. Download Trained Model
+After training completes:
+1. Download the checkpoint files from Colab
+2. Place them in your local `checkpoints/` directory
+3. The files you need:
+   - `best.pt` or `latest.pt` (model checkpoint)
+   - `data/processed/tokenizer.model` (tokenizer)
+### 4. Run Gradio Demo
+```bash
+# Activate virtual environment
+source venv/bin/activate
+# Run the demo
+python frontend/gradio_app.py
+# Or run without public URL
+python frontend/gradio_app.py --no-share
+```
+## 📊 Model Configuration
+- **Architecture**: 3-layer Transformer (optimized for faster training)
+- **Model dimension**: 256
+- **Attention heads**: 4
+- **Feed-forward dimension**: 1024
+- **Vocabulary size**: 10,000 (shared BPE)
+- **Expected BLEU score**: 18-22 (with full training)
+## 🔧 Customization Options
+### For Faster Testing
+Edit `configs/config.yaml`:
+```yaml
+model:
+  n_layers: 2  # Reduce layers
+training:
+  num_epochs: 5  # Fewer epochs
+  batch_size: 16  # Smaller batches if memory limited
+```
+### For Better Quality
+```yaml
+model:
+  n_layers: 6  # More layers
+  d_model: 512  # Larger model
+training:
+  num_epochs: 50  # More training
+  vocab_size: 20000  # Larger vocabulary
+```
+## 🐛 Troubleshooting
+### CUDA/GPU Issues
+- Ensure you're using GPU runtime in Colab (Runtime → Change runtime type → GPU)
+- Check GPU availability with `torch.cuda.is_available()`
+### Memory Issues
+- Reduce batch size in `configs/config.yaml`
+- Enable gradient accumulation (already configured)
+- Clear GPU cache periodically (automatic in training script)
+### Import Errors
+- The torchtext warning on macOS is normal and handled
+- All other imports should work correctly
+## 📝 Additional Features
+### While Model is Training
+You can work on these components locally:
+- FastAPI backend (`api/` directory)
+- React frontend (`frontend/web/` directory)
+- Docker deployment (`deployment/` directory)
+- Additional visualization tools
+### Testing Translation
+Once you have a trained model:
+```python
+# Interactive translation
+python inference/translate.py checkpoints/best.pt data/processed/tokenizer.model
+```
+## 🎯 Success Metrics
+- **Training Loss**: Should decrease below 2.0
+- **Validation BLEU**: Target 18-22 for this configuration
+- **Inference Speed**: < 500ms per sentence on GPU
+## 📧 Support
+If you encounter any issues:
+1. Check the test script: `python test_setup.py`
+2. Review the logs in `logs/` directory
+3. Ensure all dependencies are installed correctly
+Good luck with your translation system! 🌍🔤

data/download.py CHANGED Viewed

@@ -1,22 +1,27 @@
 import os
 import torch
 try:
     from torchtext.datasets import Multi30k
     from torchtext.data.utils import get_tokenizer
     from torchtext.vocab import build_vocab_from_iterator
     TORCHTEXT_AVAILABLE = True
 except Exception as e:
-    print(f"Warning: torchtext import failed: {e}")
-    print("Will use manual download method")
     TORCHTEXT_AVAILABLE = False
-import sentencepiece as spm
-from typing import List, Tuple, Optional, Dict
-import yaml
-import logging
-from tqdm import tqdm
-import urllib.request
-import tarfile
-import zipfile
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
@@ -32,37 +37,68 @@ class DataDownloader:
         os.makedirs(os.path.join(self.data_dir, 'processed'), exist_ok=True)
     def download_multi30k(self) -> Tuple[List[Tuple[str, str]], List[Tuple[str, str]], List[Tuple[str, str]]]:
-        """Download Multi30k dataset"""
         logger.info("Downloading Multi30k dataset...")
-        try:
-            # Try using torchtext first if available
-            if TORCHTEXT_AVAILABLE:
                 train_data = list(Multi30k(split='train', language_pair=('de', 'en')))
                 valid_data = list(Multi30k(split='valid', language_pair=('de', 'en')))
                 test_data = list(Multi30k(split='test', language_pair=('de', 'en')))
-            else:
-                raise Exception("torchtext not available")
-            logger.info(f"Train samples: {len(train_data)}")
-            logger.info(f"Valid samples: {len(valid_data)}")
-            logger.info(f"Test samples: {len(test_data)}")
-            # Save to files for later use
-            self._save_data_to_files(train_data, valid_data, test_data)
-            return train_data, valid_data, test_data
-        except Exception as e:
-            logger.warning(f"Torchtext download failed: {e}")
-            logger.info("Attempting alternative download method...")
-            # Alternative: Download from direct URLs
-            return self._download_multi30k_manual()
     def _download_multi30k_manual(self) -> Tuple[List[Tuple[str, str]], List[Tuple[str, str]], List[Tuple[str, str]]]:
-        """Manual download of Multi30k dataset"""
-        base_url = "https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/"
         files_to_download = {
             'train.de': 'train.de',
@@ -73,17 +109,28 @@ class DataDownloader:
             'test_2016_flickr.en': 'test.en'
         }
-        for remote_file, local_file in files_to_download.items():
-            url = base_url + remote_file
-            output_path = os.path.join(self.data_dir, 'raw', local_file)
-            if not os.path.exists(output_path):
-                logger.info(f"Downloading {remote_file}...")
-                try:
-                    urllib.request.urlretrieve(url, output_path)
-                except Exception as e:
-                    logger.error(f"Failed to download {remote_file}: {e}")
-                    return [], [], []
         # Load data from files
         train_data = self._load_parallel_data('train')
@@ -159,12 +206,13 @@ class DataDownloader:
             pad_piece='<pad>',
             unk_piece='<unk>',
             bos_piece='<bos>',
-            eos_piece='<eos>'
         )
         # Clean up
         os.remove(temp_file)
-        logger.info(f"SentencePiece model saved to {model_path}")
     def prepare_tokenizer(self, train_data: List[Tuple[str, str]]) -> None:
         """Prepare tokenizer from training data"""
@@ -182,12 +230,19 @@ class DataDownloader:
         self.train_sentencepiece(all_texts, "tokenizer", vocab_size=self.config['model']['vocab_size'])
 if __name__ == "__main__":
     downloader = DataDownloader()
     train_data, valid_data, test_data = downloader.download_multi30k()
     if train_data:
         # Train tokenizer
         downloader.prepare_tokenizer(train_data)
-        logger.info("Data download and tokenizer training completed!")
     else:
-        logger.error("Failed to download data.")

 import os
 import torch
+import sentencepiece as spm
+from typing import List, Tuple, Optional, Dict
+import yaml
+import logging
+from tqdm import tqdm
+import urllib.request
+try:
+    from datasets import load_dataset
+    HUGGINGFACE_AVAILABLE = True
+except ImportError:
+    HUGGINGFACE_AVAILABLE = False
+    print("Warning: datasets library not available. Install with: pip install datasets")
 try:
     from torchtext.datasets import Multi30k
     from torchtext.data.utils import get_tokenizer
     from torchtext.vocab import build_vocab_from_iterator
     TORCHTEXT_AVAILABLE = True
 except Exception as e:
     TORCHTEXT_AVAILABLE = False
+    print(f"Warning: torchtext import failed: {e}")
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
         os.makedirs(os.path.join(self.data_dir, 'processed'), exist_ok=True)
     def download_multi30k(self) -> Tuple[List[Tuple[str, str]], List[Tuple[str, str]], List[Tuple[str, str]]]:
+        """Download Multi30k dataset - tries multiple methods"""
         logger.info("Downloading Multi30k dataset...")
+        # Method 1: Try Hugging Face first (most reliable)
+        if HUGGINGFACE_AVAILABLE:
+            try:
+                logger.info("Attempting download from Hugging Face...")
+                return self._download_from_huggingface()
+            except Exception as e:
+                logger.warning(f"Hugging Face download failed: {e}")
+        # Method 2: Try torchtext if available
+        if TORCHTEXT_AVAILABLE:
+            try:
+                logger.info("Attempting download with torchtext...")
                 train_data = list(Multi30k(split='train', language_pair=('de', 'en')))
                 valid_data = list(Multi30k(split='valid', language_pair=('de', 'en')))
                 test_data = list(Multi30k(split='test', language_pair=('de', 'en')))
+                logger.info(f"Train samples: {len(train_data)}")
+                logger.info(f"Valid samples: {len(valid_data)}")
+                logger.info(f"Test samples: {len(test_data)}")
+                self._save_data_to_files(train_data, valid_data, test_data)
+                return train_data, valid_data, test_data
+            except Exception as e:
+                logger.warning(f"Torchtext download failed: {e}")
+        # Method 3: Try manual download from GitHub
+        logger.info("Attempting manual download from GitHub...")
+        return self._download_multi30k_manual()
+    def _download_from_huggingface(self) -> Tuple[List[Tuple[str, str]], List[Tuple[str, str]], List[Tuple[str, str]]]:
+        """Download Multi30k from Hugging Face datasets hub"""
+        logger.info("Downloading from Hugging Face datasets hub...")
+        # Load dataset
+        dataset = load_dataset("bentrevett/multi30k")
+        # Convert to expected format: List[Tuple[str, str]]
+        train_data = [(item['de'], item['en']) for item in dataset['train']]
+        valid_data = [(item['de'], item['en']) for item in dataset['validation']]
+        test_data = [(item['de'], item['en']) for item in dataset['test']]
+        logger.info(f"✅ Downloaded from Hugging Face:")
+        logger.info(f"  Train samples: {len(train_data)}")
+        logger.info(f"  Valid samples: {len(valid_data)}")
+        logger.info(f"  Test samples: {len(test_data)}")
+        # Save to files for consistency with other methods
+        self._save_data_to_files(train_data, valid_data, test_data)
+        return train_data, valid_data, test_data
     def _download_multi30k_manual(self) -> Tuple[List[Tuple[str, str]], List[Tuple[str, str]], List[Tuple[str, str]]]:
+        """Manual download of Multi30k dataset from GitHub"""
+        # Try multiple mirror URLs
+        base_urls = [
+            "https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/",
+            "https://github.com/multi30k/dataset/raw/master/data/task1/raw/",
+            "https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/data/"
+        ]
         files_to_download = {
             'train.de': 'train.de',
             'test_2016_flickr.en': 'test.en'
         }
+        success = False
+        for base_url in base_urls:
+            try:
+                for remote_file, local_file in files_to_download.items():
+                    url = base_url + remote_file
+                    output_path = os.path.join(self.data_dir, 'raw', local_file)
+                    if not os.path.exists(output_path):
+                        logger.info(f"Downloading {remote_file} from {base_url}...")
+                        urllib.request.urlretrieve(url, output_path)
+                success = True
+                logger.info(f"✅ Successfully downloaded from {base_url}")
+                break
+            except Exception as e:
+                logger.warning(f"Failed to download from {base_url}: {e}")
+                continue
+        if not success:
+            logger.error("❌ Failed to download from all sources")
+            logger.info("Please install datasets library: pip install datasets")
+            return [], [], []
         # Load data from files
         train_data = self._load_parallel_data('train')
             pad_piece='<pad>',
             unk_piece='<unk>',
             bos_piece='<bos>',
+            eos_piece='<eos>',
+            character_coverage=1.0  # Important for handling all characters
         )
         # Clean up
         os.remove(temp_file)
+        logger.info(f"✅ SentencePiece model saved to {model_path}")
     def prepare_tokenizer(self, train_data: List[Tuple[str, str]]) -> None:
         """Prepare tokenizer from training data"""
         self.train_sentencepiece(all_texts, "tokenizer", vocab_size=self.config['model']['vocab_size'])
 if __name__ == "__main__":
+    # Install datasets if not available
+    if not HUGGINGFACE_AVAILABLE:
+        import subprocess
+        print("Installing datasets library...")
+        subprocess.run(["pip", "install", "datasets", "-q"])
+        from datasets import load_dataset
     downloader = DataDownloader()
     train_data, valid_data, test_data = downloader.download_multi30k()
     if train_data:
         # Train tokenizer
         downloader.prepare_tokenizer(train_data)
+        logger.info("✅ Data download and tokenizer training completed!")
     else:
+        logger.error("❌ Failed to download data.")

model/attention.py CHANGED Viewed

@@ -5,7 +5,7 @@ import math
 from typing import Optional, Tuple
 class ScaledDotProductAttention(nn.Module):
-    """Scaled Dot-Product Attention mechanism"""
     def __init__(self, temperature: float = 1.0, dropout: float = 0.1):
         super().__init__()
@@ -25,15 +25,26 @@ class ScaledDotProductAttention(nn.Module):
             output: Attention output [batch_size, n_heads, seq_len, d_k]
             attention: Attention weights [batch_size, n_heads, seq_len, seq_len]
         """
-        # Calculate attention scores
-        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.temperature * math.sqrt(q.size(-1)))
-        # Apply mask if provided
         if mask is not None:
-            scores = scores.masked_fill(mask == 0, -1e9)
-        # Apply softmax
         attention = F.softmax(scores, dim=-1)
         attention = self.dropout(attention)
         # Apply attention to values
@@ -43,21 +54,26 @@ class ScaledDotProductAttention(nn.Module):
 class MultiHeadAttention(nn.Module):
-    """Multi-Head Attention mechanism"""
-    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
         super().__init__()
         assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
         self.d_model = d_model
         self.n_heads = n_heads
         self.d_k = d_model // n_heads
-        # Linear projections
-        self.W_q = nn.Linear(d_model, d_model)
-        self.W_k = nn.Linear(d_model, d_model)
-        self.W_v = nn.Linear(d_model, d_model)
-        self.W_o = nn.Linear(d_model, d_model)
         # Attention
         self.attention = ScaledDotProductAttention(temperature=1.0, dropout=dropout)
@@ -66,8 +82,15 @@ class MultiHeadAttention(nn.Module):
         self.dropout = nn.Dropout(dropout)
         # Layer normalization
-        self.layer_norm = nn.LayerNorm(d_model)
     def forward(self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor,
                 mask: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor]:
         """
@@ -81,8 +104,13 @@ class MultiHeadAttention(nn.Module):
             output: Multi-head attention output [batch_size, seq_len, d_model]
             attention: Attention weights [batch_size, n_heads, seq_len, seq_len]
         """
-        batch_size = query.size(0)
-        seq_len = query.size(1)
         # Store residual
         residual = query
@@ -104,8 +132,10 @@ class MultiHeadAttention(nn.Module):
         output = self.W_o(attn_output)
         output = self.dropout(output)
-        # Add and normalize
-        output = self.layer_norm(output + residual)
         return output, attention_weights
@@ -121,7 +151,9 @@ def create_padding_mask(seq: torch.Tensor, pad_idx: int = 0) -> torch.Tensor:
     Returns:
         mask: Padding mask [batch_size, 1, 1, seq_len]
     """
-    return (seq != pad_idx).unsqueeze(1).unsqueeze(2)
 def create_look_ahead_mask(size: int, device: torch.device) -> torch.Tensor:
@@ -135,8 +167,11 @@ def create_look_ahead_mask(size: int, device: torch.device) -> torch.Tensor:
     Returns:
         mask: Look-ahead mask [1, 1, size, size]
     """
-    mask = torch.triu(torch.ones(size, size, device=device), diagonal=1)
-    return (1 - mask).unsqueeze(0).unsqueeze(0)
 def create_masks(src: torch.Tensor, tgt: torch.Tensor,
@@ -157,14 +192,62 @@ def create_masks(src: torch.Tensor, tgt: torch.Tensor,
     # Source mask (padding only)
     src_mask = create_padding_mask(src, pad_idx)
-    # Target mask (padding + look-ahead)
     tgt_pad_mask = create_padding_mask(tgt, pad_idx)
     tgt_len = tgt.size(1)
     tgt_look_ahead_mask = create_look_ahead_mask(tgt_len, tgt.device)
-    tgt_mask = tgt_pad_mask.float() * tgt_look_ahead_mask.float()
-    tgt_mask = tgt_mask.bool()
-    # Memory mask (same as source mask but different shape)
     memory_mask = src_mask
     return src_mask, tgt_mask, memory_mask

 from typing import Optional, Tuple
 class ScaledDotProductAttention(nn.Module):
+    """Scaled Dot-Product Attention mechanism with numerical stability"""
     def __init__(self, temperature: float = 1.0, dropout: float = 0.1):
         super().__init__()
             output: Attention output [batch_size, n_heads, seq_len, d_k]
             attention: Attention weights [batch_size, n_heads, seq_len, seq_len]
         """
+        # Calculate attention scores with temperature scaling
+        d_k = q.size(-1)
+        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.temperature * math.sqrt(d_k))
+        # Apply mask if provided - using fp16-safe value
         if mask is not None:
+            # Determine safe mask value based on dtype
+            if scores.dtype == torch.float16:
+                mask_value = -65504.0  # Max negative value for fp16
+            else:
+                mask_value = -1e9  # Original value for fp32
+            # Use torch.finfo for more robust dtype handling
+            mask_value = torch.finfo(scores.dtype).min if hasattr(torch, 'finfo') else mask_value
+            scores = scores.masked_fill(mask == 0, mask_value)
+        # Apply softmax with numerical stability
         attention = F.softmax(scores, dim=-1)
+        # Apply dropout
         attention = self.dropout(attention)
         # Apply attention to values
 class MultiHeadAttention(nn.Module):
+    """Multi-Head Attention mechanism with improved stability"""
+    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1,
+                 use_bias: bool = True, pre_norm: bool = False):
         super().__init__()
         assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
         self.d_model = d_model
         self.n_heads = n_heads
         self.d_k = d_model // n_heads
+        self.pre_norm = pre_norm
+        # Linear projections with optional bias
+        self.W_q = nn.Linear(d_model, d_model, bias=use_bias)
+        self.W_k = nn.Linear(d_model, d_model, bias=use_bias)
+        self.W_v = nn.Linear(d_model, d_model, bias=use_bias)
+        self.W_o = nn.Linear(d_model, d_model, bias=use_bias)
+        # Initialize weights using Xavier uniform
+        self._init_weights()
         # Attention
         self.attention = ScaledDotProductAttention(temperature=1.0, dropout=dropout)
         self.dropout = nn.Dropout(dropout)
         # Layer normalization
+        self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
+    def _init_weights(self):
+        """Initialize weights with Xavier uniform distribution"""
+        for module in [self.W_q, self.W_k, self.W_v, self.W_o]:
+            nn.init.xavier_uniform_(module.weight)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
     def forward(self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor,
                 mask: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor]:
         """
             output: Multi-head attention output [batch_size, seq_len, d_model]
             attention: Attention weights [batch_size, n_heads, seq_len, seq_len]
         """
+        batch_size, seq_len, _ = query.size()
+        # Pre-norm variant (if enabled)
+        if self.pre_norm:
+            query = self.layer_norm(query)
+            key = self.layer_norm(key)
+            value = self.layer_norm(value)
         # Store residual
         residual = query
         output = self.W_o(attn_output)
         output = self.dropout(output)
+        # Add residual and normalize
+        output = output + residual
+        if not self.pre_norm:
+            output = self.layer_norm(output)
         return output, attention_weights
     Returns:
         mask: Padding mask [batch_size, 1, 1, seq_len]
     """
+    # Create boolean mask
+    mask = (seq != pad_idx).unsqueeze(1).unsqueeze(2)
+    return mask.to(torch.bool)
 def create_look_ahead_mask(size: int, device: torch.device) -> torch.Tensor:
     Returns:
         mask: Look-ahead mask [1, 1, size, size]
     """
+    # Create upper triangular matrix
+    mask = torch.triu(torch.ones(size, size, device=device, dtype=torch.bool), diagonal=1)
+    # Invert it (1 for allowed positions, 0 for masked)
+    mask = ~mask
+    return mask.unsqueeze(0).unsqueeze(0)
 def create_masks(src: torch.Tensor, tgt: torch.Tensor,
     # Source mask (padding only)
     src_mask = create_padding_mask(src, pad_idx)
+    # Target padding mask
     tgt_pad_mask = create_padding_mask(tgt, pad_idx)
+    # Target look-ahead mask
     tgt_len = tgt.size(1)
     tgt_look_ahead_mask = create_look_ahead_mask(tgt_len, tgt.device)
+    # Combine padding and look-ahead masks for target
+    # Both masks should be True where attention is allowed
+    tgt_mask = tgt_pad_mask & tgt_look_ahead_mask
+    # Memory mask (same as source mask)
     memory_mask = src_mask
     return src_mask, tgt_mask, memory_mask
+# Optional: Flash Attention wrapper (if available)
+try:
+    from torch.nn.functional import scaled_dot_product_attention
+    FLASH_ATTENTION_AVAILABLE = True
+except ImportError:
+    FLASH_ATTENTION_AVAILABLE = False
+class FlashAttention(nn.Module):
+    """Flash Attention wrapper for better performance (if available)"""
+    def __init__(self, dropout: float = 0.1):
+        super().__init__()
+        self.dropout = dropout
+    def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,
+                mask: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, None]:
+        """
+        Uses PyTorch's scaled_dot_product_attention if available (includes Flash Attention)
+        """
+        if FLASH_ATTENTION_AVAILABLE and mask is None:
+            # Use efficient implementation when no mask
+            output = scaled_dot_product_attention(
+                q, k, v,
+                dropout_p=self.dropout if self.training else 0.0,
+                is_causal=False
+            )
+            return output, None
+        else:
+            # Fallback to standard implementation
+            d_k = q.size(-1)
+            scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
+            if mask is not None:
+                mask_value = torch.finfo(scores.dtype).min
+                scores = scores.masked_fill(mask == 0, mask_value)
+            attention = F.softmax(scores, dim=-1)
+            if self.training and self.dropout > 0:
+                attention = F.dropout(attention, p=self.dropout)
+            output = torch.matmul(attention, v)
+            return output, attention

requirements.txt CHANGED Viewed

@@ -17,3 +17,4 @@ aiofiles>=23.1.0
 pytest>=7.3.0
 black>=23.3.0
 flake8>=6.0.0

 pytest>=7.3.0
 black>=23.3.0
 flake8>=6.0.0
+datasets>=4.4.1