Domain: Foundations & Theory
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
Samples from the k highest-probability tokens to limit unlikely outputs.
Samples from the smallest set of tokens whose probabilities sum to p, adapting set size by context.
Restricting updates to safe regions.
When a model cannot capture underlying structure, performing poorly on both training and test data.
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
Methods to set starting weights to preserve signal/gradient scales across layers.