[LLM/NLP] A Thorough Examination of Decoding Methods in the Era of LLMs

Title

A Thorough Examination of Decoding Methods in the Era of LLMs

Links

arxiv

A Thorough Examination of Decoding Methods in the Era of LLMs

Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. Prior research on decoding methods, primarily focusing on task-specific models, may not extend to the current era of general-p

arxiv.org

Summary

이 페이퍼는 생성형 AI 모델인 Large Language Model (LLM) 에서 채택되는 다양한 출력 토큰들을 선택하는 알고리즘인 decoding methods를 비교, 분석하며 관련된 observations를 다루고 있다.

Decoding methods를 next token predictors를 text generator 역할로 변경하는데 중요한 역할을 하며 LLM의 정확도 성능에 중대한 영향을 끼친다. 그래서 저자들은 decoding methods를 선택하는 best practices가 무엇인지에 대한 답을 구하고자 한다. "What is the best practice for choosing decoding methods in the era of LLMs?"

Decoding methods

Decoding methods는 정해진 vocabulary set에서 다음 토큰의 확률의 예측을 통해 어떻게 토큰 열이 생성될 수 있을지를 정의한다. 크게는 deterministic 방식 (항상 정해진 방식으로 다음 토큰 선정) 과 stochastic 방식 (randomness가 적용된 방식으로 다음 토큰 선정) 이 있다. 페이퍼에서 언급된 방법들을 간략히 나열해보면,

Deterministic Methods

Greedy Search: selects the token with the highest probability at each time step.
Beam Search: maintains a beam of the k most probable sequences at each time step.
Diverse Beam Search: a variant of Beam Search. Includes the diversity term to maximize inter-group diversity.
Contrastive Search: uses a look-ahead mechanism and penalize tokens compromising the isotropy of the LM's latent space.
Contrastive Decoding: maximizes the probability difference between the LLM and a weaker amateur model.
Frustratingly Simple Decoding: exploits the contrasts between the LLM and an auxiliary anti-LM constructed based on the current prefix.
DoLa: obtains the next token distribution by contrasting the logits differences between the last layer and a premature layer.

Stochastic Methods

Temperature Sampling: controls the skewness of distributions by a temperature hyperparameter.
Top-p Sampling: considers only the minimal set of most probable tokens that cover p probability distribution.
Top-k Sampling: only samples from the top k probable tokens.
η-Sampling: truncates words whose probabilities are below an entropy-dependent threshold.
Mirostat Sampling: directly controls the perplexity rate of the generated text during sampling from top-k tokens.
Typical Sampling: sorts the vocabulary according to the differences between the distribution entropy and the token probabilities.

Key findings

일반적으로 closed-ended tasks는 deterministic methods가 좋고, open-ended tasks는 stochastic methods 가 좋은 성능을 낸다.
Unaligned model (instruction tuning 되지 않은 모델) 들은 open-ended text generation task를 제외하고는 deterministic methods가 일반적으로 좋은 성능을 낸다.
Aligned model (instruction tuning 된 모델) 들은 decoding methods에 상대적으로 덜 영향을 받는다.
Deterministic methods가 hallucination을 적게 내고 instruction을 잘 따르는 능력을 보여준다.
Stochastic methods 중에서는 temperature sampling이 일반적으로 좋은 성능을 보여주고 특히 unaligned modes에서 차이가 크다.
Optimal hyperparameters for decoding methods는 globally optimal이 아니다. 모델, task, quantization settings 등등에 영향을 받는다. 또한 performance-sensitivity tradeoff도 있음.
Stochastic methods가 다양한 결과를 생성해낼 수 있어서 self consistency strategy (multiple generations 중에 majority vote 이용해서 최종 답변 선정 방식)를 활용, 성능을 향상시킬 수 있다.
모델 크기가 커지면 decoding strategies의 영향이 줄어든다.
일반적으로 diversity와 quality의 trade-off 존재한다.