Large Language Models: A Comprehensive Evaluation on Natural Language Generation Tasks

Emad Dehnavi
2 min readMay 19, 2024

Large language models (LLMs) have been making waves lately, with new models like ChatGPT, Meta-Llama-3 and Flan-T5-XXL hitting the market. But how do these models stack up when it comes to natural language generation (NLG) tasks? That’s what a team of researchers set out to discover in a recent study. ( read the full paper here )

The Players in the Game

The team evaluated 11 LLMs on a range of English and Chinese datasets, testing their performance on dialogue generation, text summarization, and story generation tasks.

  1. ChatGPT (175B)
  2. ChatGLM (6B)
  3. Flan-T5-XXL (13B)
  4. FastChat-T5 (3B)
  5. Open-LLaMA (7B)
  6. Vicuna (13B)
  7. Alpaca-Lora (7B)
  8. Chinese-Alpaca (13B)
  9. GPT4ALL (13B)
  10. Dolly (12B)
  11. Oasst-Pythia (12B)

The results offer valuable insights into the strengths and weaknesses of these cutting-edge models. One key finding was the superiority of encoder-decoder models in understanding input instructions. Flan-T5-XXL and FastChat-T5, both encoder-decoder models, consistently…

--

--