Member-only story

Large Language Models: A Comprehensive Evaluation on Natural Language Generation Tasks

Emad Dehnavi
2 min readMay 19, 2024

--

Large language models (LLMs) have been making waves lately, with new models like ChatGPT, Meta-Llama-3 and Flan-T5-XXL hitting the market. But how do these models stack up when it comes to natural language generation (NLG) tasks? That’s what a team of researchers set out to discover in a recent study. ( read the full paper here )

The Players in the Game

The team evaluated 11 LLMs on a range of English and Chinese datasets, testing their performance on dialogue generation, text summarization, and story generation tasks.

  1. ChatGPT (175B)
  2. ChatGLM (6B)
  3. Flan-T5-XXL (13B)
  4. FastChat-T5 (3B)
  5. Open-LLaMA (7B)
  6. Vicuna (13B)
  7. Alpaca-Lora (7B)
  8. Chinese-Alpaca (13B)
  9. GPT4ALL (13B)
  10. Dolly (12B)
  11. Oasst-Pythia (12B)

The results offer valuable insights into the strengths and weaknesses of these cutting-edge models. One key finding was the superiority of encoder-decoder models in understanding input instructions. Flan-T5-XXL and FastChat-T5, both encoder-decoder models, consistently outperformed the other models in terms of post-processing requirements. This suggests that having an encoder helps the model better understand the task at hand.

The study also highlighted the importance of model scale. ChatGPT, the largest model in the study with 175 billion parameters, consistently ranked among the top performers across all datasets. This reinforces the idea that bigger models tend to perform better.

However, the evaluation also revealed some challenges for LLMs. When faced with the task of generating longer texts, as in the WritingPrompts dataset, many models struggled to follow instructions effectively, resulting in low BLEU scores.

To address this, the researchers experimented with fine-tuning some of the LLMs using parameter-efficient methods like LoRA and P-Tuning V2. The results were impressive, with significant improvements in various metrics, especially N-grams matching scores. This demonstrates that fine-tuning can enhance the performance of LLMs without requiring massive computational resources.

In conclusion, the comprehensive evaluation of LLMs on NLG tasks revealed both the strengths and limitations of these models. While they excel at certain tasks and demonstrate a strong understanding of instructions, there is still room for improvement, especially when it comes to generating longer texts. The study underscores the importance of continued research and development in this field, as well as the potential of fine-tuning methods to enhance the capabilities of LLMs.

Reference: A Systematic Evaluation of Large Language Models for Natural Language Generation Task, Xuanfan Ni, Piji Li,

--

--

Emad Dehnavi
Emad Dehnavi

Written by Emad Dehnavi

With 8 years as a software engineer, I write about AI and technology in a simple way. My goal is to make these topics easy and interesting for everyone.

No responses yet

Write a response