Massively Multilingual Instruction-Following Information Extraction

Abstract

Recent trends in generative information extraction (IE) have brought forth advancements in developing instruction-following language models that can handle unseen schemas with variable task requirements. However, existing instruction benchmarks for IE are limited in terms of linguistic coverage, mostly featuring monolingual (English) or bilingual data sources (e.g. English and Chinese). To remedy this, we propose MASSIE - a large-scale instruction-based IE benchmarks covering 5 tasks and 95+ languages. Utilizing MASSIE, we conduct experiments on state-of-the-arts multilingual large language models, focusing on in-context learning (where models only have access to few-shot exemplars) and supervised fine-tuning (training with a subset of languages). Overall, we observe significant imbalance in model performance across combinations of tasks and languages, even in highly parallel datasets. Our analyses unveil much rooms for improvements in current instruction-following language models for multilingual IE.

The MASSIE Benchmark

MASSIE (MASSively multilingual instruction-tuned Information Extraction) is designed to enable rigorous evaluations and advance research on multilingual information extraction (IE) through instruction following. In particular, MASSIE features 5 extraction tasks across 95+ languages aggregated from 200+ human-annotated datasets. Samples were collected from a variety of domains (e.g. finance, health, entertainment) and contained inputs of various context lengths. In addition, it includes traditionally lesser-focused but essential scenarios such as code-mixing or dialectal variants.

We construct two versions of MASSIE, namely M-Heavy and M-Light. M-Heavy is a comprehensive benchmark with 17M samples but is typically too large for iterative development due to its sheer size. M-Light is a condensed alternative of M-Heavy through dataset-wise downsampling, containing 2.3M samples while preserving the same number of languages and domains. We encourage users to develop models based on M-Light and only use M-Heavy for final evaluation.

Tree map relative to sample quantity

Soft Evaluation The standard F1-Score with exact matching does not reveal the difference between partially correct span outputs, while metrics for open-domain text generation (e.g. BLEU, ROUGE) are not structure-aware. Thus, we propose LF1 - a modified metric based on the Levenshtein distance that captures partial correctness while being structure-aware.

In-context learning

By default, we randomly sample a set of K=3 exemplars from an English dataset for each task and fix this set in ICL evaluation. Since not every language contains enough samples, this ensures consistent prompts and facilitate language-wise comparison.

Task difficulty aligns with label complexity

Compared to single spans tasks (NER/SF/ED), LLMs achieve much lower results at complex tasks requiring triplets (RE) or hierarchical structures (EE).

Scaling law remains effective

Of the same architecture, bigger model sizes achieve incrementally better results.

Language-specific instructions do not help

Surprisingly, replacing English task-specific instructions with translated variants of the target languages either do not change or even harm models' performance in most cases.

Increasing number of exemplars gives mixed results

Models' performances fluctuates greatly with additional exemplars while not displaying any consistent trend. Often times, using as few as 3 exemplars gives the best results.

Language-specific exemplars do not scale

Replacing English exemplars with those of the target language does not show any consistent shift in models' performance.

Performance varies significantly among demonstration sets

Large disparity between language groups

BibTeX

@article{2024massie,
  author    = {Anonymous},
  title     = {Massively Multilingual Instruction-Following Information Extraction}
  year      = {2024},
}