📜  arabert - Python (1)

📅  最后修改于: 2023-12-03 15:29:27.282000             🧑  作者: Mango

Arabert - Python

Arabert is an open-source PyTorch-based library that provides pretrained language models for Arabic natural language processing tasks. Arabert is based on the BERT architecture, which is a state-of-the-art model for natural language processing tasks.

Features

Arabert comes with pre-trained models for various language processing tasks, such as:

  • Sentiment Analysis
  • Named Entity Recognition (NER)
  • Text Classification
  • Question Answering
  • Masked Language Modeling (MLM)
  • Seq2Seq Tasks

Arabert also supports fine-tuning of pre-trained models on custom datasets for specific tasks.

Installation

To install Arabert, you can use pip:

pip install arabert

Note: Arabert requires PyTorch >= 1.0 and transformers >= 3.0.

Usage
Tokenization

Arabert provides a tokenizer for Arabic text that converts text into tokens that can be used by a model. Here's an example:

from arabert import ArabertTokenizer

tokenizer = ArabertTokenizer.from_pretrained('bert-base-arabert')

text = "مرحبا بالعالم"
tokens = tokenizer.tokenize(text)

print(tokens)
# Output: ['م', '##رح', '##با', 'ب', 'ال', '##ع', '##ال', '##م']

The tokenize method returns a list of tokens. The from_pretrained method loads a pre-trained tokenizer from the Arabert library.

Sentiment Analysis

Arabert provides a pre-trained model for sentiment analysis. Here's an example:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_name = 'asafaya/bert-base-arabic-sentiment-analysis'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "ايجابي جدا"
inputs = tokenizer.encode(text, return_tensors='pt')
outputs = model(inputs)[0]
probs = torch.nn.functional.softmax(outputs, dim=1).detach().numpy()[0]
sentiment_label = ['Negative', 'Positive'][probs.argmax()]

print(sentiment_label)
# Output: 'Positive'

The code above loads a pre-trained model for sentiment analysis and uses it to classify the sentiment of the given text.

Fine-tuning

Arabert also supports fine-tuning of pre-trained models on custom datasets for specific tasks. Here's an example:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, TextClassificationPipeline

model_name = 'asafaya/bert-base-arabic-sentiment-analysis'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Load custom dataset
train_dataset = ...
test_dataset = ...

# Fine-tune the model
model.train()

# Train the model
...

# Test the model
model.eval()
pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
results = pipeline(test_dataset)

print(results)

The code above shows how to fine-tune a pre-trained model for a specific task on a custom dataset. The TextClassificationPipeline is used to classify text using the fine-tuned model.

Conclusion

Arabert is a powerful library for Arabic natural language processing tasks that provides pre-trained models and support for fine-tuning on custom datasets. With its easy-to-use interface, you can quickly get started with Arabert and build state-of-the-art Arabic natural language processing applications.