Halaman ini diterjemahkan oleh Cloud Translation API.

Template perintah metrik untuk evaluasi berbasis model

Halaman ini menyediakan daftar template yang dapat Anda gunakan untuk evaluasi berbasis model menggunakan Layanan Evaluasi AI Generatif. Untuk mengetahui informasi selengkapnya tentang metrik berbasis model, lihat Menentukan metrik Anda sendiri.

Ringkasan

Untuk evaluasi berbasis model, kami mengirimkan perintah ke model hakim untuk membuat skor metrik berdasarkan kriteria yang ditentukan, rubrik skor, dan petunjuk lainnya.

Tabel berikut memberikan ringkasan contoh template prompt metrik yang tersedia:

	Kasus penggunaan teks	Kasus penggunaan percakapan multi-giliran	Kasus penggunaan utama lainnya
Pointwise	Kefasihan Coherence Keterikatan dengan data Keselamatan Mengikuti Petunjuk (Instruction Following) Banyaknya pembacaan Kualitas Teks	Kualitas Multi-turn Chat Keselamatan Multi-giliran	Kualitas Ringkasan Kualitas Question Answering
Berpasangan	Kefasihan Coherence Keterikatan dengan data Keselamatan Mengikuti Petunjuk (Instruction Following) Banyaknya pembacaan Kualitas Teks	Kualitas Multi-turn Chat Keselamatan Multi-giliran	Kualitas Ringkasan Kualitas Question Answering

Menyusun template perintah metrik

Template perintah metrik harus mencakup bagian utama berikut:

Petunjuk
Evaluasi
Input pengguna dan respons buatan AI.

Setiap bagian dapat berisi sub-bagian.

Petunjuk

Komponen	Fungsi	Jenis	Contoh
Petunjuk	Mencakup persona untuk model hakim dan deskripsi singkat tugasnya.	Nilai default	You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user input and AI-generated responses. You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below. You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

Evaluasi

Komponen	Fungsi	Jenis	Contoh
Definisi metrik	Menentukan nama dan definisi metrik.	Input pengguna opsional	`You will be assessing a metric called SummarizationQuality, which measures the overall ability to summarize text`
Kriteria	Menentukan kriteria (dan subkriteria opsional) untuk metrik.	Input pengguna yang diperlukan	`Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements. Groundedness: The response contains information included only in the context. The response does not reference any outside information.`
Rubrik penilaian	Menentukan skala pemberian skor untuk metrik dengan penjelasan tentang arti setiap skor.	Input pengguna yang diperlukan	`5: (Very good). The summary follows instructions, is grounded, is concise, and fluent. 4: (Good). The summary follows instructions, is grounded, concise, and fluent. 3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent. 2: (Bad). The summary is grounded, but does not follow the instructions. 1: (Very bad). The summary is not grounded.`
Contoh few-shot	Contoh tugas.	Input pengguna opsional. Catatan: Contoh few-shot tidak hanya dapat meningkatkan performa, tetapi juga dapat meningkatkan pemformatan respons model hakim. Sebaiknya mulai dengan 5-10 contoh few-shot.	`RESPONSE: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs. EXPLANATION: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence. SCORE: 1`
Langkah-langkah evaluasi	Petunjuk langkah demi langkah tentang cara menyelesaikan tugas	Input pengguna opsional Catatan: Anda dapat menentukan peringkat kriteria dalam langkah evaluasi.	`STEP 1: Assess the response in aspects of instruction following, groundedness, helpfulness, and verbosity according to the criteria. STEP 2: Score based on the rubrics.`

Input pengguna

Komponen	Fungsi	Jenis	Contoh
Variabel input	Input yang perlu diberikan pengguna untuk menyelesaikan perintah bagi pemberi skor otomatis dan mendapatkan respons.	Input pengguna yang diperlukan	`## User Inputs ### Prompt {prompt} ## AI-generated Response {response}`

Selain itu, jika kolom dalam data pengguna dan variabel input tidak cocok dan Anda tidak ingin mengganti nama data, Anda dapat memberikan pemetaan:

Komponen	Fungsi	Jenis	Contoh
Pemetaan kolom metrik	Pemetaan dari variabel input dalam perintah pengguna ke data pengguna.	Input pengguna opsional Catatan: `prompt`, `response`, dan `baseline_model_response` tidak mendukung pemetaan jika `evaluate()` menjalankan inferensi model.	`metric_column_mapping = {"reference":"ground_truth"}`

Menyesuaikan template prompt metrik dengan data input Anda

Untuk menyesuaikan template dengan data dan kriteria evaluasi spesifik Anda, ikuti langkah-langkah berikut:

Identifikasi kriteria yang tidak ada: Tentukan kriteria mana yang tidak ditangani secara memadai oleh template yang ada.
Tambahkan kriteria baru: Sertakan kriteria yang tidak ada dalam perintah, dengan jelas mendefinisikan apa yang Anda harapkan untuk dipertimbangkan oleh model.
Menyesuaikan kolom input pengguna: Jika Anda memiliki kolom tambahan dari set data evaluasi yang ingin digunakan untuk evaluasi, tambahkan kolom tersebut di kolom input pengguna dan beri tahu model hakim cara menggunakan kolom ini.
Perbarui rubrik rating: Ubah rubrik rating untuk mencerminkan kriteria baru dan kepentingan relatifnya.

Misalnya, jika Anda ingin mengevaluasi model ringkasan berdasarkan seberapa baik ringkasan respons selaras dengan ringkasan referensi, Anda dapat menambahkan kriteria baru yang disebut "keselarasan referensi" dan menambahkan data referensi sebagai bagian dari User Inputs:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.
Reference alignment: The response is consistent and aligned with the reference response.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, concise, fluent and aligned with reference summary.
4: (Good). The summary follows instructions, is grounded, concise, and fluent but not aligned with reference summary.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent and is not aligned with reference summary.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, fluency and reference alignment according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Reference
{reference}

### Prompt
{prompt}

## AI-generated Response
{response}

Memberikan contoh few-shot untuk meningkatkan kualitas

Contoh sedikit tembakan dapat meningkatkan kualitas dan konsistensi respons evaluasi secara signifikan dengan memandu model ke format dan gaya output yang Anda pilih. Sebaiknya mulai dengan 5-10 contoh few-shot.

Untuk menyertakan contoh few-shot:

Identifikasi contoh yang relevan: Pilih contoh yang serupa dengan jenis data input yang akan Anda evaluasi.
Sertakan contoh dalam perintah: Tempatkan contoh langsung dalam perintah evaluasi, sebelum tugas atau konteks.
Contoh format: Pastikan contoh mengikuti format dan gaya output yang dipilih.

Misalnya, Anda dapat memberikan contoh few-shot untuk metrik coherence dan menambahkan petunjuk untuk menggunakan contoh tersebut sebagai berikut:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps as shown in few shot examples. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
...

## Criteria
...

## Rating Rubric
...

## Few-shot Examples
Response: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs.
Explanation: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence.
Score: 1

Response: Learning a new language can be a rewarding experience for children, opening doors to different cultures and expanding their understanding of the world. There are many resources available to help children learn languages, from online courses and apps to language exchange programs and immersion schools.
Explanation: The response presents two related ideas: the benefits of learning a new language for children and the resources available to aid in that process. However, there is no clear transition or connection between these two distinct points. While both sentences are relevant to the topic of language acquisition in children, the relationship between them could be made more explicit.
Score: 3

Response: Although the internet has revolutionized communication and information sharing, it has also created echo chambers where individuals are only exposed to opinions and beliefs that align with their own. This polarization can lead to increased hostility and misunderstanding between different groups, making it difficult to find common ground on important issues. Consequently, fostering media literacy and critical thinking skills is essential for navigating the vast and often biased landscape of online information. By teaching individuals to evaluate sources, identify biases, and consider diverse perspectives, we can empower them to break free from echo chambers and engage in meaningful dialogue with those who hold differing views.
Explanation: The response exhibits a clear and logical flow of ideas. The transition words 'although' and 'consequently' effectively signal the relationship between the internet's advantages, its drawbacks (echo chambers), and the proposed solution (media literacy). The text maintains cohesion through consistent focus on the central theme of online polarization and its remedies.
Score: 5

## Evaluation Steps
...

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}