get https://api.voiceflow.com/v3alpha/vft//runs/
Get results of a given test run
Testing APIs are currently under Public Preview
To add your project to the Public Preview, contact your customer success manager, or contact the Voiceflow team via the discord channel.
Example Response
Pending
This response means that the tests are still running. The more complex the test spec, the longer the test will take to run.
{
"data": {
"status": "PENDING",
"results": null
}
}
Success
{
"data": {
"status": "SUCCESS",
"results": {
"summary": [
{
"llm": "claude-instant-v1",
"accuracy": 0.5,
"numTests": 2,
"avgSimilarityScore": 0.8117166519907741,
"avgTokens": 287.25,
"totalTokens": 1149
}
],
"details": [
{
"conversationID": "91beb7ef",
"description": null,
"messageValidation": {
"currentMessage": "What teams played in the 2023 nba finals?",
"expectedResponse": "The Denver Nuggets played against the Miami Heat.",
"response": "The Denver Nuggets and Miami Heat",
"similarityScore": 0.9451896823706732,
"matchType": "SimilarityScore",
"matchField": "0.8",
"responseTime": 2.4701340198516846
},
"checks": {
"llmJudgedCorrectLanguage": true,
"noPromptLeak": true,
"isValid": true
},
"tokens": {
"queryTokens": 609,
"answerTokens": 16,
"checkTokens": 136
},
"llmSettings": {
"model": "claude-instant-v1",
"temperature": 0.42,
"maxTokens": 30,
"systemPrompt": "You are a knowledge base",
},
"status": "success",
"detail": null
},
{
"conversationID": "91beb7ef",
"description": null,
"messageValidation": {
"currentMessage": "and who won?",
"expectedResponse": "The Denver Nuggets won.",
"response": "The Miami Heat",
"similarityScore": 0.6691670754806393,
"matchType": "SimilarityScore",
"matchField": "0.8",
"responseTime": 1.4771151542663574
},
"checks": {
"llmJudgedCorrectLanguage": true,
"noPromptLeak": true,
"isValid": false
},
"tokens": {
"queryTokens": 105,
"answerTokens": 16,
"checkTokens": 135
},
"llmSettings": {
"model": "claude-instant-v1",
"temperature": 0.42,
"maxTokens": 30,
"systemPrompt": "You are a knowledge base",
},
"status": "success",
"detail": null
}
]
}
}
}
The response is broken up into two main sections: summary
, and details
. The summary will show the high level information and overall test results, and the details will show the individual response validations.
summary
: list of summary objects that summarize the results for each specified LLM
llm
: [str] the following results correspond only to tests that used this LLMaccuracy
: [float] number of valid responses divided by total validated responsesnum_tests
: [int] number of validated responsesavg_similarity_score
: [float] mean similarity score of response, expected_response pairsavg_tokens
: [float] average tokens used by each validated responsetotal_tokens
: [int] total tokens used by all transcripts
details
: list of individual response validation information
conversation_id
: [str] id of the conversation that this response validation was run indescription
: [str] description provided in the API requestmessage_validation
: object containing the information for the current response validationcurrent_message
: [str] user messageexpected_response
: [str] expected assistant response defined in API requestresponse:
[str] actual response returned by assistant. Will be empty if conversation is out-of-boundssimilarity_score
: [float] cosine similarity ofexpected_response
andresponse
embeddingsmatch_type
: [str] EitherSimilarityScore
orTextMatch
. Which type of validation was run.match_field
: [float | str] The parameter to match against. For SimilarityScore this will be the float threshold specified in the API requestresponse_time
: [float] approximate time taken for the assistant to respond to the user
checks
: the evaluations run against the assistant responsellm_judged_correct_language
: [bool] If theresponse
andexpected_response
are in the same language as judged by an LLM.no_prompt_leak
: [bool] Sometimes unwanted terms like<Knowledge>
or<Conversation_History>
from the KB prompt leak into the assistant response. This field will betrue
if the response doesn’t contain a leak.is_valid
: [bool] Whether or not thematch_type
andmatch_field
were satisfied by the response
tokens
: tokens used by the responsequery_tokens
: [int] Tokens used in the any LLM promptsanswer_tokens
: [int] Tokens used by any LLM responsescheck_tokens
: [int] Tokens by LLM judged checks
llm_settings
: LLM settings used for this particular response validationllm
: [str] the LLM usedsystem_prompt
: [str] the system prompt usedtemperature
: [float] the temperature used
status
: [str] Eithersuccess
orfailed
depending on whether or not there was an error running the given response validation