The REST API provides compatibility for the OpenAI chat API standard, allowing easier integrations into existing applications.

We currently host meta-llama/Meta-Llama-3.1-8B-Instruct-Q8, meta-llama/Meta-Llama-3.1-70B-Instruct-Q8 and meta-llama/Meta-Llama-3.1-405B-Instruct-Q8 for inference.

Request Example (Standard Mode)

curl --request POST \
     --url https://api.nmesh.io/api/v1/chat/completions \
     --header 'Authorization: Bearer YourAPIKey' \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Q8",
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    }
  ]
}
'

Input Parameters

  • messages:
    • role: The role of the message author. The choice is between: system, user, or assistant.
    • content: The contents of the message.
  • model (string): The model for chat completion. The list of models currently supported are listed above.
  • stream: If set true, tokens are returned as Server-Sent Events as they are made available. If false, return the response as plain text. Default is false.
  • top_p: The top_p (nucleus) parameter is used to adjust the number of choices for each dynamically predicted token based on the cumulative probabilities.
  • top_k: The top_k parameter is used to limit the number of choices for the next predicted word or token.
  • max_tokens: The maximum number of tokens to generate.
  • temperature: Determines the degree of randomness in the response.
  • repetition_penalty: A number that controls the diversity of generated text by reducing the likelihood of repeated sequences. Higher values decrease repetition.

Output

Right now the plain text string is directly returned to the user. E.g. the message content.

We plan to support more complex return formats soon which should include additional information similar to other providers such as OpenAI. Stay tuned!


What’s Next