Building with Claude: A Practical Developer Guide to Models, APIs, Prompts, Tools, RAG, MCP, and Agents

Claude is not just a chatbot interface. It is a platform for building AI-powered applications: assistants, coding tools, document processors, data analysers, retrieval systems, workflow automations, and agents that can use external tools.

This guide walks through the main concepts a developer needs to understand when building with Claude: how to choose a model, how API requests work, how to manage conversations, how to control outputs, how to evaluate prompts, how tool use works, how to build RAG systems, how MCP fits into the architecture, and when to use workflows versus agents.

1. Claude Model Families

Claude models are organised into three broad families, each optimised for a different priority: intelligence, balance, or speed.

Opus

Opus is the highest-intelligence Claude model family. It is designed for complex, multi-step work where reasoning, planning, and careful analysis matter more than cost or latency.

Typical use cases include:

Complex architecture design
Multi-step reasoning tasks
Deep analysis of long documents
Planning and strategy
High-stakes coding or debugging
Tasks where accuracy matters more than speed

The trade-off is that Opus is generally slower and more expensive than the other families.

Sonnet

Sonnet is the balanced family. It offers strong reasoning, good speed, and cost efficiency. For most production applications, Sonnet is usually the best default choice.

It is especially strong for:

Software development
Precise code editing
General assistants
Business workflows
Document processing
Production chat applications
Tasks that need quality without excessive latency

If you are unsure which model to choose, start with Sonnet.

Haiku

Haiku is optimised for speed and cost efficiency. It is the best fit when you need fast responses or high-volume processing.

Typical use cases include:

Classification
Routing
Lightweight extraction
Real-time user interactions
High-throughput batch processing
Simple summarisation
Dataset generation for evaluations

Haiku is not the best choice for tasks that need deep reasoning, but it is extremely useful as part of a larger system.

Model selection framework

A simple selection framework:

Priority	Model family
Highest intelligence	Opus
Best balance	Sonnet
Lowest latency / cost	Haiku

In mature applications, the best architecture is often not “pick one model”. Instead, use different models for different parts of the system. For example:

Haiku classifies the request.
Sonnet generates the main response.
Opus handles complex escalations.

All Claude model families share core capabilities such as text generation, coding, and image analysis. The main difference is their optimisation profile.

2. How Claude API Access Works

A typical Claude-powered application should not call the Anthropic API directly from a browser or mobile client. The API key must remain secret, so requests should go through your backend.

The standard flow looks like this:

User → Client app → Your server → Anthropic API → Your server → Client app → User

Five-step request flow

The user enters text in the client application.
The client sends the text to your server.
Your server calls the Anthropic API using an SDK or HTTP.
Claude generates a response.
The API returns the generated text, usage information, and a stop reason. Your server sends the result back to the client.

Anthropic provides SDKs for several languages, including Python, TypeScript, JavaScript, Go, and Ruby. You can also use plain HTTP.

A request normally includes:

API key
Model name
Message list
max_tokens limit
Optional parameters such as system, temperature, tools, stop_sequences, or streaming options

3. What Happens During Text Generation

Although you do not need to implement the internals of a language model, understanding the high-level process helps when debugging behaviour.

Claude’s generation process can be understood in four broad stages.

1. Tokenisation

The input text is broken into tokens. A token may be a word, part of a word, punctuation mark, symbol, or space.

For example, a sentence is not necessarily processed word by word. Some words may be split into multiple tokens.

2. Embedding

Each token is converted into a numerical representation. This representation captures possible meanings and relationships.

An embedding is essentially a list of numbers representing semantic properties of the token.

3. Contextualisation

The model adjusts token meanings based on neighbouring tokens.

For example, the word “bank” can mean a financial institution or the side of a river. Context determines the intended meaning.

4. Generation

Claude predicts possible next tokens and assigns probabilities to them. It selects a token based on those probabilities and generation settings, then repeats the process until it stops.

The model stops when:

It reaches the max_tokens limit.
It generates an end-of-sequence token.
It reaches a stop sequence.
It pauses to request tool use.

The response includes a stop_reason, which tells you why generation ended.

4. Making Your First API Request

A basic Python setup usually involves installing the Anthropic SDK and loading your API key from an environment variable.

pip install anthropic python-dotenv

Create a .env file:

ANTHROPIC_API_KEY="your_key_here"

Do not commit this file to version control.

Then create a client and send a request:

import os
from dotenv import load_dotenv
import anthropic

load_dotenv()

client = anthropic.Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"]
)

model = "claude-3-5-sonnet-latest"

message = client.messages.create(
    model=model,
    max_tokens=1000,
    messages=[
        {
            "role": "user",
            "content": "What is quantum computing?"
        }
    ]
)

print(message.content[0].text)

Required request fields

Field	Purpose
`model`	The Claude model to use
`max_tokens`	Maximum generated tokens
`messages`	Conversation messages

max_tokens is not a target response length. It is a safety limit.

Message structure

Claude messages use roles:

{"role": "user", "content": "Human-authored text"}

Assistant messages represent model-generated responses:

{"role": "assistant", "content": "Model-generated text"}

The API response contains metadata and nested content blocks. To access just the text:

text = message.content[0].text

5. Multi-Turn Conversations

The Anthropic API is stateless. It does not remember previous requests.

That means this will not work as expected:

Request 1: My name is Ana.
Request 2: What is my name?

Unless you include the first message again in the second request, Claude has no memory of it.

The solution: maintain message history yourself

You need to:

Store the message list in your application.
Append each user message.
Append each assistant response.
Send the full history with every new request.

Example:

messages = []

def add_user_message(messages, text):
    messages.append({
        "role": "user",
        "content": text
    })


def add_assistant_message(messages, text):
    messages.append({
        "role": "assistant",
        "content": text
    })


def chat(messages):
    response = client.messages.create(
        model=model,
        max_tokens=1000,
        messages=messages
    )
    return response.content[0].text


while True:
    user_input = input("> ")
    add_user_message(messages, user_input)

    answer = chat(messages)
    add_assistant_message(messages, answer)

    print("---")
    print(answer)
    print("---")

This is the foundation for any chat application.

6. System Prompts

A system prompt lets you define Claude’s role, style, constraints, and behaviour.

It controls how Claude responds rather than simply what it responds with.

Example:

response = client.messages.create(
    model=model,
    max_tokens=1000,
    system="You are a patient math tutor. Give hints instead of direct answers. Encourage the student to reason step by step.",
    messages=[
        {
            "role": "user",
            "content": "How do I solve 2x + 5 = 15?"
        }
    ]
)

The same user question can produce very different answers depending on the system prompt.

Conditional system prompts

In production code, it is common to build request parameters dynamically:

def chat(messages, system_prompt=None):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
    }

    if system_prompt is not None:
        params["system"] = system_prompt

    response = client.messages.create(**params)
    return response.content[0].text

This avoids sending a None value where the API expects a string.

7. Temperature

Temperature controls randomness during token selection.

At each generation step, Claude has a probability distribution over possible next tokens. Temperature changes how dominant the most likely tokens are.

Low temperature

Low temperature makes outputs more consistent and deterministic.

Use it for:

Data extraction
Factual responses
Classification
JSON generation
Code transformations
Tasks where consistency matters

Example:

response = client.messages.create(
    model=model,
    max_tokens=1000,
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": "Extract the email addresses from this text."
        }
    ]
)

High temperature

Higher temperature increases the chance of less obvious tokens being selected.

Use it for:

Brainstorming
Creative writing
Jokes
Marketing copy
Ideation

Higher temperature does not guarantee different outputs every time. It only increases the likelihood of variation.

8. Streaming Responses

Without streaming, users may wait 10–30 seconds for long responses. A spinner is a poor experience compared with showing tokens as they arrive.

Streaming sends response chunks incrementally.

Basic streaming concept

Your server sends a request to Claude.
Claude acknowledges the message.
Claude streams events as generation progresses.
Your server forwards text chunks to the frontend.
The final message is assembled for storage.

Common event types include:

Event	Meaning
`message_start`	Response begins
`content_block_start`	A content block begins
`content_block_delta`	New text chunk arrives
`content_block_stop`	Content block ends
`message_stop`	Full response ends

Simplified streaming with `text_stream`

with client.messages.stream(
    model=model,
    max_tokens=1000,
    messages=[
        {
            "role": "user",
            "content": "Write a short guide to solar panels."
        }
    ]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

    final_message = stream.get_final_message()

stream.get_final_message() gives you the assembled response, which is useful for database storage.

9. Controlling Model Output

Prompt wording is not the only way to control output. Two useful techniques are assistant prefill and stop sequences.

Assistant message prefill

You can manually add an assistant message at the end of the conversation. Claude treats it as already-written assistant content and continues from that exact point.

Example:

messages = [
    {
        "role": "user",
        "content": "Which is better, coffee or tea?"
    },
    {
        "role": "assistant",
        "content": "Coffee is better because"
    }
]

response = client.messages.create(
    model=model,
    max_tokens=500,
    messages=messages
)

Claude continues after “Coffee is better because”.

Important detail: Claude continues from the exact end of the prefilled text. It does not restart the sentence or add missing context for you.

Stop sequences

Stop sequences force generation to halt when a specific string appears.

response = client.messages.create(
    model=model,
    max_tokens=1000,
    stop_sequences=[", five"],
    messages=[
        {
            "role": "user",
            "content": "Count from one to ten."
        }
    ]
)

If Claude generates , five, generation stops before that sequence appears in the final output.

This is useful when you need precise control over length or delimiters.

10. Structured Data Generation

Claude often adds helpful explanations, Markdown headings, or code fences. That is useful for humans, but annoying when you need raw JSON, Python, YAML, regex, or another parseable format.

A common pattern combines:

A user request for structured data.
Assistant prefill with an opening delimiter.
A stop sequence with the closing delimiter.

Example:

messages = [
    {
        "role": "user",
        "content": "Generate a JSON array of five AWS automation tasks. Each item should have a task and format field."
    },
    {
        "role": "assistant",
        "content": "```json"
    }
]

response = client.messages.create(
    model=model,
    max_tokens=1000,
    messages=messages,
    stop_sequences=["```"]
)

raw_json = response.content[0].text

The result is easier to parse because Claude continues inside the JSON block and stops when it reaches the closing fence.

This technique works for:

JSON
Python code
YAML
Regex
Lists
Configuration files
Any structured format where extra text is undesirable

For more reliability, especially with complex structures, use tool calling with a JSON schema.

11. Prompt Evaluation

Many teams make the same mistake: they write a prompt, test it once or twice, tweak it manually, and ship it.

That is not enough for production.

Prompt evaluation gives you an objective way to improve prompts.

Typical evaluation workflow

Write an initial prompt.
Create an evaluation dataset.
Interpolate each test case into the prompt.
Run the prompts through Claude.
Grade the responses.
Modify the prompt and repeat.

The dataset can be tiny at first. Even three to ten examples are better than relying on intuition.

Dataset generation

You can create datasets manually, or use a fast model such as Haiku to generate test cases.

Example dataset structure:

[
  {
    "task": "Create a Python script that lists all S3 buckets.",
    "format": "python"
  },
  {
    "task": "Create a JSON config for an AWS Lambda timeout setting.",
    "format": "json"
  },
  {
    "task": "Write a regex that matches IPv4 addresses.",
    "format": "regex"
  }
]

You can use structured generation techniques to produce this dataset, parse it, and save it as dataset.json.

Running the evaluation

A basic evaluation system often has three core functions:

def run_prompt(test_case, prompt_template):
    prompt = prompt_template.format(**test_case)
    response = client.messages.create(
        model=model,
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text


def run_test_case(test_case, prompt_template, grader):
    output = run_prompt(test_case, prompt_template)
    score = grader(output, test_case)
    return {
        "test_case": test_case,
        "output": output,
        "score": score,
    }


def run_eval(dataset, prompt_template, grader):
    return [
        run_test_case(test_case, prompt_template, grader)
        for test_case in dataset
    ]

This gives you comparable results across prompt versions.

12. Grading Model Outputs

There are three common ways to grade LLM outputs.

1. Code-based graders

These are deterministic checks written in code. They are ideal for structured outputs.

Examples:

import json
import ast
import re


def validate_json(text):
    try:
        json.loads(text)
        return 10
    except Exception:
        return 0


def validate_python(text):
    try:
        ast.parse(text)
        return 10
    except Exception:
        return 0


def validate_regex(text):
    try:
        re.compile(text)
        return 10
    except Exception:
        return 0

These validators do not prove the output is semantically correct, but they catch invalid syntax.

2. Model-based graders

A model-based grader asks another model to evaluate the output.

A good grading prompt should ask for:

Strengths
Weaknesses
Reasoning
Numerical score

Do not ask only for a score. Models often default to middling scores unless forced to reason.

Example expected grader output:

{
  "strengths": ["Valid JSON", "Correct field names"],
  "weaknesses": ["Missing one requested field"],
  "reasoning": "The answer mostly follows the requested structure but omits the region field.",
  "score": 7
}

3. Human graders

Human review is the most flexible and often the most accurate, but it is slower and expensive. It is useful for calibrating model-based graders or reviewing high-impact outputs.

Combining graders

For technical tasks, combine semantic grading and syntax validation:

final_score = (model_score + syntax_score) / 2

This captures both correctness and technical validity.

13. Prompt Engineering Techniques That Actually Help

Prompt engineering is not magic phrasing. It is the process of making instructions clear, specific, structured, and testable.

Be clear and direct

The first line matters. Start with an action verb and state the task plainly.

Poor:

I need some help with food for someone who trains.

Better:

Generate a one-day meal plan for an athlete based on the provided height, weight, goal, and dietary restrictions.

Good first lines often follow this pattern:

[Action verb] + [specific task] + [output expectation]

Examples:

Write three paragraphs explaining how solar panels work.

Identify three countries that use geothermal energy and include generation statistics for each.

Generate a one-day meal plan for an athlete that respects their dietary restrictions.

Be specific

Specificity usually improves results dramatically.

There are two useful types of guidelines.

Type A: output attributes

These define what the answer should look like.

Examples:

Length
Structure
Format
Tone
Required sections
Required fields
Forbidden content

Type B: reasoning or process steps

These tell the model how to approach the problem.

Examples:

First identify the user’s goal, then list constraints, then generate the plan, then verify that each constraint is satisfied.

Use Type A guidelines almost always. Use Type B guidelines when the task is complex or requires considering multiple perspectives.

Use XML tags for structure

When prompts contain multiple content blocks, XML-style tags make boundaries clear.

Example:

<athlete_information>
Height: 180 cm
Weight: 78 kg
Goal: gain muscle
Dietary restrictions: lactose-free
</athlete_information>

<output_requirements>
- Include breakfast, lunch, dinner, and two snacks
- Provide calories and protein estimates
- Explain why each meal supports the goal
- Return a Markdown table
</output_requirements>

Use descriptive tag names. <sales_records> is better than <data>.

This is especially useful for:

Code debugging
Document analysis
Data extraction
Multi-document prompts
Long user-provided context

Provide examples

One-shot and multi-shot prompting means giving Claude examples of the desired input/output pattern.

Examples are especially useful for:

Complex formatting
Corner cases
Tone matching
Sarcasm detection
Data extraction
Classification

A strong example includes both the sample output and a short explanation of why it is good.

14. Tool Use: Extending Claude Beyond Text

By default, Claude only knows what is in the prompt and its training. It cannot automatically know current weather, query your database, create reminders, or call your APIs.

Tool use solves that.

A tool is a function that your application exposes to Claude. Claude decides when it needs the tool, asks to call it, and your server executes it.

Tool use flow

User asks question
→ Claude decides a tool is needed
→ Claude returns a tool_use block
→ Your server runs the function
→ Your server sends a tool_result block back
→ Claude produces the final answer

Example: current weather.

User: What is the weather in Lisbon?
Claude: I need weather data for Lisbon.
Server: Calls weather API.
Claude: Uses the returned weather data to answer.

15. Designing Tool Functions

Tool functions are plain functions executed by your code.

Good tool functions should:

Have descriptive names
Have descriptive argument names
Validate inputs
Raise meaningful errors
Return structured results when possible

Example:

from datetime import datetime


def get_current_datetime(date_format="%Y-%m-%d %H:%M:%S"):
    if not date_format:
        raise ValueError("date_format cannot be empty")

    return datetime.now().strftime(date_format)

Errors matter. If a tool raises a useful error, Claude can often correct the arguments and retry.

Bad error:

Error

Good error:

date_format cannot be empty. Provide a valid strftime-compatible format.

16. Tool Schemas

Claude needs a schema describing each tool.

A tool schema typically includes:

name
description
input_schema

The description should explain:

What the tool does
When to use it
What it returns
Any constraints or caveats

Example schema:

get_current_datetime_schema = {
    "name": "get_current_datetime",
    "description": "Gets the current date and time. Use this when the user asks about the current time, current date, or relative scheduling. Returns a formatted datetime string.",
    "input_schema": {
        "type": "object",
        "properties": {
            "date_format": {
                "type": "string",
                "description": "A Python strftime-compatible date format."
            }
        },
        "required": []
    }
}

In Python, you may wrap tool definitions using SDK types such as ToolParam, depending on the SDK version.

17. Handling Message Blocks with Tools

Once tools are introduced, messages are no longer always plain text.

An assistant response may contain multiple content blocks:

Text block
Tool use block
More text blocks
More tool use blocks

This means your message history functions must store complete content blocks, not only message.content[0].text.

Tool-enabled request

response = client.messages.create(
    model=model,
    max_tokens=1000,
    messages=messages,
    tools=[get_current_datetime_schema]
)

If Claude wants to use a tool, the response may contain a block like:

{
  "type": "tool_use",
  "id": "toolu_123",
  "name": "get_current_datetime",
  "input": {
    "date_format": "%Y-%m-%d"
  }
}

You must append the full assistant response to the message history.

18. Sending Tool Results Back to Claude

After executing a tool, your application must send a tool result block back to Claude.

A tool result includes:

tool_use_id
content
is_error

Example:

tool_result = {
    "type": "tool_result",
    "tool_use_id": "toolu_123",
    "content": "2026-04-27 14:30:00",
    "is_error": False
}

messages.append({
    "role": "user",
    "content": [tool_result]
})

The tool_use_id links the result to the original tool request. This is essential when Claude requests multiple tools.

Important: tool results are sent as a user message, not an assistant message.

19. Multi-Turn Tool Conversations

Some tasks require multiple tool calls.

Example:

User: What day is 103 days from today?

Claude may need to:

Call get_current_datetime.
Call add_duration_to_datetime.
Return the final answer.

You cannot always predict how many tool calls will be required. The solution is a loop.

Conversation loop

def run_conversation(messages, tools):
    while True:
        response = client.messages.create(
            model=model,
            max_tokens=1000,
            messages=messages,
            tools=tools
        )

        messages.append({
            "role": "assistant",
            "content": response.content
        })

        if response.stop_reason != "tool_use":
            return response

        tool_results = run_tools(response.content)

        messages.append({
            "role": "user",
            "content": tool_results
        })

Running tools

def run_tools(content_blocks):
    tool_results = []

    for block in content_blocks:
        if block.type != "tool_use":
            continue

        try:
            output = run_tool(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(output),
                "is_error": False
            })
        except Exception as e:
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(e),
                "is_error": True
            })

    return tool_results

Dispatching tools

def run_tool(tool_name, tool_input):
    if tool_name == "get_current_datetime":
        return get_current_datetime(**tool_input)

    if tool_name == "add_duration_to_datetime":
        return add_duration_to_datetime(**tool_input)

    if tool_name == "set_reminder":
        return set_reminder(**tool_input)

    raise ValueError(f"Unknown tool: {tool_name}")

This architecture supports arbitrary tool chains.

20. Adding Multiple Tools

Once the framework is in place, adding a tool usually requires three steps:

Implement the function.
Add its schema to the tools list.
Add a case in the dispatcher.

For example, a reminder system might include:

get_current_datetime
add_duration_to_datetime
set_reminder

Claude can then combine those tools dynamically.

This is where tools become powerful: the model can solve tasks by composing small capabilities.

21. Batch Tool Pattern

Claude can technically request multiple tools in one assistant message, but models may still call tools sequentially when they could be parallelised.

A batch tool is a workaround.

Instead of exposing only individual tools, expose a higher-level tool that accepts a list of invocations.

Example input:

{
  "invocations": [
    {
      "tool_name": "get_weather",
      "arguments": {"city": "Lisbon"}
    },
    {
      "tool_name": "get_weather",
      "arguments": {"city": "Madrid"}
    }
  ]
}

The server runs each invocation and returns a list of outputs.

This reduces unnecessary request-response cycles when tool calls are independent.

22. Tools for Structured Data

Tool use can also be used for structured extraction.

Instead of asking Claude to output JSON as text, define a tool whose input schema is the structure you want.

Then force Claude to call that tool.

response = client.messages.create(
    model=model,
    max_tokens=1000,
    messages=[
        {
            "role": "user",
            "content": "Extract the customer name, company, and requested product from this email."
        }
    ],
    tools=[extract_customer_request_schema],
    tool_choice={
        "type": "tool",
        "name": "extract_customer_request"
    }
)

Then read the structured arguments from the tool use block.

This approach is more complex than prefill and stop sequences, but more reliable for production extraction.

23. Fine-Grained Tool Calling and Tool Streaming

When streaming tool calls, you may receive partial JSON chunks as Claude builds tool arguments.

Standard streaming includes text deltas. Tool streaming adds JSON-related deltas such as partial argument chunks.

Default behaviour

By default, the API may buffer chunks until it has a complete top-level key-value pair and can validate the JSON against the schema.

This gives you more safety, but can feel less like normal streaming because chunks arrive in bursts.

Fine-grained mode

Fine-grained tool calling disables some API-side validation and streams chunks as they are generated.

Trade-off:

Mode	Benefit	Risk
Default	Validated JSON	Slower chunk delivery
Fine-grained	Faster streaming	Client must handle invalid JSON

Use fine-grained mode when you need immediate UI updates or want to begin processing tool arguments before the full object is complete.

24. Built-In Tools: Text Editing, Web Search, Code Execution

Claude can work with several tool patterns, including built-in tool schemas that still require application-side integration.

Text editor tool

The text editor tool enables file operations such as:

View files
Create files
Replace strings
Edit text
Undo operations

Claude has a built-in schema for this tool, but your application must implement the actual filesystem operations.

The typical flow is:

Send a minimal tool schema stub.
Claude expands it internally.
Claude requests file operations.
Your code performs the operations.
Results are sent back to Claude.

This is the foundation of AI code editor behaviour.

Web search tool

The web search tool allows Claude to search for current or specialised information.

A typical schema includes:

{
  "type": "web_search_20250305",
  "name": "web_search",
  "max_uses": 5,
  "allowed_domains": ["nih.gov"]
}

Useful features include:

Multiple searches per request
Domain restrictions
Search result blocks
Citation blocks
Source-grounded answers

Domain restrictions are useful when quality matters. For example, medical or scientific queries can be restricted to trusted domains.

Code execution and Files API

The Files API lets you upload files once and reference them by file ID in future requests.

Code execution allows Claude to run Python code in an isolated Docker container.

Important constraints:

The container has no network access.
Data must be provided through uploaded files.
Claude can generate files such as plots or reports.

Typical flow:

Upload file → Get file ID → Attach file to request → Claude writes/runs Python → Claude analyses results → Optional output files returned

Use cases include:

Data analysis
CSV processing
Plot generation
Report generation
Complex file transformations

25. RAG: Retrieval-Augmented Generation

RAG is a technique for answering questions over large documents or collections of documents.

The problem: you may have hundreds or thousands of pages of content. Sending all of it to Claude is expensive, slow, and may exceed context limits.

RAG solves this by retrieving only relevant chunks.

Direct document prompting

The simplest approach is to put the whole document in the prompt.

Benefits:

Simple
No preprocessing
Easy to prototype

Limitations:

Context limits
Higher cost
Slower requests
Lower effectiveness on very long prompts

RAG approach

RAG uses two steps:

Break documents into chunks.
Retrieve the most relevant chunks for each question.

Then only those chunks are included in the prompt.

Benefits:

Scales to large document sets
Reduces cost
Improves focus
Speeds up generation

Trade-off: more implementation complexity.

26. Chunking Strategies for RAG

Chunking quality has a major impact on RAG quality.

Bad chunking can make retrieval fail even if the answer exists in the source material.

1. Size-based chunking

Split text into fixed-size chunks.

Pros:

Easy to implement
Works with any text
Common in production

Cons:

Can cut sentences or concepts in half
Loses context

A common improvement is overlap:

Chunk 1: characters 0–1000
Chunk 2: characters 800–1800
Chunk 3: characters 1600–2600

Overlap duplicates some text but preserves continuity.

2. Structure-based chunking

Split by document structure:

Markdown headings
HTML sections
Paragraphs
Chapters
Page sections

This is often better when source documents are well structured.

Example: split Markdown on ## headings.

3. Semantic chunking

Semantic chunking groups sentences or sections based on meaning.

It is more advanced and can produce better chunks, but is more complex to implement.

Rule of thumb

There is no universal best chunking strategy. Choose based on document type and retrieval requirements.

27. Embeddings and Semantic Search

Embeddings are numerical representations of text meaning.

An embedding model takes text and returns a vector: a long list of numbers.

"software engineering incident response" → [0.12, -0.44, 0.03, ...]

Texts with similar meanings should have vectors that are close to one another.

Semantic search flow

Generate embeddings for all document chunks.
Store them in a vector database.
Generate an embedding for the user query.
Search for the closest chunk vectors.
Add those chunks to the Claude prompt.

This enables meaning-based retrieval rather than simple keyword matching.

Anthropic commonly recommends using Voyage AI for embeddings, although many embedding providers can be used.

28. Full RAG Pipeline

A complete RAG pipeline usually looks like this:

Split documents into chunks.
Generate embeddings for each chunk.
Store embeddings in a vector database.
Receive a user query.
Generate an embedding for the query.
Search for similar vectors.
Assemble a prompt with retrieved context.
Send the prompt to Claude.
Return the grounded answer.

Cosine similarity

A common similarity metric is cosine similarity.

Closer to 1 means more similar.
Cosine distance is often calculated as 1 - cosine_similarity.
Lower cosine distance means higher similarity.

Minimal conceptual implementation

chunks = chunk_document(document_text)
chunk_embeddings = generate_embeddings(chunks)

store = VectorIndex()

for chunk, embedding in zip(chunks, chunk_embeddings):
    store.add_vector(
        embedding,
        metadata={"content": chunk}
    )

query = "What did the software engineering team do last year?"
query_embedding = generate_embedding(query)

results = store.search(query_embedding, limit=3)

context = "\n\n".join(result.metadata["content"] for result in results)

prompt = f"""
Answer the question using the context below.

<context>
{context}
</context>

<question>
{query}
</question>
"""

29. BM25 and Hybrid Search

Semantic search is powerful, but it can miss exact terms.

BM25 is a lexical search algorithm that ranks documents based on keyword relevance.

It considers:

Query terms
Term frequency
How rare terms are across the corpus
Document relevance based on exact matches

Rare terms are usually more informative than common terms.

For example, in a technical corpus, the term “Kubernetes” is more useful than “the”.

Hybrid search

A robust RAG system often combines:

Vector search for semantic similarity
BM25 for exact keyword matching

This improves retrieval when:

User queries contain specific names
Exact terms matter
Acronyms are important
Semantic search retrieves plausible but wrong chunks

30. Multi-Index RAG and Reciprocal Rank Fusion

A multi-index RAG pipeline uses more than one retrieval method.

Example:

Query → Vector index
      → BM25 index
      → Merge results
      → Claude

Reciprocal Rank Fusion

Reciprocal Rank Fusion merges ranked results from multiple search systems.

Formula:

score = sum(1 / (rank + 1))

Example:

Vector search: doc2, doc7, doc6
BM25 search:   doc6, doc2, doc7

RRF combines ranks so documents that perform well across search methods rise to the top.

Benefits:

Better retrieval accuracy
Modular design
Easy to add more indexes
Handles edge cases better than one retrieval method alone

31. Reranking Results

Reranking is a post-processing step.

First, retrieve candidate documents using vector search, BM25, or hybrid search. Then ask an LLM to reorder the candidates by relevance.

Flow:

Retrieve candidates → Send query + candidates to Claude → Return ranked document IDs

To save tokens, pass document IDs and concise snippets rather than full documents.

Structured output can be enforced with prefill and stop sequences:

{
  "ranked_document_ids": ["doc_2", "doc_6", "doc_7"]
}

Reranking improves quality, but adds latency because it requires another model call.

Use it when retrieval quality matters more than response speed.

32. Contextual Retrieval

When a document is chunked, each chunk loses some surrounding context. A chunk may refer to “this project”, “the incident”, or “the team” without explaining what those refer to.

Contextual retrieval solves this by adding generated context to each chunk before indexing.

Process

Take the original document and one chunk.
Ask Claude to write brief context explaining where the chunk fits.
Prepend or append that context to the chunk.
Store the contextualised chunk in the vector and BM25 indexes.

Example:

Context: This chunk is from the software engineering section of the 2023 annual report and describes incident response improvements.

Original chunk: The team reduced mean time to recovery by introducing automated rollback procedures...

This improves retrieval because the chunk now contains missing context.

For very large documents, include:

The beginning of the document
The chunks immediately before the target chunk
The target chunk itself

This gives Claude enough context without sending the entire document.

33. Extended Thinking

Extended thinking gives Claude additional reasoning time before producing a final answer.

It is useful for complex tasks where normal prompting and evaluation are not enough.

Important constraints:

Thinking tokens are charged.
Thinking increases latency.
The thinking budget has a minimum size.
max_tokens must be greater than the thinking budget.

Example configuration concept:

response = client.messages.create(
    model=model,
    max_tokens=4096,
    thinking={
        "type": "enabled",
        "budget_tokens": 1024
    },
    messages=[
        {
            "role": "user",
            "content": "Analyse this complex architecture trade-off."
        }
    ]
)

Use extended thinking after normal prompt engineering fails to reach the required accuracy.

34. Image Support

Claude can process images in user messages.

Use cases include:

Image description
Counting objects
Visual comparison
UI analysis
Diagram interpretation
Risk assessment from satellite imagery

Images are included as special content blocks, either from base64 data or URLs depending on the API capability.

Example structure:

message = {
    "role": "user",
    "content": [
        {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64_image
            }
        },
        {
            "type": "text",
            "text": "Analyse this image step by step."
        }
    ]
}

Prompt quality matters enormously for image tasks. Simple prompts often produce weak results.

Good image prompts include:

Step-by-step inspection instructions
Clear criteria
Verification steps
Examples when possible
Required output format

35. PDF Support and Citations

Claude can read PDFs directly using document blocks.

The structure is similar to image input, but the type is document and the media type is application/pdf.

Claude can analyse:

Text
Tables
Charts
Images
Mixed document content

Citations

Citations allow Claude to reference where information came from.

Citation types include:

Page-based citations for PDFs
Character-location citations for text documents

To enable citations, include a citations setting on the document block and provide a title.

Citations are useful because they let users verify answers instead of trusting the model blindly.

For document-heavy applications, citations are critical for transparency.

36. Prompt Caching

Prompt caching improves speed and reduces cost by reusing processing work from previous requests.

Normally, every request is processed from scratch. If you send the same large system prompt, tool definitions, or document context repeatedly, that work is repeated each time.

Prompt caching stores processed input temporarily so identical content can be reused.

Cache rules

Important rules:

Cache duration is temporary, commonly up to one hour.
You must add cache breakpoints manually.
Content before the breakpoint must remain identical.
Any change before the breakpoint invalidates that cache layer.
Minimum cacheable content thresholds may apply.
A limited number of breakpoints can be used per request.

Longhand text blocks

To use cache control, you often need longhand content blocks:

system = [
    {
        "type": "text",
        "text": "You are a helpful assistant for analysing legal contracts...",
        "cache_control": {"type": "ephemeral"}
    }
]

Common cache locations

Good cache breakpoint candidates:

Tool schemas
Long system prompts
Static document prefixes
Repeated instruction blocks
Stable conversation prefixes

The processing order is generally:

Tools → System prompt → Messages

So tool schema caching is often especially useful.

37. MCP: Model Context Protocol

MCP is a protocol for connecting models to tools, resources, and prompts without every application having to implement custom integrations manually.

Instead of your application defining every tool schema and function, an MCP server exposes capabilities in a standard way.

Architecture

Your application → MCP client → MCP server → External service

An MCP server can expose:

Tools
Resources
Prompts

This shifts integration work from every application developer to reusable MCP servers.

Why MCP matters

Suppose you want Claude to interact with GitHub.

Without MCP, you might need to implement tools for:

Listing repositories
Reading files
Creating issues
Updating pull requests
Searching commits
Managing projects

With MCP, a GitHub MCP server can expose these capabilities for you.

MCP and tool use are complementary. Tool use is the model capability. MCP is a standard way to provide tools and context.

38. MCP Clients and Servers

An MCP client communicates with an MCP server.

Transport can vary:

stdio
HTTP
WebSockets

A common local setup runs the client and server on the same machine using standard input/output.

Typical MCP flow

User asks your application a question.
Your server asks the MCP client for available tools.
The MCP client sends a list_tools request to the MCP server.
The MCP server returns tool definitions.
Your server sends the user query and tools to Claude.
Claude requests a tool call.
Your server asks the MCP client to execute it.
The MCP client sends a call_tool request to the MCP server.
The MCP server executes the action.
Results flow back to Claude and then to the user.

39. Building MCP Tools

With the Python MCP SDK, tools can be defined using decorators instead of writing JSON schemas manually.

Example:

from pydantic import Field

@mcp.tool(
    name="read_doc_contents",
    description="Read the contents of a document by ID."
)
def read_doc_contents(
    doc_id: str = Field(description="The ID of the document to read")
):
    if doc_id not in docs:
        raise ValueError(f"Document not found: {doc_id}")

    return docs[doc_id]

The SDK generates schemas from type hints and field descriptions.

A second tool might edit a document:

@mcp.tool(
    name="edit_document",
    description="Replace a string inside a document."
)
def edit_document(
    doc_id: str,
    old_string: str,
    new_string: str
):
    if doc_id not in docs:
        raise ValueError(f"Document not found: {doc_id}")

    docs[doc_id] = docs[doc_id].replace(old_string, new_string)
    return "Document updated"

40. MCP Resources

Resources expose data for read operations.

There are two common types:

Direct resources

Static URI:

docs://documents

Templated resources

Parameterized URI:

docs://documents/{doc_id}

Resources are useful when the client wants to fetch context proactively, rather than waiting for Claude to request a tool.

Example:

@mcp.resource("docs://documents/{doc_id}", mime_type="text/plain")
def read_document_resource(doc_id: str):
    return docs[doc_id]

The client reads resources by URI and parses them based on MIME type.

41. MCP Prompts

MCP servers can also expose predefined prompt templates.

This is useful because server authors often know the best way to use their tools.

Example use case: a document server exposes a prompt called format_document_as_markdown.

The prompt might instruct Claude to:

Read a document using available tools.
Convert it to Markdown.
Save the updated version.

Clients can list available prompts and invoke them with arguments.

Conceptual client flow:

prompts = await session.list_prompts()
messages = await session.get_prompt(
    "format_document_as_markdown",
    arguments={"document_id": "doc_123"}
)

The returned messages can be sent directly to Claude.

42. Claude Code

Claude Code is a terminal-based coding assistant. It is useful as a practical example of agent architecture.

It can:

Search files
Read files
Edit files
Run terminal commands
Help set up projects
Write tests
Debug issues
Work with Git
Connect to MCP servers

Effective Claude Code workflow

A strong workflow is:

Ask Claude to inspect relevant files.
Ask it to propose a plan without coding yet.
Review the plan.
Ask it to implement.
Ask it to run tests.
Ask it to commit changes.

Another useful workflow is test-driven:

Provide context.
Ask Claude to suggest tests.
Select tests.
Ask Claude to implement until tests pass.

Claude Code is most effective when treated as a collaborative engineer, not a one-shot code generator.

Project memory

Claude Code can create a claude.md file after scanning a project. This file stores project-specific context such as architecture, style, and conventions.

You can also add notes to memory so future requests have more context.

43. Enhancing Claude Code with MCP Servers

Claude Code includes an MCP client. That means you can connect it to custom MCP servers.

Example command:

claude mcp add document-server "uv run main.py"

After restarting Claude Code, the new server capabilities become available.

Use cases include:

Sentry integration
Jira integration
Slack integration
Custom document processing
Internal deployment tools
Production monitoring

This allows Claude Code to gain new capabilities without changing its core implementation.

44. Parallelising Claude Code with Git Worktrees

Running multiple Claude Code instances on the same repository can cause conflicts if they edit the same files.

Git worktrees solve this by creating isolated working directories tied to different branches.

Workflow:

Create worktree → Start Claude instance → Assign task → Commit changes → Merge branch → Remove worktree

This lets one developer manage multiple AI coding agents in parallel.

You can also create custom Claude commands in .claude/commands using Markdown files and $ARGUMENTS placeholders.

Parallelism is powerful, but the bottleneck becomes the developer’s ability to review and coordinate the work.

45. Automated Debugging

Claude can be integrated into automated debugging workflows.

Example daily production debugging workflow:

GitHub Action runs every day.
It fetches CloudWatch logs from the last 24 hours.
Claude identifies errors and deduplicates them.
Claude analyses likely causes.
Claude proposes fixes.
Claude Code creates a pull request.

This is valuable for catching production-only issues such as:

Environment-specific configuration errors
Invalid model IDs
Missing API keys
Deployment-only runtime failures

The key benefit is not automatic merging. The key benefit is a reviewable pull request with context and proposed fixes.

46. Computer Use

Computer use allows Claude to interact with graphical interfaces through screenshots and actions.

Claude can:

Take screenshots
Move the mouse
Click
Type
Navigate applications
Test web interfaces
Report results

The important point: Claude does not directly control your computer. It requests tool actions, and an execution environment performs them.

Typical implementation:

User instruction → Claude observes screenshot → Claude requests click/type action → Container executes action → Screenshot returned → Claude continues

Computer use is useful for:

UI testing
QA automation
Browser workflows
Repetitive interface tasks
Bug discovery

It relies heavily on environment inspection. After every action, Claude needs to observe the new state to decide what to do next.

47. Workflows vs Agents

A workflow is a predefined sequence of steps.

An agent dynamically decides what steps to take using available tools.

Use workflows when the steps are known

Workflows are more reliable, easier to test, and easier to optimise.

Example: image-to-3D-model workflow.

Claude describes the image.
Claude writes CADQuery code.
The system renders the model.
Claude compares render to source image.
If needed, repeat with feedback.

This is an evaluator-optimizer workflow.

Use agents when the steps are unknown

Agents are useful when tasks vary and the model must decide how to proceed.

Agents need tools, feedback, and environment inspection.

Good agent tools are usually abstract rather than overly specific:

bash
read_file
write_file
web_fetch
get_current_datetime
add_duration
set_reminder

Avoid creating only hyper-specific tools like install_dependencies_for_react_project, because generic tools compose better.

48. Workflow Patterns

Several workflow patterns are useful when building Claude applications.

Chaining

Break a large task into sequential steps.

Example:

Topic → Research → Outline → Draft → Rewrite for constraints → Final output

This helps when a single long prompt causes Claude to miss constraints.

Parallelisation

Split a decision into independent subtasks, run them in parallel, then aggregate.

Example: choosing a material for a part.

Evaluate metal
Evaluate polymer
Evaluate ceramic
Evaluate composite
→ Aggregate recommendation

Each subtask gets focused attention.

Routing

First classify the input, then route to a specialised prompt or pipeline.

Example:

User topic → classify as educational / entertainment / technical → use matching script template

Routing is useful when different inputs require different tones, tools, or structures.

Evaluator-optimizer

Generate an output, evaluate it, then improve it.

Example:

Generate answer → Check constraints → Rewrite violations → Final answer

This pattern is very useful for quality control.

49. Environment Inspection for Agents

Agents need feedback.

After taking an action, they must inspect the result.

Examples:

A computer-use agent takes screenshots after clicks.
A coding agent reads files before editing.
A video-generation agent extracts frames with FFmpeg to inspect visual output.
A captioning agent checks timestamps after running Whisper.

Without environment inspection, agents act blindly. With inspection, they can detect errors and adapt.

This is one of the main differences between a simple tool call and a robust agent loop.

50. Practical Architecture Recommendations

A reliable Claude application usually combines several patterns.

Start simple

Begin with:

One model
Clear prompt
Basic message history
Simple API wrapper

Add structure

Then add:

System prompts
Output formatting rules
Stop sequences or structured tool extraction
Prompt evaluations

Add tools

When Claude needs external data or actions, add tools:

Current time
Database search
Internal APIs
File access
Web search
Code execution

Add retrieval

When the knowledge base grows, add RAG:

Chunking
Embeddings
Vector search
BM25
Reranking
Contextual retrieval

Add workflows before agents

If you know the steps, build a workflow. It will be more reliable.

Use agents only when flexibility is truly required.

Conclusion

Building with Claude is not only about sending a prompt and reading a response. A production-grade Claude application involves architecture decisions around models, prompts, state, streaming, tool execution, structured outputs, evaluations, retrieval, caching, MCP integrations, and agent design.

The most important principle is reliability. Start with the simplest architecture that works, evaluate it systematically, and only add advanced patterns when they solve a real problem.

Use:

Sonnet as the default model for most production use cases.
Haiku for speed, routing, classification, and high-volume tasks.
Opus for the hardest reasoning and planning tasks.
Prompt evaluations before shipping prompts.
Tools when Claude needs external data or actions.
RAG when knowledge is too large for direct prompting.
MCP when you want reusable integrations.
Workflows when the process is known.
Agents when the process must be discovered dynamically.

The best Claude systems are not just clever prompts. They are well-designed software systems where the model, tools, data, and control flow work together.

Building with Claude: A Practical Developer Guide to Models, APIs, Prompts, Tools, RAG, MCP, and Agents#

1. Claude Model Families#

Opus#

Sonnet#

Haiku#

Model selection framework#

2. How Claude API Access Works#

Five-step request flow#

3. What Happens During Text Generation#

1. Tokenisation#

2. Embedding#

3. Contextualisation#

4. Generation#

4. Making Your First API Request#

Required request fields#

Message structure#

5. Multi-Turn Conversations#

The solution: maintain message history yourself#

6. System Prompts#

Conditional system prompts#

7. Temperature#

Low temperature#

High temperature#

8. Streaming Responses#

Basic streaming concept#

Simplified streaming with text_stream#

9. Controlling Model Output#

Assistant message prefill#

Stop sequences#

10. Structured Data Generation#

11. Prompt Evaluation#

Typical evaluation workflow#

Dataset generation#

Running the evaluation#

12. Grading Model Outputs#

1. Code-based graders#

2. Model-based graders#

3. Human graders#

Combining graders#

13. Prompt Engineering Techniques That Actually Help#

Be clear and direct#

Be specific#

Type A: output attributes#

Type B: reasoning or process steps#

Use XML tags for structure#

Provide examples#

14. Tool Use: Extending Claude Beyond Text#

Tool use flow#

15. Designing Tool Functions#

16. Tool Schemas#

17. Handling Message Blocks with Tools#

Tool-enabled request#

18. Sending Tool Results Back to Claude#

19. Multi-Turn Tool Conversations#

Conversation loop#

Running tools#

Dispatching tools#

20. Adding Multiple Tools#

21. Batch Tool Pattern#

22. Tools for Structured Data#

23. Fine-Grained Tool Calling and Tool Streaming#

Default behaviour#

Fine-grained mode#

24. Built-In Tools: Text Editing, Web Search, Code Execution#

Text editor tool#

Web search tool#

Code execution and Files API#

25. RAG: Retrieval-Augmented Generation#

Direct document prompting#

RAG approach#

26. Chunking Strategies for RAG#

1. Size-based chunking#

2. Structure-based chunking#

3. Semantic chunking#

Rule of thumb#

27. Embeddings and Semantic Search#

Semantic search flow#

28. Full RAG Pipeline#

Cosine similarity#

Minimal conceptual implementation#

Building with Claude: A Practical Developer Guide to Models, APIs, Prompts, Tools, RAG, MCP, and Agents

1. Claude Model Families

Opus

Sonnet

Haiku

Model selection framework

2. How Claude API Access Works

Five-step request flow

3. What Happens During Text Generation

1. Tokenisation

2. Embedding

3. Contextualisation

4. Generation

4. Making Your First API Request

Required request fields

Message structure

5. Multi-Turn Conversations

The solution: maintain message history yourself

6. System Prompts

Conditional system prompts

7. Temperature

Low temperature

High temperature

8. Streaming Responses

Basic streaming concept

Simplified streaming with `text_stream`

9. Controlling Model Output

Assistant message prefill

Stop sequences

10. Structured Data Generation

11. Prompt Evaluation

Typical evaluation workflow

Dataset generation

Running the evaluation

12. Grading Model Outputs

1. Code-based graders

2. Model-based graders

3. Human graders

Combining graders

13. Prompt Engineering Techniques That Actually Help

Be clear and direct

Be specific

Type A: output attributes

Type B: reasoning or process steps

Use XML tags for structure

Provide examples

14. Tool Use: Extending Claude Beyond Text

Tool use flow

15. Designing Tool Functions

16. Tool Schemas

17. Handling Message Blocks with Tools

Tool-enabled request

18. Sending Tool Results Back to Claude

19. Multi-Turn Tool Conversations

Conversation loop

Running tools

Dispatching tools

20. Adding Multiple Tools

21. Batch Tool Pattern

22. Tools for Structured Data

23. Fine-Grained Tool Calling and Tool Streaming

Default behaviour

Fine-grained mode

24. Built-In Tools: Text Editing, Web Search, Code Execution

Text editor tool

Web search tool

Code execution and Files API

25. RAG: Retrieval-Augmented Generation

Direct document prompting

RAG approach

26. Chunking Strategies for RAG

1. Size-based chunking

2. Structure-based chunking

3. Semantic chunking

Rule of thumb

27. Embeddings and Semantic Search

Semantic search flow

28. Full RAG Pipeline

Cosine similarity

Minimal conceptual implementation