Claude API Rate Limits: Complete Guide (2026)

Claude API Rate Limits: Complete Guide for 2026

If you are building applications with the Anthropic Claude API, understanding rate limits is critical. Hit a rate limit at the wrong time and your application stalls, users see errors, and your queue backs up. This guide covers every rate limit tier, how to detect when you are approaching limits, and proven strategies for handling them gracefully.

How Claude API Rate Limits Work

Anthropic enforces rate limits on the Claude API using three dimensions simultaneously:

Dimension	What It Measures	How It Resets
Requests per minute (RPM)	Number of API calls	Rolling 1-minute window
Input tokens per minute (ITPM)	Tokens sent to the API	Rolling 1-minute window
Output tokens per minute (OTPM)	Tokens generated by Claude	Rolling 1-minute window

You hit a rate limit when any one of these three dimensions is exceeded. This means even if you are well under your RPM limit, sending a few very long prompts can max out your input token limit.

Rate Limit Tiers

Anthropic uses a tiered system based on your account's usage history and spend. As of early 2026, the tiers are structured as follows:

Tier 1 (New Accounts)

Model	RPM	Input TPM	Output TPM
Claude Opus 4	50	20,000	4,000
Claude Sonnet 4	50	40,000	8,000
Claude Haiku 3.5	50	50,000	10,000

Tier 2

Model	RPM	Input TPM	Output TPM
Claude Opus 4	1,000	80,000	16,000
Claude Sonnet 4	1,000	160,000	32,000
Claude Haiku 3.5	2,000	200,000	40,000

Tier 3

Model	RPM	Input TPM	Output TPM
Claude Opus 4	2,000	400,000	80,000
Claude Sonnet 4	2,000	800,000	160,000
Claude Haiku 3.5	4,000	1,000,000	200,000

Tier 4 (High Volume)

Model	RPM	Input TPM	Output TPM
Claude Opus 4	4,000	2,000,000	400,000
Claude Sonnet 4	4,000	4,000,000	800,000
Claude Haiku 3.5	8,000	5,000,000	1,000,000

Note: Exact numbers may vary. Anthropic adjusts these limits periodically and may offer custom limits for enterprise accounts. Always check the official Anthropic documentation for the most current figures.

How to Check Your Current Tier

You can check your tier and current limits in the Anthropic Console under Settings > Limits. Your tier automatically upgrades as your account accumulates spend over time.

Rate Limit Response Headers

Every API response from Claude includes headers that tell you exactly where you stand relative to your limits:

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 998
anthropic-ratelimit-requests-reset: 2026-02-06T12:01:00Z
anthropic-ratelimit-tokens-limit: 160000
anthropic-ratelimit-tokens-remaining: 145230
anthropic-ratelimit-tokens-reset: 2026-02-06T12:01:00Z

Header	Meaning
`anthropic-ratelimit-requests-limit`	Your RPM limit
`anthropic-ratelimit-requests-remaining`	Requests left in the current window
`anthropic-ratelimit-requests-reset`	When the request counter resets
`anthropic-ratelimit-tokens-limit`	Your token-per-minute limit
`anthropic-ratelimit-tokens-remaining`	Tokens remaining in the current window
`anthropic-ratelimit-tokens-reset`	When the token counter resets

Reading Headers in Code

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello, Claude!"}]
)

# Access rate limit info from the response headers
print(f"Requests remaining: {response._response.headers.get('anthropic-ratelimit-requests-remaining')}")
print(f"Tokens remaining: {response._response.headers.get('anthropic-ratelimit-tokens-remaining')}")
print(f"Resets at: {response._response.headers.get('anthropic-ratelimit-requests-reset')}")

What Happens When You Hit a Rate Limit

When you exceed any rate limit dimension, the API returns a 429 Too Many Requests response:

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the number of messages, and try again. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."
  }
}

The response also includes a retry-after header indicating how many seconds to wait before retrying.

Retry Strategies

Basic Exponential Backoff

The simplest approach is to retry with exponentially increasing delays:

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + 0.5  # 1.5s, 2.5s, 4.5s, 8.5s, 16.5s
            print(f"Rate limited. Retrying in {wait_time}s...")
            time.sleep(wait_time)

Using the `retry-after` Header

A better approach reads the retry-after header from the 429 response:

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry_after(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Use retry-after header if available, otherwise exponential backoff
            retry_after = getattr(e, 'response', None)
            if retry_after and retry_after.headers.get('retry-after'):
                wait_time = int(retry_after.headers['retry-after'])
            else:
                wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}...")
            time.sleep(wait_time)

Token-Aware Request Queuing

For production systems handling many concurrent requests, implement a token-aware queue:

import asyncio
import time
from dataclasses import dataclass

@dataclass
class RateLimitState:
    requests_remaining: int = 1000
    tokens_remaining: int = 160000
    reset_time: float = 0.0

class TokenAwareQueue:
    def __init__(self, client):
        self.client = client
        self.state = RateLimitState()
        self.lock = asyncio.Lock()

    async def call(self, messages, estimated_tokens=500):
        async with self.lock:
            # Wait if we are close to the limit
            if self.state.tokens_remaining < estimated_tokens:
                wait_time = max(0, self.state.reset_time - time.time())
                if wait_time > 0:
                    await asyncio.sleep(wait_time)

            response = await self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )

            # Update state from response headers
            headers = response._response.headers
            self.state.requests_remaining = int(
                headers.get('anthropic-ratelimit-requests-remaining', 0)
            )
            self.state.tokens_remaining = int(
                headers.get('anthropic-ratelimit-tokens-remaining', 0)
            )

            return response

Best Practices for Staying Under Rate Limits

1. Use the Right Model for the Job

Do not use Claude Opus for tasks that Claude Haiku can handle. Haiku has higher rate limits and is significantly cheaper:

Task	Recommended Model
Simple classification	Haiku 3.5
Summarization	Sonnet 4
Code generation	Sonnet 4
Complex reasoning	Opus 4
Quick extraction	Haiku 3.5

2. Reduce Input Token Usage

Trim system prompts. Every request sends your system prompt. Cut unnecessary instructions.
Use conversation summaries. Instead of sending entire conversation histories, summarize older messages.
Limit context. Only include the context the model actually needs.

# Bad: Sending entire file content for a simple question
messages = [{"role": "user", "content": f"What language is this file? {entire_10000_line_file}"}]

# Good: Send only what's needed
messages = [{"role": "user", "content": f"What language is this file? First 20 lines:\n{first_20_lines}"}]

3. Batch Requests Strategically

If you need to process 100 items, do not fire 100 simultaneous requests. Instead, batch them with concurrency limits:

import asyncio

async def process_batch(items, max_concurrent=5):
    semaphore = asyncio.Semaphore(max_concurrent)

    async def process_one(item):
        async with semaphore:
            return await call_claude(item)

    results = await asyncio.gather(*[process_one(item) for item in items])
    return results

4. Use the Message Batches API

For non-time-sensitive workloads, Anthropic's Message Batches API lets you submit up to 10,000 requests in a single batch. Batch requests have separate, much higher limits and are processed within 24 hours at a 50% discount.

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": prompt}],
            }
        }
        for i, prompt in enumerate(prompts)
    ]
)

5. Cache Repeated Requests

If multiple users ask similar questions, cache the responses:

import hashlib
import json

def get_cache_key(messages, model):
    content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

6. Use Prompt Caching

Anthropic supports prompt caching for system prompts and long context. Cached tokens do not count toward your input token rate limit on subsequent requests and cost 90% less:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your very long system prompt here...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Your question"}]
)

Monitoring Rate Limit Usage

For production systems, log your rate limit headers and set up alerts:

Alert at 80% usage to give yourself time to react
Track patterns to identify peak hours
Monitor by model since each model has independent limits

When to Request a Rate Limit Increase

If you consistently hit limits despite optimization, contact Anthropic sales for a custom plan. Be prepared with:

Your current usage patterns (RPM, TPM)
Expected growth over the next 3-6 months
Your use case description

Building AI Applications at Scale

Rate limits are one piece of the puzzle when building production AI applications. If your project involves media generation (images, video, audio, avatars) alongside text generation, consider using a unified API platform like Hypereal AI that handles rate limiting, queuing, and retries across multiple AI models, so you can focus on your application logic instead of infrastructure.

Summary

Managing Claude API rate limits comes down to three principles: know your limits (check headers), use tokens efficiently (right model, minimal context), and handle 429 errors gracefully (exponential backoff with retry-after). Implement these strategies and your application will stay reliable even under heavy load.

Claude API Rate Limits: Complete Guide for 2026

How Claude API Rate Limits Work

Anthropic enforces rate limits on the Claude API using three dimensions simultaneously:

Dimension	What It Measures	How It Resets
Requests per minute (RPM)	Number of API calls	Rolling 1-minute window
Input tokens per minute (ITPM)	Tokens sent to the API	Rolling 1-minute window
Output tokens per minute (OTPM)	Tokens generated by Claude	Rolling 1-minute window

You hit a rate limit when any one of these three dimensions is exceeded. This means even if you are well under your RPM limit, sending a few very long prompts can max out your input token limit.

Rate Limit Tiers

Anthropic uses a tiered system based on your account's usage history and spend. As of early 2026, the tiers are structured as follows:

Tier 1 (New Accounts)

Model	RPM	Input TPM	Output TPM
Claude Opus 4	50	20,000	4,000
Claude Sonnet 4	50	40,000	8,000
Claude Haiku 3.5	50	50,000	10,000

Tier 2

Model	RPM	Input TPM	Output TPM
Claude Opus 4	1,000	80,000	16,000
Claude Sonnet 4	1,000	160,000	32,000
Claude Haiku 3.5	2,000	200,000	40,000

Tier 3

Model	RPM	Input TPM	Output TPM
Claude Opus 4	2,000	400,000	80,000
Claude Sonnet 4	2,000	800,000	160,000
Claude Haiku 3.5	4,000	1,000,000	200,000

Tier 4 (High Volume)

Model	RPM	Input TPM	Output TPM
Claude Opus 4	4,000	2,000,000	400,000
Claude Sonnet 4	4,000	4,000,000	800,000
Claude Haiku 3.5	8,000	5,000,000	1,000,000

How to Check Your Current Tier

You can check your tier and current limits in the Anthropic Console under Settings > Limits. Your tier automatically upgrades as your account accumulates spend over time.

Rate Limit Response Headers

Every API response from Claude includes headers that tell you exactly where you stand relative to your limits:

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 998
anthropic-ratelimit-requests-reset: 2026-02-06T12:01:00Z
anthropic-ratelimit-tokens-limit: 160000
anthropic-ratelimit-tokens-remaining: 145230
anthropic-ratelimit-tokens-reset: 2026-02-06T12:01:00Z

Header	Meaning
`anthropic-ratelimit-requests-limit`	Your RPM limit
`anthropic-ratelimit-requests-remaining`	Requests left in the current window
`anthropic-ratelimit-requests-reset`	When the request counter resets
`anthropic-ratelimit-tokens-limit`	Your token-per-minute limit
`anthropic-ratelimit-tokens-remaining`	Tokens remaining in the current window
`anthropic-ratelimit-tokens-reset`	When the token counter resets

Reading Headers in Code

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello, Claude!"}]
)

# Access rate limit info from the response headers
print(f"Requests remaining: {response._response.headers.get('anthropic-ratelimit-requests-remaining')}")
print(f"Tokens remaining: {response._response.headers.get('anthropic-ratelimit-tokens-remaining')}")
print(f"Resets at: {response._response.headers.get('anthropic-ratelimit-requests-reset')}")

What Happens When You Hit a Rate Limit

When you exceed any rate limit dimension, the API returns a 429 Too Many Requests response:

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the number of messages, and try again. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."
  }
}

The response also includes a retry-after header indicating how many seconds to wait before retrying.

Retry Strategies

Basic Exponential Backoff

The simplest approach is to retry with exponentially increasing delays:

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + 0.5  # 1.5s, 2.5s, 4.5s, 8.5s, 16.5s
            print(f"Rate limited. Retrying in {wait_time}s...")
            time.sleep(wait_time)

Using the `retry-after` Header

A better approach reads the retry-after header from the 429 response:

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry_after(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Use retry-after header if available, otherwise exponential backoff
            retry_after = getattr(e, 'response', None)
            if retry_after and retry_after.headers.get('retry-after'):
                wait_time = int(retry_after.headers['retry-after'])
            else:
                wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}...")
            time.sleep(wait_time)

Token-Aware Request Queuing

For production systems handling many concurrent requests, implement a token-aware queue:

import asyncio
import time
from dataclasses import dataclass

@dataclass
class RateLimitState:
    requests_remaining: int = 1000
    tokens_remaining: int = 160000
    reset_time: float = 0.0

class TokenAwareQueue:
    def __init__(self, client):
        self.client = client
        self.state = RateLimitState()
        self.lock = asyncio.Lock()

    async def call(self, messages, estimated_tokens=500):
        async with self.lock:
            # Wait if we are close to the limit
            if self.state.tokens_remaining < estimated_tokens:
                wait_time = max(0, self.state.reset_time - time.time())
                if wait_time > 0:
                    await asyncio.sleep(wait_time)

            response = await self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )

            # Update state from response headers
            headers = response._response.headers
            self.state.requests_remaining = int(
                headers.get('anthropic-ratelimit-requests-remaining', 0)
            )
            self.state.tokens_remaining = int(
                headers.get('anthropic-ratelimit-tokens-remaining', 0)
            )

            return response

Best Practices for Staying Under Rate Limits

1. Use the Right Model for the Job

Do not use Claude Opus for tasks that Claude Haiku can handle. Haiku has higher rate limits and is significantly cheaper:

Task	Recommended Model
Simple classification	Haiku 3.5
Summarization	Sonnet 4
Code generation	Sonnet 4
Complex reasoning	Opus 4
Quick extraction	Haiku 3.5

2. Reduce Input Token Usage

Trim system prompts. Every request sends your system prompt. Cut unnecessary instructions.
Use conversation summaries. Instead of sending entire conversation histories, summarize older messages.
Limit context. Only include the context the model actually needs.

# Bad: Sending entire file content for a simple question
messages = [{"role": "user", "content": f"What language is this file? {entire_10000_line_file}"}]

# Good: Send only what's needed
messages = [{"role": "user", "content": f"What language is this file? First 20 lines:\n{first_20_lines}"}]

3. Batch Requests Strategically

If you need to process 100 items, do not fire 100 simultaneous requests. Instead, batch them with concurrency limits:

import asyncio

async def process_batch(items, max_concurrent=5):
    semaphore = asyncio.Semaphore(max_concurrent)

    async def process_one(item):
        async with semaphore:
            return await call_claude(item)

    results = await asyncio.gather(*[process_one(item) for item in items])
    return results

4. Use the Message Batches API

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": prompt}],
            }
        }
        for i, prompt in enumerate(prompts)
    ]
)

5. Cache Repeated Requests

If multiple users ask similar questions, cache the responses:

import hashlib
import json

def get_cache_key(messages, model):
    content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

6. Use Prompt Caching

Anthropic supports prompt caching for system prompts and long context. Cached tokens do not count toward your input token rate limit on subsequent requests and cost 90% less:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your very long system prompt here...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Your question"}]
)

Monitoring Rate Limit Usage

For production systems, log your rate limit headers and set up alerts:

Alert at 80% usage to give yourself time to react
Track patterns to identify peak hours
Monitor by model since each model has independent limits

When to Request a Rate Limit Increase

If you consistently hit limits despite optimization, contact Anthropic sales for a custom plan. Be prepared with:

Your current usage patterns (RPM, TPM)
Expected growth over the next 3-6 months
Your use case description

开始使用 Hypereal AI 构建

Claude API Rate Limits: Complete Guide for 2026

How Claude API Rate Limits Work

Rate Limit Tiers

Tier 1 (New Accounts)

Tier 2

Tier 3

Tier 4 (High Volume)

How to Check Your Current Tier

Rate Limit Response Headers

Reading Headers in Code

What Happens When You Hit a Rate Limit

Retry Strategies

Basic Exponential Backoff

Using the `retry-after` Header

Token-Aware Request Queuing

Best Practices for Staying Under Rate Limits

1. Use the Right Model for the Job

2. Reduce Input Token Usage

3. Batch Requests Strategically

4. Use the Message Batches API

5. Cache Repeated Requests

6. Use Prompt Caching

Monitoring Rate Limit Usage

When to Request a Rate Limit Increase

Building AI Applications at Scale

Summary

相关文章

Claude API 费用：完整价格计算器 (2026)

Claude Code CLI 命令：完整备忘单 (2026)

2026 年 Claude Pro 额度限制：最新用量上限及提升方法

Download Hypereal Agent

立即开始构建

开始使用 Hypereal AI 构建

Claude API Rate Limits: Complete Guide for 2026

How Claude API Rate Limits Work

Rate Limit Tiers

Tier 1 (New Accounts)

Tier 2

Tier 3

Tier 4 (High Volume)

How to Check Your Current Tier

Rate Limit Response Headers

Reading Headers in Code

What Happens When You Hit a Rate Limit

Retry Strategies

Basic Exponential Backoff

Using the `retry-after` Header

Token-Aware Request Queuing

Best Practices for Staying Under Rate Limits

1. Use the Right Model for the Job

2. Reduce Input Token Usage

3. Batch Requests Strategically

4. Use the Message Batches API

5. Cache Repeated Requests

6. Use Prompt Caching

Monitoring Rate Limit Usage

When to Request a Rate Limit Increase

Building AI Applications at Scale

Summary

相关文章

Claude API 费用：完整价格计算器 (2026)

Claude Code CLI 命令：完整备忘单 (2026)

2026 年 Claude Pro 额度限制：最新用量上限及提升方法

Download Hypereal Agent

立即开始构建