LogoHypereal AI
模型Coding LLMLimitedAgent积分定价文档Enterprise联盟计划
开始构建
Hypereal AI
  • 模型
  • Coding LLM
  • 产品
  • GPU 云
  • 租用 GPU
  • 模型训练
  • ComfyUI API
  • 部署任意模型
  • Hypereal SDK
  • Agent
  • 积分定价
  • 文档
  • Enterprise
  • 联盟计划
返回文章列表
AIClaudeAPIReference

Claude API Rate Limits: Complete Guide (2026)

Every rate limit tier, header, and best practice for the Anthropic API

Hypereal AI TeamHypereal AI Team
8 min read
2026年2月6日
100+ AI 模型,一个 API

开始使用 Hypereal AI 构建

通过单个 API 访问 Kling、Flux、Sora、Veo 等模型。免费额度即可起步,可扩展至千万级。

获取免费 API Key查看文档

无需信用卡 • 10 万+ 开发者 • 企业级服务

Claude API Rate Limits: Complete Guide for 2026

If you are building applications with the Anthropic Claude API, understanding rate limits is critical. Hit a rate limit at the wrong time and your application stalls, users see errors, and your queue backs up. This guide covers every rate limit tier, how to detect when you are approaching limits, and proven strategies for handling them gracefully.

How Claude API Rate Limits Work

Anthropic enforces rate limits on the Claude API using three dimensions simultaneously:

Dimension What It Measures How It Resets
Requests per minute (RPM) Number of API calls Rolling 1-minute window
Input tokens per minute (ITPM) Tokens sent to the API Rolling 1-minute window
Output tokens per minute (OTPM) Tokens generated by Claude Rolling 1-minute window

You hit a rate limit when any one of these three dimensions is exceeded. This means even if you are well under your RPM limit, sending a few very long prompts can max out your input token limit.

Rate Limit Tiers

Anthropic uses a tiered system based on your account's usage history and spend. As of early 2026, the tiers are structured as follows:

Tier 1 (New Accounts)

Model RPM Input TPM Output TPM
Claude Opus 4 50 20,000 4,000
Claude Sonnet 4 50 40,000 8,000
Claude Haiku 3.5 50 50,000 10,000

Tier 2

Model RPM Input TPM Output TPM
Claude Opus 4 1,000 80,000 16,000
Claude Sonnet 4 1,000 160,000 32,000
Claude Haiku 3.5 2,000 200,000 40,000

Tier 3

Model RPM Input TPM Output TPM
Claude Opus 4 2,000 400,000 80,000
Claude Sonnet 4 2,000 800,000 160,000
Claude Haiku 3.5 4,000 1,000,000 200,000

Tier 4 (High Volume)

Model RPM Input TPM Output TPM
Claude Opus 4 4,000 2,000,000 400,000
Claude Sonnet 4 4,000 4,000,000 800,000
Claude Haiku 3.5 8,000 5,000,000 1,000,000

Note: Exact numbers may vary. Anthropic adjusts these limits periodically and may offer custom limits for enterprise accounts. Always check the official Anthropic documentation for the most current figures.

How to Check Your Current Tier

You can check your tier and current limits in the Anthropic Console under Settings > Limits. Your tier automatically upgrades as your account accumulates spend over time.

Rate Limit Response Headers

Every API response from Claude includes headers that tell you exactly where you stand relative to your limits:

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 998
anthropic-ratelimit-requests-reset: 2026-02-06T12:01:00Z
anthropic-ratelimit-tokens-limit: 160000
anthropic-ratelimit-tokens-remaining: 145230
anthropic-ratelimit-tokens-reset: 2026-02-06T12:01:00Z
Header Meaning
anthropic-ratelimit-requests-limit Your RPM limit
anthropic-ratelimit-requests-remaining Requests left in the current window
anthropic-ratelimit-requests-reset When the request counter resets
anthropic-ratelimit-tokens-limit Your token-per-minute limit
anthropic-ratelimit-tokens-remaining Tokens remaining in the current window
anthropic-ratelimit-tokens-reset When the token counter resets

Reading Headers in Code

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello, Claude!"}]
)

# Access rate limit info from the response headers
print(f"Requests remaining: {response._response.headers.get('anthropic-ratelimit-requests-remaining')}")
print(f"Tokens remaining: {response._response.headers.get('anthropic-ratelimit-tokens-remaining')}")
print(f"Resets at: {response._response.headers.get('anthropic-ratelimit-requests-reset')}")

What Happens When You Hit a Rate Limit

When you exceed any rate limit dimension, the API returns a 429 Too Many Requests response:

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the number of messages, and try again. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."
  }
}

The response also includes a retry-after header indicating how many seconds to wait before retrying.

Retry Strategies

Basic Exponential Backoff

The simplest approach is to retry with exponentially increasing delays:

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + 0.5  # 1.5s, 2.5s, 4.5s, 8.5s, 16.5s
            print(f"Rate limited. Retrying in {wait_time}s...")
            time.sleep(wait_time)

Using the `retry-after` Header

A better approach reads the retry-after header from the 429 response:

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry_after(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Use retry-after header if available, otherwise exponential backoff
            retry_after = getattr(e, 'response', None)
            if retry_after and retry_after.headers.get('retry-after'):
                wait_time = int(retry_after.headers['retry-after'])
            else:
                wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}...")
            time.sleep(wait_time)

Token-Aware Request Queuing

For production systems handling many concurrent requests, implement a token-aware queue:

import asyncio
import time
from dataclasses import dataclass

@dataclass
class RateLimitState:
    requests_remaining: int = 1000
    tokens_remaining: int = 160000
    reset_time: float = 0.0

class TokenAwareQueue:
    def __init__(self, client):
        self.client = client
        self.state = RateLimitState()
        self.lock = asyncio.Lock()

    async def call(self, messages, estimated_tokens=500):
        async with self.lock:
            # Wait if we are close to the limit
            if self.state.tokens_remaining < estimated_tokens:
                wait_time = max(0, self.state.reset_time - time.time())
                if wait_time > 0:
                    await asyncio.sleep(wait_time)

            response = await self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )

            # Update state from response headers
            headers = response._response.headers
            self.state.requests_remaining = int(
                headers.get('anthropic-ratelimit-requests-remaining', 0)
            )
            self.state.tokens_remaining = int(
                headers.get('anthropic-ratelimit-tokens-remaining', 0)
            )

            return response

Best Practices for Staying Under Rate Limits

1. Use the Right Model for the Job

Do not use Claude Opus for tasks that Claude Haiku can handle. Haiku has higher rate limits and is significantly cheaper:

Task Recommended Model
Simple classification Haiku 3.5
Summarization Sonnet 4
Code generation Sonnet 4
Complex reasoning Opus 4
Quick extraction Haiku 3.5

2. Reduce Input Token Usage

  • Trim system prompts. Every request sends your system prompt. Cut unnecessary instructions.
  • Use conversation summaries. Instead of sending entire conversation histories, summarize older messages.
  • Limit context. Only include the context the model actually needs.
# Bad: Sending entire file content for a simple question
messages = [{"role": "user", "content": f"What language is this file? {entire_10000_line_file}"}]

# Good: Send only what's needed
messages = [{"role": "user", "content": f"What language is this file? First 20 lines:\n{first_20_lines}"}]

3. Batch Requests Strategically

If you need to process 100 items, do not fire 100 simultaneous requests. Instead, batch them with concurrency limits:

import asyncio

async def process_batch(items, max_concurrent=5):
    semaphore = asyncio.Semaphore(max_concurrent)

    async def process_one(item):
        async with semaphore:
            return await call_claude(item)

    results = await asyncio.gather(*[process_one(item) for item in items])
    return results

4. Use the Message Batches API

For non-time-sensitive workloads, Anthropic's Message Batches API lets you submit up to 10,000 requests in a single batch. Batch requests have separate, much higher limits and are processed within 24 hours at a 50% discount.

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": prompt}],
            }
        }
        for i, prompt in enumerate(prompts)
    ]
)

5. Cache Repeated Requests

If multiple users ask similar questions, cache the responses:

import hashlib
import json

def get_cache_key(messages, model):
    content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

6. Use Prompt Caching

Anthropic supports prompt caching for system prompts and long context. Cached tokens do not count toward your input token rate limit on subsequent requests and cost 90% less:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your very long system prompt here...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Your question"}]
)

Monitoring Rate Limit Usage

For production systems, log your rate limit headers and set up alerts:

  • Alert at 80% usage to give yourself time to react
  • Track patterns to identify peak hours
  • Monitor by model since each model has independent limits

When to Request a Rate Limit Increase

If you consistently hit limits despite optimization, contact Anthropic sales for a custom plan. Be prepared with:

  • Your current usage patterns (RPM, TPM)
  • Expected growth over the next 3-6 months
  • Your use case description

Building AI Applications at Scale

Rate limits are one piece of the puzzle when building production AI applications. If your project involves media generation (images, video, audio, avatars) alongside text generation, consider using a unified API platform like Hypereal AI that handles rate limiting, queuing, and retries across multiple AI models, so you can focus on your application logic instead of infrastructure.

Summary

Managing Claude API rate limits comes down to three principles: know your limits (check headers), use tokens efficiently (right model, minimal context), and handle 429 errors gracefully (exponential backoff with retry-after). Implement these strategies and your application will stay reliable even under heavy load.

相关文章

Claude API 费用:完整价格计算器 (2026)

12 min read

Claude Code CLI 命令:完整备忘单 (2026)

10 min read

2026 年 Claude Pro 额度限制:最新用量上限及提升方法

15 min read

On this page

  • Claude API Rate Limits: Complete Guide for 2026
  • How Claude API Rate Limits Work
  • Rate Limit Tiers
  • Tier 1 (New Accounts)
  • Tier 2
  • Tier 3
  • Tier 4 (High Volume)
  • How to Check Your Current Tier
  • Rate Limit Response Headers
  • Reading Headers in Code
  • What Happens When You Hit a Rate Limit
  • Retry Strategies
  • Basic Exponential Backoff
  • Using the `retry-after` Header
  • Token-Aware Request Queuing
  • Best Practices for Staying Under Rate Limits
  • 1. Use the Right Model for the Job
  • 2. Reduce Input Token Usage
  • 3. Batch Requests Strategically
  • 4. Use the Message Batches API
  • 5. Cache Repeated Requests
  • 6. Use Prompt Caching
  • Monitoring Rate Limit Usage
  • When to Request a Rate Limit Increase
  • Building AI Applications at Scale
  • Summary
Desktop agent

Download Hypereal Agent

Run a local AI media workspace for image generation, video prompts, model selection, credit tracking, and saved artifacts.

MacWindows
v0.1.1Requires a hypereal.cloud API keyRelease manifest
Hypereal Agent desktop app screenshot

立即开始构建

立即开始构建
Logo
Hypereal AI探索创意
TwitterGitHubLinkedInYouTubeEmail
基础设施
  • 租用 GPU
  • 模型训练
  • ComfyUI API
  • 部署任意模型
  • 公开目录
  • 基础设施文档
  • GPU 日志
  • 价格
LLM API
  • Hypereal SDK
  • Coding Credits
  • All LLM Models
  • Claude Opus 4.7
  • Claude Sonnet 4.6
  • GPT-5.5
  • Claude Haiku 4.5
  • GPT-5.5 Pro
  • GPT-5.3 Codex
  • Gemini 3.1 Pro Preview
  • Gemini 3.5 Thinking
  • Gemini 3.5 Fast
  • DeepSeek V4 Pro
  • Kimi K2.6
  • GLM-5.1
AI API
  • AI API Overview
  • Seedance 2.0 API
  • Kling 3.0 API
  • Veo 3.1 API
  • FLUX API
  • GPT Image 2 API
  • vs WaveSpeed
  • vs fal.ai
  • vs Replicate
  • vs KIE.ai
视频模型
  • Google Veo 3.1 API
  • Kling 3.0 API
  • Kling O3 Pro API
  • Seedance 2.0 API
  • HappyHorse 1.0 API
  • WAN 2.7 API
  • WAN Video API
  • Grok Video API
  • Hunyuan Video API
  • PixVerse V6 API
  • Pika Video API
  • Luma Dream Machine API
  • MiniMax Video API
  • Vidu Video API
图像模型
  • NanoBanana 2 API
  • FLUX 2 API
  • GPT Image 1 API
  • Grok Image API
  • SeeDream V5 API
  • Imagen 4 API
  • Ideogram API
  • Recraft API
  • DALL-E 3 API
  • Stable Diffusion API
  • Gemini Image API
工具
  • Face Swap API
  • Video Face Swap API
  • Virtual Try-On API
  • Image Upscaler API
  • Video Upscaler API
  • AI Talking Avatar API
  • Lip Sync API
  • OmniHuman Avatar API
  • Tripo3D H3.1 API
  • ElevenLabs TTS API
  • Fish Audio TTS API
  • Whisper STT API
  • Lyria Music API
生成器
  • Hypereal Agent
  • AI 图像生成器
  • AI 视频生成器
  • AI 数字人生成器
  • AI 音频生成器
  • AI 3D 生成器
  • AI 工具
  • 图像放大器
  • 视频放大器
合集
  • 最佳视频模型
  • 最佳图像模型
  • Seedance 2.0
  • WAN 2.7
  • Qwen Image 2
  • Grok AI
  • Seedance 1.5
  • 运动控制
  • 内容检测
  • 目标检测
公司
  • 关于我们
  • 文档
  • Hypereal SDK
  • Cookbook
  • 博客
  • 更新日志
  • 联系我们
  • 常见问题
  • 教程与技巧
  • 路线图
  • 企业版
  • 联盟分销计划
  • Platform
  • 开发者计划
法律
  • 隐私政策
  • 服务条款
  • 退款政策
  • Cookie 政策
  • 价格
  • 所有模型
  • 站点地图
  • Status
所有系统正常
•用爱从加利福尼亚打造 ❤️
© 版权所有 2026。保留所有权利。