Robots.txt Validator

Validate and analyze robots.txt files for syntax errors, missing directives, and common misconfigurations. Paste your robots.txt content or enter a domain to fetch and validate it automatically.

Robots.txt Content

What Is Robots.txt?

Robots.txt is a plain-text file placed at the root of a website (example.com/robots.txt) that tells search engine crawlers which pages or sections of the site they are allowed or forbidden to crawl. It follows the Robots Exclusion Protocol (REP), first introduced in 1994 and formalized as an internet standard in RFC 9309 (2022).

Every major search engine respects robots.txt, including Google, Bing, Yahoo, and Yandex. However, robots.txt is a directive, not enforcement — well-behaved crawlers follow the rules, but malicious bots may ignore them. For actual access control, use server-level authentication or firewall rules on your router. Understanding DNS and web server configuration is essential background for managing robots.txt effectively.

Robots.txt Directive Reference

The robots.txt file supports a specific set of directives that control crawler behavior:

Directive	Description	Example
User-agent	Specifies which crawler the rules apply to	User-agent: Googlebot
Disallow	Blocks crawling of a path or directory	Disallow: /admin/
Allow	Explicitly allows crawling (overrides Disallow)	Allow: /admin/public/
Sitemap	Points to XML sitemap location	Sitemap: https://example.com/sitemap.xml
Crawl-delay	Seconds between requests (Bing, Yandex)	Crawl-delay: 10
# (Comment)	Human-readable notes (ignored by crawlers)	# This blocks the admin area

Pro Tip: Google does not support the Crawl-delay directive. To control Google's crawl rate, use Google Search Console's crawl rate settings instead. For Bing and Yandex, Crawl-delay sets the minimum time between requests. A value of 10 means the crawler waits 10 seconds between requests. Set this carefully — high values can severely limit indexing. Use our DNS Lookup tool to verify your site's DNS is configured correctly for search engine access.

Robots.txt Syntax Rules

Proper robots.txt syntax follows strict formatting rules:

One directive per line — Each line contains one directive and its value, separated by a colon.
User-agent must come first — Every group of rules must start with a User-agent directive.
Case-sensitive paths — URL paths in Disallow/Allow are case-sensitive (/Admin/ is different from /admin/).
Wildcard support — Use * for pattern matching and $ for end-of-URL matching (Google and Bing extension).
UTF-8 encoding — The file must be UTF-8 encoded.
Maximum size — Google limits processing to 500 KB. Lines beyond this limit are ignored.

Common Robots.txt Examples

Here are robots.txt configurations for common scenarios:

Allow All Crawlers (Default)

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Block All Crawlers (Staging/Development)

User-agent: *
Disallow: /

Block Specific Sections

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Disallow: /tmp/
Allow: /api/public/

Sitemap: https://example.com/sitemap.xml

Block Specific Bots

# Allow all crawlers by default
User-agent: *
Disallow:

# Block specific AI training bots
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Note: Robots.txt does not hide content from the internet. Blocked pages can still appear in search results (showing the URL without a snippet) if other pages link to them. To fully remove a page from search results, use the noindex meta tag or the X-Robots-Tag HTTP header. For access control, configure authentication on your web server or network firewall. Learn how DNS resolution connects crawlers to your server.

Pattern Matching with Wildcards

Google and Bing support wildcard patterns in robots.txt for more flexible rules:

Pattern	Matches	Example Use
*	Any sequence of characters	Disallow: /*.pdf$ — blocks all PDFs
$	End of URL	Disallow: /page$ — blocks /page but not /page/sub
/dir/	Directory and all contents	Disallow: /images/ — blocks everything under /images/
/file	Path prefix match	Disallow: /search — blocks /search, /search?q=test, /searching

# Block all PDF files
User-agent: *
Disallow: /*.pdf$

# Block query string URLs
Disallow: /*?*

# Block specific file types
Disallow: /*.json$
Disallow: /*.xml$
Allow: /sitemap.xml

Testing Robots.txt Rules

Always test your robots.txt before deployment:

Google Search Console — Use the "robots.txt Tester" tool under Settings to test specific URLs against your rules.
This validator — Paste your robots.txt content above to check for syntax errors and common issues.
Command line verification — Fetch your live robots.txt to confirm it's accessible and correctly formatted.

# Fetch and display your robots.txt:
curl https://example.com/robots.txt

# Check if a specific path is blocked:
# (Manual check — read the file and trace the rules)
curl -s https://example.com/robots.txt | grep -i "disallow"

After updating robots.txt, verify your DNS records are correct and the file is accessible. If you're using a CDN like Cloudflare, make sure the robots.txt isn't being cached with old content. Check your website's IP to confirm the server is reachable.

Robots.txt and SEO

Proper robots.txt configuration directly affects how search engines crawl and index your site:

Crawl budget optimization — Block low-value pages (search results, filters, duplicates) to focus crawl budget on important content.
Prevent duplicate content — Block parameter URLs, print pages, and filtered views that create duplicate content. Use with proper redirects for comprehensive duplicate management.
Protect sensitive paths — Block admin panels, login pages, and internal tools from appearing in search results.
Sitemap discovery — The Sitemap directive helps search engines find and crawl all important pages, especially on large sites.

Robots.txt works alongside other SEO mechanisms like meta robots tags, canonical URLs, and redirect chains. For comprehensive site management, also verify your DNS configuration, MX records for email delivery, and port availability.

Common Robots.txt Mistakes

Mistake	Impact	Fix
Blocking CSS/JS files	Google can't render pages properly	Allow crawling of CSS and JS resources
Using on staging live	Production site gets deindexed	Remove `Disallow: /` before launch
Blocking sitemap access	Contradicts Sitemap directive	Ensure sitemap URL is not blocked by Disallow rules
Wrong file location	Crawlers won't find the file	Must be at root: /robots.txt
Missing User-agent	Rules have no target	Always start with User-agent directive

Key Takeaways

Robots.txt controls search engine crawling but doesn't enforce access restrictions — it's a directive, not a security measure.
Every group of rules must start with a User-agent directive. Use * for all crawlers.
Include a Sitemap directive pointing to your XML sitemap for better discoverability.
Google ignores Crawl-delay — use Search Console for Google crawl rate control.
Test changes with this validator before deploying to avoid accidentally blocking important pages.
Don't block CSS or JavaScript files — Google needs them to render and index pages properly.

Video: Robots.txt Explained

Related Tools and Guides

Frequently Asked Questions

Does robots.txt block pages from Google search results?

Not entirely. Robots.txt prevents crawling, but blocked URLs can still appear in search results if other pages link to them (shown as URL-only results without snippets). To fully remove pages, use the noindex meta tag or X-Robots-Tag header instead.

Where should robots.txt be located?

Robots.txt must be at the root of your domain: https://example.com/robots.txt. It cannot be in a subdirectory. Each subdomain needs its own robots.txt file (e.g., blog.example.com/robots.txt is separate from example.com/robots.txt).

Do all search engines follow robots.txt?

All major search engines (Google, Bing, Yahoo, Yandex, Baidu) respect robots.txt. However, malicious bots, scrapers, and some AI crawlers may ignore it. Robots.txt is an honor system — for actual access control, use server authentication or firewall rules.

Can robots.txt block images from appearing in search?

Yes. You can block image directories with Disallow: /images/ to prevent image crawling. However, Google may still index the image URL if it's linked from other crawlable pages. Use the noindex X-Robots-Tag for definitive removal.

Should I block /wp-admin/ in robots.txt?

Blocking /wp-admin/ is common but not strictly necessary since WordPress login pages typically have noindex meta tags. If you do block it, make sure to allow /wp-admin/admin-ajax.php since many themes and plugins need it for proper rendering.

How quickly do search engines pick up robots.txt changes?

Google typically rechecks robots.txt within 24-48 hours. You can request a recrawl through Google Search Console to speed this up. The file is cached by crawlers, so changes aren't instant. Verify your DNS is configured correctly so crawlers can reach the updated file.

What happens if robots.txt is missing?

If no robots.txt file exists (returns 404), search engines assume they can crawl everything on the site. A missing robots.txt is not an error — it simply means no restrictions are in place. However, adding a Sitemap directive in robots.txt helps search engines discover content more efficiently.

About Tommy N.

Tommy is the founder of RouterHax and a network engineer with 10+ years of experience in home and enterprise networking. He specializes in router configuration, WiFi optimization, and network security. When not writing guides, he's testing the latest mesh WiFi systems and helping readers troubleshoot their home networks.