Validate and analyze robots.txt files for syntax errors, missing directives, and common misconfigurations. Paste your robots.txt content or enter a domain to fetch and validate it automatically.

Robots.txt is a plain-text file placed at the root of a website (example.com/robots.txt) that tells search engine crawlers which pages or sections of the site they are allowed or forbidden to crawl. It follows the Robots Exclusion Protocol (REP), first introduced in 1994 and formalized as an internet standard in RFC 9309 (2022).
Every major search engine respects robots.txt, including Google, Bing, Yahoo, and Yandex. However, robots.txt is a directive, not enforcement — well-behaved crawlers follow the rules, but malicious bots may ignore them. For actual access control, use server-level authentication or firewall rules on your router. Understanding DNS and web server configuration is essential background for managing robots.txt effectively.
The robots.txt file supports a specific set of directives that control crawler behavior:
| Directive | Description | Example |
|---|---|---|
| User-agent | Specifies which crawler the rules apply to | User-agent: Googlebot |
| Disallow | Blocks crawling of a path or directory | Disallow: /admin/ |
| Allow | Explicitly allows crawling (overrides Disallow) | Allow: /admin/public/ |
| Sitemap | Points to XML sitemap location | Sitemap: https://example.com/sitemap.xml |
| Crawl-delay | Seconds between requests (Bing, Yandex) | Crawl-delay: 10 |
| # (Comment) | Human-readable notes (ignored by crawlers) | # This blocks the admin area |
Pro Tip: Google does not support the
Crawl-delaydirective. To control Google's crawl rate, use Google Search Console's crawl rate settings instead. For Bing and Yandex,Crawl-delaysets the minimum time between requests. A value of 10 means the crawler waits 10 seconds between requests. Set this carefully — high values can severely limit indexing. Use our DNS Lookup tool to verify your site's DNS is configured correctly for search engine access.
Proper robots.txt syntax follows strict formatting rules:
* for pattern matching and $ for end-of-URL matching (Google and Bing extension).Here are robots.txt configurations for common scenarios:
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
User-agent: *
Disallow: /
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Disallow: /tmp/
Allow: /api/public/
Sitemap: https://example.com/sitemap.xml
# Allow all crawlers by default
User-agent: *
Disallow:
# Block specific AI training bots
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
noindex meta tag or the X-Robots-Tag HTTP header. For access control, configure authentication on your web server or network firewall. Learn how DNS resolution connects crawlers to your server.
Google and Bing support wildcard patterns in robots.txt for more flexible rules:
| Pattern | Matches | Example Use |
|---|---|---|
| * | Any sequence of characters | Disallow: /*.pdf$ — blocks all PDFs |
| $ | End of URL | Disallow: /page$ — blocks /page but not /page/sub |
| /dir/ | Directory and all contents | Disallow: /images/ — blocks everything under /images/ |
| /file | Path prefix match | Disallow: /search — blocks /search, /search?q=test, /searching |
# Block all PDF files
User-agent: *
Disallow: /*.pdf$
# Block query string URLs
Disallow: /*?*
# Block specific file types
Disallow: /*.json$
Disallow: /*.xml$
Allow: /sitemap.xml
Always test your robots.txt before deployment:
# Fetch and display your robots.txt:
curl https://example.com/robots.txt
# Check if a specific path is blocked:
# (Manual check — read the file and trace the rules)
curl -s https://example.com/robots.txt | grep -i "disallow"
After updating robots.txt, verify your DNS records are correct and the file is accessible. If you're using a CDN like Cloudflare, make sure the robots.txt isn't being cached with old content. Check your website's IP to confirm the server is reachable.
Proper robots.txt configuration directly affects how search engines crawl and index your site:
Robots.txt works alongside other SEO mechanisms like meta robots tags, canonical URLs, and redirect chains. For comprehensive site management, also verify your DNS configuration, MX records for email delivery, and port availability.
| Mistake | Impact | Fix |
|---|---|---|
| Blocking CSS/JS files | Google can't render pages properly | Allow crawling of CSS and JS resources |
| Using on staging live | Production site gets deindexed | Remove Disallow: / before launch |
| Blocking sitemap access | Contradicts Sitemap directive | Ensure sitemap URL is not blocked by Disallow rules |
| Wrong file location | Crawlers won't find the file | Must be at root: /robots.txt |
| Missing User-agent | Rules have no target | Always start with User-agent directive |
* for all crawlers.Not entirely. Robots.txt prevents crawling, but blocked URLs can still appear in search results if other pages link to them (shown as URL-only results without snippets). To fully remove pages, use the noindex meta tag or X-Robots-Tag header instead.
Robots.txt must be at the root of your domain: https://example.com/robots.txt. It cannot be in a subdirectory. Each subdomain needs its own robots.txt file (e.g., blog.example.com/robots.txt is separate from example.com/robots.txt).
All major search engines (Google, Bing, Yahoo, Yandex, Baidu) respect robots.txt. However, malicious bots, scrapers, and some AI crawlers may ignore it. Robots.txt is an honor system — for actual access control, use server authentication or firewall rules.
Yes. You can block image directories with Disallow: /images/ to prevent image crawling. However, Google may still index the image URL if it's linked from other crawlable pages. Use the noindex X-Robots-Tag for definitive removal.
Blocking /wp-admin/ is common but not strictly necessary since WordPress login pages typically have noindex meta tags. If you do block it, make sure to allow /wp-admin/admin-ajax.php since many themes and plugins need it for proper rendering.
Google typically rechecks robots.txt within 24-48 hours. You can request a recrawl through Google Search Console to speed this up. The file is cached by crawlers, so changes aren't instant. Verify your DNS is configured correctly so crawlers can reach the updated file.
If no robots.txt file exists (returns 404), search engines assume they can crawl everything on the site. A missing robots.txt is not an error — it simply means no restrictions are in place. However, adding a Sitemap directive in robots.txt helps search engines discover content more efficiently.
About Tommy N.
Tommy is the founder of RouterHax and a network engineer with 10+ years of experience in home and enterprise networking. He specializes in router configuration, WiFi optimization, and network security. When not writing guides, he's testing the latest mesh WiFi systems and helping readers troubleshoot their home networks.
![]() |
![]() |
![]() |
![]() |
Promotion for FREE Gifts. Moreover, Free Items here. Disable Ad Blocker to get them all.
Once done, hit any button as below
![]() |
![]() |
![]() |
![]() |