Jump to content
thirty bees forum

Major issues, blackhole being crawled by bing and duckduckgo even if "Disallow: /blackhole/" is written inside robots.txt, losing lots of customers because of it.


Recommended Posts

Posted

I just find out that when I do a search for my business on bing and duckduckgo, the page https://www.mysite.com/blackhole/ come up in the search results.

If a visitor click on this link, then he is granted with the blackhole page and block from my site.

Google however doesn't crawl this page

How can I fix this?

Posted (edited)
  On 10/1/2024 at 2:14 PM, papagino said:

I just find out that when I do a search for my business on bing and duckduckgo, the page https://www.mysite.com/blackhole/ come up in the search results.

If a visitor click on this link, then he is granted with the blackhole page and block from my site.

Google however doesn't crawl this page

How can I fix this?

Expand  

I see the same issue with duckduckgo. Not with bing@datakick

Edited by x97wehner
Posted

You should ask duckduckgo this question, not me. If the robots.txt explicitly blocks the url, there shouldn't be any reason for them to index it. 

  • Like 1
  • Haha 1
Posted
  On 10/1/2024 at 4:02 PM, wakabayashi said:

How does your robots.txt look like?

And please really check the file on the server, not what any tool is saying...

Expand  

The file robots.txt on the server does have Disallow: /blackhole/ on the bottom...

Posted
  On 10/1/2024 at 4:20 PM, papagino said:

The file robots.txt on the server does have Disallow: /blackhole/ on the bottom...

Expand  

Did you add that entry into the robots.txt from the very beginning - at the same time you installed the blackholebots module? If you added this later, then bing might have already indexed the page. 

Posted
  On 10/1/2024 at 4:29 PM, datakick said:

Did you add that entry into the robots.txt from the very beginning - at the same time you installed the blackholebots module? If you added this later, then bing might have already indexed the page. 

Expand  

Yes I did... and that was a very long time ago, a year maybe...

 

Posted

I don't have this issue with my domains. They even show on first page when I search for the addon shops (something that Google does not do because reasons)... Shame that nobody uses those in Europe... 🙂


They say that robots.txt is respected but who knows...

Posted

I've released new version of the module that will allow you to change the trap URL. 

If bing is already indexing your trap url for any reason, you can change it from https://domain.com/blackhole to something new like https://domain.com/my-honey-trap. (and change robots.txt accordingly)

This way, when bing sends a traffic to your website to /blackhole address, it will not be blocked. To prevent 404, I suggest you add redirect from /blackhole to homepage into your .htaccess file as well.

Hopefully, bing will not add the new trap url to the index again.

I've added some extra precaution to prevent this as well -- if the known good bot (google, bing, etc) somehow make it to your trap url (even when the robots.txt blocks it), then the content of the trap page will be mostly empty, and page headers will contain <meta name="robots" content="noindex"> that will instruct bot to not index this page.

Posted (edited)

Great, hope this works with new version of the module.

should there be 2 lines in robots.txt:

Disallow: /blackholenew/
Disallow: /modules/blackholebots/blackholenew/

Edited by DRMasterChief
Posted
  On 10/2/2024 at 8:28 AM, DRMasterChief said:

Great, hope this works with new version of the module.

should it be like this in robots.txt:

Disallow: /blackholenew/
Disallow: /modules/blackholebots/blackholenew/

Expand  

The other disallow directive is for stores without friendly urls enabled - there is no change in blackhole name there

The robots.txt should look like this:

User-agent: *
Disallow: */blackholenew/
Disallow: /modules/blackholebots/blackhole/*

Note the * before /blackholenew/ -- it's to block language variants as well

  • Like 1
Posted
  On 10/2/2024 at 8:44 AM, datakick said:

The other disallow directive is for stores without friendly urls enabled - there is no change in blackhole name there

The robots.txt should look like this:

User-agent: *
Disallow: */blackholenew/
Disallow: /modules/blackholebots/blackhole/*

Note the * before /blackholenew/ -- it's to block language variants as well

Expand  

Thanks datakick for the updates, my site is bilingual, maybe the missing "*" in the robots.txt was the problem in my case...

Will try the new version and investigate further...

Cheers

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...