Mobile app version of vmapp.org
Login or Join
Alves908

: Facebook crawler with no user agent spamming our site in possible DoS attack Crawlers registered to Facebook (ipv6 ending in :face:b00c::1) were slamming our site, seeing 10s of thousands of hits

@Alves908

Posted in: #Cdn #Cloudflare #Ddos #Facebook #WebCrawlers

Crawlers registered to Facebook (ipv6 ending in :face:b00c::1) were slamming our site, seeing 10s of thousands of hits in just 20 minutes. We noticed they didn't have a user agent in the header and implemented a rule on cloudflare to protect ourselves.

It appears they've patched the crawler and added a user agent 'Externalhit/1.1' which is a recognised crawler. Now they're circumventing the rule, I'm seeing 11,000 hits in 15 minutes. Often multiple times to the same page! This is crippling our database. It's prevent customers from legitimately using the site.

We've implemented a broad block on all of Facebook's IPs in order to try and remedy this but we've likely already lost business because of it.

My question is: Has anyone seen this before? Any idea what's causing it? Is there a channel for getting a response from Facebook or is there a legal route we should go?

Link to our tweet: twitter.com/TicketSource/status/969148062290599937 Tried FB developers group, and Facebook rep and were directed to Support. Filed a ticket, no response.

Log sample:

2018-03-01 09:00:33 10.0.1.175 GET /dylanthomas - 443 - facebookexternalhit/1.1 - 200 0 0 5394 2a03:2880:30:7fcf:face:b00c:0:8000
2018-03-01 09:00:33 10.0.1.175 GET /dylanthomas - 443 - facebookexternalhit/1.1 - 200 0 0 5362 2a03:2880:30:afd1:face:b00c:0:8000
2018-03-01 09:00:33 10.0.1.175 GET /dylanthomas - 443 - facebookexternalhit/1.1 - 200 0 0 5378 2a03:2880:30:7fcf:face:b00c:0:8000
2018-03-01 09:00:33 10.0.1.175 GET /dylanthomas - 443 - facebookexternalhit/1.1 - 200 0 0 5425 2a03:2880:30:2fea:face:b00c:0:8000
2018-03-01 09:00:33 10.0.1.175 GET /dylanthomas - 443 - facebookexternalhit/1.1 - 200 0 0 5394 2a03:2880:30:2fea:face:b00c:0:8000
2018-03-01 09:00:33 10.0.1.175 GET /dylanthomas - 443 - facebookexternalhit/1.1 - 200 0 0 5659 2a03:2880:30:2fd8:face:b00c:0:8000
2018-03-01 09:00:33 10.0.1.175 GET /dylanthomas - 443 - facebookexternalhit/1.1 - 200 0 0 5659 2a03:2880:11:dff3:face:b00c:0:8000
2018-03-01 09:00:36 10.0.1.175 GET /whitedreamspremiere - 443 - facebookexternalhit/1.1 - 200 0 0 5048 2a03:2880:2020:bffb:face:b00c:0:8000
2018-03-01 09:00:36 10.0.1.175 GET /helioscollective - 443 - facebookexternalhit/1.1 - 200 0 0 4633 2a03:2880:3020:1ffd:face:b00c:0:8000
2018-03-01 09:00:36 10.0.1.175 GET /helioscollective - 443 - facebookexternalhit/1.1 - 200 0 0 4727 2a03:2880:3011:afc5:face:b00c:0:8000
2018-03-01 09:00:36 10.0.1.175 GET /helioscollective - 443 - facebookexternalhit/1.1 - 200 0 0 4977 2a03:2880:3020:1ffd:face:b00c:0:8000
2018-03-01 09:00:36 10.0.1.175 GET /event/FDMEJD - 443 - facebookexternalhit/1.1 - 200 0 0 4868 2a03:2880:2111:1ff9:face:b00c:0:8000


Edit2: These IPs are crawling as we've found URLs from our payment process. So they followed a link and ended up in a session only URL.

Edit3: Facebook have updated the bug report developers.facebook.com/bugs/1894024420610804

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Alves908

2 Comments

Sorted by latest first Latest Oldest Best

 

@Cugini213

developers.facebook.com/bugs/1894024420610804
Per the answer from Facebook, any page shared on Facebook should expect that if their content is shared Facebook crawlers will increase traffic 10-20x that number of shares.

This sounds like Facebook is scraping the content every single time it's accessed, with little to no caching in place.

In our case, while Facebook is probably good for advertising overall, this is an immense strain when you run a database intensive page that's shared. We'll have to rate limit the traffic on our end to prevent a denial of service attack. A resource intensive answer to Facebook's over active bot.

10% popularity Vote Up Vote Down


 

@Steve110

Sources say that Facebook/Externalhit does not respect crawl-delay in robots.txt because Facebook doesn't use a crawler, it uses a scraper.

Whenever one of your pages is shared on Facebook, it scrapes your site for your meta title, description and image.

My guess is that if Facebook is scraping your site 11,000 times in 15 minutes then I think the most likely scenario is that someone has figured out how to abuse the Facebook scraper to DDOS your site.

Perhaps they are running a bot that is clicking on your share link over and over, and Facebook is scraping your page every time that it does.

Off the top of my head, the first thing that I would try to do is cache the pages that Facebook is scraping. You can do this in htaccess. This will hopefully tell Facebook not to load your page with every share until the cache expires.

Because of your issue, I would set the html expiry to longer than usual

In .htaccess:

<IfModule mod_expires.c>
ExpiresActive On
ExpiresDefault "access plus 60 seconds"
ExpiresByType text/html "access plus 900 seconds"

</IfModule>


Setting html to expire at 900 seconds will hopefully prevent Facebook from crawling any individual page at more than once per 15 minutes.



Edit:
I ran a quick search and found a page written a few years ago that discusses the very issue you're encountering now. This person discovered that websites could be flooded by the Facebook scraper through its share feature. He reported it to Facebook but they chose to do nothing about it. Perhaps the article will more clearly tell you what is happening to you and maybe it can lead you in the right direction as to how you'd like to deal with the situation:
chr13.com/2014/04/20/using-facebook-notes-to-ddos-any-website/

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme