Mobile app version of vmapp.org
Login or Join
Berryessa370

: Parsing all the links of the internet for a backlink checker A backlink checker service for being accurate, it should parse all the sites of the internet, which it should be some zettabytes,

@Berryessa370

Posted in: #Hyperlink #Seo

A backlink checker service for being accurate, it should parse all the sites of the internet, which it should be some zettabytes, let alone the processing time it needs for doing all the parsing (and let alone the spider traps and all the strange things it could find during the parsing). It also puzzles me where it would find all the sites of the internet before parsing them.

So, my questions are how a backlink service work? Can something like that be achievable if you are not google or they are just not super accurate? Is there any other easier workflow or is it a really difficult task and can be done only as I described above?

I was thinking of zmap.io which it can at least ping all the ips in less that one hour and after that it maybe possible to get the domains, but again it really seems a really difficult problem which needs many resources. Can you think any other easier way or they just do it like above?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Berryessa370

1 Comments

Sorted by latest first Latest Oldest Best

 

@Jamie184

I checked out the zmap.io/ link. This is something you want to avoid.

I work in the Internet security research realm, and using this tool will get you blacklisted and blocked from whole networks very fast indeed! You will likely be blacklisted from using Google too if it gets too far!

Backlink scanners simply do what other search engines do. They spider the web but focus on backlinks and catalog them into a database. From there, you can write some code to analyze the data however you like. It is a really simple process, however, like any search engine you need seed data. Generally, you would use various domain name lists you can find on the web. You can also use directory sites and so forth.

However, I warn you this is also a way of getting blocked rather quickly since webmasters see a TON of garbage scrapers/scanners that profit off of their bandwidth and hard work.

While I cannot tell you how to develop this kind of site, you likely can use an open source search engine spider such as Nutch which is now a part of the Apache collection. There may be specific tools for this as well. You will likely need to know how to write code in order to create filters and regular expressions so be prepared. I used to do some of this and it can be a real learning curve if you are a newbie.

Even if you do not use a tool and write the code yourself, and it really is not that hard anymore, you will have to really plan things out in extreme detail. It would be a fairly significant amount of code and a sizable database would be required.

The biggest problem you have is resources. It will take more time than you can imagine to begin spidering the web and cataloging the results even if they are just links. Most of the existing sites are so far behind and miss links left and right. While at one point I had over 8000 confirmed links to my site, most reported only about 100 or less. It is a huge up-hill climb. You will need a lot of bandwidth, a huge database, and a fair amount of CPU to handle several processes that; spider, index, maintain the database, act as your front-end website, and so forth.

Then there is the monetizing of the data. If you are not an SEO expert, I cannot see what this data is good for. You will have to sell a service of some sort. Backlinks may not be enough. And then there is the existing competition that has been around for a long time. You will need to ask yourself, what is my unique value proposition and can I develop it and sell it? This is a really tough business to be in. Just be prepared for that.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme