Organizational Research By

Surprising Reserch Topic

Question:Crawling data or using API


How these sites gather all the data - questionhub, bigresource, thedevsea, developerbay?

Is this legal to show data in frame as bigresource do?

asked Sep 13, 2013 in Crawl by rajesh
edited Sep 12, 2013
0 votes
28 views



Related Hot Questions

2 Answers

0 votes

EDITED : fixed some spelling issues 20110310

How these sites gather all data- questionhub, bigresource ...

Here's a very general sketch of what is probably happening in the background at website like questionhub.com

  1. Spider program (google "spider program" to learn more)

    a. configured to start reading web pages at stackoverflow.com (for example)

    b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.

    c. Returns HTML data from all of those pages

  2. Search Index Program

    Reads HTML data returned by spider and creates search index Storing the words that it found AND what URL those words where found at

  3. User Interface web-page

    Provides feature rich user-interface so you can search the sites that have been spidered.

Is this legal to show data in frame as bigresource do?

To be technical, "it all depends" ;-)

Normally, websites want to be visible in google, so why not other search engines too.

Just as google displays part of the text that was found when a site was spidered, questionhub.com (or others) has chosen to show more of the text found on the original page, possibly keeping the formatting that was in the orginal HTML OR changing the formatting to fit their standard visual styling.

A remote site can 'request' that spyders do NOT go thru some/all of their web pages by adding a rule in a well-known file called robots.txt. Spiders do not have to honor the robots.txt, but a vigilant website will track the IP addresses of spyders that do not honor their robots.txt file and then block that IP address from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.

There is a several industries (besides google) built about what you are asking. There are tags in stack-overflow for search-engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open-source spider, but the name eludes me right now. Good luck.

I hope this helps.

P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer. This goes for your other posts here too ;-)

answered Sep 13, 2013 by rajesh
edited Sep 12, 2013
0 votes

EDITED : fixed some spelling issues 20110310

How these sites gather all data- questionhub, bigresource ...

Here's a very general sketch of what is probably happening in the background at website like questionhub.com

  1. Spider program (google "spider program" to learn more)

    a. configured to start reading web pages at stackoverflow.com (for example)

    b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.

    c. Returns HTML data from all of those pages

  2. Search Index Program

    Reads HTML data returned by spider and creates search index Storing the words that it found AND what URL those words where found at

  3. User Interface web-page

    Provides feature rich user-interface so you can search the sites that have been spidered.

Is this legal to show data in frame as bigresource do?

To be technical, "it all depends" ;-)

Normally, websites want to be visible in google, so why not other search engines too.

Just as google displays part of the text that was found when a site was spidered, questionhub.com (or others) has chosen to show more of the text found on the original page, possibly keeping the formatting that was in the orginal HTML OR changing the formatting to fit their standard visual styling.

A remote site can 'request' that spyders do NOT go thru some/all of their web pages by adding a rule in a well-known file called robots.txt. Spiders do not have to honor the robots.txt, but a vigilant website will track the IP addresses of spyders that do not honor their robots.txt file and then block that IP address from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.

There is a several industries (besides google) built about what you are asking. There are tags in stack-overflow for search-engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open-source spider, but the name eludes me right now. Good luck.

I hope this helps.

P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer. This goes for your other posts here too ;-)

answered Sep 13, 2013 by rajesh
edited Sep 12, 2013

...