What Web Crawlers See

Web Crawlers are also called Automatic Indexers, Bots, Web Spiders or Web Robots.

How does a web crawler work?

Many people seem to be under the impression that a web crawler or web spider crawls along the Internet and establishes itself on the web servers it finds and read the contents of the hard disk to find what it wants. This is what a virus would do - establishing itself on a machine and then executing various actions. Most web servers, ideally all, wouldn't allow such a thing to happen. It breaches all security concepts. So, how does a web crawler work?

Very simple: the executable code never leaves the machine from where it operates. It sends requests to web servers and is served web pages, or other resources, by the web server. If it finds links, it sends requests to the pages to which these links point. That's it. The page served to the web crawler is exactly the same as the one served to your browser. It doesn't matter if the data on the page is static data on the page or dynamic data loaded into the page from a database, the web crawler sees it.

Does the web crawler see what I see in my browser?

The web crawler sees what the server serves it. The question is, can the web crawler make sense out of everything served to it? Your browser executes Javascript on your machine and renders the result to you. Can a web crawler do this? If it was programmed to do so, but the conventional wisdom is that everything generated by Javascript is not visible to a web crawler. The same goes for any Flash content - your browser has a plug-in that renders the Flash content on your machine. Do the web crawlers see Flash content? Some may, some may not. I'd play it safe and not rely on web crawlers to see any Flash content. It is generally accepted that web crawlers don't see images and videos or hear sound files. So, if you want the crawler to make some sense of your images they must have the "alt" attribute with a text value briefly descibing the image, like "ACME company logo".

So what does the web crawler see?

It sees plain old text, whether delivered from a database or as static contents of a web page. This includes alt tags of images and meta-data elements in the header. To see a web page much as a cralwer would see it, download the Lynx Browser, which is a text only browser and use it to look at your website.

What do I take from this?

You do want your site to feature on search engines. Therefore you do some Search Engine Optimisation. What you take from this is that what you want the search engine to know about on your pages must be included as text on your pages after the page has been served to the web crawler. You don't have to be afraid that text loaded from a database won't be visible to the web crawler.

And finally...

Let us know if you have anything to add or any remarks. There is much I haven't said about web crawlers, but the idea was to let you know what they see and take into account and what they don't see. The form for sending mail is below.

If you find what you learned on this page useful, please use the social media widgets at the bottom and pin, tweet, plus-one or whatever this page.


Submit a comment

Use and empty line to separate paragraphs in the "Comment" text area.

Links and html markup are not allowed.




LTXQBwYl

whWniZPCSITBJJ

Thanks Niyaz,I know these things. But as i wants to dwoalond the whole directory full of images, and which is protected by robots.txt. And i know everything is possible. But as i am desktop programmer i can fool the registry, new with the web tech it was difficult to overcome with robots.txt. Well as soon as i will have depth knowledge in crawler i will write one to bypass robots.txt. I just asked for help to write that one. Enjoy

dbapps_chris

The robots.txt file

It's not the robots text file that protects the directory where the images are stored. Right click any image in a web site and say open in a new tab and you will get the image on its own and be able to download it.

On many websites you can get into the directory (if you know its name and where it is) and see a list of the images.

This site uses a certain framework for putting the web pages together using a template and page fragments. I think this is what protects the directory where the images are stored.