Global Aircraft -- IungamBot

Return to GAC Home - Help - Privacy Policy - Legal Notices

Global Aircraft

Technical Details for IungamBot

Statuses of our web crawlers:

IungamBot (PERL) is currently running (since 2012-12-24 20:59:39 PDT).
IungamBot (PHP) is currently sleeping (since 2012-12-24 20:59:39 PDT).
IungamBotJr is currently exclusive (since 2012-12-24 20:59:39 PDT).
IungamBot-Photos is currently running (since 2012-12-24 20:59:39 PDT).
IungamBot-Update is currently idling (since 2012-12-24 20:59:39 PDT).

Progress of our web crawlers:

4846 docs have been crawled, giving us 4653 docs cached into our temporary queue.
So, 193 docs either returned 300, 400, or 500 or were excluded from our search.

Document Index

What exactly is IungamBot?
How to get IungamBot not to index part or any of your website
How to ask IungamBot to exclude certain pages
Technical information for the IungamBot web crawler
How to get your website indexed by the Iungam Search engine
What if IungamBot isn't obeying your robots.txt file?
Find out what IungamBot looks for when it indexes your website
A list of protocols for the Iungam search engine
What to do if IungamBot has already added pages you don't want on our search engine
Find out what the statuses of the IungamBots mean
Find out more about the different web crawlers that makes up IungamBot
Do you have any questions that aren't answered here?

What exactly is IungamBot?

IungamBot is the web crawler that scours the internet for (primarily) aviation documents to add to the Global Aircraft ("GAC") Iungam Search Engine. Iungam is Latin for "I shall unite", which perfectly describes the purpose of this engine --- to unite the pages of the internet in one place for easy access. Iungam is composed mostly of member websites, but IungamBot may sometimes follow links to outside websites and could begin adding those sites to our query. This is a good thing! We are currently trying to break free of the members-only definition so that we can offer an even better service to all of our visitors.

Many people wonder how you pronounce "Iungam". Clasically Iungam is pronounced as "yungum", but later the Italians invented the 'J' as a consonantal 'i'-- so it can also be pronounced "jungum". Iungere (Jungere) is ultimately the source for the English join, subjugate, and juxtapose.

How to get IungamBot not to index part or any of your website

If you have access to the root of your server (i.e. you aren't hosted on geocities, for instance, where your top level is www.geocities.com/usrname/), you can create a file called 'robots.txt' and put this in the root of your website. For instance, http://www.foo.com/bar/robots.txt would not work, but http://www.foo.com/robots.txt is correct. Inside the file, you should include the following syntax:

 #this will exclude all robots from the entire server
 User-agent: *
 Disallow: /
 
 #this will allow all robots access to the entire server
 User-agent: *
 Disallow:
 
 #this will disallow all robots from accessing anything in /cgi-bin/ 
 User-agent: *
 Disallow: /cgi-bin/
 
 #this will disallow IungamBot from accessing the file /logbook.doc 
 User-agent: IungamBot
 Disallow: /logbook.doc
 
 #this will exclude IungamBot access to the entire server
 User-agent: IungamBot
 Disallow: /

How to ask IungamBot to exclude certain pages

IungamBot will follow the noindex, nofollow, noarchive instructions in the standard robot meta tag. If you put these meta tags in the head of your document, you can tell IungamBot not to index, archive or follow links on that particular page. The following syntax is looked for:

 #IungamBot can index, follow, and archive the document
 <meta name="robots" content="all" />

 #IungamBot will retrieve the document but will not add it to our database 
 <meta name="robots" content="noindex" />
 
 #IungamBot will not take a cache snapshot of the current document. 
 # Keep in mind that this will significantly lower this document's 
 # appearance in the results on our search engine.
 <meta name="robots" content="noarchive" />
 
 #IungamBot will not follow any links on the current page
 <meta name="robots" content="nofollow" />

This robots meta tag will pertain to various robots that visit your page. If you want more options or if you want to set restrictions strictly for IungamBot, use the iungambot meta tag instead of the robots tag. For example:

 #IungamBot will not add the document to our database or follow links on 
 # it, but other robots may 
 <meta name="iungambot" content="noindex,nofollow" />

Technical information for the IungamBot web crawler

Technical information

 robot-id: iungambot
 robot-name: IungamBot
 robot-cover-url: http://search.globalaircraft.org/
 robot-details-url: http://search.globalaircraft.org/iungambot/about.pl
 robot-owner-name: Charles Munson
 robot-owner-url: http://www.globalaircraft.org/
 robot-owner-email: server@globalaircraft.org
 robot-status: active
 robot-purpose: indexing
 robot-type: browser
 robot-platform: linux
 robot-availability: data
 robot-exclusion: yes
 robot-exclusion-useragent: iungambot
 robot-noindex: yes
 robot-host: *.globalaircraft.org
 robot-from: yes
 robot-useragent: GAC IungamBot/1.x
 robot-language: php, perl
 robot-description: used to build databases of aviation docs for the Iungam search engine 
 robot-history: Developed by the Global Aircraft Organization
 robot-environment: research
 modified-date: Wed Jul 02 10:50:38 EST 2003
 modified-by: server@globalaircraft.org

How to get your website indexed by the Iungam Search engine

Currently only GAC Members can submit a request to add their website to our search engine. If you would like to add your site please see the Add Website form.

What if IungamBot isn't obeying your robots.txt file?

There can be a few reasons why IungamBot is being disobedient. One reason is that either the syntax of your robots.txt file is incorrect or it isn't placed in the top directory of your server. Another reason could be that IungamBot chose not too look for this file in the first place, or perhaps it got lost while trying to find it. Another reason could be that your website was crawled before you put this file into place. In this case, please wait a few weeks for IungamBot to realize that you have put these new restrictions into effect.

Find out what IungamBot looks for when it indexes your website

Currently the IungamBot will search for and follow all HREF and SRC links. Then it will strip out all HTML and JScript and save this as the cache (i.e. if you didn't specify noarchive). If some links aren't being picked up or followed, then IungamBot might have gone over its maximum follow allowance, or your HTML had too many errors in it to successfully bring about these links. Broken tags will usually cause many problems for IungamBot when trying to follow links. Broken tags are:

 #A broken tag spans across two lines
 <body text="#000000" bgcolor="#FFFFFF" 
 >

A list of protocols for the Iungam search engine

The current list of Iungam Search protocols is as follows:

 This option will give you the cache of a certain document in our search engine
	cache:URL of page [search terms[ search terms]]
 	ex) cache:www.globalaircraft.org/planes/yf-17_cobra.pl f-17 cobra

  This option will return all documents under the given domain; 
     it is sub-domain sensitive
 	site:domain root [search terms[ search terms]]
 	ex) site:www.globalaircraft.org f-15 eagle

What to do if IungamBot has already added pages you don't want on our search engine

The only option currently available is to simply wait for IungamBot to figure out that you don't want this page to be indexed any longer. This may take a few weeks. If IungamBot hasn't removed this document within a month please contact the bot's administrator at server@globalaircraft.org.

Find out what the statuses of the IungamBots mean

Status	Description
Running	This status means that the IungamBot is currently running, or it was running the last time our script checked the bot.
Idling	This status means that either the bot was running a while ago and is currently not doing anything, or it is currently awaiting instructions from the bot master.
Offline (sleeping)	This status means that the bot is not currently running, and is not accepting any commands from the bot master.
Retired	This means that the bot is no longer in standard use - however, the bot may be tweaked to run as any of the IungamBots to help out if one is bogged down with work at the time.
Not Responding	This would imply that the bot isn't responding to our requests for a status. Either the bot is retaliating or it has run into an internal error and needs to be checked by the bot master.

Find out more about the different web crawlers that makes up IungamBot

The IungamBot crawler is currently composed of five systems. All of the following bots will go under the user-agent of "GAC IungamBot/1.x", with 'x' representing the particular bot's ID number. For a more in-depth view of each system please refer to the chart below:

ID	Name	Language	Description
1	IungamBot	PERL	This web crawler is responsible for scouring the internet for new documents that we don't currently list and sending these links to other IungamBots which will complete the appropriate routine to adding this to our database. This also has the job of actually grabbing the pages and creating a cache of them. It adds each page to our database along with any information needed for our server.
2	IungamBot	PHP	This crawler is used as a mirror to any IungamBot to help relieve the duties of a bot which is bogged down with work.
3	IungamBotJr	PERL	IungamBotJr is reserved for administration use only. This means that only certain pages will be searched with this bot -- but when run it will create a cache of a page and add it to our database with full stats needed.
4	IungamBot-Photos	PHP	This web crawler is responsible for retrieving images off of the internet from the query created by IungamBot and adding them to our database. Currently this bot is restricted to grab images off of globalaircraft.org and other trusted websites only.
5	IungamBot-Update	PERL	The job of this crawler is to go through our current database of documents and update them. This will update the cache for the documents, update the age of the documents, and could remove documents if they are no longer active.

Do you have any questions that aren't answered here?

You may send questions regarding the IungamBot technology to server@globalaircraft.org.