Search engine (SE) is a computer system designed to search for information.
The most well-known search engine applications are web services for searching text or graphic information on the World Wide Web. There are also systems that can search for files on FTP servers, products in online stores, and information in news groups.
To search for information using a search engine, the user formulates a search query. The job of the search engine is to find documents containing either the specified keywords or words related to the user's request. In this case, the search engine generates a search results page. Some engines also extract information from suitable databases and resource directories on the Internet.
Search and maintenance methods are divided into four types of search engines: systems using search robots, systems controlled by humans, hybrid and meta. The architecture usually includes:
As a rule, systems operate in stages:
To update the search engine’s collected information, this indexing cycle is repeated.
Search engines work by storing information about web pages, receiving their HTML code and URLs(Uniform Resource Locator).
A crawler is a program that automatically passes through all the links found on the page and highlights them. A crawler, based on references or based on a predefined list of addresses, searches for new documents not yet known to the search engine. The site owner can exclude certain pages using robots.txt, which can be used to prevent the indexing of special files, pages or directories of the site. The search engine analyzes the content of each page for further indexing. Words can be extracted from headers, page text or special fields - meta tags.
A separate crawler is looking for new URLs via scanning the links in the internet. Another robot is visiting each of the new pages to analyse the information and add it to the indexed DB.
An index is a module that analyzes a page by first breaking it into parts using its own lexical and morphological algorithms. All the elements of the web page are isolated and analyzed separately. Data about web pages is stored in the index database for use in subsequent queries.