A different take on how search engines work

The functions search engines perform

Search engines have two primary functions. The first function is to crawl the internet and build what is called an index, a sort of map to organise the information within. If every student in a school wrote an original story on an essay on a particular area of their interest, the search engine is the equivalent of the teacher who takes all the material, find out what they are all about, and then organises them into some, like an anthology, so that other people can access it.

Another function of the search engine is to provide search users with a ranked list of relevant websites. This, of course is after they’ve indexed the sites and determined what they are about. Using the previous example of the anthology: if another individual, say for argument’s sake a parent, asks the school a question about the collection – “Is there any material like children’s animal stories?” – then the school has to pass on that request to the teacher, who will return a list of the essays in the book according to how relevant they are. The returned list may not be entirely relevant. If someone has written about “The Adventures of Zippy the Hippo”, that will be ascertained to be of more use to the parent.

Ranking by relevance

But if the book contains another essay about “Physical differences between giraffes and horses”, the teacher may include it in the note to the parent as an afterthought, because while it may not be a children’s story, the teacher may decide the latter essay may contain some facts about animals that the parent may wish to pass on to the child. But that essay will be lower down the list than the one about Zippy the Hippo, because the teacher may decide that it is less relevant according to priority. And so it is with search engines, there is an application of judgement involved about how relevant a page is in the rankings.

Cataloguing written work and other media

Imagine this scenario – School A decides to undertake a revolutionary approach to schooling. Instead of teachers dishing out material, students will be allowed to focus on things that interest them, as long as they produce a report that incorporates some elements of critical thinking, calculation, and creativity. Students produce as many reports as they like and all their produced work is assessed for its merit to give them a final grade. In classes, teacher facilitate rather than instruct. Students are encouraged not just to work on their own project, but to read each others’ work. One teacher in the whole school is in charge of co-ordinating all these various projects and keeping track of them.

Supposing the published anthology contains not just stories and essays, but some students have also included their own art work to accompany their submitted writings. And maybe a few have written essays on music, and have submitted CDs of accompanying recordings. The teacher must catalogue the existence of such art work and music recordings, in case someone asks, “Have you got anything that might be of interest, about animals and music?”

Search queries

Imagine that students go up to the co-ordinating teacher to ask about whether work on certain topics exist.

But imagine, too, that parents are slightly uncomfortable with this seemingly unregulated way of learning and ring the school, to ask about whether work on certain topics exist. This is their way of checking that their children have submitted work that they said they would be doing.

Requests may be put by parents to office staff who take the call. But the office staff will probably hand a post-it to the teacher with keywords such as “animals, music” to expedite things. The teacher in charge of the project then has to scour the materials in the collection and return a list of useful materials which may include recordings, essays and art work.

Keeping information up to date with crawls

Imagine that different schools in a district decide to start their own collections and over the years the various collections accumulate. Someone working on a project in one school may find it useful to know what someone else has done in another. But this means that schools A, B, C and D all each get the same call, so before long the co-ordinating teachers at each school decide that it would be worthwhile meeting up ever so often, to establish a common catalogue all the work in existence in the district from the various schools.

At a pre-determined interval, the teachers get together to collate and catalogue the different essays, art work, recordings and other materials that may have evolved, such as videos.

These group of teachers perform the role of crawlers or spiders, who reach the many billions of interconnected documents on the web.

So the teachers decide to create an Excel file cataloguing the existing work they already have, in their schools, and each item has a few keywords next to it to describe what it is generally about.

For example, “#3576. Scary Ghosts in the Castle. Horror. Children’s. Video.”

This Excel file is emailed to all the teachers and periodically updated as new projects are added.

Indexing, Cached Results and Different Dacacenters
When someone asks a teacher at the school for some information, the teacher goes to the Excel file and organises the required information by relevance.

This Excel file may soon prove to be out of date, and recent work may take a while to appear in it.

The teachers are busy and can only meet to update the files with recent work when their schedules coincide.

Each teacher may maintain a copy of the Excel file, adding only the new projects in their own school, but when they all get together they collate the projects completed in other schools and produce a new Excel file.

So when a parent rings school A for information, the results may either be historical ones with recent additions from school A. If the same parent rings school B, he is going to get historical results with recent school B additions.

Filtering Search Results
Imagine the schools have essays about trains. A common one is “#684. Fast trains in England. Transport. Speed.”

Then someone in school A submits an essay about the Shinkansen, the Japanese Bullet Train. Someone in school B writes an essay about Running.

If both schools are approached by a parent who wants to know about a train that is fast, both teachers may be handed post-its with the words “train fast”. The teacher in school A might reply with a list of works in this order:

Fast Trains in England
Bullet Trains

The teacher from School B may have a list that looks like:

Fast Trains in England
Running (train to run fast – get it?)

It depends on information in their Excel files, which may be different.
So while the teachers convene and the Excel file is up to date, the returned lists by both schools may still be different because the search has been differently interpreted.

If teacher A is aware that the parent making the enquiry has a Japanese relative, the bullet train essay may top the list produced from school A.

If teacher B knows that the parent has recently moved to England from Japan, and gauges that the parent is keen to know more about the new country, then Fast Trains in England will School B’s list.

Assume the office of School B receives a query from a parent who wants to pick up and view any video recordings about jazz. The teacher in school B is off on long-term sick leave, and the enquiry is fairly urgent (so the caller says), and the Excel file held by School B cannot be traced, so the office of School B rings School C to ask if they would mind handling this query. Of course, the data in the Excel File will contain historical common data as well as a catalogue of items specific to School C.

But the teacher at School C may realise the parent making the enquiry lives closer to School A and if is intending to view jazz video recordings using the VHS player, then it might be more sense to see if School A has any first, so the enquiry is passed on further to School A.

Search and Privacy

Mr S is a secretive individual. He doesn’t like people keeping tabs on him. So when he rings School A to ask to ask if there are student submissions about How Computers Work, he always calls from an unlisted phone number. Or he calls from different payphones. Or he disguises his voice, sometimes lower, sometimes high. He is keen to track How Computers Work, because his son at the school has promised to do a report on it, and ringing the school to ask about How Computers Work is his way of making sure his son is making proper effort at school. He also figures that if his son George has put in stellar effort, then his teacher is likely to recommend his work above all other similar projects.

George S has submitted his project on How Computers Work but because it may be a while before his teacher at the school logs the work and has a good look at it to categorise the keywords, as it still buried under a mound of other work to catalogue, it is not showing on the Excel file at the school.

So Mr S keeps ringing the school frequently to check How Computers Work, to see if George’s work has been catalogued, and when it does appear, he hopes to make an assessment of its quality by where it appears on the list. But until the Excel file is updated, he is wasting his time. And when will George’s work be recorded? Well, the teacher has to slowly work through the pile which never seems to stop accumulating.

Despite the attempts at secrecy, the teacher at the school knows it is Mr S making these enquires. Despite the unlisted numbers, or calls from various payphones, or different voices, there is something unique about Mr S. Perhaps it is the way he always says, “Could I possibly trouble you with my enquiry about …” that uniquely identifies him, so even though he has taken steps to appear anonymous, the teacher at the other end knows, and pretty soon the other coordinating teachers know too.

The calls from Mr S create problems. Initially, the teacher thinks the list of projects distributed to him is somehow incorrect, or there are issues with the relevance, as the callers keep calling back – so in future requests sometimes even though the same projects are listed the order is changed.

Or sometimes other work which might not have seemed relevant to mention are placed in the distributed list in the hope it may be what Mr S is looking for. Mr S, really, is looking for the appearance of a new project on the list the teacher gives him – the indexing of his son’s work. But the teacher keeps giving him a changing list, which frustrates and puzzles him. And he sometimes wonders why when he asks for How Computers Work, the relevance of the recommended projects keeps changing even though the enquiry is the same and the books are the same.

Returned Results from Saved Searches in the Index
Sometimes he rings the other schools just to see if they have an Excel file that is more updated than the one at School A, but he makes too many enquiries in a short space of time all the teachers collude to give him a set list of projects without even considering his query seriously.

But when Mr James rings with the same query on How Computers Work, the query is treated as a separate one.

Filtering Results by Device
School A knows Mr James lives near the school itself because he always calls from his landline. But one day when Mr James makes a call the person taking the call hears a lot of background noise and realises he is calling from a mobile and within the vicinity of School D. So when Mr James asks about projects around the theme of natural disasters, there is some sense in the decision to run his search using the School D Excel file so he might be able to pick up what he needs from around his immediate location.

And so this is how it is. Search engines crawl the internet to find pages to catalogue. Once the engines find these pages, they decipher the code from them and store selected pieces in massive databases, to be recalled later when needed for a search query. This is similar to the Excel file kept by the teachers. Search engine companies have various data centers so that the load of a search is spread across different data centers, rather than converging at one point, which would cause an overload. But the data held by each data center varies depending on when it was last updated.

When you search for a term in Google the results returned by search engines to you depend on the data they hold and which data server is appointed to handle your query. The ranking order of returned results is further filtered by relevance to factors such as your location as well as the device you are searching from – whether it is a mobile device or a desktop.
The information on your previous searches can be held on file to further refine your future searches. If you do not click on the list of results that Google offers to you, that it has the impression that they are not relevant and demotes the position of certain results in future.

Is it possible that Google could possibly customise search results to fit an individual? If you are signed in to your Google account on your device then your search records are overtly held and you are already being served specifically customised results. And it is not far fetched to say Google can obtain records about previous searches if it records IMEI numbers together with search results within a database. The IMEI number is a long digit number that uniquely identifies your phone, tablet or any other device.

Suppose then, you have a website. You want it to be catalogued or indexed, to display prominently in searches regardless of user device, specific to the location you want to target. In order for your website to feature prominently, you really need someone who understands the mechanics of search – this is where I come in.

So get in touch.